Oliver Dürr. Statistisches Data Mining (StDM) Woche 12. Institut für Datenanalyse und Prozessdesign Zürcher Hochschule für Angewandte Wissenschaften

Similar documents
Oliver Dürr Institut für Datenanalyse und Prozessdesign Zürcher Hochschule für Angewandte Wissenschaften

Random Forest A. Fornaser

Oliver Dürr. Statistisches Data Mining (StDM) Woche 5. Institut für Datenanalyse und Prozessdesign Zürcher Hochschule für Angewandte Wissenschaften

Ensemble Learning: An Introduction. Adapted from Slides by Tan, Steinbach, Kumar

Random Forests and Boosting

RESAMPLING METHODS. Chapter 05

CSC411 Fall 2014 Machine Learning & Data Mining. Ensemble Methods. Slides by Rich Zemel

CSC 411 Lecture 4: Ensembles I

7. Boosting and Bagging Bagging

Network Traffic Measurements and Analysis

CS 229 Midterm Review

STA 4273H: Statistical Machine Learning

MIT Samberg Center Cambridge, MA, USA. May 30 th June 2 nd, by C. Rea, R.S. Granetz MIT Plasma Science and Fusion Center, Cambridge, MA, USA

Applying Supervised Learning

Classification/Regression Trees and Random Forests

Computer Vision Group Prof. Daniel Cremers. 8. Boosting and Bagging

The Basics of Decision Trees

Data Mining Lecture 8: Decision Trees

MSA220/MVE440 Statistical Learning for Big Data

Lecture 20: Bagging, Random Forests, Boosting

Ensemble Learning. Another approach is to leverage the algorithms we have via ensemble methods

Statistics 202: Statistical Aspects of Data Mining

Contents. Preface to the Second Edition

Tree-based methods for classification and regression

Nonparametric Classification Methods

Naïve Bayes for text classification

UVA CS 4501: Machine Learning. Lecture 10: K-nearest-neighbor Classifier / Bias-Variance Tradeoff. Dr. Yanjun Qi. University of Virginia

Model Assessment and Selection. Reference: The Elements of Statistical Learning, by T. Hastie, R. Tibshirani, J. Friedman, Springer

Classification Algorithms in Data Mining

April 3, 2012 T.C. Havens

Slides for Data Mining by I. H. Witten and E. Frank

Random Forests May, Roger Bohn Big Data Analytics

Evaluation Measures. Sebastian Pölsterl. April 28, Computer Aided Medical Procedures Technische Universität München

Bias-Variance Analysis of Ensemble Learning

Machine Learning Techniques for Data Mining

BIOINF 585: Machine Learning for Systems Biology & Clinical Informatics

Big Data Methods. Chapter 5: Machine learning. Big Data Methods, Chapter 5, Slide 1

UVA CS 6316/4501 Fall 2016 Machine Learning. Lecture 15: K-nearest-neighbor Classifier / Bias-Variance Tradeoff. Dr. Yanjun Qi. University of Virginia

INTRO TO RANDOM FOREST BY ANTHONY ANH QUOC DOAN

Business Club. Decision Trees

CS6375: Machine Learning Gautam Kunapuli. Mid-Term Review

Classification and Regression by randomforest

Data Preprocessing. Supervised Learning

Model Inference and Averaging. Baging, Stacking, Random Forest, Boosting

Ensemble methods. Ricco RAKOTOMALALA Université Lumière Lyon 2. Ricco Rakotomalala Tutoriels Tanagra -

Using Machine Learning to Optimize Storage Systems

Introduction to Classification & Regression Trees

Recognition Part I: Machine Learning. CSE 455 Linda Shapiro

CS178: Machine Learning and Data Mining. Complexity & Nearest Neighbor Methods

CS 559: Machine Learning Fundamentals and Applications 10 th Set of Notes

Performance Evaluation of Various Classification Algorithms

An introduction to random forests

Ensemble Methods, Decision Trees

Nonparametric Approaches to Regression

CART Bagging Trees Random Forests. Leo Breiman

CPSC 340: Machine Learning and Data Mining. Probabilistic Classification Fall 2017

Lecture 9: Support Vector Machines

Cross-validation and the Bootstrap

Supervised Learning Classification Algorithms Comparison

Distribution-free Predictive Approaches

8. Tree-based approaches

MS1b Statistical Data Mining Part 3: Supervised Learning Nonparametric Methods

Machine Learning. Chao Lan

Overview. Non-Parametrics Models Definitions KNN. Ensemble Methods Definitions, Examples Random Forests. Clustering. k-means Clustering 2 / 8

Boosting Simple Model Selection Cross Validation Regularization

STAT Midterm Research Project: Random Forest. Vicky Xu. Xiyu Liang. Xue Cao

Probabilistic Approaches

Model combination. Resampling techniques p.1/34

Topic 1 Classification Alternatives

Support Vector Machines

Predictive Analytics: Demystifying Current and Emerging Methodologies. Tom Kolde, FCAS, MAAA Linda Brobeck, FCAS, MAAA

Cross-validation and the Bootstrap

CSE 573: Artificial Intelligence Autumn 2010

Advanced Video Content Analysis and Video Compression (5LSH0), Module 8B

Supervised Learning for Image Segmentation

Ensemble methods in machine learning. Example. Neural networks. Neural networks

Lecture Notes for Chapter 5

Multivariate Data Analysis and Machine Learning in High Energy Physics (V)

DECISION TREES & RANDOM FORESTS X CONVOLUTIONAL NEURAL NETWORKS

Lecture #17: Autoencoders and Random Forests with R. Mat Kallada Introduction to Data Mining with R

Part I. Classification & Decision Trees. Classification. Classification. Week 4 Based in part on slides from textbook, slides of Susan Holmes

Machine Learning / Jan 27, 2010

CS4495/6495 Introduction to Computer Vision. 8C-L1 Classification: Discriminative models

Sensor-based Semantic-level Human Activity Recognition using Temporal Classification

Tanagra Tutorial. Determining the right number of neurons and layers in a multilayer perceptron.

Lecture 25: Review I

Boosting Simple Model Selection Cross Validation Regularization. October 3 rd, 2007 Carlos Guestrin [Schapire, 1989]

Classification with PAM and Random Forest

Classification and K-Nearest Neighbors

Preface to the Second Edition. Preface to the First Edition. 1 Introduction 1

Ensemble Methods: Bagging

Lecture Notes for Chapter 5

Nonparametric Methods Recap

Tutorials Case studies

Information Management course

Classification. Instructor: Wei Ding

Lecture 27: Review. Reading: All chapters in ISLR. STATS 202: Data mining and analysis. December 6, 2017

Contents Machine Learning concepts 4 Learning Algorithm 4 Predictive Model (Model) 4 Model, Classification 4 Model, Regression 4 Representation

Data Mining Practical Machine Learning Tools and Techniques

Using Machine Learning to Identify Security Issues in Open-Source Libraries. Asankhaya Sharma Yaqin Zhou SourceClear

Transcription:

Statistisches Data Mining (StDM) Woche 12 Oliver Dürr Institut für Datenanalyse und Prozessdesign Zürcher Hochschule für Angewandte Wissenschaften oliver.duerr@zhaw.ch Winterthur, 6 Dezember 2016 1

Multitasking senkt Lerneffizienz: Keine Laptops im Theorie-Unterricht Deckel zu oder fast zu (Sleep modus)

Overview of classification (until the end to the semester) Classifiers K-Nearest-Neighbors (KNN) Logistic Regression Linear discriminant analysis Support Vector Machine (SVM) Classification Trees Neural networks NN Deep Neural Networks (e.g. CNN, RNN) Evaluation Cross validation Performance measures ROC Analysis / Lift Charts Theoretical Guidance / General Ideas Combining classifiers Bagging Random Forest Boosting Bayes Classifier Bias Variance Trade off (Overfitting) Feature Engineering Feature Extraction Feature Selection

Overview of Ensemble Methods Many instances of the same classifier Bagging (bootstrapping & aggregating) Create new data using bootstrap Train classifiers on new data Average over all predictions (e.g. take majority vote) Bagged Trees Use bagging with decision trees Random Forest Bagged trees with special trick Boosting An iterative procedure to adaptively change distribution of training data by focusing more on previously misclassified records. Combining classifiers Weighted averaging over predictions Stacking classifiers Use output of classifiers as input for a new classifier

Bagging / Random Forest Chapter 8.2 in ILSR

Bagging: Bootstrap Aggregating D Original Training data Step 1: Create Multiple Data Sets... D 1 D 2 D t-1 D t Bootstrap Step 2: Build Multiple Classifiers C 1 C 2 C t -1 C t Step 3: Combine Classifiers C * Aggregating: mean, majority vote Source: Tan, Steinbach, Kumar

Why does it work? Suppose there are 25 base classifiers Each classifier has error rate, ε = 0.35 Assume classifiers are independent (that s the hardest assumption) Take majority vote Majority voter is wrong if 13 or more are wrong Number of wrong predictors X ~ Bin(size=25, p 0 =0.35) > 1 - pbinom(12, size=25, prob = 0.35) [1] 0.06044491 25 i= 13 25 i ε (1 ε ) i 25 i = 0.06

Why does it work? 25 Base Classifiers Ensembles are only better than one classifier, if each classifier is better than random guessing!

Reminder Trees library(rpart) fit <- rpart(kyphosis ~., data = kyphosis) plot(fit) text(fit, use.n = TRUE) pred = predict(fit, data=kyphosis) > head(pred,4) #Probabilities absent present 1 0.4210526 0.5789474 2 0.8571429 0.1428571 3 0.4210526 0.5789474 4 0.4210526 0.5789474

Bagging trees Decision trees suffer from high variance! If we randomly split the training data into 2 parts, and fit decision trees on both parts, the results could be quite different Averaging reduces Variance Independent è var ~ σ 2 / n (not true strictly) Bagging for trees Take bootstrap B samples from training set For each bootstrap sample train a decision tree For the test set take the majority vote of all trained classifiers (or average) D Original Training data Step 1: Create Multiple Data Sets... D 1 D 2 D t-1 D t Step 2: Build Multiple Classifiers C 1 C 2 C t -1 C t Step 3: Combine Classifiers C *

Reminder: Bootstrapping Resampling of the observed dataset (and of equal size to the observed dataset), each of which is obtained by random sampling with replacement from the original dataset Obs X Y * 1 Z 3 1 3 5.3 4.3 5.3 2.8 2.4 2.8 αˆ * 1 Obs X Y Obs X Y 1 4.3 2.4 2 2.1 1.1 3 5.3 2.8 Original Data (Z) * 2 Z Z *B 2 3 1 Obs 2.1 5.3 4.3 X 1.1 2.8 2.4 Y ˆα * 2 ˆα *B 2 2.1 1.1 2 2.1 1.1 1 4.3 2.4 Slide taken from Al Sharif: http://www.alsharif.info/#!iom530/c21o7

Lösung der Aufgabe: Bagging von Hand library(rpart) library(mass) # For the data set Boston$crim = Boston$crim > median(boston$crim) # Split in Training and Testset idt = sample(nrow(boston), size = floor(nrow(boston)*3/4)) data.train = Boston[idt, ] data.test = Boston[- idt, ] # Single Tree d = rpart( as.factor(crim) ~., data = data.train) sum(predict(fit, data.test, type='class') == Boston$crim[-idt]) / nrow(data.test) #0.88 n.trees = 100 preds = rep(0, nrow(data.test)) for (j in 1:n.trees ){ # Doing the bootstrap index.train = sample( nrow(data.train), size = nrow(data.train), replace = TRUE ) # Fitting to the training data fit = rpart( as.factor(crim) ~., data = data.train[index.train,]) # Predict the test set tree.fit.test = predict(fit, data.test, type='class') preds = preds + ifelse(tree.fit.test == TRUE, 1, -1) #Trick with +-1 } sum(boston$crim[-idt] == (preds > 0)) / nrow(data.test) #0.96

How to classify a new observation? Tree t=1 t=2 t=3 Two ways to derive the ensemble result: 1) Each tree has a winner-class Take the class which was most omen the winner 2) average probabilioes:

A Comparison of Error Rates Here the green line represents a simple majority vote approach The purple line corresponds to averaging the probability esomates. Both do far beter than a single tree (dashed red) and get close to the Bayes error rate (dashed grey). Slide taken from Al Sharif: http://www.alsharif.info/#!iom530/c21o7

Example 2: Car Seat Data The red line represents the test error rate using a single tree. The black line corresponds to the bagging error rate using majority vote while the blue line averages the probabilioes. Test Error rate 0.15 0.20 0.25 0.30 0.35 0.40 Slide taken from Al Sharif: http://www.alsharif.info/#!iom530/c21o7 0 20 40 60 80 100 Number of Bootstrap Data Sets

Out-of-Bag Estimation: Cross-validation on the fly Since bootstrapping involves random selection of subsets of observations to build a training data set, then the remaining non-selected part could be the testing data. On average, each bagged tree makes use of around 2/3 of the observations, so we end up having 1/3 of the observations used for testing Slide taken from Al Sharif: http://www.alsharif.info/#!iom530/c21o7

The importance of independence It s a quite strong assumption, that all classifiers are independent. Since the same features are used in each bag, the resulting classifiers are quite dependent. We don t really get new classifiers or trees The bagging has high bias but low variance. De-correlate by any means! è Random Forest

Random Forest

Random Forests It is a very efficient statistical learning method It builds on the idea of bagging, but it provides an improvement because it de-correlates the trees How does it work? Build a number of decision trees on bootstrapped training sample, but when building these trees, each time a split in a tree is considered, a random sample of m predictors is chosen as split candidates from the full set of p predictors (Usually m p)

Random Forest: Learning algorithm Sample m variables out p. As usual for trees sample the best variable for splitting

Why are we considering a random sample of m predictors instead of all p predictors for splitting? Suppose that we have a very strong predictor in the data set along with a number of other moderately strong predictor, then in the collection of bagged trees, most or all of them will use the very strong predictor for the first split! All bagged trees will look similar. Hence all the predictions from the bagged trees will be highly correlated Averaging many highly correlated quantities does not lead to a large variance reduction, and thus random forests de-correlates the bagged trees leading to more reduction in variance This makes the individual trees weaker but enhances the overall performance

Random Forest with different values of m Notice when random forests are built using m = p, then this amounts simply to bagging. Test Classification Error 0.2 0.3 0.4 0.5 m=p m=p/2 m= p 0 100 200 300 400 500 Number of Trees

Random Forest in R (basic) fit = randomforest(as.factor(crim) ~., data=data.train) predict(fit, data.test) #### # Parameters randomforest(as.factor(crim) ~., data=data.train, ntree=100, mtry = 42) #ntree number of trees grown #mtry number of features used in each split. If set to p èbagged Trees

Variable Importance Measure Bagging typically improves the accuracy over prediction using a single tree, but it is now hard to interpret the model! We have hundreds of trees, and it is no longer clear which variables are most important to the procedure Thus bagging improves prediction accuracy at the expense of interpretability But, we can still get an overall summary of the importance of each predictor using Relative Influence Plots Slide taken from Al Sharif: http://www.alsharif.info/#!iom530/c21o7

Variable Importance 1: performance-loss by permutation Determine importance-2 for v4: 1) obtain standard oob-performance 2) values of v4 are randomly permuted in oob samples and oobperformance is again computed. If variable is not important nothing happends. Hence: 3) Use decrease of performace as important measure of v4 v1 v2 v3 v4 v5 v6 v7 0 1 1 2 0 1 0 0 2 2 1 2 0 1 1 0 0 1 1 2 0 1 0 0 1 1 0 2 0 2 1 0 2 0 1

Variable Importance 2: Score improvement at each Split v3 v1. v1 v5 v6 v5 v4 v3 v4 At each split where the variable, e.g. v3, is used the improvement of the score (Decrease of Giniindex) is measured. The average over all these v3-involving splits in all trees is the importance-1 measure of v3.

varimp library(randomforest) library(elemstatlearn) heart = SAheart #importance = TRUE sonst keine Accuracy fit = randomforest(as.factor(chd) ~., data = heart, importance=true) predict(fit, data=kyphosis, type='prob') varimpplot(fit)

Interpreting Variable Importance RF in standard implementation is biased towards continuous variables Problems with correlations: If two variables are highly correlated removing one yield no degeneration in performance Same with Gini index measure Not specific for random forest: correlation often posses problems

Proximity: Similarity between two observation according to supervised RF The test data contains now a pair of observation random forest determines proximity by counting in how many trees both observation end up in the same leaf. Since the RF was built in a supervised modus, the proximity is also influenced by the class label of the observations.

Boosting Consider binary problem with classes coded as +1,-1 A sequence of classifiers C m is learnt The training data is reweighted (depending on misclassification of example) Slides: Trevor Hastie

An experiment Decission Boundary of a single tree. E.g.

An experiment Decision Boundary after 1 iteration Decision Boundary after 3 iterations Strong Weights Decision Boundary after 20 iterations Decision Boundary after 100 iterations

Idea of Ada Boost Algorithm

Details of Ada Boost Algorithm Zu c) α m Error small èlarge weight Error > 0.5 opposite is taken (quite uncommon only for small m) err m

Performance of Boosting Boosting most frequently used with trees (not necessary) Trees are typically only grown to a certain depth (1,2). If trees of depth 2 would be better than trees of depth 1 (stubs) interactions between features would be important