Big Data Methods. Chapter 5: Machine learning. Big Data Methods, Chapter 5, Slide 1

Similar documents
The Basics of Decision Trees

Random Forest A. Fornaser

MIT 801. Machine Learning I. [Presented by Anna Bosman] 16 February 2018

Classification/Regression Trees and Random Forests

Introduction to Classification & Regression Trees

Random Forests and Boosting

Contents. Preface to the Second Edition

Using Machine Learning to Optimize Storage Systems

Tree-based methods for classification and regression

Performance Estimation and Regularization. Kasthuri Kannan, PhD. Machine Learning, Spring 2018

Machine Learning: An Applied Econometric Approach Online Appendix

CS 229 Midterm Review

7. Boosting and Bagging Bagging

Cross-validation and the Bootstrap

Ensemble Learning: An Introduction. Adapted from Slides by Tan, Steinbach, Kumar

Ensemble Methods: Bagging

Lecture 20: Bagging, Random Forests, Boosting

Data Mining Lecture 8: Decision Trees

Network Traffic Measurements and Analysis

CSC 411 Lecture 4: Ensembles I

Statistical Methods for Data Mining

Lecture 25: Review I

MIT Samberg Center Cambridge, MA, USA. May 30 th June 2 nd, by C. Rea, R.S. Granetz MIT Plasma Science and Fusion Center, Cambridge, MA, USA

Slides for Data Mining by I. H. Witten and E. Frank

Cross-validation and the Bootstrap

Predictive Analytics: Demystifying Current and Emerging Methodologies. Tom Kolde, FCAS, MAAA Linda Brobeck, FCAS, MAAA

Machine Learning Techniques for Data Mining

Preface to the Second Edition. Preface to the First Edition. 1 Introduction 1

Classification and Regression

Classification with Decision Tree Induction

Lecture 27: Review. Reading: All chapters in ISLR. STATS 202: Data mining and analysis. December 6, 2017

Decision Trees Dr. G. Bharadwaja Kumar VIT Chennai

Allstate Insurance Claims Severity: A Machine Learning Approach

Advanced and Predictive Analytics with JMP 12 PRO. JMP User Meeting 9. Juni Schwalbach

Analytical model A structure and process for analyzing a dataset. For example, a decision tree is a model for the classification of a dataset.

Overview. Non-Parametrics Models Definitions KNN. Ensemble Methods Definitions, Examples Random Forests. Clustering. k-means Clustering 2 / 8

Evaluating generalization (validation) Harvard-MIT Division of Health Sciences and Technology HST.951J: Medical Decision Support

Pattern Recognition. Kjell Elenius. Speech, Music and Hearing KTH. March 29, 2007 Speech recognition

Problems 1 and 5 were graded by Amin Sorkhei, Problems 2 and 3 by Johannes Verwijnen and Problem 4 by Jyrki Kivinen. Entropy(D) = Gini(D) = 1

Decision trees. Decision trees are useful to a large degree because of their simplicity and interpretability

CS229 Lecture notes. Raphael John Lamarre Townshend

Predictive Analysis: Evaluation and Experimentation. Heejun Kim

Topics in Machine Learning

RESAMPLING METHODS. Chapter 05

International Journal of Scientific Research & Engineering Trends Volume 4, Issue 6, Nov-Dec-2018, ISSN (Online): X

Nonparametric Approaches to Regression

Variable Selection 6.783, Biomedical Decision Support

2. Data Preprocessing

Classification with PAM and Random Forest

Model selection and validation 1: Cross-validation

Evaluation Measures. Sebastian Pölsterl. April 28, Computer Aided Medical Procedures Technische Universität München

Cross-validation. Cross-validation is a resampling method.

COMPUTATIONAL INTELLIGENCE SEW (INTRODUCTION TO MACHINE LEARNING) SS18. Lecture 6: k-nn Cross-validation Regularization

8. Tree-based approaches

Applying Supervised Learning

International Journal of Software and Web Sciences (IJSWS)

Data Mining Practical Machine Learning Tools and Techniques

Machine Learning: Think Big and Parallel

Nonparametric Classification Methods

Business Club. Decision Trees

3. Data Preprocessing. 3.1 Introduction

CS6375: Machine Learning Gautam Kunapuli. Mid-Term Review

Basic Data Mining Technique

Supervised vs unsupervised clustering

Decision Tree CE-717 : Machine Learning Sharif University of Technology

Evaluating Classifiers

Ensemble methods in machine learning. Example. Neural networks. Neural networks

Ensemble Methods, Decision Trees

STA 4273H: Statistical Machine Learning

INTRO TO RANDOM FOREST BY ANTHONY ANH QUOC DOAN

Lecture on Modeling Tools for Clustering & Regression

Algorithms: Decision Trees

Machine Learning. Decision Trees. Le Song /15-781, Spring Lecture 6, September 6, 2012 Based on slides from Eric Xing, CMU

How do we obtain reliable estimates of performance measures?

Event: PASS SQL Saturday - DC 2018 Presenter: Jon Tupitza, CTO Architect

Linear Methods for Regression and Shrinkage Methods

Nonparametric Methods Recap

Decision Trees / Discrete Variables

Model combination. Resampling techniques p.1/34

Classification: Decision Trees

Knowledge Discovery and Data Mining

Assignment No: 2. Assessment as per Schedule. Specifications Readability Assignments

MSA220/MVE440 Statistical Learning for Big Data

Data Preprocessing. Slides by: Shree Jaswal

What is Learning? CS 343: Artificial Intelligence Machine Learning. Raymond J. Mooney. Problem Solving / Planning / Control.

Artificial Intelligence. Programming Styles

Outlier Ensembles. Charu C. Aggarwal IBM T J Watson Research Center Yorktown, NY Keynote, Outlier Detection and Description Workshop, 2013

Model Selection and Assessment

CSE Data Mining Concepts and Techniques STATISTICAL METHODS (REGRESSION) Professor- Anita Wasilewska. Team 13

Note: In the presentation I should have said "baby registry" instead of "bridal registry," see

Introduction to Machine Learning

INTRODUCTION TO MACHINE LEARNING. Measuring model performance or error

Ensemble Learning. Another approach is to leverage the algorithms we have via ensemble methods

Interpretable Machine Learning with Applications to Banking

Statistical Consulting Topics Using cross-validation for model selection. Cross-validation is a technique that can be used for model evaluation.

SOCIAL MEDIA MINING. Data Mining Essentials

Univariate and Multivariate Decision Trees

1) Give decision trees to represent the following Boolean functions:

Machine Learning. Chao Lan

April 3, 2012 T.C. Havens

Transcription:

Big Data Methods Chapter 5: Machine learning Big Data Methods, Chapter 5, Slide 1

5.1 Introduction to machine learning What is machine learning? Concerned with the study and development of algorithms that learn from and make predictions on data. Algorithms make data driven predictions or decisions through building a model from sample inputs. Examples: prediction of outcomes (e.g. voting behavior), spam filtering, pattern recognition (e.g. of medical diseases or fraudulent behavior). Supervised learning is based on a predictive model for some outcome Y as a function of some regressors X: which combination and functional form of X does best predict Y? Unsupervised learning: There is no outcome Y data to be be descriptively summarized or clustered in an interesting way (e.g. shopping patterns market basket analysis). Big Data Methods, Chapter 5, Slide 2

General approach of machine learning Split the original data randomly into three data sets: training data, validation data, and test data. Training data: estimate a model (e.g. estimating coefficients). Validation data: refine and tune a trained model to obtain optimal prediction in validation data (e.g. based on reestimating the coefficients of the training data). Test data: generate the final predictions based on the refined trained model (e.g. use the optimized coefficients) Training/validation data are used for model selection, test data for (out-of-sample) model assessment. Data Test Data Training Data Model building Model validation Model assessment Validation Data Pick the best model Model tuning Big Data Methods, Chapter 5, Slide 3

Source: Wikipedia on «Decision tree learning» (March 2017) Big Data Methods, Chapter 5, Slide 4 5.2 Classification and regression trees How to grow trees (1) Idea: sequentially partition the space of regressors (X) in training data into subspaces in a way that reduces the sum of squared residuals (SSR) in the outcome (Y) by the next partitioning step as much as possible. The aim is to partition the data in distinct strata or leaves within which observations are comparable in terms of X. The outcome is predicted based on the averages of Y within the leaves (i.e. the conditional means of Y given the leave) M N 1 E( Y X x) y I{ x L }, y y I{ x L } m m m i i m m 1 I{ xi Lm} i 1 : particular leave or stratum (M is number of leaves) L m Machine learning A stopping rule/tuning parameter for partitioning is required to prevent overfitting (too many leaves such that variance explodes), for instance based on cross-validation (minizimes MSE)

How to grow trees (2) Machine learning Trees are comparable to regression with discretized X and use the data to optimally choose what/where to discretize. Example to convey the intuition: Assume a sample of size N and two regressors X. 1, X2 N 2 The sum of squared residuals (SSR) of Y is y y, where ( ) i 1 i y y i 1 N One tries to split the sample based on either x or in a way that i1 x i 2 minimizes the SSR over the newly created subsamples. Possible splits: x versus OR versus. i1 c xi1 c xi 2 c xi 2 c One jointly chooses (1) the regressor and (2) the value c that minimizes the SSR. After the first split one looks at the two strata (or leaves of the tree ), and considers the next SSR-minimizing split. In the simplest version of a regression tree, one stops splitting after the SSR is below a particular threshold. 1 N i Big Data Methods, Chapter 5, Slide 5

Big Data Methods, Chapter 5, Slide 6 How to grow trees (3) Rather than starting with a small tree, it is more sophisticated to first build ( grow ) a large tree, and then prune (delete) leaves that have little impact on the SSR. This avoids missing initial splits that would lead to important subsequent splits even if the initial splits per se do not importantly improve the SSR. One might use a stopping rule to not grow the large tree beyond a minimum number of observation in each leave. Tuning parameter for tree finally considered: number of leaves, e.g. picked by k-fold cross-validation (see chapter 4). Cross-validation: Machine learning Divide the training data into K folds. For each of k=1,,k estimate the tree with various choices of leave numbers in all but the kth fold and compute the MSE for each choice when using the kth fold for prediction. Take the averages of the choice-specific MSEs over the K steps and pick the choice of leave numbers that minimizes this average.

Graphical illustration Big Data Methods, Chapter 5, Slide 7

Pros and cons of Classification and regression trees Advantages: Easy to interpret: Within a leave (easy to follow graphically), the prediction is a sample mean. Flexible and able to capture substantial non-linearities Partitions the data naturally where the largest changes in the outcome as a function of covariates occur (no need to manually create interaction terms etc.). No need for creating new variables. Disadvantages: Can get computationally very expensive if dimension of X is large. Predictions may be unstable: Small changes in the sample can lead to very different trees. Other, more continuous methods dominate classification and regression trees in terms of prediction accuracy (but lack the nice graphical interpretation). Big Data Methods, Chapter 5, Slide 8

5.3 Bootstrap aggregating (bagging) Basig idea Bagging is based on generating bootstrap samples by sampling with replacement out of the original training data. The machine learning method (e.g. tree) is applied to each of the bootstrap samples. The prediction of the outcome is obtained by averaging over the predictions in the individual bootstrap samples. Big Data Methods, Chapter 5, Slide 9

Example: Bagging classification/regression trees Procedure: Machine learning Draw many bootstrap samples and apply classification/regression tree to each sample. Trees are not pruned, but fully grown to some minimum leaf size, such that bias is low, but variance is high with each sample. The final prediction is obtained by averaging over the predictions in the bootstrap samples, which also entails variance reduction. b B 1 M N b b b 1 b b b E( Y X x) ymi{ x Lm}, ym yi I{ xi Lm} b b B i 1 m 1 I{ xi Lm} i 1 B is the number of bootstrap samples, b indexes a specific sample b {1,..., B}. Remark 1: A lot of bootstraps are required because bootstrap trees are correlated as bootstrap samples overlap substantially. Remark 2: Continuous estimator, because averaging over bootstraps smooths out the discrete steps in the individuals trees. Big Data Methods, Chapter 5, Slide 10

5.4 Random forests Random forests for prediction Among the most competitive methods for prediction. Based on model averaging: prediction is the average of hundreds or thousands of distinct regression trees. Similarity to bagging: regression trees are estimated in bootstrap samples (or subsamples with smaller size than original data) and fully grown. Difference to bagging: at each partitioning step, only a random (and small) subset of regressors (rather than all) is considered as potential variables for further partitioning. Randomly picking regressors prevents correlated trees across bootstrap samples (as in bagging) and is computationally attractive. Big Data Methods, Chapter 5, Slide 11

Random forests for causal effects of binary variables To estimate causal effects of a binary variable (denoted by W) on Y given X, rather than doing mere prediction of Y. U D Y X Idea: for each regression tree, estimate the effect of W on Y within each leaf defined by X and average over the trees. Assumption: Conditional on X, there must not exist any unobservables jointly affecting W and Y. Implies that W is as good as randomly assigned given X. E Under this assumption, causal random forests can be consistent and asymptotically normal. Allows to derive confidence intervals on effects and do hypothesis testing (in contrast to many predictive machine learning approaches, for which no asymptotic theory is available). Big Data Methods, Chapter 5, Slide 12

Random forests for causal effects: algorithm Estimated conditional mean of Y given X: Estimated conditional mean effect of binary variable W on Y given X: Taken from Wager and Athey (2016): Estimation and Inference of Heterogeneous Treatment Effects using Random Forests The algorithm is implemented in the causaltree package for R. Big Data Methods, Chapter 5, Slide 13

5.5 Further machine learning approaches Big Data Methods, Chapter 5, Slide 14 Classifiers, support vector machines, neural networks Classifiers Machine learning In the spirit of trees in the sense that they split the data, but do not sequentially split the data Support vector machines Categorizes observations into groups such that the observations in the separate categories are divided by a gap or distance that is as large as possible (i.e. maximized). Artificial neural networks Rudimentarily mimics the neural structure of a biological brain. Neural units are connected with other neural units in a system Each unit may receive signals (similar to a coefficient) from other units and pass on a signal itself as a function of incoming signals. Each unit is therefore a function of other units (apart from the input variables which are not functions of other units) this can be thought of as nested regression models. At the end of the system is the predicted outcome of interest as a function of previous units.

Boosting A way to improve (simple) machine learning methods. As example, assume that we estimate the conditional mean of Y by a simple tree with just two partitions. This predictor is likely to perform poorly. Idea of boosting: repeatedly apply a poor predictor After the first application of the simple tree we calculate the residuals (difference of actual outcome and its prediction). We then apply the simple tree to the residuals instead of the original outcomes. We repeat this many times, each time applying the simple tree to the residuals from the previous stage. Repeating a simple method many times allows approximating regression functions in a flexible way. again, model averaging improves single methods. Big Data Methods, Chapter 5, Slide 15