Big Data Methods. Chapter 5: Machine learning. Big Data Methods, Chapter 5, Slide 1

Big Data Methods Chapter 5: Machine learning Big Data Methods, Chapter 5, Slide 1

5.1 Introduction to machine learning What is machine learning? Concerned with the study and development of algorithms that learn from and make predictions on data. Algorithms make data driven predictions or decisions through building a model from sample inputs. Examples: prediction of outcomes (e.g. voting behavior), spam filtering, pattern recognition (e.g. of medical diseases or fraudulent behavior). Supervised learning is based on a predictive model for some outcome Y as a function of some regressors X: which combination and functional form of X does best predict Y? Unsupervised learning: There is no outcome Y data to be be descriptively summarized or clustered in an interesting way (e.g. shopping patterns market basket analysis). Big Data Methods, Chapter 5, Slide 2

General approach of machine learning Split the original data randomly into three data sets: training data, validation data, and test data. Training data: estimate a model (e.g. estimating coefficients). Validation data: refine and tune a trained model to obtain optimal prediction in validation data (e.g. based on reestimating the coefficients of the training data). Test data: generate the final predictions based on the refined trained model (e.g. use the optimized coefficients) Training/validation data are used for model selection, test data for (out-of-sample) model assessment. Data Test Data Training Data Model building Model validation Model assessment Validation Data Pick the best model Model tuning Big Data Methods, Chapter 5, Slide 3

Source: Wikipedia on «Decision tree learning» (March 2017) Big Data Methods, Chapter 5, Slide 4 5.2 Classification and regression trees How to grow trees (1) Idea: sequentially partition the space of regressors (X) in training data into subspaces in a way that reduces the sum of squared residuals (SSR) in the outcome (Y) by the next partitioning step as much as possible. The aim is to partition the data in distinct strata or leaves within which observations are comparable in terms of X. The outcome is predicted based on the averages of Y within the leaves (i.e. the conditional means of Y given the leave) M N 1 E( Y X x) y I{ x L }, y y I{ x L } m m m i i m m 1 I{ xi Lm} i 1 : particular leave or stratum (M is number of leaves) L m Machine learning A stopping rule/tuning parameter for partitioning is required to prevent overfitting (too many leaves such that variance explodes), for instance based on cross-validation (minizimes MSE)

How to grow trees (2) Machine learning Trees are comparable to regression with discretized X and use the data to optimally choose what/where to discretize. Example to convey the intuition: Assume a sample of size N and two regressors X. 1, X2 N 2 The sum of squared residuals (SSR) of Y is y y, where ( ) i 1 i y y i 1 N One tries to split the sample based on either x or in a way that i1 x i 2 minimizes the SSR over the newly created subsamples. Possible splits: x versus OR versus. i1 c xi1 c xi 2 c xi 2 c One jointly chooses (1) the regressor and (2) the value c that minimizes the SSR. After the first split one looks at the two strata (or leaves of the tree ), and considers the next SSR-minimizing split. In the simplest version of a regression tree, one stops splitting after the SSR is below a particular threshold. 1 N i Big Data Methods, Chapter 5, Slide 5

Big Data Methods, Chapter 5, Slide 6 How to grow trees (3) Rather than starting with a small tree, it is more sophisticated to first build ( grow ) a large tree, and then prune (delete) leaves that have little impact on the SSR. This avoids missing initial splits that would lead to important subsequent splits even if the initial splits per se do not importantly improve the SSR. One might use a stopping rule to not grow the large tree beyond a minimum number of observation in each leave. Tuning parameter for tree finally considered: number of leaves, e.g. picked by k-fold cross-validation (see chapter 4). Cross-validation: Machine learning Divide the training data into K folds. For each of k=1,,k estimate the tree with various choices of leave numbers in all but the kth fold and compute the MSE for each choice when using the kth fold for prediction. Take the averages of the choice-specific MSEs over the K steps and pick the choice of leave numbers that minimizes this average.

Graphical illustration Big Data Methods, Chapter 5, Slide 7

Pros and cons of Classification and regression trees Advantages: Easy to interpret: Within a leave (easy to follow graphically), the prediction is a sample mean. Flexible and able to capture substantial non-linearities Partitions the data naturally where the largest changes in the outcome as a function of covariates occur (no need to manually create interaction terms etc.). No need for creating new variables. Disadvantages: Can get computationally very expensive if dimension of X is large. Predictions may be unstable: Small changes in the sample can lead to very different trees. Other, more continuous methods dominate classification and regression trees in terms of prediction accuracy (but lack the nice graphical interpretation). Big Data Methods, Chapter 5, Slide 8

5.3 Bootstrap aggregating (bagging) Basig idea Bagging is based on generating bootstrap samples by sampling with replacement out of the original training data. The machine learning method (e.g. tree) is applied to each of the bootstrap samples. The prediction of the outcome is obtained by averaging over the predictions in the individual bootstrap samples. Big Data Methods, Chapter 5, Slide 9

Example: Bagging classification/regression trees Procedure: Machine learning Draw many bootstrap samples and apply classification/regression tree to each sample. Trees are not pruned, but fully grown to some minimum leaf size, such that bias is low, but variance is high with each sample. The final prediction is obtained by averaging over the predictions in the bootstrap samples, which also entails variance reduction. b B 1 M N b b b 1 b b b E( Y X x) ymi{ x Lm}, ym yi I{ xi Lm} b b B i 1 m 1 I{ xi Lm} i 1 B is the number of bootstrap samples, b indexes a specific sample b {1,..., B}. Remark 1: A lot of bootstraps are required because bootstrap trees are correlated as bootstrap samples overlap substantially. Remark 2: Continuous estimator, because averaging over bootstraps smooths out the discrete steps in the individuals trees. Big Data Methods, Chapter 5, Slide 10

5.4 Random forests Random forests for prediction Among the most competitive methods for prediction. Based on model averaging: prediction is the average of hundreds or thousands of distinct regression trees. Similarity to bagging: regression trees are estimated in bootstrap samples (or subsamples with smaller size than original data) and fully grown. Difference to bagging: at each partitioning step, only a random (and small) subset of regressors (rather than all) is considered as potential variables for further partitioning. Randomly picking regressors prevents correlated trees across bootstrap samples (as in bagging) and is computationally attractive. Big Data Methods, Chapter 5, Slide 11

Random forests for causal effects of binary variables To estimate causal effects of a binary variable (denoted by W) on Y given X, rather than doing mere prediction of Y. U D Y X Idea: for each regression tree, estimate the effect of W on Y within each leaf defined by X and average over the trees. Assumption: Conditional on X, there must not exist any unobservables jointly affecting W and Y. Implies that W is as good as randomly assigned given X. E Under this assumption, causal random forests can be consistent and asymptotically normal. Allows to derive confidence intervals on effects and do hypothesis testing (in contrast to many predictive machine learning approaches, for which no asymptotic theory is available). Big Data Methods, Chapter 5, Slide 12

Random forests for causal effects: algorithm Estimated conditional mean of Y given X: Estimated conditional mean effect of binary variable W on Y given X: Taken from Wager and Athey (2016): Estimation and Inference of Heterogeneous Treatment Effects using Random Forests The algorithm is implemented in the causaltree package for R. Big Data Methods, Chapter 5, Slide 13

5.5 Further machine learning approaches Big Data Methods, Chapter 5, Slide 14 Classifiers, support vector machines, neural networks Classifiers Machine learning In the spirit of trees in the sense that they split the data, but do not sequentially split the data Support vector machines Categorizes observations into groups such that the observations in the separate categories are divided by a gap or distance that is as large as possible (i.e. maximized). Artificial neural networks Rudimentarily mimics the neural structure of a biological brain. Neural units are connected with other neural units in a system Each unit may receive signals (similar to a coefficient) from other units and pass on a signal itself as a function of incoming signals. Each unit is therefore a function of other units (apart from the input variables which are not functions of other units) this can be thought of as nested regression models. At the end of the system is the predicted outcome of interest as a function of previous units.

Boosting A way to improve (simple) machine learning methods. As example, assume that we estimate the conditional mean of Y by a simple tree with just two partitions. This predictor is likely to perform poorly. Idea of boosting: repeatedly apply a poor predictor After the first application of the simple tree we calculate the residuals (difference of actual outcome and its prediction). We then apply the simple tree to the residuals instead of the original outcomes. We repeat this many times, each time applying the simple tree to the residuals from the previous stage. Repeating a simple method many times allows approximating regression functions in a flexible way. again, model averaging improves single methods. Big Data Methods, Chapter 5, Slide 15