MS&E 226: Small Data

Similar documents
Chapter 3. Bootstrap. 3.1 Introduction. 3.2 The general idea

The Bootstrap and Jackknife

Cross-validation and the Bootstrap

Lecture 12. August 23, Department of Biostatistics Johns Hopkins Bloomberg School of Public Health Johns Hopkins University.

Missing Data Analysis for the Employee Dataset

STA258H5. Al Nosedal and Alison Weir. Winter Al Nosedal and Alison Weir STA258H5 Winter / 27

Heteroskedasticity and Homoskedasticity, and Homoskedasticity-Only Standard Errors

Applied Statistics and Econometrics Lecture 6

Chapters 5-6: Statistical Inference Methods

Missing Data Analysis for the Employee Dataset

Cross-validation and the Bootstrap

Today. Lecture 4: Last time. The EM algorithm. We examine clustering in a little more detail; we went over it a somewhat quickly last time

MS&E 226: Small Data

Notes on Simulations in SAS Studio

Performance Evaluation

Lecture 3: Chapter 3

A Study of Cross-Validation and Bootstrap for Accuracy Estimation and Model Selection (Kohavi, 1995)

Machine Learning: An Applied Econometric Approach Online Appendix

Sections 4.3 and 4.4

Week 4: Simple Linear Regression II

Using Machine Learning to Optimize Storage Systems

Lecture 25: Review I

Chapter 6: DESCRIPTIVE STATISTICS

Standard Errors in OLS Luke Sonnet

More Summer Program t-shirts

Hyperparameters and Validation Sets. Sargur N. Srihari

OLS Assumptions and Goodness of Fit

Monte Carlo Simulation. Ben Kite KU CRMDA 2015 Summer Methodology Institute

Lecture 13: Model selection and regularization

Multivariate Analysis Multivariate Calibration part 2

ST512. Fall Quarter, Exam 1. Directions: Answer questions as directed. Please show work. For true/false questions, circle either true or false.

Chapter 2: Modeling Distributions of Data

Evaluating generalization (validation) Harvard-MIT Division of Health Sciences and Technology HST.951J: Medical Decision Support

Measures of Central Tendency

Resampling Methods. Levi Waldron, CUNY School of Public Health. July 13, 2016

Performance Estimation and Regularization. Kasthuri Kannan, PhD. Machine Learning, Spring 2018

Week 4: Simple Linear Regression III

Resources for statistical assistance. Quantitative covariates and regression analysis. Methods for predicting continuous outcomes.

Outline. Topic 16 - Other Remedies. Ridge Regression. Ridge Regression. Ridge Regression. Robust Regression. Regression Trees. Piecewise Linear Model

1. Determine the population mean of x denoted m x. Ans. 10 from bottom bell curve.

CPSC 340: Machine Learning and Data Mining. Probabilistic Classification Fall 2017

Multicollinearity and Validation CIVL 7012/8012

1. The Normal Distribution, continued

Averages and Variation

STA Rev. F Learning Objectives. Learning Objectives (Cont.) Module 3 Descriptive Measures

Prepare a stem-and-leaf graph for the following data. In your final display, you should arrange the leaves for each stem in increasing order.

in this course) ˆ Y =time to event, follow-up curtailed: covered under ˆ Missing at random (MAR) a

Optimization and Simulation

DATA MINING AND MACHINE LEARNING. Lecture 6: Data preprocessing and model selection Lecturer: Simone Scardapane

Simulation studies. Patrick Breheny. September 8. Monte Carlo simulation Example: Ridge vs. Lasso vs. Subset

Blending of Probability and Convenience Samples:

Math 214 Introductory Statistics Summer Class Notes Sections 3.2, : 1-21 odd 3.3: 7-13, Measures of Central Tendency

Estimation of Item Response Models

CSE 446 Bias-Variance & Naïve Bayes

Statistics 406 Exam November 17, 2005

Analytical model A structure and process for analyzing a dataset. For example, a decision tree is a model for the classification of a dataset.

Problem Set #8. Econ 103

A. Incorrect! This would be the negative of the range. B. Correct! The range is the maximum data value minus the minimum data value.

Statistical Analysis of List Experiments

An Introduction to the Bootstrap

Robust Linear Regression (Passing- Bablok Median-Slope)

CHAPTER 11 EXAMPLES: MISSING DATA MODELING AND BAYESIAN ANALYSIS

Regression Analysis and Linear Regression Models

Use of Extreme Value Statistics in Modeling Biometric Systems

STA 570 Spring Lecture 5 Tuesday, Feb 1

5 Bootstrapping. 4.7 Extensions. 4.8 References. 5.1 Bootstrapping for random samples (the i.i.d. case) ST697F: Topics in Regression.

Big Data Methods. Chapter 5: Machine learning. Big Data Methods, Chapter 5, Slide 1

After opening Stata for the first time: set scheme s1mono, permanently

Serial Correlation and Heteroscedasticity in Time series Regressions. Econometric (EC3090) - Week 11 Agustín Bénétrix

Lasso. November 14, 2017

2017 ITRON EFG Meeting. Abdul Razack. Specialist, Load Forecasting NV Energy

Lecture 6: Chapter 6 Summary

Exercise 2.23 Villanova MAT 8406 September 7, 2015

Lecture 7: Linear Regression (continued)

Statistical Matching using Fractional Imputation

Ensemble Learning. Another approach is to leverage the algorithms we have via ensemble methods

Regression on SAT Scores of 374 High Schools and K-means on Clustering Schools

Chapter 2 Describing, Exploring, and Comparing Data

Bootstrap confidence intervals Class 24, Jeremy Orloff and Jonathan Bloom

Economics Nonparametric Econometrics

CHAPTER 7 EXAMPLES: MIXTURE MODELING WITH CROSS- SECTIONAL DATA

Measures of Central Tendency. A measure of central tendency is a value used to represent the typical or average value in a data set.

CHAPTER 3: Data Description

Unit 5: Estimating with Confidence

Cross-validation. Cross-validation is a resampling method.

Descriptive Statistics, Standard Deviation and Standard Error

STA Module 4 The Normal Distribution

STA /25/12. Module 4 The Normal Distribution. Learning Objectives. Let s Look at Some Examples of Normal Curves

Statistical Analysis Using Combined Data Sources: Discussion JPSM Distinguished Lecture University of Maryland

Week 5: Multiple Linear Regression II

Introduction to Mixed Models: Multivariate Regression

Week 10: Heteroskedasticity II

Last time... Coryn Bailer-Jones. check and if appropriate remove outliers, errors etc. linear regression

+ Statistical Methods in

1 More configuration model

4.5 The smoothed bootstrap

Data Analysis and Solver Plugins for KSpread USER S MANUAL. Tomasz Maliszewski

Math 120 Introduction to Statistics Mr. Toner s Lecture Notes 3.1 Measures of Central Tendency

Week 11: Interpretation plus

LAB #2: SAMPLING, SAMPLING DISTRIBUTIONS, AND THE CLT

Transcription:

MS&E 226: Small Data Lecture 13: The bootstrap (v3) Ramesh Johari ramesh.johari@stanford.edu 1 / 30

Resampling 2 / 30

Sampling distribution of a statistic For this lecture: There is a population model that gives the distribution (cdf) G of an outcome Y. We observe data Y that comes from this population model. We compute a statistic T (Y) given the data; an example might be an estimate for a parameter. The sampling distribution of T (Y) is the distribution of values for the statistic that are obtained over many parallel universes, where the same process is repeated. In this lecture we develop a flexible and powerful approach to estimating the sampling distribution of any statistic: the bootstrap. 3 / 30

Idea behind the bootstrap The basic idea behind the bootstrap is straightforward: The data Y are a sample from the population model. 4 / 30

Idea behind the bootstrap The basic idea behind the bootstrap is straightforward: The data Y are a sample from the population model. If we resample (with replacement) from Y, this mimics sampling from the population model. 4 / 30

Idea behind the bootstrap The basic idea behind the bootstrap is straightforward: The data Y are a sample from the population model. If we resample (with replacement) from Y, this mimics sampling from the population model. In the bootstrap, we draw B new samples (of size n) from the original data, and treat each of these as a parallel universe. We can then pretend this is the sampling distribution, and compute what we want (e.g., standard errors, confidence intervals, etc.). 4 / 30

Why bootstrap? We ve seen that the sampling distribution of the MLE is asymptotically normal, and we can characterize the mean (it is unbiased) and variance (in terms of Fisher information). So why do we need the bootstrap? Sample sizes might be small (asymptopia is not a good assumption). Assumptions we made in the population model might have been violated. We might want sampling distributions for much more complex estimation procedures, where no closed form expression exists. 5 / 30

Formal definition 6 / 30

The bootstrap algorithm We are given a sample Y of n observations. For 1 b B: Draw n samples uniformly at random, with replacement, from Y. Denote the i th observation in the b th sample by Ỹ (b) i. Compute the value of the statistic from the b th sample as T b = T (Ỹ(b) ). The histogram (i.e., empirical distribution) of {T b, 1 b B} is an estimate of the sampling distribution. We call this the bootstrap distribution. 7 / 30

The bootstrap algorithm A picture: 8 / 30

An analogy The following analogy is helpful to keep in mind. For the sampling distribution we have: 9 / 30

An analogy The following analogy is helpful to keep in mind. For the sampling distribution we have: Bootstrapping treats the sample Y as if it represents the true population model: 9 / 30

When is bootstrap problematic? From the preceding slides, several things have to happen for the bootstrap distribution to be close to the sampling distribution: B should be large enough that any randomness due to resampling is not consequential in our estimated distribution. (What is large enough? In practice, limited primarily by computational time.) The original data sample Y should be generated as i.i.d. samples from the population model, and Y should be large enough to be representative of the original population model (sufficient sample size, no sampling bias). The first we can control. The second requires contextual knowledge of how Y was actually generated. 10 / 30

When is bootstrap problematic? Bootstrap estimation is a very powerful technique for approximating the sampling distribution, because it makes very few assumptions about the nature of the population model. That said, it is not immune to yielding inaccurate results; for example, for extreme value statistics (e.g., when T is the maximum or minimum of the data), the bootstrap can yield badly biased estimates. In practice, for estimating things like standard errors of estimators, the bootstrap is a fairly reliable technique (assuming the two conditions on the preceding slide hold). 11 / 30

Standard errors We use the bootstrap distribution just like we use the sampling distribution. For example, the bootstrap estimate of the standard error ŜE BS is the standard deviation of the bootstrap distribution. 12 / 30

Confidence intervals Since we have an estimate of the sampling distribution, we can use it to construct confidence intervals. Two approaches to building 95% confidence intervals: 13 / 30

Confidence intervals Since we have an estimate of the sampling distribution, we can use it to construct confidence intervals. Two approaches to building 95% confidence intervals: The normal interval: [T (Y) 1.96ŜE BS, T (Y) + 1.96ŜE BS]. This approach assumes that the sampling distribution of T (Y) is normal, and uses the standard error accordingly. 13 / 30

Confidence intervals Since we have an estimate of the sampling distribution, we can use it to construct confidence intervals. Two approaches to building 95% confidence intervals: The normal interval: [T (Y) 1.96ŜE BS, T (Y) + 1.96ŜE BS]. This approach assumes that the sampling distribution of T (Y) is normal, and uses the standard error accordingly. The percentile interval: Let T q be the q th quantile of the bootstrap distribution. Then the 95th percentile bootstrap interval is: [T 0.025, T 0.975 ]. In general the percentile interval works well when the sampling distribution is symmetric, but not necessarily normal. (Many other types of intervals as well...) 13 / 30

Example 1: Mean of in-class midterm scores The mean of the score on the 226 in-class midterm was ˆµ = 33.8 (with n = 108). What is the standard error? By the central limit theorem, the sample mean has standard error approximately ˆσ/ n = 0.60, where ˆσ = 6.25 is the sample standard deviation. What does the bootstrap suggest? 14 / 30

Example 1: Mean of in-class midterm scores Here is how we can directly code the bootstrap in R for the sample mean: n_reps = 10000 n = 79 # Test scores are in "scores" boot.out = sapply(1:n_reps, function(i) { mean(sample(scores, n, replace = TRUE)) }) Now boot.out contains n_reps replicates of the sample mean. 15 / 30

Example 1: Mean of in-class midterm scores Code to run bootstrap in R (using boot library): # create function for mean mean.boot = function(data, indices) { d = data[indices] return(mean(d)) } # Test scores are in "scores" boot.out = boot(scores, mean.boot, 10000) boot.out$t contains the means across parallel universes. Use boot.ci for confidence intervals. 16 / 30

Example 1: Mean of in-class midterm scores Results: 400 300 Frequency 200 100 0 32 33 34 35 36 Mean of test scores ŜE BS = 0.60, matching the normal approximation (because the central limit theorem is pretty good here). 17 / 30

Example 2: Median of in-class midterm scores For the median, we don t have an easy way to approximate the standard error; so we turn to the bootstrap. Results: 4000 3000 Frequency 2000 1000 0 32 33 34 35 36 Median of test scores ŜE BS = 1.87. Note that the actual median in the class was 32.5. 18 / 30

Example 3: Funnels (survival analysis) Suppose as an online retailer you have a three stage checkout flow: customers (1) add an item to their cart, (2) enter their credit card info, and (3) hit Purchase. In practice, customers might abandon before completing these activities. 19 / 30

Example 3: Funnels (survival analysis) Suppose as an online retailer you have a three stage checkout flow: customers (1) add an item to their cart, (2) enter their credit card info, and (3) hit Purchase. In practice, customers might abandon before completing these activities. Suppose you ve collected data on n customers, where Y i {1, 2, 3} denotes the latest stage the customer completed. 19 / 30

Example 3: Funnels (survival analysis) Suppose as an online retailer you have a three stage checkout flow: customers (1) add an item to their cart, (2) enter their credit card info, and (3) hit Purchase. In practice, customers might abandon before completing these activities. Suppose you ve collected data on n customers, where Y i {1, 2, 3} denotes the latest stage the customer completed. Let γ i be the probability that a customer that completes stage s will also complete stage s + 1, for s = 1, 2. We estimate these as follows: ˆγ 1 = {i : Y i 2} {i : Y i 1} ; ˆγ 2 = {i : Y i = 3} {i : Y i 2}. 19 / 30

Example 3: Funnels (survival analysis) Suppose as an online retailer you have a three stage checkout flow: customers (1) add an item to their cart, (2) enter their credit card info, and (3) hit Purchase. In practice, customers might abandon before completing these activities. Suppose you ve collected data on n customers, where Y i {1, 2, 3} denotes the latest stage the customer completed. Let γ i be the probability that a customer that completes stage s will also complete stage s + 1, for s = 1, 2. We estimate these as follows: ˆγ 1 = {i : Y i 2} {i : Y i 1} ; ˆγ 2 = {i : Y i = 3} {i : Y i 2}. But standard errors are not easy to compute, since these are quotients; the bootstrap is an easy approach to get standard errors. 19 / 30

Example 4: Clustered data This example came up when working on a project with Netflix; it is a common example of computing standard errors correctly when you have clustered observations. It is also a useful example to see how you can sometimes apply the bootstrap even in settings where the data is not perfectly i.i.d. 20 / 30

Example 4: Clustered data Suppose we collect data on viewing behavior of n = 100 users. Each user i has k i sessions; and the average bit rate of session j of user i is r ij. So our data is {r ij, 1 i 1000, 1 j k i }. I generated synthetic data where each user generated 10 sessions; for each i, user i s session had mean rate that is µ i Exp(1/1000); and each of the k i sessions of user i are of rate N (µ i, 1). We estimate the average delivered bit rate as: ˆµ = n ki i=1 j=1 r ij n i=1 k. i This estimate is 908.0. 21 / 30

Example 4: Clustered data What is the standard error of this estimate? Naive approach: take sample standard deviation of the per session bit rates, divided by square root of number of sessions. For my synthetic dataset this gave ŜE = 9.22. This is not quite right however, because sessions are not independent: sessions belonging to the same user are likely to have similar bit rate. How to adjust for this correlation? 22 / 30

Example 4: Clustered data Recreate the data generating process with a block bootstrap: Sample users (n = 1000) with replacement. Each time a user is sampled, include all of their sessions. The bootstrap distribution is the resulting histogram of ˆµ (b), over each bootstrap sample b. Using this approach gave a standard error of ŜE BS = 91.6. (Why is this ten times larger than the naive ŜE?) 23 / 30

Bootstrap for regression modeling 24 / 30

Resampling the data As usual for regression, we assume we are given n data points, with observations Y and covariates X. There are two ways to bootstrap in this case: Resample new bootstrap observations Ỹ(b), holding X fixed. Resample new data X and Y. We ll focus on the second (also called case resampling), since it s both more likely to be relevant to you in practice, and it is more robust to modeling errors (though it can be more inefficient). 25 / 30

Linear regression via OLS Suppose we want bootstrap standard errors for OLS coefficients; we can get these as follows. For 1 b B: Draw n samples uniformly at random, with replacement, from (X, Y). Denote the i th outcome (resp. covariate) in the b th sample by Ỹ (b) (b) i (resp., X i ). Given the resulting data ( X (b), Ỹ(b) ) in the b th sample, run OLS and compute the resulting coefficient vector ˆβ (b). This gives the bootstrap distribution of ˆβ, and we can use it to, e.g., compute standard errors or confidence intervals (or even bias) for each of the coefficients. 26 / 30

Example: Heteroskedastic data Suppose for 1 i 100, X i N (0, 1), and Y i = X i + ε i, where ε i N (0, Xi 4 ). This data exhibits heteroskedasticity: the error variance is not constant. In this case the linear normal model assumptions are violated. What is the effect on the standard error that R reports? We use the bootstrap to find out. 27 / 30

Example: Heteroskedastic data Here is the code to run the bootstrap: df = data.frame(x,y) coef.boot = function(data, indices) { fm = lm(data = data[indices,], Y ~ 1 + X) return(coef(fm)) } boot.out = boot(df, coef.boot, 50000) 28 / 30

Example: Heteroskedastic data If we run lm(data = df, Y ~ 1 + X), then R reports a standard error of 0.26 on the coefficient on X, which is 1.42. But using the bootstrap approach, we compute a standard error of 0.62! Because of the failure of the regression assumption, the bootstrap reports a standard error which is higher than the (optimistic) standard error reported using the assumption of constant variance. 29 / 30

Other applications Note that you can use the bootstrap in a similar way to also compute standard errors, confidence intervals, etc., for regularized regression (ridge, lasso). The bootstrap is an incredibly flexible tool, with many variations developed to make it applicable in a wide range of settings; in fact, it can even be used to estimate test error in a prediction setting (instead of cross validation). 30 / 30