The Bootstrap and Jackknife

Similar documents
Chapter 3. Bootstrap. 3.1 Introduction. 3.2 The general idea

Lecture 12. August 23, Department of Biostatistics Johns Hopkins Bloomberg School of Public Health Johns Hopkins University.

Cross-validation and the Bootstrap

MS&E 226: Small Data

Stat 528 (Autumn 2008) Density Curves and the Normal Distribution. Measures of center and spread. Features of the normal distribution

Evaluating generalization (validation) Harvard-MIT Division of Health Sciences and Technology HST.951J: Medical Decision Support

Resampling Methods. Levi Waldron, CUNY School of Public Health. July 13, 2016

Notes on Simulations in SAS Studio

Bootstrap confidence intervals Class 24, Jeremy Orloff and Jonathan Bloom

Lecture 3: Chapter 3

An Introduction to the Bootstrap

Regression Analysis and Linear Regression Models

Applied Survey Data Analysis Module 2: Variance Estimation March 30, 2013

A. Incorrect! This would be the negative of the range. B. Correct! The range is the maximum data value minus the minimum data value.

Cross-validation and the Bootstrap

Further Simulation Results on Resampling Confidence Intervals for Empirical Variograms

Heteroskedasticity and Homoskedasticity, and Homoskedasticity-Only Standard Errors

Statistical Tests for Variable Discrimination

4.5 The smoothed bootstrap

Bootstrap Confidence Interval of the Difference Between Two Process Capability Indices

STA Rev. F Learning Objectives. Learning Objectives (Cont.) Module 3 Descriptive Measures

Unit 8 SUPPLEMENT Normal, T, Chi Square, F, and Sums of Normals

IT 403 Practice Problems (1-2) Answers

10.4 Measures of Central Tendency and Variation

10.4 Measures of Central Tendency and Variation

COPYRIGHTED MATERIAL CONTENTS

3.6 Sample code: yrbs_data <- read.spss("yrbs07.sav",to.data.frame=true)

Using Excel for Graphical Analysis of Data

Quantitative - One Population

Chapters 5-6: Statistical Inference Methods

Week 7: The normal distribution and sample means

Averages and Variation

Outline. Topic 16 - Other Remedies. Ridge Regression. Ridge Regression. Ridge Regression. Robust Regression. Regression Trees. Piecewise Linear Model

Machine Learning Techniques for Data Mining

Multicollinearity and Validation CIVL 7012/8012

STA Module 2B Organizing Data and Comparing Distributions (Part II)

STA Learning Objectives. Learning Objectives (cont.) Module 2B Organizing Data and Comparing Distributions (Part II)

LOESS curve fitted to a population sampled from a sine wave with uniform noise added. The LOESS curve approximates the original sine wave.

Slides for Data Mining by I. H. Witten and E. Frank

STA258H5. Al Nosedal and Alison Weir. Winter Al Nosedal and Alison Weir STA258H5 Winter / 27

More Summer Program t-shirts

The partial Package. R topics documented: October 16, Version 0.1. Date Title partial package. Author Andrea Lehnert-Batar

Conditional Volatility Estimation by. Conditional Quantile Autoregression

Prepare a stem-and-leaf graph for the following data. In your final display, you should arrange the leaves for each stem in increasing order.

Laboratory #11. Bootstrap Estimates

Speeding up R code using Rcpp and foreach packages.

CHAPTER 2 Modeling Distributions of Data

FMA901F: Machine Learning Lecture 3: Linear Models for Regression. Cristian Sminchisescu

Graphical Analysis of Data using Microsoft Excel [2016 Version]

Fast Automated Estimation of Variance in Discrete Quantitative Stochastic Simulation

Chapter 6: DESCRIPTIVE STATISTICS

Acquisition Description Exploration Examination Understanding what data is collected. Characterizing properties of data.

1. Estimation equations for strip transect sampling, using notation consistent with that used to

Measures of Central Tendency. A measure of central tendency is a value used to represent the typical or average value in a data set.

Measures of Central Tendency

Big Data Methods. Chapter 5: Machine learning. Big Data Methods, Chapter 5, Slide 1

A Study of Cross-Validation and Bootstrap for Accuracy Estimation and Model Selection (Kohavi, 1995)

Unit 5: Estimating with Confidence

Today. Lecture 4: Last time. The EM algorithm. We examine clustering in a little more detail; we went over it a somewhat quickly last time

CHAPTER 2 Modeling Distributions of Data

Epidemic spreading on networks

Bootstrap and multiple imputation under missing data in AR(1) models

Condence Intervals about a Single Parameter:

Metaheuristic Development Methodology. Fall 2009 Instructor: Dr. Masoud Yaghini

Regression III: Advanced Methods

CHAPTER 2 Modeling Distributions of Data

STA 570 Spring Lecture 5 Tuesday, Feb 1

Chapter 2: The Normal Distribution

Modelling and Quantitative Methods in Fisheries

Comparison of Methods for Analyzing and Interpreting Censored Exposure Data

Bootstrap Confidence Intervals for Regression Error Characteristic Curves Evaluating the Prediction Error of Software Cost Estimation Models

ACCURACY AND EFFICIENCY OF MONTE CARLO METHOD. Julius Goodman. Bechtel Power Corporation E. Imperial Hwy. Norwalk, CA 90650, U.S.A.

STATISTICS (STAT) Statistics (STAT) 1

Machine Learning and Bioinformatics 機器學習與生物資訊學

Spatial Outlier Detection

The first few questions on this worksheet will deal with measures of central tendency. These data types tell us where the center of the data set lies.

+ Statistical Methods in

DS Machine Learning and Data Mining I. Alina Oprea Associate Professor, CCIS Northeastern University

Comparing Dropout Nets to Sum-Product Networks for Predicting Molecular Activity

Nonparametric Error Estimation Methods for Evaluating and Validating Artificial Neural Network Prediction Models

Data Preprocessing. S1 Teknik Informatika Fakultas Teknologi Informasi Universitas Kristen Maranatha

Math 214 Introductory Statistics Summer Class Notes Sections 3.2, : 1-21 odd 3.3: 7-13, Measures of Central Tendency

Descriptive Statistics and Graphing

Machine Learning for Pre-emptive Identification of Performance Problems in UNIX Servers Helen Cunningham

Chapter 2 Modeling Distributions of Data

What is Process Capability?

15 Wyner Statistics Fall 2013

Monte Carlo Simulation. Ben Kite KU CRMDA 2015 Summer Methodology Institute

How Learning Differs from Optimization. Sargur N. Srihari

Week 4: Simple Linear Regression III

Unit 1 Review of BIOSTATS 540 Practice Problems SOLUTIONS - Stata Users

Splines and penalized regression

EPSRC Centre for Doctoral Training in Industrially Focused Mathematical Modelling

3. Data Analysis and Statistics

Chapter2 Description of samples and populations. 2.1 Introduction.

Resampling Methods for Dependent Data

Introduction to Data Mining

Statistics 406 Exam November 17, 2005

Chapter 5snow year.notebook March 15, 2018

Stochastic Simulation: Algorithms and Analysis

Transcription:

The Bootstrap and Jackknife Summer 2017 Summer Institutes 249

Bootstrap & Jackknife Motivation In scientific research Interest often focuses upon the estimation of some unknown parameter, θ. The parameter θ can represent for example, mean weight of a certain strain of mice, heritability index, a genetic component of variation, a mutation rate, etc. Two key questions need to be addressed: 1. How do we estimate θ? 2. Given an estimator for θ, how do we estimate its precision/accuracy? We assume Question 1 can be reasonably well specified by the researcher Question 2, for our purposes, will be addressed via the estimation of the estimator s standard error Summer 2017 Summer Institutes 250

What is a standard error? Suppose we want to estimate a parameter theta (eg. the mean/median/squared-log-mode) of a distribution Our sample is random, so Any function of our sample is random, so... Our estimate, theta-hat, is random! So... If we collected a new sample, we d get a new estimate. Same for another sample, and another... So Our estimate has a distribution! It s called a sampling distribution! The standard deviation of that distribution is the standard error Summer 2017 Summer Institutes 251

Bootstrap Motivation Challenges Answering Question 2, even for relatively simple estimators (e.g., ratios and other non-linear functions of estimators) can be quite challenging Solutions to most estimators are mathematically intractable or too complicated to develop (with or without advanced training in statistical inference) However Great strides in computing, particularly in the last 25 years, have made computational intensive calculations feasible. We will investigate how the bootstrap allows us to obtain robust estimates of precision for our estimator, θ, with a simple example Summer 2017 Summer Institutes 252

Bootstrap Estimation Estimating the precision of the sample mean A dataset of n observation provides more than an estimate of the population mean (denoted here as X ), where 1 n X = X i 1 i n = It gives an estimate of the precision of se! X, = σ! 2 / n where, σ! 2 = n i=1 (X i X ) n 1 is an estimate of the population variance., namely This standard error estimate is that it does not extend to estimators other than X Standard deviation: square root of variance; standard error: estimate of standard deviation 2 X Summer 2017 Summer Institutes 253

Bootstrap Estimation Estimating the precision of the sample mean From the formulas on the previous page, we can obtain an estimate of precision for X by estimating the population variance and plugging it into the formula for the standard error estimate. Question: What IF you did not know the formula for the standard error of the sample mean, BUT you had access to modern PC. How might you obtain an estimate of precision? Answer: The bootstrap! Summer 2017 Summer Institutes 254

Bootstrap Algorithm Bootstrapping Assuming the sample accurately reflects the population from which it is drawn Generate a large number of bootstrap samples by resampling (with replacement) from the dataset Resample with the same structure (dependence, sample sizes) as used in the original sample Compute your estimator,, (here, ), for each of the bootstrap samples Compute the standard deviation from the statistics calculated above. θ! X θ! = X Summer 2017 Summer Institutes 255

Bootstrap Algorithm Bootstrap sample Bootstrap estimates 1: 2: (X (1) 1, X (1) 2,, X (1) n ) θ! (X (1) 1, X (1) 2,, X (1) n ) = X (1) (X (2) 1, X (2) 2,, X (2) n ) θ! (X (2) 1, X (2) 2,, X (2) n ) = X (2)!! B: (X 1 (B), X 2 (B),, X n (B) ) θ! (X 1 (B), X 2 (B),, X n (B) ) = X (B) (X ( j) (.) X Compute! ) 2, where! 2 σ, and b = j=1 σ b B 1 X (.) = 1 B B j=1 X ( j) The bootstrap standard error is. For other estimators, simply replace with the of your choice. B se! boot X = 2 X θ! σ! 2 b. Summer 2017 Summer Institutes 256

Bootstrap Estimation Examples What is the variance of the sample median? No idea! => Use the bootstrap! Bootstrapped estimates of the standard error for sample median Data Median Original sample: {1, 5, 8, 3, 7} 5 Bootstrap 1: {1, 7, 1, 3, 7} 3 Bootstrap 2: {7, 3, 8, 8, 3} 7 Bootstrap 3: {7, 3, 8, 8, 3} 7 Bootstrap 4: {3, 5, 5, 1, 5} 5 Bootstrap 5: {1, 1, 5, 1, 8} 1 etc. Bootstrap B (=1000) Summer 2017 Summer Institutes 257

Bootstrap Estimation Examples Bootstrapped estimates of the standard error for sample median (cont.) Descriptive statistics for the sample medians from 1000 bootstrap samples B 1000 Mean 4.964 Standard Deviation 1.914 Median 5 Minimum, Maximum 1, 8 25th, 75th percentile 3, 7 We estimate the standard error for the sample median as 1.914 A 95% asymptotic (with n=5?) confidence interval (using the 0.975 quantile of the standard normal distribution) is 5 +/- 1.96(1.914) = (1.25, 8.75) Summer 2017 Summer Institutes 258

Bootstrap Estimation Examples Bootstrapped estimates of the standard error for sample relative risk r = P[D Exposed]/P[D Not exposed] Cross-classification of Framingham Men by high systolic blood pressure and heart disease Heart Disease High Systol BP No Yes No 915 48 Yes 322 44 The sample estimate of the relative risk is r = (44/366)/(48/963) = 2.412 Summer 2017 Summer Institutes 259

Bootstrap Estimation Examples Bootstrapped estimates of the standard error for the relative risk (cont.) Descriptive statistics for the sample relative risks B 100000 Bootstrap mean, r 2.464 Bootstrap Median 2.412 Standard Deviation 0.507 The bootstrap standard error for the estimated relative risk is 0.507 A 95% bootstrap confidence interval is 2.412 +/- 1.96(0.507) = (1.42, 3.41) Summer 2017 Summer Institutes 260

Bootstrap Summary Advantages All purpose computer intensive method useful for statistical inference. Bootstrap estimates of precision do not require knowledge of the theoretical form of an estimator s standard error, no matter how complicated it is. Disadvantages Typically not useful for correlated (dependent) data. Missing data, censoring, data with outliers are also problematic Often used incorrectly Summer 2017 Summer Institutes 261

Jackknife Jackknife Estimation The jackknife (or leave one out) method, invented by Quenouille (1949), is an alternative resampling method to the bootstrap. The method is based upon sequentially deleting one observation from the dataset, recomputing the estimator, here,! θ (i), n times. That is, there are exactly n jackknife estimates obtained in a sample of size n. Like the bootstrap, the jackknife method provides a relatively easy way to estimate the precision of an estimator, θ. The jackknife is generally less computationally intensive than the bootstrap Summer 2017 Summer Institutes 262

Jackknife Algorithm Jackknifing For a dataset with n observations, compute n estimates by sequentally omitting each observation from the dataset and estimating! θ on the remaining n 1 observations. Using the n jackknife estimates,! θ (1),θ! (2),,θ! (n), we estimate the standard error of the estimator as se! n 1 jack = n Unlike the bootstrap, the jackknife standard error estimate will not change for a given sample n i=1 (θ! (i) θ! (.)) 2 Summer 2017 Summer Institutes 263

Jackknife Summary Advantages Useful method for estimating and compensating for bias in an estimator. Like the bootstrap, the methodology does not require knowledge of the theoretical form of an estimator s standard error. Is generally less computationally intensive compared to the bootstrap method. Disadvantages The jackknife method is more conservative than the bootstrap method, that is, its estimated standard error tends to be slightly larger. Performs poorly when the the estimator is not sufficiently smooth, i.e., a non-smooth statistic for which the jackknife performs poorly is the median. Summer 2017 Summer Institutes 264