Lecture 12. August 23, Department of Biostatistics Johns Hopkins Bloomberg School of Public Health Johns Hopkins University.

Similar documents
The Bootstrap and Jackknife

STA258H5. Al Nosedal and Alison Weir. Winter Al Nosedal and Alison Weir STA258H5 Winter / 27

MS&E 226: Small Data

Chapter 3. Bootstrap. 3.1 Introduction. 3.2 The general idea

An Introduction to the Bootstrap

Lecture 14. December 19, Department of Biostatistics Johns Hopkins Bloomberg School of Public Health Johns Hopkins University.

Bootstrap confidence intervals Class 24, Jeremy Orloff and Jonathan Bloom

More Summer Program t-shirts

Laboratory #11. Bootstrap Estimates

Evaluating generalization (validation) Harvard-MIT Division of Health Sciences and Technology HST.951J: Medical Decision Support

Cross-validation and the Bootstrap

Notes on Simulations in SAS Studio

Problem Set #8. Econ 103

Week 4: Describing data and estimation

COPYRIGHTED MATERIAL CONTENTS

Today. Lecture 4: Last time. The EM algorithm. We examine clustering in a little more detail; we went over it a somewhat quickly last time

CHAPTER 2 Modeling Distributions of Data

Cross-validation and the Bootstrap

Stat 528 (Autumn 2008) Density Curves and the Normal Distribution. Measures of center and spread. Features of the normal distribution

Lecture 3 - Object-oriented programming and statistical programming examples

Heteroskedasticity and Homoskedasticity, and Homoskedasticity-Only Standard Errors

Bootstrap Confidence Intervals for Regression Error Characteristic Curves Evaluating the Prediction Error of Software Cost Estimation Models

Section E. Measuring the Strength of A Linear Association

Computation of the variance-covariance matrix

A Handbook of Statistical Analyses Using R. Brian S. Everitt and Torsten Hothorn

Bootstrap Confidence Interval of the Difference Between Two Process Capability Indices

GenStat for Schools. Disappearing Rock Wren in Fiordland

Monte Carlo Analysis

Introduction to the Bootstrap and Robust Statistics

Optimization and Simulation

Package bootlr. July 13, 2015

Lecture 6: Chapter 6 Summary

Data Analysis using R script R Tutorial, March 2018

Measures of Position

Minitab Notes for Activity 1

A. Incorrect! This would be the negative of the range. B. Correct! The range is the maximum data value minus the minimum data value.

5 Bootstrapping. 4.7 Extensions. 4.8 References. 5.1 Bootstrapping for random samples (the i.i.d. case) ST697F: Topics in Regression.

Chapters 5-6: Statistical Inference Methods

Lecture 25: Review I

STAT 113: Lab 9. Colin Reimer Dawson. Last revised November 10, 2015

CHAPTER 2 Modeling Distributions of Data

RESAMPLING METHODS. Chapter 05

Chapter 2: Statistical Models for Distributions

Performance Evaluation

R Programming: Worksheet 6

STA 570 Spring Lecture 5 Tuesday, Feb 1

LAB #2: SAMPLING, SAMPLING DISTRIBUTIONS, AND THE CLT

6. More Loops, Control Structures, and Bootstrapping

Outline. Topic 16 - Other Remedies. Ridge Regression. Ridge Regression. Ridge Regression. Robust Regression. Regression Trees. Piecewise Linear Model

Name Date Types of Graphs and Creating Graphs Notes

Missing Data Analysis for the Employee Dataset

Statistics I 2011/2012 Notes about the third Computer Class: Simulation of samples and goodness of fit; Central Limit Theorem; Confidence intervals.

Page 1. Graphical and Numerical Statistics

Descriptive Statistics, Standard Deviation and Standard Error

Monte Carlo Simulation. Ben Kite KU CRMDA 2015 Summer Methodology Institute

An Introduction to Minitab Statistics 529

Data Classification 1

Estimating Population Parameters

Unit 5: Estimating with Confidence

An Interesting Way to Combine Numbers

IPS9 in R: Bootstrap Methods and Permutation Tests (Chapter 16)

Bootstrapping Method for 14 June 2016 R. Russell Rhinehart. Bootstrapping

Outline. Bayesian Data Analysis Hierarchical models. Rat tumor data. Errandum: exercise GCSR 3.11

Tree-based methods for classification and regression

1. Estimation equations for strip transect sampling, using notation consistent with that used to

Resampling Methods. Levi Waldron, CUNY School of Public Health. July 13, 2016

Things you ll know (or know better to watch out for!) when you leave in December: 1. What you can and cannot infer from graphs.

CS 101 Randomness. Lecture 21

Chapter 6 Normal Probability Distributions

1 Overview of Statistics; Essential Vocabulary

Bootstrapping Regression Models in R


Missing Data Analysis for the Employee Dataset

Markov Chain Monte Carlo (part 1)

The partial Package. R topics documented: October 16, Version 0.1. Date Title partial package. Author Andrea Lehnert-Batar

Missing Data and Imputation

Lecture 11: Testing and Design Statistical Computing, Wednesday October 21, 2015

Missing Data and Imputation

Bootstrapped and Means Trimmed One Way ANOVA and Multiple Comparisons in R

Further Simulation Results on Resampling Confidence Intervals for Empirical Variograms

CHAPTER 3: Data Description

CHAPTER 1. Introduction. Statistics: Statistics is the science of collecting, organizing, analyzing, presenting and interpreting data.

Evaluating Machine Learning Methods: Part 1

CHAPTER 2: SAMPLING AND DATA

Install RStudio from - use the standard installation.

The first few questions on this worksheet will deal with measures of central tendency. These data types tell us where the center of the data set lies.

Fitting latency models using B-splines in EPICURE for DOS

5b. Descriptive Statistics - Part II

Chapter 6: DESCRIPTIVE STATISTICS

CHAPTER 2: Describing Location in a Distribution

Lecture 24: Generalized Additive Models Stat 704: Data Analysis I, Fall 2010

Lecture 27: Review. Reading: All chapters in ISLR. STATS 202: Data mining and analysis. December 6, 2017

Color quantization using modified median cut

2) familiarize you with a variety of comparative statistics biologists use to evaluate results of experiments;

Data can be in the form of numbers, words, measurements, observations or even just descriptions of things.

6. More Loops, Control Structures, and Bootstrapping

Sections 4.3 and 4.4

FlowJo Software Lecture Outline:

Ch3 E lement Elemen ar tary Descriptive Descriptiv Statistics

Univariate descriptives

Transcription:

Lecture 12 Department of Biostatistics Johns Hopkins Bloomberg School of Public Health Johns Hopkins University August 23, 2007

1 2 3 4 5

1 2 Introduce the bootstrap 3 the bootstrap algorithm 4 Example bootstrap calculations 5 Discussion

is a tool for estimating standard errors and the bias of estimators As its name suggests, the jackknife is a small, handy tool; in contrast to the bootstrap, which is then the moral equivalent of a giant workshop full of tools Both the jackknife and the bootstrap involve resampling data; that is, repeatedly creating new data sets from the original data

deletes each observation and calculates an estimate based on the remaining n 1 of them It uses this collection of estimates to do things like estimate the bias and the standard error Note that estimating the bias and having a standard error are not needed for things like sample means, which we know are unbiased estimates of population means and what their standard errors are

We ll consider the jackknife for univariate data Let X 1,..., X n be a collection of data used to estimate a parameter θ Let ˆθ be the estimate based on the full data set Let ˆθ i be the estimate of θ obtained by deleting observation i Let θ = 1 n n i=1 ˆθ i

Then, the jackknife estimate of the bias is ) (n 1) ( θ ˆθ Continued (how far the average delete-one estimate is from the actual estimate) estimate of the standard error is [ n 1 n ] 1/2 n (ˆθ i θ) 2 i=1 (the deviance of the delete-one estimates from the average delete-one estimate)

Example Consider the data set of 630 measurements of gray matter volume for workers from a lead manufacturing plant The median gray matter volume is around 589 cubic centimeters We want to estimate the bias and standard error of the median

The gist of the code Example n <- length(gmvol) theta <- median(gmvol) jk <- sapply(1 : n, function(i) median(gmvol[-i]) ) thetabar <- mean(jk) biasest <- (n - 1) * (thetabar - theta) seest <- sqrt((n - 1) * mean((jk - thetabar)^2))

Example Or, using the bootstrap package library(bootstrap) out <- jackknife(gmvol, median) out$jack.se out$jack.bias

Example Both methods (of course) yield an estimated bias of 0 and a se of 9.94 Odd little fact: the jackknife estimate of the bias for the median is always 0 when the number of observations is even It has been shown that the jackknife is a linear approximation to the bootstrap Generally do not use the jackknife for sample quantiles like the median; as it has been shown to have some poor properties

Pseudo observations Another interesting way to think about the jackknife uses pseudo observations Let Pseudo Obs = nˆθ (n 1)ˆθ i Think of these as whatever observation i contributes to the estimate of θ Note when ˆθ is the sample mean, the pseudo observations are the data themselves Then the sample standard error of these observations is the previous jackknife estimated standard error. The mean of these observations is a bias-corrected estimate of θ

Example: Tom s notes

is a tremendously useful tool for constructing confidence intervals and calculating standard errors for difficult statistics For example, how would one derive a confidence interval for the median? procedure follows from the so called bootstrap

Suppose that I have a statistic that estimates some population parameter, but I don t know its sampling distribution suggests using the distribution defined by the data to approximate its sampling distribution

in practice In practice, the bootstrap is always carried out using simulation We will cover only a few aspects of bootstrap resampling The general procedure follows by first simulating complete data sets from the observed data with replacement This is approximately drawing from the sampling distribution of that statistic, at least as far as the data is able to approximate the true population distribution Calculate the statistic for each simulated data set Use the simulated statistics to either define a confidence interval or take the standard deviation to calculate a standard error

Example Consider again, the data set of 630 measurements of gray matter volume for workers from a lead manufacturing plant The median gray matter volume is around 589 cubic centimeters We want a confidence interval for the median of these measurements

Bootstrap procedure for calculating for the median from a data set of n observations i. Sample n observations with replacement from the observed data resulting in one simulated complete data set ii. Take the median of the simulated data set iii. Repeat these two steps B times, resulting in B simulated medians iv. These medians are approximately draws from the sampling distribution of the median of n observations; therefore we can Draw a histogram of them Calculate their standard deviation to estimate the standard error of the median Take the 2.5 th and 97.5 th percentiles as a confidence interval for the median

Example code B <- 1000 n <- length(gmvol) resamples <- matrix(sample(gmvol, n * B, replace = TRUE), B, n) medians <- apply(resamples, 1, median) sd(medians) [1] 3.148706 quantile(medians, c(.025,.975)) 2.5% 97.5% 582.6384 595.3553

density 0.00 0.05 0.10 0.15 580 585 590 595 600 605 Gray Matter Volume

Notes on the bootstrap is non-parametric However, the theoretical arguments proving the validity of the bootstrap rely on large samples Better percentile bootstrap confidence intervals correct for bias There are lots of variations on bootstrap procedures; the book An Introduction to the Bootstrap by Efron and Tibshirani is a great place to start for both bootstrap and jackknife information

library(boot) stat <- function(x, i) {median(x[i])} boot.out <- boot(data = gmvol, statistic = stat, R = 1000) boot.ci(boot.out) Level Percentile BCa 95% (583.1, 595.2 ) (583.2, 595.3 )