Introduction to the Bootstrap and Robust Statistics

Similar documents
STA258H5. Al Nosedal and Alison Weir. Winter Al Nosedal and Alison Weir STA258H5 Winter / 27

Resampling Methods. Levi Waldron, CUNY School of Public Health. July 13, 2016

Chapter 3. Bootstrap. 3.1 Introduction. 3.2 The general idea

Package bootlr. July 13, 2015

More Summer Program t-shirts

Package complmrob. October 18, 2015

Package uclaboot. June 18, 2003

Lecture 12. August 23, Department of Biostatistics Johns Hopkins Bloomberg School of Public Health Johns Hopkins University.

MS&E 226: Small Data

The simpleboot Package

Notes on Simulations in SAS Studio

Z-TEST / Z-STATISTIC: used to test hypotheses about. µ when the population standard deviation is unknown

Confidence Intervals: Estimators

IPS9 in R: Bootstrap Methods and Permutation Tests (Chapter 16)

The Bootstrap and Jackknife

IT 403 Practice Problems (1-2) Answers

COPYRIGHTED MATERIAL CONTENTS

Week 4: Describing data and estimation

+ Statistical Methods in

Outline. Topic 16 - Other Remedies. Ridge Regression. Ridge Regression. Ridge Regression. Robust Regression. Regression Trees. Piecewise Linear Model

A Handbook of Statistical Analyses Using R. Brian S. Everitt and Torsten Hothorn

Bootstrap confidence intervals Class 24, Jeremy Orloff and Jonathan Bloom

Measures of Dispersion

STATISTICAL THINKING IN PYTHON II. Generating bootstrap replicates

Data Analysis and Solver Plugins for KSpread USER S MANUAL. Tomasz Maliszewski

R practice. Eric Gilleland. 20th May 2015

We have seen that as n increases, the length of our confidence interval decreases, the confidence interval will be more narrow.

Chapter 2: The Normal Distributions

050 0 N 03 BECABCDDDBDBCDBDBCDADDBACACBCCBAACEDEDBACBECCDDCEA

Quantitative - One Population

Week 7: The normal distribution and sample means

Measures of Central Tendency. A measure of central tendency is a value used to represent the typical or average value in a data set.

CHAPTER 2 Modeling Distributions of Data

Chapter 2 Describing, Exploring, and Comparing Data

Package condir. R topics documented: February 15, 2017

Let s go through some examples of applying the normal distribution in practice.

Bootstrapped and Means Trimmed One Way ANOVA and Multiple Comparisons in R

15 Wyner Statistics Fall 2013

Introduction to the Practice of Statistics using R: Chapter 6

1. Select[] is used to select items meeting a specified criterion.

Bootstrap Confidence Intervals for Regression Error Characteristic Curves Evaluating the Prediction Error of Software Cost Estimation Models

Measures of Dispersion

IQC monitoring in laboratory networks

Package distdichor. R topics documented: September 24, Type Package

Cpk: What is its Capability? By: Rick Haynes, Master Black Belt Smarter Solutions, Inc.

Measures of Central Tendency

Chapters 5-6: Statistical Inference Methods

Nonparametric and Simulation-Based Tests. Stat OSU, Autumn 2018 Dalpiaz

In this computer exercise we will work with the analysis of variance in R. We ll take a look at the following topics:

CHAPTER 1 INTRODUCTION

Chapter 2 Modeling Distributions of Data

Package endogenous. October 29, 2016

IQR = number. summary: largest. = 2. Upper half: Q3 =

050 0 N 03 BECABCDDDBDBCDBDBCDADDBACACBCCBAACEDEDBACBECCDDCEA

A noninformative Bayesian approach to small area estimation

appstats6.notebook September 27, 2016

Pair-Wise Multiple Comparisons (Simulation)

Introduction to Mplus

Unit 7 Statistics. AFM Mrs. Valentine. 7.1 Samples and Surveys

Annotated multitree output

CHAPTER 2 Modeling Distributions of Data

Documentation for BayesAss 1.3

LESSON 3: CENTRAL TENDENCY

APTS Computer Intensive Statistics 2017/18. Computer Practical 1 Outline Solutions Simulation Basics to Monte Carlo Methods

Two-Level Designs. Chapter 881. Introduction. Experimental Design. Experimental Design Definitions. Alias. Blocking

Bootstrapping Methods

10.4 Measures of Central Tendency and Variation

10.4 Measures of Central Tendency and Variation

Contents Cont Hypothesis testing

Package infer. July 11, Type Package Title Tidy Statistical Inference Version 0.3.0

Math 214 Introductory Statistics Summer Class Notes Sections 3.2, : 1-21 odd 3.3: 7-13, Measures of Central Tendency

STATS PAD USER MANUAL

Lab 5 - Risk Analysis, Robustness, and Power

STAT 113: Lab 9. Colin Reimer Dawson. Last revised November 10, 2015

An Introduction to the Bootstrap

Simulation and resampling analysis in R

SCHOOL OF BUSINESS, ECONOMICS AND MANAGEMENT BBA240 STATISTICS/ QUANTITATIVE METHODS FOR BUSINESS AND ECONOMICS

CS 664 Image Matching and Robust Fitting. Daniel Huttenlocher

The first few questions on this worksheet will deal with measures of central tendency. These data types tell us where the center of the data set lies.

INFO Wednesday, September 14, 2016

Learning Objectives. Continuous Random Variables & The Normal Probability Distribution. Continuous Random Variable

Elementary Statistics. Chapter 2 Review: Summarizing & Graphing Data

Dealing with Categorical Data Types in a Designed Experiment

Lecture 3: Chapter 3

CHAPTER 2 Modeling Distributions of Data

CHAPTER 2: Describing Location in a Distribution

Analyzing Genomic Data with NOJAH

Stat 528 (Autumn 2008) Density Curves and the Normal Distribution. Measures of center and spread. Features of the normal distribution

The partial Package. R topics documented: October 16, Version 0.1. Date Title partial package. Author Andrea Lehnert-Batar

Stat 427/527: Advanced Data Analysis I

The ICSNP Package. October 5, 2007

Problem Set #8. Econ 103

Lecture 3 - Object-oriented programming and statistical programming examples

Learner Expectations UNIT 1: GRAPICAL AND NUMERIC REPRESENTATIONS OF DATA. Sept. Fathom Lab: Distributions and Best Methods of Display

A. Incorrect! This would be the negative of the range. B. Correct! The range is the maximum data value minus the minimum data value.

Introduction to Monte Carlo Simulations with Applications in R Using the SimDesign Package

6. More Loops, Control Structures, and Bootstrapping

Predictive Checking. Readings GH Chapter 6-8. February 8, 2017

MAT 142 College Mathematics. Module ST. Statistics. Terri Miller revised July 14, 2015

MAT 110 WORKSHOP. Updated Fall 2018

Transcription:

Introduction to the Bootstrap and Robust Statistics PSY711/712 Winter Term 2009 1 t-based Confidence Intervals This document shows how to calculate confidence intervals for a sample mean using the t statistic in R. 1.1 using a for loop Here is some R code that uses a for loop to generate a confidence interval using the percentile-t bootstrap method. The first thing we need to do is figure out how to extract a t value from the output of t.test: > n <- 20 > mu <- 50 > sigma <- 10 > set.seed(321) > mysample <- rnorm(n=n,mean=mu,sd=sigma) > t.test.result <- t.test(mysample) > names(t.test.result) [1] "statistic" "parameter" "p.value" "conf.int" "estimate" [6] "null.value" "alternative" "method" "data.name" > (the.t.value <- t.test.result$statistic) t 27.5021 Now that we can extract the t statistic, we can do the bootstrap. In the following code, notice that the call to t.test inside the for loop uses the sample mean, m, as an estimate of the population mean, µ. Although using m in this way may seem puzzling, it is consistent with the general bootstrap technique of using the sample as a stand-in for the actual population of scores. > # need sample mean and sd: > m <- mean(mysample) # need mean of original sample > s <- sd(mysample) > # now do the bootstrap: > R <- 1999 > boot.t.value <- rep(0,r) # array for holding bootstrapped t's > for (k in 1:R){ + boot.sample <- sample(mysample,replace=t) # bootstrapped sample + # do t test: + boot.t.test <- t.test(boot.sample,mu=m) # notice that mu is sample mean! + boot.t.value[k] <- boot.t.test$statistic # extract t value 1

+ } > # confidence interval: > alpha <-.05; > critical.t.low <- quantile(boot.t.value,alpha/2) > critical.t.high <- quantile(boot.t.value,1-alpha/2) > ci.95.bootstrap <- c(m - critical.t.high*s/sqrt(n), m - critical.t.low*s/sqrt(n)) > names(ci.95.bootstrap) <- c("2.5%","97.5%") > ci.95.bootstrap 2.5% 97.5% 49.17219 57.62760 For comparison, here is the confidence interval calculated using classical methods: > t.low <- qt(p=alpha/2,df=n-1) > t.high <- qt(p=1-alpha/2,df=n-1) > ci.95 <- c(m - t.high*s/sqrt(n), m - t.low*s/sqrt(n)) > names(ci.95) <- c("2.5%","97.5%") > ci.95 2.5% 97.5% 48.91693 56.97581 > # alternative method: > t.test(mysample)$conf.int[1:2] [1] 48.91693 56.97581 The fact that the two intervals are so similar inspires some confidence in the bootstrap. 1.2 using boot and boot.ci This section describe how to use the functions boot and boot.ci in the boot package to calculate percentile-t confidence intervals. To use boot we need to write a function that calculates the statistic of interest in this case the mean on each bootstrapped sample. To calculate percentile-t confidence interval, we have to construct our function so that it also returns the variance of the estimated statistic. The function boot.stat fulfills these requirements: > boot.mean <- function(thedata,ids){ + boot.sample <- thedata[ids] # bootstrapped sample + m <- mean(boot.sample) # bootstrapped mean (the statistic of interest) + n <- length(boot.sample) + s <- sd(boot.sample) # pop sd estimated from bootstrapped sample + sem <- s/sqrt(n) # standard error of the bootstrapped mean + m.var <- sem^2 # IMPORTANT: boot.ci requires the estimated VARIANCE of the statistic + results <- c(m,m.var) # boot expects the statistic to be 1st and variance to be 2nd + return(results) + } Notice that boot.mean returns two values: the mean of the bootstrapped sample and the variance of the sample mean. Also note that the order of the variables matters: boot expects the statistic to be given first, and the variance of the statistic second. Next, we use boot to do the bootstrap and boot.ci to calculate the confidence interval. The percentile-t confidence interval is specified as type="stud" for studentized confidence interval. 2

> library(boot) # load package > R <-1999 > boot.results <- boot(mysample,statistic=boot.mean,r=r) > boot.ci(boot.results,type="stud") # studentized, or stud, equals percentile-t boot.ci(boot.out = boot.results, type = "stud") 95% (49.39, 57.53 ) Not surprisingly, the confidence interval is similar to the one calculated with the classical method. 1.3 iterative bootstraps Using a bootstrap does not eliminate the problems caused by outliers. Consider, for example, the confidence interval of the mean for the following data: > set.seed(32) > mysample <- rnorm(n=20) > b <- boot(mysample,boot.mean,r=1999) 95% (-0.0388, 0.4724 ) > mysample[20] <- 1000 > b <- boot(mysample,boot.mean,r=1999) 95% ( -10.91, 28982.05 ) 3

Introducing a single outlier results in a dramatic increase in the confidence interval. What is needed here, obviously, is a more robust estimate of central tendency. One possibility is the sample median: Why not alter boot.mean so that it calculates the median and estimates the variance of the sample median? Well, one problem is that it is not obvious how to calculate the variance of a sample median 1. One way to get around this problem is to do a bootstrap on the bootstrapped sample. The following code shows how to do this bootstrap-within-a-bootstrap. > boot.median <- function(thedata,ids){ + boot.sample <- thedata[ids] # bootstrapped sample + m <- median(boot.sample) # bootstrapped median + n <- length(boot.sample) + R <- 199 + # for loop is TOO SLOW: + # reboot.medians <- rep(0,r) + # for(k in 1:R){ + # reboot.sample <- sample(boot.sample,replace=t) + # reboot.medians[k] <- median(reboot.sample) + # } + # the following three lines are MUCH faster than the for loop: + # create n x R matrix of resampled values: + reboot.samples <- matrix(data=sample(boot.sample,size=r*n,replace=t),nrow=n) + # calculate mean of each n=20 column + reboot.medians <- apply(reboot.samples,margin=2,median) + # calculate variance of medians: + m.var <- var(reboot.medians) + # + # return results: + results <- c(m,m.var) + return(results) + } > b <- boot(mysample,boot.median,r=1999) 95% (-0.0217, 0.6303 ) Even without the for loop, the bootstrap takes about 1 minute to run on my computer (a MacBook Pro running OS X 10.5.5 with a 2.53 GHz Intel Core 2 Duo CPU). As expected, the confidence interval of the median is much narrower than the interval for the mean. 1 There are, in fact, equations that can be used to estimate the standard error, and hence variance, of a sample median. However, for the purpose of this exercise we will ignore those equations. 4

1.4 limitations of t-based confidence intervals There are many cases where our data consist of numbers that are 0. For instance, we may be measuring the number of offspring in different litters; the number of defects in silicon chips; the interval between the eruptions of a geyser; etc. In such cases, the sample mean must be 0. Notice, however, that confidence intervals calculated using t do not place any restriction on the lower value. In other words, a confidence interval based on t may include negative numbers even in cases where it is impossible to obtain such values. In other contexts, confidence intervals may exceed a maximum possible value for our measurements. The following R code illustrates this problem with the classical method. In this example, we construct a data vector that must contain non-negative values: > thedata <- c(0,0,1,1,2,3,15,35) > m <- mean(thedata) > s <- sd(thedata) > n <- length(thedata) > alpha <-.05 > t.low <- qt(p=alpha/2,df=n-1) > t.high <- qt(p=1-alpha/2,df=n-1) > ci.95 <- c(m - t.high*s/sqrt(n), m - t.low*s/sqrt(n)) > names(ci.95) <- c("2.5%","97.5%") > ci.95 2.5% 97.5% -3.157305 17.407305 A similar problem can occur with the percentile-t bootstrap > set.seed(1) > b <- boot(thedata,boot.mean,r=999) Based on 999 bootstrap replicates 95% (-0.367, 94.388 ) although, because of the random nature of the bootstrap, another bootstrap on the same data may yield a plausible confidence interval: > set.seed(121) > b <- boot(thedata,boot.mean,r=999) Based on 999 bootstrap replicates 5

95% ( 0.156, 84.319 ) Clearly, t-based methods can yield implausible confidence intervals. The problem here is that the requirement that the data be 0 is a gross violation of the normality assumption that underlies the use of t to construct a confidence interval. Interestingly, confidence intervals based on the percentile bootstrap do not suffer from this shortcoming. The following code re-creates the bootstrap that created the implausible studentized confidence interval and uses it to calculate the percentile and BCa confidence intervals: > set.seed(1) > b <- boot(thedata,boot.mean,r=999) > boot.ci(b,type=c("stud","perc","bca")) Based on 999 bootstrap replicates boot.ci(boot.out = b, type = c("stud", "perc", "bca")) Percentile BCa 95% (-0.367, 94.388 ) ( 0.875, 16.250 ) ( 1.375, 19.500 ) Some BCa intervals may be unstable Unlike the studentized interval, the percentile and BCa intervals are plausible, although boot.ci returns a warning message suggesting that these intervals may not be reliable (probably because R is too small). 6