Laboratory #11. Bootstrap Estimates

Similar documents
Minitab Guide for MA330

a. divided by the. 1) Always round!! a) Even if class width comes out to a, go up one.

The Bootstrap and Jackknife

GenStat for Schools. Disappearing Rock Wren in Fiordland

Quantitative - One Population

Lecture 12. August 23, Department of Biostatistics Johns Hopkins Bloomberg School of Public Health Johns Hopkins University.

1. Select[] is used to select items meeting a specified criterion.

Learner Expectations UNIT 1: GRAPICAL AND NUMERIC REPRESENTATIONS OF DATA. Sept. Fathom Lab: Distributions and Best Methods of Display

Bootstrap Confidence Interval of the Difference Between Two Process Capability Indices

Chapter 3. Bootstrap. 3.1 Introduction. 3.2 The general idea

Excel 2010 with XLSTAT

Evaluating generalization (validation) Harvard-MIT Division of Health Sciences and Technology HST.951J: Medical Decision Support

Central Limit Theorem Sample Means

Notes on Simulations in SAS Studio

Chapter 3: Data Description Calculate Mean, Median, Mode, Range, Variation, Standard Deviation, Quartiles, standard scores; construct Boxplots.

Practical 2: Using Minitab (not assessed, for practice only!)

Stat 528 (Autumn 2008) Density Curves and the Normal Distribution. Measures of center and spread. Features of the normal distribution

Page 1. Graphical and Numerical Statistics

In Minitab interface has two windows named Session window and Worksheet window.

A. Incorrect! This would be the negative of the range. B. Correct! The range is the maximum data value minus the minimum data value.

Question. Dinner at the Urquhart House. Data, Statistics, and Spreadsheets. Data. Types of Data. Statistics and Data

Statistics 528: Minitab Handout 1

Resampling Methods. Levi Waldron, CUNY School of Public Health. July 13, 2016

Table Of Contents. Table Of Contents

Week 7: The normal distribution and sample means

L E A R N I N G O B JE C T I V E S

Averages and Variation

You will begin by exploring the locations of the long term care facilities in Massachusetts using descriptive statistics.

Chapter 3. Descriptive Measures. Slide 3-2. Copyright 2012, 2008, 2005 Pearson Education, Inc.

STA Module 4 The Normal Distribution

STA /25/12. Module 4 The Normal Distribution. Learning Objectives. Let s Look at Some Examples of Normal Curves

A Study of Cross-Validation and Bootstrap for Accuracy Estimation and Model Selection (Kohavi, 1995)

Fly wing length data Sokal and Rohlf Box 10.1 Ch13.xls. on chalk board


Chapter 2 Modeling Distributions of Data

Descriptive Statistics and Graphing

Cpk: What is its Capability? By: Rick Haynes, Master Black Belt Smarter Solutions, Inc.

Data Analysis and Solver Plugins for KSpread USER S MANUAL. Tomasz Maliszewski

1. Estimation equations for strip transect sampling, using notation consistent with that used to

Cross-validation and the Bootstrap

CHAPTER 2 Modeling Distributions of Data

Selected Introductory Statistical and Data Manipulation Procedures. Gordon & Johnson 2002 Minitab version 13.

Your Name: Section: 2. To develop an understanding of the standard deviation as a measure of spread.

Measures of Dispersion

Chapters 5-6: Statistical Inference Methods

Descriptive Statistics, Standard Deviation and Standard Error

= = P. IE 434 Homework 2 Process Capability. Kate Gilland 10/2/13. Figure 1: Capability Analysis

An Introduction to the Bootstrap

Chapter 2: The Normal Distribution

Chapter 2: The Normal Distributions

Unit 1 Review of BIOSTATS 540 Practice Problems SOLUTIONS - Stata Users

IT 403 Practice Problems (1-2) Answers

3 Graphical Displays of Data

Maximizing Statistical Interactions Part II: Database Issues Provided by: The Biostatistics Collaboration Center (BCC) at Northwestern University

Minitab 17 commands Prepared by Jeffrey S. Simonoff

STATISTICAL THINKING IN PYTHON II. Generating bootstrap replicates

packet-switched networks. For example, multimedia applications which process

BASIC SIMULATION CONCEPTS

Minitab Study Card J ENNIFER L EWIS P RIESTLEY, PH.D.

STAT 113: Lab 9. Colin Reimer Dawson. Last revised November 10, 2015

Outline. Topic 16 - Other Remedies. Ridge Regression. Ridge Regression. Ridge Regression. Robust Regression. Regression Trees. Piecewise Linear Model

Curve fitting. Lab. Formulation. Truncation Error Round-off. Measurement. Good data. Not as good data. Least squares polynomials.

Guide to Running TEMAP2.R. The R-program, TEMAP2.R, fits the TE models to the data; it can also

CHAPTER 3: Data Description

Minitab Notes for Activity 1

Problem Set #8. Econ 103

MS&E 226: Small Data

CHAPTER 1. Introduction. Statistics: Statistics is the science of collecting, organizing, analyzing, presenting and interpreting data.

STA 570 Spring Lecture 5 Tuesday, Feb 1

Statistical Graphics

225L Acceleration of Gravity and Measurement Statistics

Confidence Intervals. Dennis Sun Data 301

Introduction to the Monte Carlo Method Ryan Godwin ECON 7010

Cluster Randomization Create Cluster Means Dataset

Unit 1: The One-Factor ANOVA as a Generalization of the Two-Sample t Test

Cross-validation and the Bootstrap

Sample some Pi Monte. Introduction. Creating the Simulation. Answers & Teacher Notes

Chapter 2 Describing, Exploring, and Comparing Data

Lab 7 Statistics I LAB 7 QUICK VIEW

Multivariate Capability Analysis

STATS PAD USER MANUAL

Harris: Quantitative Chemical Analysis, Eight Edition CHAPTER 05: QUALITY ASSURANCE AND CALIBRATION METHODS

Introduction to Minitab 1

On the Reliability of System Identification: Applications of Bootstrap Theory

More Summer Program t-shirts

Elementary Statistics: Looking at the Big Picture

Bluman & Mayer, Elementary Statistics, A Step by Step Approach, Canadian Edition

IQC monitoring in laboratory networks

Lecture 6: Chapter 6 Summary

Probability and Statistics for Final Year Engineering Students

Chapter 2: Frequency Distributions

Frequently Asked Questions Updated 2006 (TRIM version 3.51) PREPARING DATA & RUNNING TRIM

Bootstrap confidence intervals Class 24, Jeremy Orloff and Jonathan Bloom

Example how not to do it: JMP in a nutshell 1 HR, 17 Apr Subject Gender Condition Turn Reactiontime. A1 male filler

Problem 1: Hello World!

An Introduction to Minitab Statistics 529

And the benefits are immediate minimal changes to the interface allow you and your teams to access these

Chapter 3 - Displaying and Summarizing Quantitative Data

Unit 7 Statistics. AFM Mrs. Valentine. 7.1 Samples and Surveys

Minitab on the Math OWL Computers (Windows NT)

Transcription:

Name Laboratory #11. Bootstrap Estimates Randomization methods so far have been used to compute p-values for hypothesis testing. Randomization methods can also be used to place confidence limits around estimates. Randomization methods can be used to place a confidence limit around any statistic, not just those that have known statistical distributions. As can be seen by referring to the key for choosing a frequency distribution (see Handouts), randomization methods using the "bootstrap" method are especially useful when errors around an estimate are not normally distributed, or when the statistical distribution for a statistic is unknown. The bootstrap is a technique for estimating a statistic and its distribution. It uses resampling. The resampling for randomization tests (Lab 4) was carried out WITHOUT replacement. The values of the response variable (in Minitab column c1, for example) were sampled without replacement into a new arrangement (in Minitab column c2, for example). Each value in the first arrangement (c1) occurs only once in the second (c2). The advantage of repeated sampling and calculation, of course, was that a statistical analysis could be undertaken without making assumptions about the distribution of outcomes. Bootstrap methods use sampling WITH replacement. That is, any individual observation in one column can be sampled repeatedly. Any single observation in column c1 can appear repeatedly in c2. Or it may not appear at all in c2. The bootstrap was invented by Bradley Efron (Annals of Statistics 7:1-26). It is described in Manly (1991: Randomization and Monte Carlo Methods in Biology Chapman Hall). The purpose of this lab is: 6 to describe the bootstrap 6 to show how to compute a bootstrap estimate using Minitab 6 to show how to devise an accurate computational procedure in Minitab. To complete this lab, you will need a package that returns a random sample of values from a column of data. Most statistical packages will do this. The basic version of commercially available spreadsheets do not have functions to do this directly. You will also need a statistical package or spreadsheet that will execute a command repeatedly in order to accumulate the values of a statistic from repeated runs into a single column. This can be done readily with line code commands. It is hard to execute a batch file from a graphics interface.

Laboratory #11. Bootstrap Estimates The Bootstrap. The idea behind the bootstrap is simple. A sample of n values of a variable quantity Q = Q 1, Q 2,..., Q n is taken from a population and used to estimate some parameter, such as the mean of the population, or the skewness of the population. The true value of the parameter can never be known exactly unless the entire population is sampled, but of course the parameter can be estimated from the n observations. The sampling variation in this estimate is assessed by taking random samples of size n. This collection of random samples of size n can be as large as we like. We are going to regard this observed distribution as the best approximation of the true distribution of all possible samples from the population. The samples that make up this approximate distribution of Q are called bootstrap samples. Each sample is used to make a bootstrap estimate of the true parameter (mean, slope, variance, diversity index, etc). The distribution of these bootstrap estimates, which we require for hypothesis testing and stating confidence limits, can be obtained by repeated sampling and then constructing an observed frequency distribution. Example: Mandible lengths of male golden jackals (Manly 1991) Q = [ 120 107 110 116 114 111 113 117 114 112 ] @ mm The observed mean value is mean(q) = 114.3 mm. This is an estimate of a parameter, the true mean. To demonstrate how bootstrap estimates work we will to obtain a bootstrap estimate of the mean. This will be done by taking 10 samples (WITH replacement) from the collection of 10 observations, computing the mean from each sample, and then collecting 500 of these estimates. To do this, we will use Minitab to sample WITH replacement. The next page shows a series of steps for determining how to accomplish this in Minitab.

1. State the computation, in words. 2. Find a set of commands in your package to execute the procedure 3. Check the commands with a sample set of made-up data. 4. If it is not correct, look for another set of commands. 5. Repeat until a correct procedure is found. Laboratory #11. Bootstrap Estimates Name Table 11.1. Generic recipe for devising computational procedure in any statistical package. Step 1. State the computation. sample with replacement from c1 into c2 Step 2. Use the help file in your package to find the commands. MTB > Help Commands You may need to choose a collection of commands: MTB > Help commands 15 E Step 3. Which command are you going to use to sample with replacement from c1 into c2? Now display the help file for this command. MTB > set into c1 DATA> 3 4 5 6 7 8 DATA> end Next, see if this command works on a small data set. Put some data into a column for a test run.

Laboratory #11. Bootstrap Estimates If you sample from c1 into c2 with replacement, what do expect to see in c2? (this will be marked as present absent, not on whether it is correct) Now execute the command to sample from c1 into c2. Then display c1 c2. You should see some of the numbers in c1 several times in c2. You should also see that some numbers from c1 are missing from c2. If this is not happening, the computational procedure is probably not correct. Step 4. If it is not correct, look for another set of commands. Step 5. Repeat until an accurate procedure is found. * * * Bootstrap estimates. Mean mandible lengths of golden jackals. Now that we have a computational procedure for sampling with replacement, we can move on to making bootstrap estimates of a statistic. The first statistic will be the mean for mandible lengths of male golden jackals. Example: Mandible lengths of male golden jackals (Manly 1991) Q = [ 120 107 110 116 114 111 113 117 114 112 ] @ mm Set these 10 observations into column 1 of your statistical spreadsheet Define Data from keyboard Next, place a random sample of 10 observations from c1 into c2.

Next, calculate the mean value of the random sample in c2 and store this in k1. You may need to use the help facility to do this. The mean in c1 and c2 should differ slightly. Laboratory #11. Bootstrap Estimates Name Next, store your bootstrap estimate of the difference in two means at the top of column 3 You may need the help facility. Once you have several values (random means) in column c3, write a batch file to add a bootstrap estimate to column 3 every time you execute the file. MTB > Store Jackal.ctl STOR> Create control file Make sure the batch file is adding a new and slight different estimate to c3 each time the batch file is executed. Once the batch file is working correctly, then Tape, paste, or write out by hand the batch file:

Laboratory #11. Bootstrap Estimates Bootstrap estimates. Confidence intervals. Name MTB > n c3 Now accumulate 500 bootstrap estimates of the mean in c3. To make sure you have 500 observations Compute the mean value of the bootstrap estimates. How does this compare to the estimate of 113.4? A bootstrap estimate of a mean is hardly necessary because simply taking the sum and dividing by the number of observations will produce a perfectly good estimate. But what about the confidence limits around the mean? A theoretical distribution may not be available for obtaining a confidence limit, in which case a bootstrap estimate of the confidence limits will be useful. You will recall that the confidence limit is a range around an estimate that will include the true value of the parameter 95% of the time (or 99% of the time, or whatever). To obtain a rough idea of the 95% confidence limits around the bootstrap estimate of the mean, construct a histogram of the bootstrap estimates in c3. Tape or paste your histogram here

MTB > describe c1 N MEAN MEDIAN TRMEAN STDEV SEMEAN C1 10 113.40 113.50 113.37 3.72 1.18 MTB > invcdf.95; SUBC> t 9. 0.9500 1.8331 Laboratory #11. Bootstrap Estimates Name Based on this histogram, make a rough guess and write out the upper and lower limits that include 95% of the bootstrap estimates (answer does not have to be exact). Upper limit = mm Lower limit = mm Next, try to think of a way to determine the 95% limits exactly, from the 500 bootstrap estimates in c3 (this will be marked as present/absent not on whether it is correct). There are several ways of computing the 95% limits exactly for the 500 bootstrap estimates in c3. One is to use the percentile function found in many packages, including spreadsheets. Another is to use the sort function, found in almost all packages. With the sort function you will need to count downward in the sorted column, to reach the value for which 2.5% of the values are smaller.. You will then need to count upward in the column to reach the value for which 2.5% of the values are larger. Try the percentile method. If you package does not make the computation, try the sort method. Compute the upper and lower limits that include 95% of the bootstrap estimates. Upper limit = mm Lower limit = mm Now compute then compare these bootstrap estimates of the 95% confidence limits to the theoretical limits using the t-distribution. Here are the computations: What is wrong with this invcdf command wrong?

Laboratory #11. Bootstrap Estimates Name MTB > invcdf.975; SUBC> t 9. 0.9750 2.2622 Now compute the theoretically based upper and lower limits using: the mean, the standard deviation, sample number n, and the t-statistic: MTB > let k1 = 113.4-2.2622*stdev(c1)/sqrt(10) MTB > let k2 = 113.4 + 2.2622*stdev(c1)/sqrt(10) MTB > print k1 k2 K1 110.740 K2 116.060 Write up for this laboratory: 1. Work through the bootstrap example shown above, filling in all of the blanks. 2. Use bootstrap methods to set 95% confidence limits on the proportion of ants in the diet of short-horned lizards Phrynosoma douglassi brevirostre, from near Bow Island, Alberta during the month of June (data from Manly 1991 p 247). Use 1000 bootstrap estimates to obtain a good estimate of the upper and lower limits. p = [ 33 100 43 100 14 64 40 0.0 47 ] @ % Upper limit = mm Lower limit = mm Compute the upper and lower limit 95% confidence limits using a t-distribution. Upper limit = mm Lower limit = mm Should you report the first or the second pair of confidence intervals in a thesis or other document? Why?