5 Bootstrapping. 4.7 Extensions. 4.8 References. 5.1 Bootstrapping for random samples (the i.i.d. case) ST697F: Topics in Regression.

Size: px

Start display at page:

Download "5 Bootstrapping. 4.7 Extensions. 4.8 References. 5.1 Bootstrapping for random samples (the i.i.d. case) ST697F: Topics in Regression."

Rosalyn Booker
6 years ago
Views:

1 ST697F: Topics in Regression. Spring 2007 c Extensions The approach above can be readily extended to the case where we are interested in inverse prediction or regulation about one predictor given the other predictors are known (for inverse prediction) or set at fixed values (in regulation). For example, suppose there are two predictors x 1 and x 2 and the model for the mean is β 0 + β 1 x 1 + β 2 x 2. For inverse prediction suppose there is a new unit with known x 2 value, say x 20 but an unknown value of x 1, say x 10. We would estimate the unknown x 10 via ˆx 10 = (Y 0 ˆβ 0 ˆβ 2 x 20 )/ ˆβ 1. Similarly for regulation, we could ask, for what x 1 is the expected value of Y equal to a specified constant c, when the second predictor is set at x 20? The result is ρ = (c β 0 β 2 x 20 )/β 1, which would be estimated by ˆρ = (c ˆβ 0 ˆβ 2 x 20 )/ ˆβ 1. Both of these problems are in the form of a ratio and we can apply Fieller s result or the delta method with appropriate definition of σ 11, σ 22, and σ 12. The problem of doing inverse prediction or regulation for multiple x s is more complicated, but notice that you can always attack this by trying to invert prediction or confidence intervals. 4.8 References Kutner et al., Section 4.6. (inverse prediction/calibration) Greene (p.61), Mood, Graybill and Boes ( p. 181), Casella and Berger (p. 240) for approximations for nonlinear functions. Fieller s Theorem. Buonaccorsi(1998, 2001). Graybill and Iyer. Section 6.4 covers inverse prediction and regulation in simple linear regression (although they give a different way to compute the results are the same as ours), while sections 6.6 and 6.7 give two other situations where a ratio of linear combinations of coefficients is of interest. 5 Bootstrapping Bootstrapping has become a popular way to carry out statistical inferences. The basic idea is to mimic the original sampling method to try and generate the sampling distribution of some estimator or test statistic and then get estimates of bias, standard error or confidence intervals and tests based on these results. The simulating which is done to mimic this sampling makes use of the data to create a population or model from which to sample. There is a huge literature and there has been a plethora of recent books on the subject; see the references. While the bootstrap is relatively simple to describe, there are many subtle and complex issues around the performance of bootstrap based confidence intervals and tests that are beyond the scope of this text. Our objective here is to describe the basic ideas and illustrate their application in regression contexts. The fundamental concepts are most easily motivated in the context of a single random sample. 5.1 Bootstrapping for random samples (the i.i.d. case) The univariate, single parameter case. Let W 1,..., W n be a random sample from some distribution F. That is the W i are independent where each W i has the same distribution defined by the CDF F. Let θ be any parameter of interest associated

2 ST697F: Topics in Regression. Spring 2007 c 22 with the population distribution F. Examples include the mean, the standard deviation, the median, some percentile, etc. The parameter θ will be estimated by ˆθ = g(w) = g(w 1,... W n ) = g(w), where W contains W 1,..., W n. (Somewhat confusingly, but as is standard, we use ˆθ to denote either the estimator g(w) which is a random variable, or the estimate g(w) which is the actual number observed. It should be clear from the context how it is being used.) The population CDF is F (w) = P (W w), while the empirical CDF is ˆF defined by ˆF (w) = # of w i w. n Often, the parameter θ can be viewed as some function of F, say t(f ). The plug-in estimator of θ is t( ˆF ). That is, if ˆθ is the plug-in estimator of θ it is calculated in the same way θ would be determined from F, but using ˆF as if it were F. For example, W = i W i/n is the plug-in estimator of the population mean µ, while the plug-in estimator of the population standard deviation σ is ˆσ = [ i (W i W ) 2 /n] 1/2. (6) Note the slight difference from the usual sample standard deviation s = [ i (W i W ) 2 /(n 1)] 1/2 ) because of the division by n rather than n 1. Often, we evaluate how good an estimator is via its bias = E(ˆθ θ), and its standard deviation/standard error, denoted σˆθ. In many problems, the bias and standard error can be written explicitly (often as functions of parameters) and can be estimated directly from the data. In addition, exact or approximate confidence intervals and tests can often be obtained using some exact distributional results or via some large sample arguments. For example, with θ = µ and ˆθ = W, we know E( W ) = µ, so the bias is 0, and standard error of W is exactly σ W = σ/n 1/2. This is typically estimated via s/n 1/2. Under normality, exact confidence intervals for µ are found based on the t distribution, while for large sample sizes approximate confidence intervals are based on the normal. In many cases where there are no exact expressions for bias, standard error or the sampling distribution, some asymptotic or approximate result is used. For example, we previously used a Taylor series based method to approximate the bias and standard error of nonlinear functions; see Section 4.3. Approximate confidence intervals are often found by assuming that (ˆθ θ)/ˆσˆθ is approximately standard normal. There validity of approximations for the bias and standard error and of the normality assumption for estimators are always in question. Can we find a way to estimate the bias and standard error for ˆθ, and more generally the sampling distribution of ˆθ that does not depend on assuming F is of a particular type (normal, exponential, etc.) and does not require an analytical expression for the bias or standard error? The bootstrap estimate of standard error. The standard error of ˆθ, σˆθ, is itself some function of F, say σˆθ = h(f ). The definition of the bootstrap estimate of the standard error is ˆσˆθ = h( ˆF ). If we know explicitly what the function h(f ) is then we can calculate the bootstrap estimate of standard error directly. For example, with θ = µ and ˆθ = W, we noted above that σ W = σ/n 1/2. σ/n 1/2 can be viewed as h(f ) as it is a function of F via σ which is the population standard deviation associated with F. So the bootstrap estimate of σ W is ˆσ/n 1/2 where ˆσ is given in (6). Usually we do not have an analytical expression for how σˆθ is a function of F, in which case the definition of the bootstrap estimate of standard error above is not useful. We can however calculate the bootstrap estimate of standard error via simulation without knowing what h is. The algorithm for a random sample with observed data w 1,... w n, proceeds as follows: For b = 1 to B, where B is a large value: 1. Take a random sample of size n WITH REPLACEMENT from the values w 1,... w n. Denote this

3 ST697F: Topics in Regression. Spring 2007 c 23 bootstrap sample by w b1,..., w bn. Notice that some of the original values in the sample will typically occur more than once in the bootstrap sample. 2. Compute your estimate in the same way as you did for original data now using the bootstrap sample ˆθ b = g(w b1,..., w bn ). This leads to B bootstrap estimates ˆθ 1,... ˆθ B. The bootstrap estimate of standard error is calculated as [ B b=1 ˆσ B ˆθ = (ˆθ b θ ] 1/2 ) 2, B 1 where θ = B ˆθ b=1 b /B. (Technically the bootstrap estimate of standard error is the limit of the above as B, but it is common to use the term in this way.) The bootstrap estimate of Bias is θ t( ˆF ). REMARKS Note that even if the original estimator ˆθ is not the plug in estimator, the bias is calculated using the plug-in estimator t( ˆF ). If we don t know t(), we cannot calculate the bootstrap estimate of bias. The mean of the bootstrap values should NOT be taken as the estimate of θ. The Bootstrap estimate of the distribution of ˆθ is simply the distribution of the B values ˆθ 1,... ˆθ B. We ll call this the Empirical Bootstrap Distribution and denote the empirical CDF by Ĝ; note that this is an estimate of the CDF of ˆθ. Typically we will use a histogram, smoothed histogram or stem and leaf plot to represent this distribution rather than give it in CDF form General multivariate and/or multiparameter The univariate single parameter case is easily generalized to cases of a random sample where there are multiple parameters of interest and/or each observation is multivariate. Let W 1,..., W n be i.i.d. with some distribution F, where the W i can now be vector valued. Let θ be a collection of q parameters of interest, with estimator ˆθ; q could be 1 as when we are interested in a single parameter even if the data is multivariate. We now resample with replacement as before from the observed w,..., w n and write ˆθ b for the estimate of θ from the bth bootstrap sample. The bootstrap mean is ˆ θ = b ˆθ b/n and the bootstrap covariance matrix is b S B = (ˆθ b ˆ θ )(ˆθ b ˆ θ ). B 1 S B is the bootstrap estimate of Cov(ˆθ), the covariance matrix of ˆθ. The square roots of the diagonal elements are the bootstrap standard errors of the individual components.

4 ST697F: Topics in Regression. Spring 2007 c Bootstrap Confidence Intervals As noted earlier, a common method of obtaining approximate confidence intervals is to assume (ˆθ θ)/ˆσˆθ is approximately standard normal, leading to an approximate confidence interval for θ of the form ˆθ ± z(1 α/2)ˆσˆθ, where ˆσˆθ is the estimated standard error of ˆθ. Nonparametric confidence intervals can be found using the bootstrap in a variety of ways. There is plenty of discussion that can be found about the different methods in the cited literature. The two most commonly used ones that have emerged in practice are the percentile method and the BC a method. The Percentile Method: Consider α 1 and α 2 with α 1 + α 2 = α, where usually α 1 = α 2 = α/2. The percentile confidence interval for θ is [L, U], where L = Ĝ 1 (α 1 ) and U = Ĝ 1 (1 α 2 ). That is, 100α 1 % of the bootstrap values are less than or equal to L and 100(1 α 2 )% of the bootstrap values are less than or equal to U. Note that if Ĝ is normal with mean at ˆθ, then the bootstrap percentile interval would agree with ˆθ ± z(1 α/2)ˆσ B ˆθ. Notice that with the percentile method, if [L, U] is the percentile interval for θ then the percentile interval for any function of θ, say φ = g(θ) is simply [g(l), g(u)]. This is a nice property that does not hold for the delta method interval, where confidence intervals for nonlinear functions do not simply transform in this way. It is not transparent as to why the percentile method works. It can be shown that it works well if there is some transformation g and a constant c for which t g(ˆθ) g(θ)/c follows approximately a standard normal distribution. (It is not necessary to know what the g is). The percentile method, which is very easy to calculate has been found to work well in many cases but it can encounter problems. In practice an improved bootstrap method called the BC a method (bias corrected accelerated method) is often preferred. The BC a bootstrap interval is given by Ĝ 1 (α 1) and U = Ĝ 1 (1 α 2). This resembles the percentile method but uses quantities α 1 and α 2. These are somewhat involved to calculate and depend on two quantities, one is a bias correction term and the other is an acceleration term. The first accounts for potential bias (actually median biasedness) in ˆθ while the acceleration term addresses the fact that the standard error of ˆθ may itself be a function of θ. See for example Efron and Tibshirani for details (but there are some notational differences with our treatment.) 5.3 Bootstrapping in Regression Models The methods of using bootstrap estimates for bias, standard error and confidence intervals are the same as in the random sample setting. What changes in the manner in which the bootstrap samples are generated. We continue to assume, as we have done to this point, that the observations on different units are independent Random Regressors: bootstrapping (Y, X) together Suppose all the regressors are random quantities, so when we choose the ith unit in the sample we obtain the response and all of the predictors, collected in Y i X i1 W i =.... X i,p 1 A regression model is specified for E(Y x). A variance model may be specified, possibly with heteroscedasticity depending on the X s, but need not be. If the W 1,..., W n arise from a random sample of

5 ST697F: Topics in Regression. Spring 2007 c 25 individual units (i.e., can be treated as i.i.d.), then we can proceed to use the bootstrap as in Section for any estimator or collection of estimators of interest. This could be the coefficients, variance parameters, correlations, or any functions of these, including nonlinear ones. We illustrate first with β, which is estimated by ˆβ, either through least squares or some type of weighted least squares (possibly iterative). In the bth bootstrap sample the estimator is ˆβ b and the bootstrap estimate of the covariance of ˆβ is S B. The square roots of the diagonal elements of S B gives the bootstrap standard errors for the individual coefficients. For any one of the coefficients the bootstrap estimates can be used to get the bootstrap estimate of bias and standard error and nonparametric confidence intervals. For estimating a linear combination of interest, say θ = c β, because of the linearity, the bootstrap standard error can be calculated via (c S B c) 1/2. To get the empirical distribution and confidence intervals though we need to obtain each ˆθ b = c ˆβ b. If the interest is in some nonlinear function of β, say θ = g(β), then one calculates θ b = g(β b) for b = 1 to B. REMARKS: Since under constant variance, we know the exact covariance of the least squares estimator and how to get an unbiased of it, there is no need to use the bootstrap for this purpose. However the bootstrap is useful for getting nonparametric confidence intervals for the coefficients and functions of them. These provide an alternative to the usual t based intervals or the delta method type intervals for nonlinear functions. Since the homogeneity of variance assumption does not have to hold in this case, if we are using least squares, the bootstrap estimate S B provides an alternative to White s robust estimate of the covariance (Section 3.2) Bootstrapping residuals Consider fixing the x values and suppose the model is Y i = m(x i, β) + ɛ i (7) where the ɛ i are assumed to be independent and identically distributed (iid) with mean 0 from some distribution F. (Note that this implies the errors have constant variance σ 2 and are uncorrelated). The ith residual is r i = Y i Ŷi. The bth bootstrap sample is generated by getting Y bi = m(x i, ˆβ) + r bi, i = 1 to n, where r b1,..., r bn are bootstrapped residuals obtained by sampling with replacement from some collection of values that reflect the distribution of the original ɛ i. Recall that this distribution is assumed to have mean 0 and variance σ 2. One possibility is to just sample with replacement from the residuals r 1,..., r n. Notice that when we sample with replacement from the residuals the variance of the resulting value is i r2 i /n = (n p)mse/n where MSE is our unbiased estimator of σ2. This suggest a modification, namely to sample from modified residuals (n/(n p)) 1/2 r i ; when we sample from this the variance equals MSE. This means that r bi is generated by sampling one of the residuals and if it is the kth residual that is selected then r bi = (n/(n p)) 1/2 r k. We will always use this modification although it is clearly not very important as n gets large. For the linear case with an intercept in the model, r = i r i/n = 0, so sampling from the residuals or modified residuals is sampling from a distribution with mean 0. Without an intercept r is not zero so we should resample from centered residuals r i r.

6 ST697F: Topics in Regression. Spring 2007 c 26 The bth bootstrap sample consists of (y b1, x 1 ),... (y bn, x n ) from which we get ˆβ and any other estimate b of interest. It can be shown that using the modified residuals, as B then S B (the sample covariance among the bootstrap estimates ˆβ,..., ˆβ ) converges to 1 B MSE(X X) 1. So bootstrapping from the modified residuals is doing exactly the right thing in terms of estimating the covariance of ˆβ. As noted before the bootstrap is not needed in order to estimate Σ ˆβ under constant variance since MSE(X X) 1 provides an unbiased estimator. The usefulness of the bootstrap here is that it allows us to construct confidence intervals on the coefficients and linear combinations of them that are not dependent on normality of ˆβ. We can also do other things such as carry out inferences for σ 2 (which under the normality assumption are based on a chi-square distribution with n-p degrees of freedom) or handle inferences for nonlinear functions of the parameters Bootstrap Prediction Intervals Suppose we want to predict Y 0 at x 0. It is the distribution of Ŷ0 Y 0 that we need, where Ŷ0 = x 0ˆβ. If the CDF of this distribution is denoted H then P (H 1 (α/2) Ŷ0 Y 0 H 1 (1 α/2)) = 1 α, implying P (Ŷ0 H 1 (1 α/2) Y 0 Ŷ0 H 1 (α/2)) = 1 α. This means that [Ŷ0 H 1 (1 α/2), Ŷ0 H 1 (α/2)] is a 100(1 α)% prediction interval for Y 0. Since we don t know H we estimate it via the bootstrap as follows: For b = 1 to B, i) generate ˆβ b as in Section and construct Ŷ0b = x 0ˆβ b. ii) generate Y 0b = Ŷ0 + r 0b where r 0b is a newly generated bootstrap residual. iii) Construct D b = Ŷ0b Y 0b. Consider the empirical distribution of D 1,..., D B and denote the percentiles by Ĥ(α/2) (value with α/2 to the left) and Ĥ(1 α/2). The prediction interval for Y 0 (based on the percentile method) is (Ŷ0 Ĥ 1 (1 α/2), Ŷ0 Ĥ 1 (α/2)) Bootstrapping the residuals for the heteroscedastic case Consider the variance model V (ɛ i ) = σ 2 a 2 i, where a i is known. Equivalently, we can view ɛ i as a i δ i where the δ i are assumed to be independent and identically distributed with mean 0 from some distribution F (having mean 0 and variance σ 2 ). So δ i = (w i ) 1/2 ɛ i where w i = 1/a 2 i is the weight. We could estimate the distribution of the residuals by using the modified weighted residuals ˆδ i = (n/(n p)) 1/2 (w i ) 1/2 (Y i m(x i, ˆβ)). The weighted residuals do not necessarily add to zero, however, since there isn t an intercept in the transformed model. With r i = (w i ) 1/2 (Y i m(x i, ˆβ)), in general it is better to resample from ˆδ i = (n/(n p)) 1/2 (r i r) rather than ˆδ i = (n/(n p)) 1/2 r i. All this really does this is change the intercept used in the bootstrapping. In practice the mean of these weighted residuals is often small so it doesn t make much of a difference. The bootstrap sample is generated by Y bi = m(x i, ˆβ) + a i d bi where the d bi are sampled with replacement from ˆδ 1,... ˆδ n. Notice that the a i is fixed to the ith position in the sample, while the d bi comes from selecting one of the weighted residuals. The ˆβ could be either ordinary of weighted least squares, depending on which is being used. If there is a parametric model for the variance, a 2 i = v i(β, λ), then λ need to be estimated in order to do the bootstrapping. This leads to the use of ˆδ i = (n/(n p)) 1/2 (ŵ i ) 1/2 (Y i m(x i, ˆβ)), where ŵ i = 1/v i (ˆβ, ˆλ)

7 ST697F: Topics in Regression. Spring 2007 c 27 and Y bi = m(x i, ˆβ) + (v i (ˆβ, ˆλ)) 1/2 d bi. If the ˆβ being used in the original analysis is two-stage or iteratively reweighted least squares then within each bootstrap sample you would carry out this same procedure; that is do two-stage or iteratively reweighted least squares on each bootstrap sample Bootstrapping with replication Suppose there are k distinct collections of regressors, say x 1,..., x k, with n j observations at x j, so j n j = n. In this case with the x s treated as fixed we can still resample in a way which allows for changing variances. One option is to use the fitted model (which means we believe we have the right function for the mean), then when we generate an observation at x j, we resample from just the residuals at x j. If the residuals at x j are denoted r j1,..., r jnj then we would create modified residuals [ ] 1/2 nj 1 r jm = (r jm r j ), where r j is the mean of r j1,..., r jnj. The resampling is from these modified residuals. n j Another option is to generate n j response at x j by resampling n j times with replacement from the original n j responses at x j. If there are only a few replicates at each distinct x, this may not work very well. 5.4 Bootstrap Hypothesis Testing Hypothesis testing is a popular way of carrying out inferences. There are a couple of ways to carry out bootstrap tests of hypotheses. Method 1: Try to generate the null distribution. One way is to emulate carrying out a test based on some test statistic. Suppose the test is based on a test statistic Q and rejects H 0 if Q is large (almost all tests can be put into this form). The bootstrap approach simulates the null distribution of the test statistic Q. With this approach, the bootstrap samples must be generated under the null model (the model incorporating the null hypothesis). Then for each bootstrap sample the test statistic is calculated and the empirical distribution of these test statistics over the B bootstrap samples is an estimate of the null distribution. Suppose the test statistic being used has an observed value Q obs and the bootstrap values are Q 1,..., Q B (so Q b was the value of the test statistic generated in the bth bootstrap sample). The bootstrap P-value of the test is P boot = number of times Q b Q obs. B The null hypothesis of independence is rejected if P boot is less than α, the desired level of the test (e.g.,.05 for a test at the 5% level). This method of bootstrapping mimics our usual approach to testing based on the null distribution and is a popular way to approach testing. In more complicated situations it can be difficult to figure out how to properly resample under the null. Even when you can do so, this method can have problems. Oftentimes there are parameters involved in the model which must be estimated and the distribution of the test statistic under the null hypothesis may depend on these parameters. That is, the test statistic is not what is known as a pivotal quantity. For regression problems, in the case of uncorrelated errors and constant variance, where the bootstrap is being used to protect against non-normal errors, this approach does fairly well. More serious problems can arise in more complicated models, including heteroscedasticity or correlated errors. This issue

8 ST697F: Topics in Regression. Spring 2007 c 28 has not received the attention it deserves in practice and further work is needed on the magnitude of the problem. Method 2: Invert a confidence interval. For a single parameter, hypothesis testing can also be carried out using the bootstrap confidence interval. We illustrate for two sided test of H 0 : θ = θ 0, versus H A : θ θ 0. An approximate test of size α is to reject H 0 if the 100(1 α)% confidence interval for θ does not contain θ 0. A test at level.05 would use a 95% confidence interval, a level.10 test a 90% confidence interval, etc. A P-value for the test can be obtained by finding the smallest level at which the the null hypothesis is rejected. Equivalently, this means finding the largest value C for which a 100C% confidence interval contains 0 and then taking the P-value as 1 C. For one-sided tests, a similar approach can be taken using one-sided confidence intervals. Because of the potential problems mentioned earlier with obtaining a P-value by directly bootstrapping the test statistic under the null, we recommend carrying out the test via confidence intervals when possible Summary Briefly, the main points in this section are: Bootstrapping the residuals requires that we have a model for the mean and the variance, except when we have replication. If we have a random sample of units, then you can either bootstrap the (y i, x ) sets or you can bootstrap the residuals if you believe you have the right model for Y x. Bootstrapping the (y i, x i ) sets does not require you have a model for the variance, and it allows heteroscedasticity. If the x s are fixed then in generally you should bootstrap the residuals, although there are some cases where bootstrapping the sets works okay. If you have a random sampling of units that is not a single random sample (i.e., it comes from something like a stratified or multi-stage sampling) then you can bootstrap the residuals or bootstrap the (y, x) sets in a way which reflects the original sampling scheme. 5.5 References Kutner et al. (Section 11.5) Efron and Tibshirani (1993). A comprehensive look but not too technical. Chapter 9 handles regression in detail. Diaconis and Efron (1983), Efron and Tibshirani (1986 ), Leger et. al. (1992, p. 378), Davison and Hinkley (1997), Manly (2000).

9 7 UNIVARIATE BOOTSTRAP - FOOD EXPENDITURES 28 7 Univariate bootstrap - food expenditures This example illustrates the use of the bootstrap in a single sample where we use the sample mean, variance, standard deviation, coefficient of variation (standard deviation/mean), median and the 90th percentile for illustration. The data consists of food expenditures in dollars for 40 randomly sampled households in Ohio. (Data from Exploring Statistics by Kitchens originally from Bureau of Labor Statistics). B = 1000 bootstrap samples were used. The original data has sample mean x = , standard deviation s = , median 3165 and 90th percentile The sample mean is known to be an unbiased estimator of the population mean µ, with exact standard error σ/n 1/2, usually estimated by s/n 1/2 = (Std Mean). An approximate confidence interval for µ, based on the approximate normality of the sample mean is found using x ± t(1 α/2, n 1)s/n 1/2. (We usually use the t rather than a normal value even when we think the population is not normal, but sample size is large, in order to be conservative. There is little difference between the t and z values though for even moderate n). The 95% interval here is ( , ). The mean of the 1000 bootstrap sample mean is As we know the sample mean is unbiased so there is no need to estimate bias. If we do, the bootstrap estimate of bias is This is small relative to the estimate and the only reason it is not 0 is due to the use of a finite number (1000) of bootstrap samples. As you increase B this will go to 0. The bootstrap estimate of standard error of x, from the 1000 samples is (the standard deviation of the 1000 samples means) The theoretical bootstrap estimate of the standard error, what it converges to as B, is ˆσ/n 1/2 (see notes for ˆσ) which can be computed directly and differs modestly from s/n 1/2.) The bootstrap is not needed for assessing bias or getting the standard error of x and is done here for illustration and to show agreement with the usual results. For a nonparametric confidence interval, the 90% percentile interval uses the 5th and 95th percentile as endpoints. This yields (3256.8,4034.2). This is not that different from the earlier interval which is not surprising given the normality of x as demonstrated by the empirical bootstrap distribution of the sample mean. There are analytical procedures (often approximate) for treating the variance, standard devation or coefficient of variation. Note that the sample variance is unbiased for σ 2 but s is biased for σ. The bootstrap estimate of the bias is s an estimator of σ is (39/40) 1/2 = For the variance, if we assume the population is normal then an exact confidence interval is available based on the chi-square with n 1 degrees of freedom and taking the square root give a confidence interval for σ. The 90% intervals are given from proc univariate under the normality assumption and are ( , ) for the variance and (1276,1859) for the standard deviation. Without the normality assumption, there are large sample results that can be used to based on the approximate normality of the sample variance or sample standard deviation. These involve an analytical expression for the asymptotic standard. Note that if you use the approximate normality of the sample variance and then separately work with the standard deviation, the interval for the standard devation is not just the square root of the interal for the variance. The bootstrap percentile intervals though transform directly. The 90% bootstrap percentile confidence interval for σ is ( , ). The sample coefficient of variation is s/ x =.418. As an estimator s/ x is a ratio and cannot get an exact expression for its expected value and hence its bias (although we can approximate it using our earlier methods). The bootstrap estimate of bias is (39/40) 1/2 = indicating bias is not a serious issue. It is possible using a multivariate central limit theorem and the delta method to determine that for large sample size the sample coefficient of variation is approximately normal with mean equal to the population coefficient of variation and some standard deviation that depends on a number of parameters that must be estimated. This is one way to approach the problem, but relies on approximations and estimation of unknowns in the approximate standard error. Using the bootstrap, the 90% bootstrap percentile confidence interval for the population CV is (.32,.48).

10 7 UNIVARIATE BOOTSTRAP - FOOD EXPENDITURES 29 Notice from the empirical distributions the sampling distributions are approximately normal (as large sample theory tells us they will be). In these cases, the intervals that comes out of the bootstrap perecentile method will be close to what is obtained using a normal approximation and using intervals of the form estimate±z(1 α/2)se, where SE is the bootstrap standard errror. For the standard deviation, which has the least normal looking of the sampling distributions, the resulting interval is (1092.3, ), which is not too different from the percentile method. Proc univariate gives approximate confidence intervals for population percentiles but under normality assumptions and without normality. The distribution free intervals are based on using the order statistics. These are described in the SAS online documentation. The median can be addressed in a similar manner. The bootstrap estimate of bias ( ) is relatively small. The 90% bootstrap percentile interval for the median is (2837.5,3679). The 90th percentile shows some difficulty with employing the bootstrap. There are only some values that typically end up as the 90th percentile in the bootstrap sample, leading to a very discrete distribution. While this is not particularly problematic in estimating the bias or standard deviation (though it could be in small sample sizes) it poses problems with the confidence intervals. Notice that the 95th and 99th percentiles of the empirical distribution are the same. A 90% percentile interval is (4367,7580) while a 98% is (3970,7580), which is a bit unsatisfactory with the upper point staying the same. One way to deal with this problem, and a general strategy employed that can be employed in bootstrapping is to smooth the data before resampling, so the resampling is from a continous distribution rather than from a set of points. The UNIVARIATE Procedure Variable: expend Moments N 40 Sum Weights 40 Mean Sum Observations Std Deviation Variance Skewness Kurtosis Uncorrected SS Corrected SS Coeff Variation Std Error Mean Basic Statistical Measures Location Variability Mean Std Deviation 1509 Median Variance Mode Range 6967 Interquartile Range 1462 Basic Confidence Limits Assuming Normality Parameter Estimate 90% Confidence Limits Mean Std Deviation Variance Tests for Location: Mu0=0 Test -Statistic p Value Student s t t Pr > t <.0001 Sign M 20 Pr >= M <.0001 Signed Rank S 410 Pr >= S <.0001

11 7 UNIVARIATE BOOTSTRAP - FOOD EXPENDITURES 30 Quantiles (Definition 5) 90% Confidence Limits Quantile Estimate Assuming Normality 100% Max % % % % Q % Median % Q % % % % Min % Confidence Limits Order Statistics Quantile Distribution Free LCL Rank UCL Rank Coverage 100% Max 99% % % % Q % Median % Q % % % % Min Variable: expend Stem Leaf # Boxplot *-----* Multiply Stem.Leaf by 10**+3

12 7 UNIVARIATE BOOTSTRAP - FOOD EXPENDITURES 31 Variable=SMEAN Variable=SMEAN Moments N 1000 Sum Wgts 1000 Mean Sum Std Dev Variance Quantiles(Def=5) 100% Max % % Q % % Med % % Q % % Min % % Histogram # Boxplot 4350+* 4 0.** 8 0.***** ********** 37.**************** 63.*********************** ****************************** ******************************************** 176 *--+--*.************************************** *************************************** ************************* 99.************** ***** 17.** 6.* * * may represent up to 4 counts ** ELIMINATED OUTPUT ON VARIANCE

13 7 UNIVARIATE BOOTSTRAP - FOOD EXPENDITURES 32 Variable=SD N 1000 Sum Wgts 1000 Mean Sum Std Dev Variance Skewness Kurtosis Quantiles(Def=5) 100% Max % % Q % % Med % % Q % % Min % % Histogram # Boxplot 2050+** 6.***** 17.************** 54.*********************** 91.************************************* ****************************************** 166.**************************************** 157 *--+--*.************************************ *********************** 92.****************** 71.******** 29.***** 17.*** * * may represent up to 4 counts

14 7 UNIVARIATE BOOTSTRAP - FOOD EXPENDITURES 33 Variable=CV Variable=CV N 1000 Sum Wgts 1000 Mean Sum Std Dev Variance Quantiles(Def=5) 100% Max % % Q % % Med % % Q % % Min % % Histogram # Boxplot 0.57+* 1 0.* 2 0.* 4.*** 11.********** 38.******************** 78.******************************* ***************************************** 163.****************************************** 167 *--+--*.************************************* 146.************************ ********************* 83.********** 37.****** 21.***** 17.*** 9 0.** * * may represent up to 4 counts

15 7 UNIVARIATE BOOTSTRAP - FOOD EXPENDITURES 34 Variable=MEDIAN Moments N 1000 Sum Wgts 1000 Mean Sum Std Dev Variance Variable=MEDIAN Quantiles(Def=5) 100% Max % % Q % % Med % % Q % % Min % % Histogram # Boxplot 3875+* 1 0.** 8 0.** 7.****** 22.******* 26.****** 23.* 1.******** 32.** 8.******** 32.********** 37.*************** *************** 60.********************************* 129.************************************* 145 *--+--*.************************* 100.* 2.*********** 41.****************** ********************* 84.*********************** 90.***** * * may represent up to 4 counts

16 7 UNIVARIATE BOOTSTRAP - FOOD EXPENDITURES 35 Variable=P90 Variable=P90 N 1000 Sum Wgts 1000 Mean Sum Std Dev Variance Quantiles(Def=5) 100% Max % % Q % % Med % % Q % % Min % % 3970 Histogram # Boxplot 8100+* ********** ***************************** ************************************** ***************************************** 203 *-----*..********************************* 161.*********************** ************ 59..************** 67..*** * * may represent up to 5 counts

17 7 UNIVARIATE BOOTSTRAP - FOOD EXPENDITURES 36 title Bootstrap with a single sample ; options pagesize=60 linesize=80; /* THIS IS A PROGRAM TO BOOTSTRAP WITH A SINGLE RANDOM SAMPLE USING THE MEAN, VARIANCE, STANDARD DEVIATION, COEFFICIENT OF VARIATION, MEDIAN, 90TH PERCENTILE */ filename bb boot.out ; /* READ IN ORIGINAL DATA INTO INTERNAL SAS FILE values */ data values; infile food.dat ; input expend; title descriptive statistics on original sample ; proc univariate cibasic cipctlnormal cipctldf plot alpha=.10; /* START INTO IML WHERE BOOTSTRAPPING WILL BE DONE */ proc iml; /* put data into vector x */ use values; read all var{expend} into x; close values; n=nrow(x); /* = sample size */ xb=x; /* initializes xb to be same size as x */ nboot = 1000; /* specify number of bootstrap replicates */ do j=1 to nboot; /* get the n samples with replacement. i indicates the sampling within bootstrap replicate j. The generated k is a discrete uniform over 1 to n; the function int takes the integer part */ do i= 1 to n; uv=uniform(0); k=int(uv*n+1); xb[i]=x[k]; end; /* xb contains the n values in bootstrap sample */ /*compute statistics of interest, sum and ssq are matrix functions that do sum and sum of squares */ smean = sum(xb)/n; /* get sample mean */ svar=(ssq(xb) - (n*(smean**2)))/(n-1); /* sample variance */ sd = sqrt(svar); /* sample s.dev. */ cv = sd/smean; /* coefficient of variation */

18 7 UNIVARIATE BOOTSTRAP - FOOD EXPENDITURES 37 /*compute median and 90th percentile */ b=xb; /* initializes b*/ xb[rank(xb)] = b; /* xb has ranked values */ c1=int(n/2); c2=c1+1; median=(xb[c1]+xb[c2])/2; /* use if n is even */ diff = c1 - (n/2); if diff <0 then median = xb[c2]; /* if n is odd */ d= int(.9*n); p90= xb[d]; /* rough 90th percentile. Can be refined */ /* the next two commands puts the results to file bb which is aliased with external file boot.out through the filename statement at beginning. The +1 in the put statement says to skip one space. */ file bb; put smean +1 svar +1 sd +1 cv +1 median +1 p90; end; quit; /* Get descriptive statistics via proc univariate */ data new; infile boot.out ; input smean svar sd cv median p90; proc univariate plot;

19 8 BOOTSTRAP REGRESSION SETS - ESTERASE ASSAY 38 8 Bootstrap Regression Sets - Esterase Assay Here we demonstrate the bootstrap where it is assumed that the n units in the study are a random sample (independent and identically distributed) so we can resample the (Y, x) sets. We will demonstrate using the Esterase Assay data in order to compare it to our earlier results. This assumes there is a sample of 106 individuals and for each individual we then get true esterase concentration via some exact method and at the same time get a binding count from running the radioimmunoassy. This would not be the right way to proceed if this is a designed experiment using standards with known concentrations. That would involve bootstrapping residuals. Here we will use the bootstrap to get an estimated covariance matrix for the least squares estimates of the coefficients and to get confidence intervals. Using least squares does not require that we model the variance, but the usual estimated covariance is known to be wrong. Analytically one option was to use White s robust estimator. See Example 4.1 for the least squares results and the robust estimate of covariance (labeled consistent covariance of estimates). The least squares estimates are unbiased (assuming the linear model is right) so we don t need the bootstrap to assess bias. The bootstrap estimates of standard error are 21.1 for the intercept and 1.33 for the slope; these are the square roots of the diagonal elements of the bootstrap estimate of Σ ˆβ, which is labeled Covariance Matrix below. Notice the similarity between this and White s robust estimator. The empirical bootstrap distributions demonstrate the normality of the estimators and intervals based on the normal approximation using the bootstrap standard errors should be reasonable. The 90% percentile intervals are (-53.1, 15.6) for β 0 and (15, 19.3) for β 1. BHAT MSE Covariance Matrix B0 B1 B B Variable=B0 Mean Std Dev Quantiles(Def=5) 100% Max % % Q % % Med % % Q % % Min % % Histogram # Boxplot 55+* 1 0..** 6.***** 24.************ 58.*********************** 115.************************************ **************************************** 197 *--+--* -25+********************************* 161.************************ *************** 75.******** 39.**** 17.* 4 0.* 2 0.* * 1 0

20 8 BOOTSTRAP REGRESSION SETS - ESTERASE ASSAY 39 Variable=B1 N 1000 Sum Wgts 1000 Mean Sum Std Dev Variance Quantiles(Def=5) Variable=B1 100% Max % % Q % % Med % % Q % % Min % % Histogram # Boxplot * 1 0.* 1 0.* 2 0.*** 9 0.** 7.***** 19.********** 40.************** 55.*************************** 105.******************************* ************************************** 152 *--+--*.************************************ 143.*********************************** ********************** 87.**************** 64.********* 35.**** 13.* 2.* * * may represent up to 4 counts

21 8 BOOTSTRAP REGRESSION SETS - ESTERASE ASSAY 40 title Esterase Hormone data subset -bootstrap ; options pagesize=60 linesize=80; /* THIS IS A PROGRAM TO BOOTSTRAP IN REGRESSION WITH RESAMPLING OF THE X AND Y S TOGETHER, NEXT LINE SETS UP CORRESPONDENCE BETWEEN FILE bb INSIDE SAS AND THE EXTERNAL FILE bhat.out */ filename bb bhat.out ; /* READ IN ORIGINAL DATA INTO INTERNAL SAS FILE a */ data a; infile ester.dat ; input ester count; con=1.0; proc iml; /* put data into y vector and x matrix */ use a; read all var {count} into y; read all var {con ester} into x; close a; /* need to do next two lines just to define the vector yb and matrix xb that will be used in bootstrap */ yb=y; xb=x; /* get usual least squares estimators and mean squared error using matrix forms bhat is beta(hat) and MSE is the mean square error t(x) stands for transpose of the matrix x r is vector of residuals and ssq is a function which gets the sums of squares of the vector in argument */ xpxinv=inv(t(x)*x); bhat=xpxinv*(t(x)*y); yhat=x*bhat; r=y-yhat; sse=ssq(r); df=nrow(x)-ncol(x); mse=sse/df; n=nrow(x); print bhat mse; /* j indexes the number of bootstrap replicates */ do j=1 to 500;

22 8 BOOTSTRAP REGRESSION SETS - ESTERASE ASSAY 41 /* get the n samples with replacement i indicates the sampling with bootstrap replicate j, the generated k is a discrete uniform over 1 to n; the function int takes the integer part */ do i= 1 to n; uv=uniform(0); k=int(uv*n+1); yb[i]=y[k]; xb[i,1]=x[k,1]; xb[i,2]=x[k,2]; end; /* now do least squares with xb the new x matrix and yb the new repsonse vector bhatb and mseb have the results in there (the b at end stands for bootstrap) */ xpxinvb=inv(t(xb)*xb); bhatb=xpxinvb*(t(xb)*yb); yhatb=xb*bhatb; rb=yb-yhatb; sseb=ssq(rb); mseb=sseb/df; b0=bhatb[1]; b1=bhatb[2]; /* the next two commands puts the results to file bb which is aliased with external file bhat.out through the filename statement at beginning. The +1 in the put statement says to skip one space. If you don t do this, there are no blanks between variables. */ file bb; put b0 +1 b1 +1; end; quit; /* Now go and get descriptive statistics through proc corr and proc univariate. I ran proc corr so can get the estimated covariance of beta(hat). This is the sample covariance of the bhatb s over the bootstrap samples; this is obtained with the cov option */ data new; infile bhat.out ; input b0 b1; proc corr cov; proc univariate plot;

23 9 BOOTSTRAP REGRESSION: RESIDUALS - ESTERASE ASSAY/WEIGHTED 42 9 Bootstrap Regression: Residuals - Esterase Assay/weighted Here we demonstrate the bootstrap by resampling the residuals. We allow for specifying fixed weights to use for weighted least squares. We demonstrate using the Esterase Assay data, where it is assumed that V (ɛ i ) = x 2 i σ2. We fit this model and got estimated standard errors in Example 4.2. There is no need for the bootstrap for these purposes, but the bootstrap will be useful for assessing the distribution and getting confidence intervals. In addition to working with the coefficients we also estimate the mean value at x = 20, called M20 in the output. NOTE: In the SAS code, the ith component of ynew is Yi = w 1/2 the ith component of yhatw is w 1/2 i x ˆβ i and the ith component of rw is Yi w1/2 this by (n/(n p)) 1/2, this is what is called ˆδ i in Section of the notes. i Y i and the ith row of xb is w 1/2 i x i ˆβ = w 1/2 i i x i. This means (Y i x ˆβ). i If we multiply The estimated coefficient and MSE agree with the weighted analysis in Example 4.2. The estimated covariance matrix of the coefficients (and associated standard errors) differ modestly from the covariance matrix and standard errors in Example 4.2 in part due to the use of B = The estimators are all approximately normal and the normal based confidence intervals will be very similar to the percentile intervals. BHATW MSE Covariance Matrix DF = 999 B0 B1 B B Variable=B0 N 1000 Sum Wgts 1000 Mean Sum Std Dev Variance Quantiles(Def=5) 100% Max % % Q % % Med % % Q % % Min % % Histogram # Boxplot 2.5+* 1 0.** 5 0.*** 9.*** 11.*********** 41.************** 55.********************* 83.******************************* **************************************** 160.*************************************** 153 *--+--*.****************************** **************************** 109.***************** 68.********** 40.**** 16.** 6.* * 1 0

24 9 BOOTSTRAP REGRESSION: RESIDUALS - ESTERASE ASSAY/WEIGHTED 43 Variable=B1 N 1000 Sum Wgts 1000 Mean Sum Std Dev Variance Quantiles(Def=5) 100% Max % % Q % % Med % % Q % % Min % % Histogram # Boxplot * 1 0.** 7.****** 27.****************** 89.**************************** ************************************** ******************************************* 213 *--+--*.************************************ ****************** 86.********* 45.***** 22.** * * may represent up to 5 counts Variable=M20 Mean Sum Std Dev Variance % Max % % Q % % Med % % Q % % Min % % Histogram # Boxplot ***** 25.************** 66.********************** 106.**************************************** **************************************** 199 *--+--* ************************************* *********************** 115.*************** 72.***** 21.*** * * may represent up to 5 counts

25 9 BOOTSTRAP REGRESSION: RESIDUALS - ESTERASE ASSAY/WEIGHTED 44 title Esterase Hormone data -bootstrap ; options pagesize=60 linesize=80; /* THIS IS A PROGRAM TO BOOTSTRAP IN REGRESSION WITH BOOTSRAPPING ON THE RESIDUALS. - VARIANCE IS ASSUMED OF THE FORM SIGMA^2*a_i^2. - USES WEIGHTED LEAST SQUARES. - IF EQUAL VARIANCE, SET a = 1 AND LEAVE REST THE SAME. */ filename bb bhat2.out ; /* DOING ESTERASE ASSAY EXAMPLE WITH VARIANCE PROPORTIONAL TO X SQUARED.*/ data a; infile ester.dat ; input ester count; con=1.0; a2 = ester**2; wt=1/a2; ystar=count*sqrt(wt); x1star=1*sqrt(wt); x2star=ester*sqrt(wt); proc iml; /* put transformed data into ynew vector and xnew matrix and variances into v */ use a; read all var {ystar} into ynew; read all var {x1star x2star} into xnew; close a; yb=ynew; /* intialize */ xb=xnew; /* get s weighted least squares estimators and MSE */ xpxinv=inv(t(xnew)*xnew); bhatw=xpxinv*(t(xnew)*ynew); yhatw=xnew*bhatw; rw=ynew-yhatw; sse=ssq(rw); df=nrow(xnew)-ncol(xnew); mse=sse/df; n=nrow(xnew); print bhatw mse; nboot = 1000; do j=1 to nboot; /* Resample from residuals and add to fitted value note that already has weighting built in and modifies residual.*/ do i= 1 to n;

Chapter 3. Bootstrap. 3.1 Introduction. 3.2 The general idea

Chapter 3. Bootstrap. 3.1 Introduction. 3.2 The general idea Chapter 3 Bootstrap 3.1 Introduction The estimation of parameters in probability distributions is a basic problem in statistics that one tends to encounter already during the very first course on the subject.