5 Bootstrapping. 4.7 Extensions. 4.8 References. 5.1 Bootstrapping for random samples (the i.i.d. case) ST697F: Topics in Regression.

Size: px
Start display at page:

Download "5 Bootstrapping. 4.7 Extensions. 4.8 References. 5.1 Bootstrapping for random samples (the i.i.d. case) ST697F: Topics in Regression."

Transcription

1 ST697F: Topics in Regression. Spring 2007 c Extensions The approach above can be readily extended to the case where we are interested in inverse prediction or regulation about one predictor given the other predictors are known (for inverse prediction) or set at fixed values (in regulation). For example, suppose there are two predictors x 1 and x 2 and the model for the mean is β 0 + β 1 x 1 + β 2 x 2. For inverse prediction suppose there is a new unit with known x 2 value, say x 20 but an unknown value of x 1, say x 10. We would estimate the unknown x 10 via ˆx 10 = (Y 0 ˆβ 0 ˆβ 2 x 20 )/ ˆβ 1. Similarly for regulation, we could ask, for what x 1 is the expected value of Y equal to a specified constant c, when the second predictor is set at x 20? The result is ρ = (c β 0 β 2 x 20 )/β 1, which would be estimated by ˆρ = (c ˆβ 0 ˆβ 2 x 20 )/ ˆβ 1. Both of these problems are in the form of a ratio and we can apply Fieller s result or the delta method with appropriate definition of σ 11, σ 22, and σ 12. The problem of doing inverse prediction or regulation for multiple x s is more complicated, but notice that you can always attack this by trying to invert prediction or confidence intervals. 4.8 References Kutner et al., Section 4.6. (inverse prediction/calibration) Greene (p.61), Mood, Graybill and Boes ( p. 181), Casella and Berger (p. 240) for approximations for nonlinear functions. Fieller s Theorem. Buonaccorsi(1998, 2001). Graybill and Iyer. Section 6.4 covers inverse prediction and regulation in simple linear regression (although they give a different way to compute the results are the same as ours), while sections 6.6 and 6.7 give two other situations where a ratio of linear combinations of coefficients is of interest. 5 Bootstrapping Bootstrapping has become a popular way to carry out statistical inferences. The basic idea is to mimic the original sampling method to try and generate the sampling distribution of some estimator or test statistic and then get estimates of bias, standard error or confidence intervals and tests based on these results. The simulating which is done to mimic this sampling makes use of the data to create a population or model from which to sample. There is a huge literature and there has been a plethora of recent books on the subject; see the references. While the bootstrap is relatively simple to describe, there are many subtle and complex issues around the performance of bootstrap based confidence intervals and tests that are beyond the scope of this text. Our objective here is to describe the basic ideas and illustrate their application in regression contexts. The fundamental concepts are most easily motivated in the context of a single random sample. 5.1 Bootstrapping for random samples (the i.i.d. case) The univariate, single parameter case. Let W 1,..., W n be a random sample from some distribution F. That is the W i are independent where each W i has the same distribution defined by the CDF F. Let θ be any parameter of interest associated

2 ST697F: Topics in Regression. Spring 2007 c 22 with the population distribution F. Examples include the mean, the standard deviation, the median, some percentile, etc. The parameter θ will be estimated by ˆθ = g(w) = g(w 1,... W n ) = g(w), where W contains W 1,..., W n. (Somewhat confusingly, but as is standard, we use ˆθ to denote either the estimator g(w) which is a random variable, or the estimate g(w) which is the actual number observed. It should be clear from the context how it is being used.) The population CDF is F (w) = P (W w), while the empirical CDF is ˆF defined by ˆF (w) = # of w i w. n Often, the parameter θ can be viewed as some function of F, say t(f ). The plug-in estimator of θ is t( ˆF ). That is, if ˆθ is the plug-in estimator of θ it is calculated in the same way θ would be determined from F, but using ˆF as if it were F. For example, W = i W i/n is the plug-in estimator of the population mean µ, while the plug-in estimator of the population standard deviation σ is ˆσ = [ i (W i W ) 2 /n] 1/2. (6) Note the slight difference from the usual sample standard deviation s = [ i (W i W ) 2 /(n 1)] 1/2 ) because of the division by n rather than n 1. Often, we evaluate how good an estimator is via its bias = E(ˆθ θ), and its standard deviation/standard error, denoted σˆθ. In many problems, the bias and standard error can be written explicitly (often as functions of parameters) and can be estimated directly from the data. In addition, exact or approximate confidence intervals and tests can often be obtained using some exact distributional results or via some large sample arguments. For example, with θ = µ and ˆθ = W, we know E( W ) = µ, so the bias is 0, and standard error of W is exactly σ W = σ/n 1/2. This is typically estimated via s/n 1/2. Under normality, exact confidence intervals for µ are found based on the t distribution, while for large sample sizes approximate confidence intervals are based on the normal. In many cases where there are no exact expressions for bias, standard error or the sampling distribution, some asymptotic or approximate result is used. For example, we previously used a Taylor series based method to approximate the bias and standard error of nonlinear functions; see Section 4.3. Approximate confidence intervals are often found by assuming that (ˆθ θ)/ˆσˆθ is approximately standard normal. There validity of approximations for the bias and standard error and of the normality assumption for estimators are always in question. Can we find a way to estimate the bias and standard error for ˆθ, and more generally the sampling distribution of ˆθ that does not depend on assuming F is of a particular type (normal, exponential, etc.) and does not require an analytical expression for the bias or standard error? The bootstrap estimate of standard error. The standard error of ˆθ, σˆθ, is itself some function of F, say σˆθ = h(f ). The definition of the bootstrap estimate of the standard error is ˆσˆθ = h( ˆF ). If we know explicitly what the function h(f ) is then we can calculate the bootstrap estimate of standard error directly. For example, with θ = µ and ˆθ = W, we noted above that σ W = σ/n 1/2. σ/n 1/2 can be viewed as h(f ) as it is a function of F via σ which is the population standard deviation associated with F. So the bootstrap estimate of σ W is ˆσ/n 1/2 where ˆσ is given in (6). Usually we do not have an analytical expression for how σˆθ is a function of F, in which case the definition of the bootstrap estimate of standard error above is not useful. We can however calculate the bootstrap estimate of standard error via simulation without knowing what h is. The algorithm for a random sample with observed data w 1,... w n, proceeds as follows: For b = 1 to B, where B is a large value: 1. Take a random sample of size n WITH REPLACEMENT from the values w 1,... w n. Denote this

3 ST697F: Topics in Regression. Spring 2007 c 23 bootstrap sample by w b1,..., w bn. Notice that some of the original values in the sample will typically occur more than once in the bootstrap sample. 2. Compute your estimate in the same way as you did for original data now using the bootstrap sample ˆθ b = g(w b1,..., w bn ). This leads to B bootstrap estimates ˆθ 1,... ˆθ B. The bootstrap estimate of standard error is calculated as [ B b=1 ˆσ B ˆθ = (ˆθ b θ ] 1/2 ) 2, B 1 where θ = B ˆθ b=1 b /B. (Technically the bootstrap estimate of standard error is the limit of the above as B, but it is common to use the term in this way.) The bootstrap estimate of Bias is θ t( ˆF ). REMARKS Note that even if the original estimator ˆθ is not the plug in estimator, the bias is calculated using the plug-in estimator t( ˆF ). If we don t know t(), we cannot calculate the bootstrap estimate of bias. The mean of the bootstrap values should NOT be taken as the estimate of θ. The Bootstrap estimate of the distribution of ˆθ is simply the distribution of the B values ˆθ 1,... ˆθ B. We ll call this the Empirical Bootstrap Distribution and denote the empirical CDF by Ĝ; note that this is an estimate of the CDF of ˆθ. Typically we will use a histogram, smoothed histogram or stem and leaf plot to represent this distribution rather than give it in CDF form General multivariate and/or multiparameter The univariate single parameter case is easily generalized to cases of a random sample where there are multiple parameters of interest and/or each observation is multivariate. Let W 1,..., W n be i.i.d. with some distribution F, where the W i can now be vector valued. Let θ be a collection of q parameters of interest, with estimator ˆθ; q could be 1 as when we are interested in a single parameter even if the data is multivariate. We now resample with replacement as before from the observed w,..., w n and write ˆθ b for the estimate of θ from the bth bootstrap sample. The bootstrap mean is ˆ θ = b ˆθ b/n and the bootstrap covariance matrix is b S B = (ˆθ b ˆ θ )(ˆθ b ˆ θ ). B 1 S B is the bootstrap estimate of Cov(ˆθ), the covariance matrix of ˆθ. The square roots of the diagonal elements are the bootstrap standard errors of the individual components.

4 ST697F: Topics in Regression. Spring 2007 c Bootstrap Confidence Intervals As noted earlier, a common method of obtaining approximate confidence intervals is to assume (ˆθ θ)/ˆσˆθ is approximately standard normal, leading to an approximate confidence interval for θ of the form ˆθ ± z(1 α/2)ˆσˆθ, where ˆσˆθ is the estimated standard error of ˆθ. Nonparametric confidence intervals can be found using the bootstrap in a variety of ways. There is plenty of discussion that can be found about the different methods in the cited literature. The two most commonly used ones that have emerged in practice are the percentile method and the BC a method. The Percentile Method: Consider α 1 and α 2 with α 1 + α 2 = α, where usually α 1 = α 2 = α/2. The percentile confidence interval for θ is [L, U], where L = Ĝ 1 (α 1 ) and U = Ĝ 1 (1 α 2 ). That is, 100α 1 % of the bootstrap values are less than or equal to L and 100(1 α 2 )% of the bootstrap values are less than or equal to U. Note that if Ĝ is normal with mean at ˆθ, then the bootstrap percentile interval would agree with ˆθ ± z(1 α/2)ˆσ B ˆθ. Notice that with the percentile method, if [L, U] is the percentile interval for θ then the percentile interval for any function of θ, say φ = g(θ) is simply [g(l), g(u)]. This is a nice property that does not hold for the delta method interval, where confidence intervals for nonlinear functions do not simply transform in this way. It is not transparent as to why the percentile method works. It can be shown that it works well if there is some transformation g and a constant c for which t g(ˆθ) g(θ)/c follows approximately a standard normal distribution. (It is not necessary to know what the g is). The percentile method, which is very easy to calculate has been found to work well in many cases but it can encounter problems. In practice an improved bootstrap method called the BC a method (bias corrected accelerated method) is often preferred. The BC a bootstrap interval is given by Ĝ 1 (α 1) and U = Ĝ 1 (1 α 2). This resembles the percentile method but uses quantities α 1 and α 2. These are somewhat involved to calculate and depend on two quantities, one is a bias correction term and the other is an acceleration term. The first accounts for potential bias (actually median biasedness) in ˆθ while the acceleration term addresses the fact that the standard error of ˆθ may itself be a function of θ. See for example Efron and Tibshirani for details (but there are some notational differences with our treatment.) 5.3 Bootstrapping in Regression Models The methods of using bootstrap estimates for bias, standard error and confidence intervals are the same as in the random sample setting. What changes in the manner in which the bootstrap samples are generated. We continue to assume, as we have done to this point, that the observations on different units are independent Random Regressors: bootstrapping (Y, X) together Suppose all the regressors are random quantities, so when we choose the ith unit in the sample we obtain the response and all of the predictors, collected in Y i X i1 W i =.... X i,p 1 A regression model is specified for E(Y x). A variance model may be specified, possibly with heteroscedasticity depending on the X s, but need not be. If the W 1,..., W n arise from a random sample of

5 ST697F: Topics in Regression. Spring 2007 c 25 individual units (i.e., can be treated as i.i.d.), then we can proceed to use the bootstrap as in Section for any estimator or collection of estimators of interest. This could be the coefficients, variance parameters, correlations, or any functions of these, including nonlinear ones. We illustrate first with β, which is estimated by ˆβ, either through least squares or some type of weighted least squares (possibly iterative). In the bth bootstrap sample the estimator is ˆβ b and the bootstrap estimate of the covariance of ˆβ is S B. The square roots of the diagonal elements of S B gives the bootstrap standard errors for the individual coefficients. For any one of the coefficients the bootstrap estimates can be used to get the bootstrap estimate of bias and standard error and nonparametric confidence intervals. For estimating a linear combination of interest, say θ = c β, because of the linearity, the bootstrap standard error can be calculated via (c S B c) 1/2. To get the empirical distribution and confidence intervals though we need to obtain each ˆθ b = c ˆβ b. If the interest is in some nonlinear function of β, say θ = g(β), then one calculates θ b = g(β b) for b = 1 to B. REMARKS: Since under constant variance, we know the exact covariance of the least squares estimator and how to get an unbiased of it, there is no need to use the bootstrap for this purpose. However the bootstrap is useful for getting nonparametric confidence intervals for the coefficients and functions of them. These provide an alternative to the usual t based intervals or the delta method type intervals for nonlinear functions. Since the homogeneity of variance assumption does not have to hold in this case, if we are using least squares, the bootstrap estimate S B provides an alternative to White s robust estimate of the covariance (Section 3.2) Bootstrapping residuals Consider fixing the x values and suppose the model is Y i = m(x i, β) + ɛ i (7) where the ɛ i are assumed to be independent and identically distributed (iid) with mean 0 from some distribution F. (Note that this implies the errors have constant variance σ 2 and are uncorrelated). The ith residual is r i = Y i Ŷi. The bth bootstrap sample is generated by getting Y bi = m(x i, ˆβ) + r bi, i = 1 to n, where r b1,..., r bn are bootstrapped residuals obtained by sampling with replacement from some collection of values that reflect the distribution of the original ɛ i. Recall that this distribution is assumed to have mean 0 and variance σ 2. One possibility is to just sample with replacement from the residuals r 1,..., r n. Notice that when we sample with replacement from the residuals the variance of the resulting value is i r2 i /n = (n p)mse/n where MSE is our unbiased estimator of σ2. This suggest a modification, namely to sample from modified residuals (n/(n p)) 1/2 r i ; when we sample from this the variance equals MSE. This means that r bi is generated by sampling one of the residuals and if it is the kth residual that is selected then r bi = (n/(n p)) 1/2 r k. We will always use this modification although it is clearly not very important as n gets large. For the linear case with an intercept in the model, r = i r i/n = 0, so sampling from the residuals or modified residuals is sampling from a distribution with mean 0. Without an intercept r is not zero so we should resample from centered residuals r i r.

6 ST697F: Topics in Regression. Spring 2007 c 26 The bth bootstrap sample consists of (y b1, x 1 ),... (y bn, x n ) from which we get ˆβ and any other estimate b of interest. It can be shown that using the modified residuals, as B then S B (the sample covariance among the bootstrap estimates ˆβ,..., ˆβ ) converges to 1 B MSE(X X) 1. So bootstrapping from the modified residuals is doing exactly the right thing in terms of estimating the covariance of ˆβ. As noted before the bootstrap is not needed in order to estimate Σ ˆβ under constant variance since MSE(X X) 1 provides an unbiased estimator. The usefulness of the bootstrap here is that it allows us to construct confidence intervals on the coefficients and linear combinations of them that are not dependent on normality of ˆβ. We can also do other things such as carry out inferences for σ 2 (which under the normality assumption are based on a chi-square distribution with n-p degrees of freedom) or handle inferences for nonlinear functions of the parameters Bootstrap Prediction Intervals Suppose we want to predict Y 0 at x 0. It is the distribution of Ŷ0 Y 0 that we need, where Ŷ0 = x 0ˆβ. If the CDF of this distribution is denoted H then P (H 1 (α/2) Ŷ0 Y 0 H 1 (1 α/2)) = 1 α, implying P (Ŷ0 H 1 (1 α/2) Y 0 Ŷ0 H 1 (α/2)) = 1 α. This means that [Ŷ0 H 1 (1 α/2), Ŷ0 H 1 (α/2)] is a 100(1 α)% prediction interval for Y 0. Since we don t know H we estimate it via the bootstrap as follows: For b = 1 to B, i) generate ˆβ b as in Section and construct Ŷ0b = x 0ˆβ b. ii) generate Y 0b = Ŷ0 + r 0b where r 0b is a newly generated bootstrap residual. iii) Construct D b = Ŷ0b Y 0b. Consider the empirical distribution of D 1,..., D B and denote the percentiles by Ĥ(α/2) (value with α/2 to the left) and Ĥ(1 α/2). The prediction interval for Y 0 (based on the percentile method) is (Ŷ0 Ĥ 1 (1 α/2), Ŷ0 Ĥ 1 (α/2)) Bootstrapping the residuals for the heteroscedastic case Consider the variance model V (ɛ i ) = σ 2 a 2 i, where a i is known. Equivalently, we can view ɛ i as a i δ i where the δ i are assumed to be independent and identically distributed with mean 0 from some distribution F (having mean 0 and variance σ 2 ). So δ i = (w i ) 1/2 ɛ i where w i = 1/a 2 i is the weight. We could estimate the distribution of the residuals by using the modified weighted residuals ˆδ i = (n/(n p)) 1/2 (w i ) 1/2 (Y i m(x i, ˆβ)). The weighted residuals do not necessarily add to zero, however, since there isn t an intercept in the transformed model. With r i = (w i ) 1/2 (Y i m(x i, ˆβ)), in general it is better to resample from ˆδ i = (n/(n p)) 1/2 (r i r) rather than ˆδ i = (n/(n p)) 1/2 r i. All this really does this is change the intercept used in the bootstrapping. In practice the mean of these weighted residuals is often small so it doesn t make much of a difference. The bootstrap sample is generated by Y bi = m(x i, ˆβ) + a i d bi where the d bi are sampled with replacement from ˆδ 1,... ˆδ n. Notice that the a i is fixed to the ith position in the sample, while the d bi comes from selecting one of the weighted residuals. The ˆβ could be either ordinary of weighted least squares, depending on which is being used. If there is a parametric model for the variance, a 2 i = v i(β, λ), then λ need to be estimated in order to do the bootstrapping. This leads to the use of ˆδ i = (n/(n p)) 1/2 (ŵ i ) 1/2 (Y i m(x i, ˆβ)), where ŵ i = 1/v i (ˆβ, ˆλ)

7 ST697F: Topics in Regression. Spring 2007 c 27 and Y bi = m(x i, ˆβ) + (v i (ˆβ, ˆλ)) 1/2 d bi. If the ˆβ being used in the original analysis is two-stage or iteratively reweighted least squares then within each bootstrap sample you would carry out this same procedure; that is do two-stage or iteratively reweighted least squares on each bootstrap sample Bootstrapping with replication Suppose there are k distinct collections of regressors, say x 1,..., x k, with n j observations at x j, so j n j = n. In this case with the x s treated as fixed we can still resample in a way which allows for changing variances. One option is to use the fitted model (which means we believe we have the right function for the mean), then when we generate an observation at x j, we resample from just the residuals at x j. If the residuals at x j are denoted r j1,..., r jnj then we would create modified residuals [ ] 1/2 nj 1 r jm = (r jm r j ), where r j is the mean of r j1,..., r jnj. The resampling is from these modified residuals. n j Another option is to generate n j response at x j by resampling n j times with replacement from the original n j responses at x j. If there are only a few replicates at each distinct x, this may not work very well. 5.4 Bootstrap Hypothesis Testing Hypothesis testing is a popular way of carrying out inferences. There are a couple of ways to carry out bootstrap tests of hypotheses. Method 1: Try to generate the null distribution. One way is to emulate carrying out a test based on some test statistic. Suppose the test is based on a test statistic Q and rejects H 0 if Q is large (almost all tests can be put into this form). The bootstrap approach simulates the null distribution of the test statistic Q. With this approach, the bootstrap samples must be generated under the null model (the model incorporating the null hypothesis). Then for each bootstrap sample the test statistic is calculated and the empirical distribution of these test statistics over the B bootstrap samples is an estimate of the null distribution. Suppose the test statistic being used has an observed value Q obs and the bootstrap values are Q 1,..., Q B (so Q b was the value of the test statistic generated in the bth bootstrap sample). The bootstrap P-value of the test is P boot = number of times Q b Q obs. B The null hypothesis of independence is rejected if P boot is less than α, the desired level of the test (e.g.,.05 for a test at the 5% level). This method of bootstrapping mimics our usual approach to testing based on the null distribution and is a popular way to approach testing. In more complicated situations it can be difficult to figure out how to properly resample under the null. Even when you can do so, this method can have problems. Oftentimes there are parameters involved in the model which must be estimated and the distribution of the test statistic under the null hypothesis may depend on these parameters. That is, the test statistic is not what is known as a pivotal quantity. For regression problems, in the case of uncorrelated errors and constant variance, where the bootstrap is being used to protect against non-normal errors, this approach does fairly well. More serious problems can arise in more complicated models, including heteroscedasticity or correlated errors. This issue

8 ST697F: Topics in Regression. Spring 2007 c 28 has not received the attention it deserves in practice and further work is needed on the magnitude of the problem. Method 2: Invert a confidence interval. For a single parameter, hypothesis testing can also be carried out using the bootstrap confidence interval. We illustrate for two sided test of H 0 : θ = θ 0, versus H A : θ θ 0. An approximate test of size α is to reject H 0 if the 100(1 α)% confidence interval for θ does not contain θ 0. A test at level.05 would use a 95% confidence interval, a level.10 test a 90% confidence interval, etc. A P-value for the test can be obtained by finding the smallest level at which the the null hypothesis is rejected. Equivalently, this means finding the largest value C for which a 100C% confidence interval contains 0 and then taking the P-value as 1 C. For one-sided tests, a similar approach can be taken using one-sided confidence intervals. Because of the potential problems mentioned earlier with obtaining a P-value by directly bootstrapping the test statistic under the null, we recommend carrying out the test via confidence intervals when possible Summary Briefly, the main points in this section are: Bootstrapping the residuals requires that we have a model for the mean and the variance, except when we have replication. If we have a random sample of units, then you can either bootstrap the (y i, x ) sets or you can bootstrap the residuals if you believe you have the right model for Y x. Bootstrapping the (y i, x i ) sets does not require you have a model for the variance, and it allows heteroscedasticity. If the x s are fixed then in generally you should bootstrap the residuals, although there are some cases where bootstrapping the sets works okay. If you have a random sampling of units that is not a single random sample (i.e., it comes from something like a stratified or multi-stage sampling) then you can bootstrap the residuals or bootstrap the (y, x) sets in a way which reflects the original sampling scheme. 5.5 References Kutner et al. (Section 11.5) Efron and Tibshirani (1993). A comprehensive look but not too technical. Chapter 9 handles regression in detail. Diaconis and Efron (1983), Efron and Tibshirani (1986 ), Leger et. al. (1992, p. 378), Davison and Hinkley (1997), Manly (2000).

9 7 UNIVARIATE BOOTSTRAP - FOOD EXPENDITURES 28 7 Univariate bootstrap - food expenditures This example illustrates the use of the bootstrap in a single sample where we use the sample mean, variance, standard deviation, coefficient of variation (standard deviation/mean), median and the 90th percentile for illustration. The data consists of food expenditures in dollars for 40 randomly sampled households in Ohio. (Data from Exploring Statistics by Kitchens originally from Bureau of Labor Statistics). B = 1000 bootstrap samples were used. The original data has sample mean x = , standard deviation s = , median 3165 and 90th percentile The sample mean is known to be an unbiased estimator of the population mean µ, with exact standard error σ/n 1/2, usually estimated by s/n 1/2 = (Std Mean). An approximate confidence interval for µ, based on the approximate normality of the sample mean is found using x ± t(1 α/2, n 1)s/n 1/2. (We usually use the t rather than a normal value even when we think the population is not normal, but sample size is large, in order to be conservative. There is little difference between the t and z values though for even moderate n). The 95% interval here is ( , ). The mean of the 1000 bootstrap sample mean is As we know the sample mean is unbiased so there is no need to estimate bias. If we do, the bootstrap estimate of bias is This is small relative to the estimate and the only reason it is not 0 is due to the use of a finite number (1000) of bootstrap samples. As you increase B this will go to 0. The bootstrap estimate of standard error of x, from the 1000 samples is (the standard deviation of the 1000 samples means) The theoretical bootstrap estimate of the standard error, what it converges to as B, is ˆσ/n 1/2 (see notes for ˆσ) which can be computed directly and differs modestly from s/n 1/2.) The bootstrap is not needed for assessing bias or getting the standard error of x and is done here for illustration and to show agreement with the usual results. For a nonparametric confidence interval, the 90% percentile interval uses the 5th and 95th percentile as endpoints. This yields (3256.8,4034.2). This is not that different from the earlier interval which is not surprising given the normality of x as demonstrated by the empirical bootstrap distribution of the sample mean. There are analytical procedures (often approximate) for treating the variance, standard devation or coefficient of variation. Note that the sample variance is unbiased for σ 2 but s is biased for σ. The bootstrap estimate of the bias is s an estimator of σ is (39/40) 1/2 = For the variance, if we assume the population is normal then an exact confidence interval is available based on the chi-square with n 1 degrees of freedom and taking the square root give a confidence interval for σ. The 90% intervals are given from proc univariate under the normality assumption and are ( , ) for the variance and (1276,1859) for the standard deviation. Without the normality assumption, there are large sample results that can be used to based on the approximate normality of the sample variance or sample standard deviation. These involve an analytical expression for the asymptotic standard. Note that if you use the approximate normality of the sample variance and then separately work with the standard deviation, the interval for the standard devation is not just the square root of the interal for the variance. The bootstrap percentile intervals though transform directly. The 90% bootstrap percentile confidence interval for σ is ( , ). The sample coefficient of variation is s/ x =.418. As an estimator s/ x is a ratio and cannot get an exact expression for its expected value and hence its bias (although we can approximate it using our earlier methods). The bootstrap estimate of bias is (39/40) 1/2 = indicating bias is not a serious issue. It is possible using a multivariate central limit theorem and the delta method to determine that for large sample size the sample coefficient of variation is approximately normal with mean equal to the population coefficient of variation and some standard deviation that depends on a number of parameters that must be estimated. This is one way to approach the problem, but relies on approximations and estimation of unknowns in the approximate standard error. Using the bootstrap, the 90% bootstrap percentile confidence interval for the population CV is (.32,.48).

10 7 UNIVARIATE BOOTSTRAP - FOOD EXPENDITURES 29 Notice from the empirical distributions the sampling distributions are approximately normal (as large sample theory tells us they will be). In these cases, the intervals that comes out of the bootstrap perecentile method will be close to what is obtained using a normal approximation and using intervals of the form estimate±z(1 α/2)se, where SE is the bootstrap standard errror. For the standard deviation, which has the least normal looking of the sampling distributions, the resulting interval is (1092.3, ), which is not too different from the percentile method. Proc univariate gives approximate confidence intervals for population percentiles but under normality assumptions and without normality. The distribution free intervals are based on using the order statistics. These are described in the SAS online documentation. The median can be addressed in a similar manner. The bootstrap estimate of bias ( ) is relatively small. The 90% bootstrap percentile interval for the median is (2837.5,3679). The 90th percentile shows some difficulty with employing the bootstrap. There are only some values that typically end up as the 90th percentile in the bootstrap sample, leading to a very discrete distribution. While this is not particularly problematic in estimating the bias or standard deviation (though it could be in small sample sizes) it poses problems with the confidence intervals. Notice that the 95th and 99th percentiles of the empirical distribution are the same. A 90% percentile interval is (4367,7580) while a 98% is (3970,7580), which is a bit unsatisfactory with the upper point staying the same. One way to deal with this problem, and a general strategy employed that can be employed in bootstrapping is to smooth the data before resampling, so the resampling is from a continous distribution rather than from a set of points. The UNIVARIATE Procedure Variable: expend Moments N 40 Sum Weights 40 Mean Sum Observations Std Deviation Variance Skewness Kurtosis Uncorrected SS Corrected SS Coeff Variation Std Error Mean Basic Statistical Measures Location Variability Mean Std Deviation 1509 Median Variance Mode Range 6967 Interquartile Range 1462 Basic Confidence Limits Assuming Normality Parameter Estimate 90% Confidence Limits Mean Std Deviation Variance Tests for Location: Mu0=0 Test -Statistic p Value Student s t t Pr > t <.0001 Sign M 20 Pr >= M <.0001 Signed Rank S 410 Pr >= S <.0001

11 7 UNIVARIATE BOOTSTRAP - FOOD EXPENDITURES 30 Quantiles (Definition 5) 90% Confidence Limits Quantile Estimate Assuming Normality 100% Max % % % % Q % Median % Q % % % % Min % Confidence Limits Order Statistics Quantile Distribution Free LCL Rank UCL Rank Coverage 100% Max 99% % % % Q % Median % Q % % % % Min Variable: expend Stem Leaf # Boxplot *-----* Multiply Stem.Leaf by 10**+3

12 7 UNIVARIATE BOOTSTRAP - FOOD EXPENDITURES 31 Variable=SMEAN Variable=SMEAN Moments N 1000 Sum Wgts 1000 Mean Sum Std Dev Variance Quantiles(Def=5) 100% Max % % Q % % Med % % Q % % Min % % Histogram # Boxplot 4350+* 4 0.** 8 0.***** ********** 37.**************** 63.*********************** ****************************** ******************************************** 176 *--+--*.************************************** *************************************** ************************* 99.************** ***** 17.** 6.* * * may represent up to 4 counts ** ELIMINATED OUTPUT ON VARIANCE

13 7 UNIVARIATE BOOTSTRAP - FOOD EXPENDITURES 32 Variable=SD N 1000 Sum Wgts 1000 Mean Sum Std Dev Variance Skewness Kurtosis Quantiles(Def=5) 100% Max % % Q % % Med % % Q % % Min % % Histogram # Boxplot 2050+** 6.***** 17.************** 54.*********************** 91.************************************* ****************************************** 166.**************************************** 157 *--+--*.************************************ *********************** 92.****************** 71.******** 29.***** 17.*** * * may represent up to 4 counts

14 7 UNIVARIATE BOOTSTRAP - FOOD EXPENDITURES 33 Variable=CV Variable=CV N 1000 Sum Wgts 1000 Mean Sum Std Dev Variance Quantiles(Def=5) 100% Max % % Q % % Med % % Q % % Min % % Histogram # Boxplot 0.57+* 1 0.* 2 0.* 4.*** 11.********** 38.******************** 78.******************************* ***************************************** 163.****************************************** 167 *--+--*.************************************* 146.************************ ********************* 83.********** 37.****** 21.***** 17.*** 9 0.** * * may represent up to 4 counts

15 7 UNIVARIATE BOOTSTRAP - FOOD EXPENDITURES 34 Variable=MEDIAN Moments N 1000 Sum Wgts 1000 Mean Sum Std Dev Variance Variable=MEDIAN Quantiles(Def=5) 100% Max % % Q % % Med % % Q % % Min % % Histogram # Boxplot 3875+* 1 0.** 8 0.** 7.****** 22.******* 26.****** 23.* 1.******** 32.** 8.******** 32.********** 37.*************** *************** 60.********************************* 129.************************************* 145 *--+--*.************************* 100.* 2.*********** 41.****************** ********************* 84.*********************** 90.***** * * may represent up to 4 counts

16 7 UNIVARIATE BOOTSTRAP - FOOD EXPENDITURES 35 Variable=P90 Variable=P90 N 1000 Sum Wgts 1000 Mean Sum Std Dev Variance Quantiles(Def=5) 100% Max % % Q % % Med % % Q % % Min % % 3970 Histogram # Boxplot 8100+* ********** ***************************** ************************************** ***************************************** 203 *-----*..********************************* 161.*********************** ************ 59..************** 67..*** * * may represent up to 5 counts

17 7 UNIVARIATE BOOTSTRAP - FOOD EXPENDITURES 36 title Bootstrap with a single sample ; options pagesize=60 linesize=80; /* THIS IS A PROGRAM TO BOOTSTRAP WITH A SINGLE RANDOM SAMPLE USING THE MEAN, VARIANCE, STANDARD DEVIATION, COEFFICIENT OF VARIATION, MEDIAN, 90TH PERCENTILE */ filename bb boot.out ; /* READ IN ORIGINAL DATA INTO INTERNAL SAS FILE values */ data values; infile food.dat ; input expend; title descriptive statistics on original sample ; proc univariate cibasic cipctlnormal cipctldf plot alpha=.10; /* START INTO IML WHERE BOOTSTRAPPING WILL BE DONE */ proc iml; /* put data into vector x */ use values; read all var{expend} into x; close values; n=nrow(x); /* = sample size */ xb=x; /* initializes xb to be same size as x */ nboot = 1000; /* specify number of bootstrap replicates */ do j=1 to nboot; /* get the n samples with replacement. i indicates the sampling within bootstrap replicate j. The generated k is a discrete uniform over 1 to n; the function int takes the integer part */ do i= 1 to n; uv=uniform(0); k=int(uv*n+1); xb[i]=x[k]; end; /* xb contains the n values in bootstrap sample */ /*compute statistics of interest, sum and ssq are matrix functions that do sum and sum of squares */ smean = sum(xb)/n; /* get sample mean */ svar=(ssq(xb) - (n*(smean**2)))/(n-1); /* sample variance */ sd = sqrt(svar); /* sample s.dev. */ cv = sd/smean; /* coefficient of variation */

18 7 UNIVARIATE BOOTSTRAP - FOOD EXPENDITURES 37 /*compute median and 90th percentile */ b=xb; /* initializes b*/ xb[rank(xb)] = b; /* xb has ranked values */ c1=int(n/2); c2=c1+1; median=(xb[c1]+xb[c2])/2; /* use if n is even */ diff = c1 - (n/2); if diff <0 then median = xb[c2]; /* if n is odd */ d= int(.9*n); p90= xb[d]; /* rough 90th percentile. Can be refined */ /* the next two commands puts the results to file bb which is aliased with external file boot.out through the filename statement at beginning. The +1 in the put statement says to skip one space. */ file bb; put smean +1 svar +1 sd +1 cv +1 median +1 p90; end; quit; /* Get descriptive statistics via proc univariate */ data new; infile boot.out ; input smean svar sd cv median p90; proc univariate plot;

19 8 BOOTSTRAP REGRESSION SETS - ESTERASE ASSAY 38 8 Bootstrap Regression Sets - Esterase Assay Here we demonstrate the bootstrap where it is assumed that the n units in the study are a random sample (independent and identically distributed) so we can resample the (Y, x) sets. We will demonstrate using the Esterase Assay data in order to compare it to our earlier results. This assumes there is a sample of 106 individuals and for each individual we then get true esterase concentration via some exact method and at the same time get a binding count from running the radioimmunoassy. This would not be the right way to proceed if this is a designed experiment using standards with known concentrations. That would involve bootstrapping residuals. Here we will use the bootstrap to get an estimated covariance matrix for the least squares estimates of the coefficients and to get confidence intervals. Using least squares does not require that we model the variance, but the usual estimated covariance is known to be wrong. Analytically one option was to use White s robust estimator. See Example 4.1 for the least squares results and the robust estimate of covariance (labeled consistent covariance of estimates). The least squares estimates are unbiased (assuming the linear model is right) so we don t need the bootstrap to assess bias. The bootstrap estimates of standard error are 21.1 for the intercept and 1.33 for the slope; these are the square roots of the diagonal elements of the bootstrap estimate of Σ ˆβ, which is labeled Covariance Matrix below. Notice the similarity between this and White s robust estimator. The empirical bootstrap distributions demonstrate the normality of the estimators and intervals based on the normal approximation using the bootstrap standard errors should be reasonable. The 90% percentile intervals are (-53.1, 15.6) for β 0 and (15, 19.3) for β 1. BHAT MSE Covariance Matrix B0 B1 B B Variable=B0 Mean Std Dev Quantiles(Def=5) 100% Max % % Q % % Med % % Q % % Min % % Histogram # Boxplot 55+* 1 0..** 6.***** 24.************ 58.*********************** 115.************************************ **************************************** 197 *--+--* -25+********************************* 161.************************ *************** 75.******** 39.**** 17.* 4 0.* 2 0.* * 1 0

20 8 BOOTSTRAP REGRESSION SETS - ESTERASE ASSAY 39 Variable=B1 N 1000 Sum Wgts 1000 Mean Sum Std Dev Variance Quantiles(Def=5) Variable=B1 100% Max % % Q % % Med % % Q % % Min % % Histogram # Boxplot * 1 0.* 1 0.* 2 0.*** 9 0.** 7.***** 19.********** 40.************** 55.*************************** 105.******************************* ************************************** 152 *--+--*.************************************ 143.*********************************** ********************** 87.**************** 64.********* 35.**** 13.* 2.* * * may represent up to 4 counts

21 8 BOOTSTRAP REGRESSION SETS - ESTERASE ASSAY 40 title Esterase Hormone data subset -bootstrap ; options pagesize=60 linesize=80; /* THIS IS A PROGRAM TO BOOTSTRAP IN REGRESSION WITH RESAMPLING OF THE X AND Y S TOGETHER, NEXT LINE SETS UP CORRESPONDENCE BETWEEN FILE bb INSIDE SAS AND THE EXTERNAL FILE bhat.out */ filename bb bhat.out ; /* READ IN ORIGINAL DATA INTO INTERNAL SAS FILE a */ data a; infile ester.dat ; input ester count; con=1.0; proc iml; /* put data into y vector and x matrix */ use a; read all var {count} into y; read all var {con ester} into x; close a; /* need to do next two lines just to define the vector yb and matrix xb that will be used in bootstrap */ yb=y; xb=x; /* get usual least squares estimators and mean squared error using matrix forms bhat is beta(hat) and MSE is the mean square error t(x) stands for transpose of the matrix x r is vector of residuals and ssq is a function which gets the sums of squares of the vector in argument */ xpxinv=inv(t(x)*x); bhat=xpxinv*(t(x)*y); yhat=x*bhat; r=y-yhat; sse=ssq(r); df=nrow(x)-ncol(x); mse=sse/df; n=nrow(x); print bhat mse; /* j indexes the number of bootstrap replicates */ do j=1 to 500;

22 8 BOOTSTRAP REGRESSION SETS - ESTERASE ASSAY 41 /* get the n samples with replacement i indicates the sampling with bootstrap replicate j, the generated k is a discrete uniform over 1 to n; the function int takes the integer part */ do i= 1 to n; uv=uniform(0); k=int(uv*n+1); yb[i]=y[k]; xb[i,1]=x[k,1]; xb[i,2]=x[k,2]; end; /* now do least squares with xb the new x matrix and yb the new repsonse vector bhatb and mseb have the results in there (the b at end stands for bootstrap) */ xpxinvb=inv(t(xb)*xb); bhatb=xpxinvb*(t(xb)*yb); yhatb=xb*bhatb; rb=yb-yhatb; sseb=ssq(rb); mseb=sseb/df; b0=bhatb[1]; b1=bhatb[2]; /* the next two commands puts the results to file bb which is aliased with external file bhat.out through the filename statement at beginning. The +1 in the put statement says to skip one space. If you don t do this, there are no blanks between variables. */ file bb; put b0 +1 b1 +1; end; quit; /* Now go and get descriptive statistics through proc corr and proc univariate. I ran proc corr so can get the estimated covariance of beta(hat). This is the sample covariance of the bhatb s over the bootstrap samples; this is obtained with the cov option */ data new; infile bhat.out ; input b0 b1; proc corr cov; proc univariate plot;

23 9 BOOTSTRAP REGRESSION: RESIDUALS - ESTERASE ASSAY/WEIGHTED 42 9 Bootstrap Regression: Residuals - Esterase Assay/weighted Here we demonstrate the bootstrap by resampling the residuals. We allow for specifying fixed weights to use for weighted least squares. We demonstrate using the Esterase Assay data, where it is assumed that V (ɛ i ) = x 2 i σ2. We fit this model and got estimated standard errors in Example 4.2. There is no need for the bootstrap for these purposes, but the bootstrap will be useful for assessing the distribution and getting confidence intervals. In addition to working with the coefficients we also estimate the mean value at x = 20, called M20 in the output. NOTE: In the SAS code, the ith component of ynew is Yi = w 1/2 the ith component of yhatw is w 1/2 i x ˆβ i and the ith component of rw is Yi w1/2 this by (n/(n p)) 1/2, this is what is called ˆδ i in Section of the notes. i Y i and the ith row of xb is w 1/2 i x i ˆβ = w 1/2 i i x i. This means (Y i x ˆβ). i If we multiply The estimated coefficient and MSE agree with the weighted analysis in Example 4.2. The estimated covariance matrix of the coefficients (and associated standard errors) differ modestly from the covariance matrix and standard errors in Example 4.2 in part due to the use of B = The estimators are all approximately normal and the normal based confidence intervals will be very similar to the percentile intervals. BHATW MSE Covariance Matrix DF = 999 B0 B1 B B Variable=B0 N 1000 Sum Wgts 1000 Mean Sum Std Dev Variance Quantiles(Def=5) 100% Max % % Q % % Med % % Q % % Min % % Histogram # Boxplot 2.5+* 1 0.** 5 0.*** 9.*** 11.*********** 41.************** 55.********************* 83.******************************* **************************************** 160.*************************************** 153 *--+--*.****************************** **************************** 109.***************** 68.********** 40.**** 16.** 6.* * 1 0

24 9 BOOTSTRAP REGRESSION: RESIDUALS - ESTERASE ASSAY/WEIGHTED 43 Variable=B1 N 1000 Sum Wgts 1000 Mean Sum Std Dev Variance Quantiles(Def=5) 100% Max % % Q % % Med % % Q % % Min % % Histogram # Boxplot * 1 0.** 7.****** 27.****************** 89.**************************** ************************************** ******************************************* 213 *--+--*.************************************ ****************** 86.********* 45.***** 22.** * * may represent up to 5 counts Variable=M20 Mean Sum Std Dev Variance % Max % % Q % % Med % % Q % % Min % % Histogram # Boxplot ***** 25.************** 66.********************** 106.**************************************** **************************************** 199 *--+--* ************************************* *********************** 115.*************** 72.***** 21.*** * * may represent up to 5 counts

25 9 BOOTSTRAP REGRESSION: RESIDUALS - ESTERASE ASSAY/WEIGHTED 44 title Esterase Hormone data -bootstrap ; options pagesize=60 linesize=80; /* THIS IS A PROGRAM TO BOOTSTRAP IN REGRESSION WITH BOOTSRAPPING ON THE RESIDUALS. - VARIANCE IS ASSUMED OF THE FORM SIGMA^2*a_i^2. - USES WEIGHTED LEAST SQUARES. - IF EQUAL VARIANCE, SET a = 1 AND LEAVE REST THE SAME. */ filename bb bhat2.out ; /* DOING ESTERASE ASSAY EXAMPLE WITH VARIANCE PROPORTIONAL TO X SQUARED.*/ data a; infile ester.dat ; input ester count; con=1.0; a2 = ester**2; wt=1/a2; ystar=count*sqrt(wt); x1star=1*sqrt(wt); x2star=ester*sqrt(wt); proc iml; /* put transformed data into ynew vector and xnew matrix and variances into v */ use a; read all var {ystar} into ynew; read all var {x1star x2star} into xnew; close a; yb=ynew; /* intialize */ xb=xnew; /* get s weighted least squares estimators and MSE */ xpxinv=inv(t(xnew)*xnew); bhatw=xpxinv*(t(xnew)*ynew); yhatw=xnew*bhatw; rw=ynew-yhatw; sse=ssq(rw); df=nrow(xnew)-ncol(xnew); mse=sse/df; n=nrow(xnew); print bhatw mse; nboot = 1000; do j=1 to nboot; /* Resample from residuals and add to fitted value note that already has weighting built in and modifies residual.*/ do i= 1 to n;

Chapter 3. Bootstrap. 3.1 Introduction. 3.2 The general idea

Chapter 3. Bootstrap. 3.1 Introduction. 3.2 The general idea Chapter 3 Bootstrap 3.1 Introduction The estimation of parameters in probability distributions is a basic problem in statistics that one tends to encounter already during the very first course on the subject.

More information

THE UNIVERSITY OF BRITISH COLUMBIA FORESTRY 430 and 533. Time: 50 minutes 40 Marks FRST Marks FRST 533 (extra questions)

THE UNIVERSITY OF BRITISH COLUMBIA FORESTRY 430 and 533. Time: 50 minutes 40 Marks FRST Marks FRST 533 (extra questions) THE UNIVERSITY OF BRITISH COLUMBIA FORESTRY 430 and 533 MIDTERM EXAMINATION: October 14, 2005 Instructor: Val LeMay Time: 50 minutes 40 Marks FRST 430 50 Marks FRST 533 (extra questions) This examination

More information

CONTENTS 1. ST697F: TOPICS IN REGRESSION, SPRING 2009, EXAMPLES John Buonaccorsi, c Not to be used without permission.

CONTENTS 1. ST697F: TOPICS IN REGRESSION, SPRING 2009, EXAMPLES John Buonaccorsi, c Not to be used without permission. CONTENTS 1 ST697F: TOPICS IN REGRESSION, SPRING 2009, EXAMPLES John Buonaccorsi, c Not to be used without permission. Contents 1 Linear Regression Example 3 1.1 Using a regression procedure..........................................

More information

MS&E 226: Small Data

MS&E 226: Small Data MS&E 226: Small Data Lecture 13: The bootstrap (v3) Ramesh Johari ramesh.johari@stanford.edu 1 / 30 Resampling 2 / 30 Sampling distribution of a statistic For this lecture: There is a population model

More information

ST512. Fall Quarter, Exam 1. Directions: Answer questions as directed. Please show work. For true/false questions, circle either true or false.

ST512. Fall Quarter, Exam 1. Directions: Answer questions as directed. Please show work. For true/false questions, circle either true or false. ST512 Fall Quarter, 2005 Exam 1 Name: Directions: Answer questions as directed. Please show work. For true/false questions, circle either true or false. 1. (42 points) A random sample of n = 30 NBA basketball

More information

Week 4: Simple Linear Regression II

Week 4: Simple Linear Regression II Week 4: Simple Linear Regression II Marcelo Coca Perraillon University of Colorado Anschutz Medical Campus Health Services Research Methods I HSMP 7607 2017 c 2017 PERRAILLON ARR 1 Outline Algebraic properties

More information

Cross-validation and the Bootstrap

Cross-validation and the Bootstrap Cross-validation and the Bootstrap In the section we discuss two resampling methods: cross-validation and the bootstrap. These methods refit a model of interest to samples formed from the training set,

More information

Centering and Interactions: The Training Data

Centering and Interactions: The Training Data Centering and Interactions: The Training Data A random sample of 150 technical support workers were first given a test of their technical skill and knowledge, and then randomly assigned to one of three

More information

Outline. Topic 16 - Other Remedies. Ridge Regression. Ridge Regression. Ridge Regression. Robust Regression. Regression Trees. Piecewise Linear Model

Outline. Topic 16 - Other Remedies. Ridge Regression. Ridge Regression. Ridge Regression. Robust Regression. Regression Trees. Piecewise Linear Model Topic 16 - Other Remedies Ridge Regression Robust Regression Regression Trees Outline - Fall 2013 Piecewise Linear Model Bootstrapping Topic 16 2 Ridge Regression Modification of least squares that addresses

More information

Week 4: Simple Linear Regression III

Week 4: Simple Linear Regression III Week 4: Simple Linear Regression III Marcelo Coca Perraillon University of Colorado Anschutz Medical Campus Health Services Research Methods I HSMP 7607 2017 c 2017 PERRAILLON ARR 1 Outline Goodness of

More information

Resources for statistical assistance. Quantitative covariates and regression analysis. Methods for predicting continuous outcomes.

Resources for statistical assistance. Quantitative covariates and regression analysis. Methods for predicting continuous outcomes. Resources for statistical assistance Quantitative covariates and regression analysis Carolyn Taylor Applied Statistics and Data Science Group (ASDa) Department of Statistics, UBC January 24, 2017 Department

More information

An Introduction to the Bootstrap

An Introduction to the Bootstrap An Introduction to the Bootstrap Bradley Efron Department of Statistics Stanford University and Robert J. Tibshirani Department of Preventative Medicine and Biostatistics and Department of Statistics,

More information

Model selection and validation 1: Cross-validation

Model selection and validation 1: Cross-validation Model selection and validation 1: Cross-validation Ryan Tibshirani Data Mining: 36-462/36-662 March 26 2013 Optional reading: ISL 2.2, 5.1, ESL 7.4, 7.10 1 Reminder: modern regression techniques Over the

More information

Chapter 6: DESCRIPTIVE STATISTICS

Chapter 6: DESCRIPTIVE STATISTICS Chapter 6: DESCRIPTIVE STATISTICS Random Sampling Numerical Summaries Stem-n-Leaf plots Histograms, and Box plots Time Sequence Plots Normal Probability Plots Sections 6-1 to 6-5, and 6-7 Random Sampling

More information

STAT:5400 Computing in Statistics

STAT:5400 Computing in Statistics STAT:5400 Computing in Statistics Introduction to SAS Lecture 18 Oct 12, 2015 Kate Cowles 374 SH, 335-0727 kate-cowles@uiowaedu SAS SAS is the statistical software package most commonly used in business,

More information

Week 5: Multiple Linear Regression II

Week 5: Multiple Linear Regression II Week 5: Multiple Linear Regression II Marcelo Coca Perraillon University of Colorado Anschutz Medical Campus Health Services Research Methods I HSMP 7607 2017 c 2017 PERRAILLON ARR 1 Outline Adjusted R

More information

Notes on Simulations in SAS Studio

Notes on Simulations in SAS Studio Notes on Simulations in SAS Studio If you are not careful about simulations in SAS Studio, you can run into problems. In particular, SAS Studio has a limited amount of memory that you can use to write

More information

STA 570 Spring Lecture 5 Tuesday, Feb 1

STA 570 Spring Lecture 5 Tuesday, Feb 1 STA 570 Spring 2011 Lecture 5 Tuesday, Feb 1 Descriptive Statistics Summarizing Univariate Data o Standard Deviation, Empirical Rule, IQR o Boxplots Summarizing Bivariate Data o Contingency Tables o Row

More information

Robust Linear Regression (Passing- Bablok Median-Slope)

Robust Linear Regression (Passing- Bablok Median-Slope) Chapter 314 Robust Linear Regression (Passing- Bablok Median-Slope) Introduction This procedure performs robust linear regression estimation using the Passing-Bablok (1988) median-slope algorithm. Their

More information

Cross-validation and the Bootstrap

Cross-validation and the Bootstrap Cross-validation and the Bootstrap In the section we discuss two resampling methods: cross-validation and the bootstrap. 1/44 Cross-validation and the Bootstrap In the section we discuss two resampling

More information

Lab 5 - Risk Analysis, Robustness, and Power

Lab 5 - Risk Analysis, Robustness, and Power Type equation here.biology 458 Biometry Lab 5 - Risk Analysis, Robustness, and Power I. Risk Analysis The process of statistical hypothesis testing involves estimating the probability of making errors

More information

The Bootstrap and Jackknife

The Bootstrap and Jackknife The Bootstrap and Jackknife Summer 2017 Summer Institutes 249 Bootstrap & Jackknife Motivation In scientific research Interest often focuses upon the estimation of some unknown parameter, θ. The parameter

More information

THIS IS NOT REPRESNTATIVE OF CURRENT CLASS MATERIAL. STOR 455 Midterm 1 September 28, 2010

THIS IS NOT REPRESNTATIVE OF CURRENT CLASS MATERIAL. STOR 455 Midterm 1 September 28, 2010 THIS IS NOT REPRESNTATIVE OF CURRENT CLASS MATERIAL STOR 455 Midterm September 8, INSTRUCTIONS: BOTH THE EXAM AND THE BUBBLE SHEET WILL BE COLLECTED. YOU MUST PRINT YOUR NAME AND SIGN THE HONOR PLEDGE

More information

Standard Errors in OLS Luke Sonnet

Standard Errors in OLS Luke Sonnet Standard Errors in OLS Luke Sonnet Contents Variance-Covariance of ˆβ 1 Standard Estimation (Spherical Errors) 2 Robust Estimation (Heteroskedasticity Constistent Errors) 4 Cluster Robust Estimation 7

More information

Heteroskedasticity and Homoskedasticity, and Homoskedasticity-Only Standard Errors

Heteroskedasticity and Homoskedasticity, and Homoskedasticity-Only Standard Errors Heteroskedasticity and Homoskedasticity, and Homoskedasticity-Only Standard Errors (Section 5.4) What? Consequences of homoskedasticity Implication for computing standard errors What do these two terms

More information

Repeated Measures Part 4: Blood Flow data

Repeated Measures Part 4: Blood Flow data Repeated Measures Part 4: Blood Flow data /* bloodflow.sas */ options linesize=79 pagesize=100 noovp formdlim='_'; title 'Two within-subjecs factors: Blood flow data (NWK p. 1181)'; proc format; value

More information

Lecture 25: Review I

Lecture 25: Review I Lecture 25: Review I Reading: Up to chapter 5 in ISLR. STATS 202: Data mining and analysis Jonathan Taylor 1 / 18 Unsupervised learning In unsupervised learning, all the variables are on equal standing,

More information

Minitab 17 commands Prepared by Jeffrey S. Simonoff

Minitab 17 commands Prepared by Jeffrey S. Simonoff Minitab 17 commands Prepared by Jeffrey S. Simonoff Data entry and manipulation To enter data by hand, click on the Worksheet window, and enter the values in as you would in any spreadsheet. To then save

More information

Data Analysis and Solver Plugins for KSpread USER S MANUAL. Tomasz Maliszewski

Data Analysis and Solver Plugins for KSpread USER S MANUAL. Tomasz Maliszewski Data Analysis and Solver Plugins for KSpread USER S MANUAL Tomasz Maliszewski tmaliszewski@wp.pl Table of Content CHAPTER 1: INTRODUCTION... 3 1.1. ABOUT DATA ANALYSIS PLUGIN... 3 1.3. ABOUT SOLVER PLUGIN...

More information

Analysis of variance - ANOVA

Analysis of variance - ANOVA Analysis of variance - ANOVA Based on a book by Julian J. Faraway University of Iceland (UI) Estimation 1 / 50 Anova In ANOVAs all predictors are categorical/qualitative. The original thinking was to try

More information

Table Of Contents. Table Of Contents

Table Of Contents. Table Of Contents Statistics Table Of Contents Table Of Contents Basic Statistics... 7 Basic Statistics Overview... 7 Descriptive Statistics Available for Display or Storage... 8 Display Descriptive Statistics... 9 Store

More information

Frequencies, Unequal Variance Weights, and Sampling Weights: Similarities and Differences in SAS

Frequencies, Unequal Variance Weights, and Sampling Weights: Similarities and Differences in SAS ABSTRACT Paper 1938-2018 Frequencies, Unequal Variance Weights, and Sampling Weights: Similarities and Differences in SAS Robert M. Lucas, Robert M. Lucas Consulting, Fort Collins, CO, USA There is confusion

More information

Regression Analysis and Linear Regression Models

Regression Analysis and Linear Regression Models Regression Analysis and Linear Regression Models University of Trento - FBK 2 March, 2015 (UNITN-FBK) Regression Analysis and Linear Regression Models 2 March, 2015 1 / 33 Relationship between numerical

More information

Applied Statistics and Econometrics Lecture 6

Applied Statistics and Econometrics Lecture 6 Applied Statistics and Econometrics Lecture 6 Giuseppe Ragusa Luiss University gragusa@luiss.it http://gragusa.org/ March 6, 2017 Luiss University Empirical application. Data Italian Labour Force Survey,

More information

Today. Lecture 4: Last time. The EM algorithm. We examine clustering in a little more detail; we went over it a somewhat quickly last time

Today. Lecture 4: Last time. The EM algorithm. We examine clustering in a little more detail; we went over it a somewhat quickly last time Today Lecture 4: We examine clustering in a little more detail; we went over it a somewhat quickly last time The CAD data will return and give us an opportunity to work with curves (!) We then examine

More information

Exercise 2.23 Villanova MAT 8406 September 7, 2015

Exercise 2.23 Villanova MAT 8406 September 7, 2015 Exercise 2.23 Villanova MAT 8406 September 7, 2015 Step 1: Understand the Question Consider the simple linear regression model y = 50 + 10x + ε where ε is NID(0, 16). Suppose that n = 20 pairs of observations

More information

Section 2.3: Simple Linear Regression: Predictions and Inference

Section 2.3: Simple Linear Regression: Predictions and Inference Section 2.3: Simple Linear Regression: Predictions and Inference Jared S. Murray The University of Texas at Austin McCombs School of Business Suggested reading: OpenIntro Statistics, Chapter 7.4 1 Simple

More information

Chapter 2 Describing, Exploring, and Comparing Data

Chapter 2 Describing, Exploring, and Comparing Data Slide 1 Chapter 2 Describing, Exploring, and Comparing Data Slide 2 2-1 Overview 2-2 Frequency Distributions 2-3 Visualizing Data 2-4 Measures of Center 2-5 Measures of Variation 2-6 Measures of Relative

More information

Bootstrap Confidence Interval of the Difference Between Two Process Capability Indices

Bootstrap Confidence Interval of the Difference Between Two Process Capability Indices Int J Adv Manuf Technol (2003) 21:249 256 Ownership and Copyright 2003 Springer-Verlag London Limited Bootstrap Confidence Interval of the Difference Between Two Process Capability Indices J.-P. Chen 1

More information

COPYRIGHTED MATERIAL CONTENTS

COPYRIGHTED MATERIAL CONTENTS PREFACE ACKNOWLEDGMENTS LIST OF TABLES xi xv xvii 1 INTRODUCTION 1 1.1 Historical Background 1 1.2 Definition and Relationship to the Delta Method and Other Resampling Methods 3 1.2.1 Jackknife 6 1.2.2

More information

4.5 The smoothed bootstrap

4.5 The smoothed bootstrap 4.5. THE SMOOTHED BOOTSTRAP 47 F X i X Figure 4.1: Smoothing the empirical distribution function. 4.5 The smoothed bootstrap In the simple nonparametric bootstrap we have assumed that the empirical distribution

More information

Lecture 12. August 23, Department of Biostatistics Johns Hopkins Bloomberg School of Public Health Johns Hopkins University.

Lecture 12. August 23, Department of Biostatistics Johns Hopkins Bloomberg School of Public Health Johns Hopkins University. Lecture 12 Department of Biostatistics Johns Hopkins Bloomberg School of Public Health Johns Hopkins University August 23, 2007 1 2 3 4 5 1 2 Introduce the bootstrap 3 the bootstrap algorithm 4 Example

More information

Chapter 2: The Normal Distributions

Chapter 2: The Normal Distributions Chapter 2: The Normal Distributions Measures of Relative Standing & Density Curves Z-scores (Measures of Relative Standing) Suppose there is one spot left in the University of Michigan class of 2014 and

More information

Cluster Randomization Create Cluster Means Dataset

Cluster Randomization Create Cluster Means Dataset Chapter 270 Cluster Randomization Create Cluster Means Dataset Introduction A cluster randomization trial occurs when whole groups or clusters of individuals are treated together. Examples of such clusters

More information

Subset Selection in Multiple Regression

Subset Selection in Multiple Regression Chapter 307 Subset Selection in Multiple Regression Introduction Multiple regression analysis is documented in Chapter 305 Multiple Regression, so that information will not be repeated here. Refer to that

More information

Fathom Dynamic Data TM Version 2 Specifications

Fathom Dynamic Data TM Version 2 Specifications Data Sources Fathom Dynamic Data TM Version 2 Specifications Use data from one of the many sample documents that come with Fathom. Enter your own data by typing into a case table. Paste data from other

More information

Big Data Methods. Chapter 5: Machine learning. Big Data Methods, Chapter 5, Slide 1

Big Data Methods. Chapter 5: Machine learning. Big Data Methods, Chapter 5, Slide 1 Big Data Methods Chapter 5: Machine learning Big Data Methods, Chapter 5, Slide 1 5.1 Introduction to machine learning What is machine learning? Concerned with the study and development of algorithms that

More information

Statistical Pattern Recognition

Statistical Pattern Recognition Statistical Pattern Recognition Features and Feature Selection Hamid R. Rabiee Jafar Muhammadi Spring 2012 http://ce.sharif.edu/courses/90-91/2/ce725-1/ Agenda Features and Patterns The Curse of Size and

More information

Missing Data Analysis for the Employee Dataset

Missing Data Analysis for the Employee Dataset Missing Data Analysis for the Employee Dataset 67% of the observations have missing values! Modeling Setup For our analysis goals we would like to do: Y X N (X, 2 I) and then interpret the coefficients

More information

Chapter 6: Linear Model Selection and Regularization

Chapter 6: Linear Model Selection and Regularization Chapter 6: Linear Model Selection and Regularization As p (the number of predictors) comes close to or exceeds n (the sample size) standard linear regression is faced with problems. The variance of the

More information

Things you ll know (or know better to watch out for!) when you leave in December: 1. What you can and cannot infer from graphs.

Things you ll know (or know better to watch out for!) when you leave in December: 1. What you can and cannot infer from graphs. 1 2 Things you ll know (or know better to watch out for!) when you leave in December: 1. What you can and cannot infer from graphs. 2. How to construct (in your head!) and interpret confidence intervals.

More information

Chapters 5-6: Statistical Inference Methods

Chapters 5-6: Statistical Inference Methods Chapters 5-6: Statistical Inference Methods Chapter 5: Estimation (of population parameters) Ex. Based on GSS data, we re 95% confident that the population mean of the variable LONELY (no. of days in past

More information

STAT 2607 REVIEW PROBLEMS Word problems must be answered in words of the problem.

STAT 2607 REVIEW PROBLEMS Word problems must be answered in words of the problem. STAT 2607 REVIEW PROBLEMS 1 REMINDER: On the final exam 1. Word problems must be answered in words of the problem. 2. "Test" means that you must carry out a formal hypothesis testing procedure with H0,

More information

Week 10: Heteroskedasticity II

Week 10: Heteroskedasticity II Week 10: Heteroskedasticity II Marcelo Coca Perraillon University of Colorado Anschutz Medical Campus Health Services Research Methods I HSMP 7607 2017 c 2017 PERRAILLON ARR 1 Outline Dealing with heteroskedasticy

More information

Prepare a stem-and-leaf graph for the following data. In your final display, you should arrange the leaves for each stem in increasing order.

Prepare a stem-and-leaf graph for the following data. In your final display, you should arrange the leaves for each stem in increasing order. Chapter 2 2.1 Descriptive Statistics A stem-and-leaf graph, also called a stemplot, allows for a nice overview of quantitative data without losing information on individual observations. It can be a good

More information

Multivariate Analysis Multivariate Calibration part 2

Multivariate Analysis Multivariate Calibration part 2 Multivariate Analysis Multivariate Calibration part 2 Prof. Dr. Anselmo E de Oliveira anselmo.quimica.ufg.br anselmo.disciplinas@gmail.com Linear Latent Variables An essential concept in multivariate data

More information

Linear Methods for Regression and Shrinkage Methods

Linear Methods for Regression and Shrinkage Methods Linear Methods for Regression and Shrinkage Methods Reference: The Elements of Statistical Learning, by T. Hastie, R. Tibshirani, J. Friedman, Springer 1 Linear Regression Models Least Squares Input vectors

More information

IQR = number. summary: largest. = 2. Upper half: Q3 =

IQR = number. summary: largest. = 2. Upper half: Q3 = Step by step box plot Height in centimeters of players on the 003 Women s Worldd Cup soccer team. 157 1611 163 163 164 165 165 165 168 168 168 170 170 170 171 173 173 175 180 180 Determine the 5 number

More information

Statistical Matching using Fractional Imputation

Statistical Matching using Fractional Imputation Statistical Matching using Fractional Imputation Jae-Kwang Kim 1 Iowa State University 1 Joint work with Emily Berg and Taesung Park 1 Introduction 2 Classical Approaches 3 Proposed method 4 Application:

More information

Learner Expectations UNIT 1: GRAPICAL AND NUMERIC REPRESENTATIONS OF DATA. Sept. Fathom Lab: Distributions and Best Methods of Display

Learner Expectations UNIT 1: GRAPICAL AND NUMERIC REPRESENTATIONS OF DATA. Sept. Fathom Lab: Distributions and Best Methods of Display CURRICULUM MAP TEMPLATE Priority Standards = Approximately 70% Supporting Standards = Approximately 20% Additional Standards = Approximately 10% HONORS PROBABILITY AND STATISTICS Essential Questions &

More information

Missing Data Analysis for the Employee Dataset

Missing Data Analysis for the Employee Dataset Missing Data Analysis for the Employee Dataset 67% of the observations have missing values! Modeling Setup Random Variables: Y i =(Y i1,...,y ip ) 0 =(Y i,obs, Y i,miss ) 0 R i =(R i1,...,r ip ) 0 ( 1

More information

1 Methods for Posterior Simulation

1 Methods for Posterior Simulation 1 Methods for Posterior Simulation Let p(θ y) be the posterior. simulation. Koop presents four methods for (posterior) 1. Monte Carlo integration: draw from p(θ y). 2. Gibbs sampler: sequentially drawing

More information

Splines and penalized regression

Splines and penalized regression Splines and penalized regression November 23 Introduction We are discussing ways to estimate the regression function f, where E(y x) = f(x) One approach is of course to assume that f has a certain shape,

More information

GAMs semi-parametric GLMs. Simon Wood Mathematical Sciences, University of Bath, U.K.

GAMs semi-parametric GLMs. Simon Wood Mathematical Sciences, University of Bath, U.K. GAMs semi-parametric GLMs Simon Wood Mathematical Sciences, University of Bath, U.K. Generalized linear models, GLM 1. A GLM models a univariate response, y i as g{e(y i )} = X i β where y i Exponential

More information

Macros and ODS. SAS Programming November 6, / 89

Macros and ODS. SAS Programming November 6, / 89 Macros and ODS The first part of these slides overlaps with last week a fair bit, but it doesn t hurt to review as this code might be a little harder to follow. SAS Programming November 6, 2014 1 / 89

More information

LAB #2: SAMPLING, SAMPLING DISTRIBUTIONS, AND THE CLT

LAB #2: SAMPLING, SAMPLING DISTRIBUTIONS, AND THE CLT NAVAL POSTGRADUATE SCHOOL LAB #2: SAMPLING, SAMPLING DISTRIBUTIONS, AND THE CLT Statistics (OA3102) Lab #2: Sampling, Sampling Distributions, and the Central Limit Theorem Goal: Use R to demonstrate sampling

More information

Performance Estimation and Regularization. Kasthuri Kannan, PhD. Machine Learning, Spring 2018

Performance Estimation and Regularization. Kasthuri Kannan, PhD. Machine Learning, Spring 2018 Performance Estimation and Regularization Kasthuri Kannan, PhD. Machine Learning, Spring 2018 Bias- Variance Tradeoff Fundamental to machine learning approaches Bias- Variance Tradeoff Error due to Bias:

More information

STATS PAD USER MANUAL

STATS PAD USER MANUAL STATS PAD USER MANUAL For Version 2.0 Manual Version 2.0 1 Table of Contents Basic Navigation! 3 Settings! 7 Entering Data! 7 Sharing Data! 8 Managing Files! 10 Running Tests! 11 Interpreting Output! 11

More information

5.5 Regression Estimation

5.5 Regression Estimation 5.5 Regression Estimation Assume a SRS of n pairs (x, y ),..., (x n, y n ) is selected from a population of N pairs of (x, y) data. The goal of regression estimation is to take advantage of a linear relationship

More information

Recent advances in Metamodel of Optimal Prognosis. Lectures. Thomas Most & Johannes Will

Recent advances in Metamodel of Optimal Prognosis. Lectures. Thomas Most & Johannes Will Lectures Recent advances in Metamodel of Optimal Prognosis Thomas Most & Johannes Will presented at the Weimar Optimization and Stochastic Days 2010 Source: www.dynardo.de/en/library Recent advances in

More information

Generalized Additive Model

Generalized Additive Model Generalized Additive Model by Huimin Liu Department of Mathematics and Statistics University of Minnesota Duluth, Duluth, MN 55812 December 2008 Table of Contents Abstract... 2 Chapter 1 Introduction 1.1

More information

Introduction to hypothesis testing

Introduction to hypothesis testing Introduction to hypothesis testing Mark Johnson Macquarie University Sydney, Australia February 27, 2017 1 / 38 Outline Introduction Hypothesis tests and confidence intervals Classical hypothesis tests

More information

Factorial ANOVA with SAS

Factorial ANOVA with SAS Factorial ANOVA with SAS /* potato305.sas */ options linesize=79 noovp formdlim='_' ; title 'Rotten potatoes'; title2 ''; proc format; value tfmt 1 = 'Cool' 2 = 'Warm'; data spud; infile 'potato2.data'

More information

Tree-based methods for classification and regression

Tree-based methods for classification and regression Tree-based methods for classification and regression Ryan Tibshirani Data Mining: 36-462/36-662 April 11 2013 Optional reading: ISL 8.1, ESL 9.2 1 Tree-based methods Tree-based based methods for predicting

More information

Chapter 1. Math review. 1.1 Some sets

Chapter 1. Math review. 1.1 Some sets Chapter 1 Math review This book assumes that you understood precalculus when you took it. So you used to know how to do things like factoring polynomials, solving high school geometry problems, using trigonometric

More information

One way ANOVA when the data are not normally distributed (The Kruskal-Wallis test).

One way ANOVA when the data are not normally distributed (The Kruskal-Wallis test). One way ANOVA when the data are not normally distributed (The Kruskal-Wallis test). Suppose you have a one way design, and want to do an ANOVA, but discover that your data are seriously not normal? Just

More information

Lecture 13: Model selection and regularization

Lecture 13: Model selection and regularization Lecture 13: Model selection and regularization Reading: Sections 6.1-6.2.1 STATS 202: Data mining and analysis October 23, 2017 1 / 17 What do we know so far In linear regression, adding predictors always

More information

Bland-Altman Plot and Analysis

Bland-Altman Plot and Analysis Chapter 04 Bland-Altman Plot and Analysis Introduction The Bland-Altman (mean-difference or limits of agreement) plot and analysis is used to compare two measurements of the same variable. That is, it

More information

Chapter 7: Dual Modeling in the Presence of Constant Variance

Chapter 7: Dual Modeling in the Presence of Constant Variance Chapter 7: Dual Modeling in the Presence of Constant Variance 7.A Introduction An underlying premise of regression analysis is that a given response variable changes systematically and smoothly due to

More information

Introduction to Mixed Models: Multivariate Regression

Introduction to Mixed Models: Multivariate Regression Introduction to Mixed Models: Multivariate Regression EPSY 905: Multivariate Analysis Spring 2016 Lecture #9 March 30, 2016 EPSY 905: Multivariate Regression via Path Analysis Today s Lecture Multivariate

More information

EXST3201 Mousefeed01 Page 1

EXST3201 Mousefeed01 Page 1 EXST3201 Mousefeed01 Page 1 3 /* 4 Examine differences among the following 6 treatments 5 N/N85 fed normally before weaning and 85 kcal/wk after 6 N/R40 fed normally before weaning and 40 kcal/wk after

More information

ON SOME METHODS OF CONSTRUCTION OF BLOCK DESIGNS

ON SOME METHODS OF CONSTRUCTION OF BLOCK DESIGNS ON SOME METHODS OF CONSTRUCTION OF BLOCK DESIGNS NURNABI MEHERUL ALAM M.Sc. (Agricultural Statistics), Roll No. I.A.S.R.I, Library Avenue, New Delhi- Chairperson: Dr. P.K. Batra Abstract: Block designs

More information

Multiple Linear Regression

Multiple Linear Regression Multiple Linear Regression Rebecca C. Steorts, Duke University STA 325, Chapter 3 ISL 1 / 49 Agenda How to extend beyond a SLR Multiple Linear Regression (MLR) Relationship Between the Response and Predictors

More information

Chapter 1. Looking at Data-Distribution

Chapter 1. Looking at Data-Distribution Chapter 1. Looking at Data-Distribution Statistics is the scientific discipline that provides methods to draw right conclusions: 1)Collecting the data 2)Describing the data 3)Drawing the conclusions Raw

More information

STAT 503 Fall Introduction to SAS

STAT 503 Fall Introduction to SAS Getting Started Introduction to SAS 1) Download all of the files, sas programs (.sas) and data files (.dat) into one of your directories. I would suggest using your H: drive if you are using a computer

More information

Statistical Analysis of List Experiments

Statistical Analysis of List Experiments Statistical Analysis of List Experiments Kosuke Imai Princeton University Joint work with Graeme Blair October 29, 2010 Blair and Imai (Princeton) List Experiments NJIT (Mathematics) 1 / 26 Motivation

More information

Advanced Operations Research Techniques IE316. Quiz 1 Review. Dr. Ted Ralphs

Advanced Operations Research Techniques IE316. Quiz 1 Review. Dr. Ted Ralphs Advanced Operations Research Techniques IE316 Quiz 1 Review Dr. Ted Ralphs IE316 Quiz 1 Review 1 Reading for The Quiz Material covered in detail in lecture. 1.1, 1.4, 2.1-2.6, 3.1-3.3, 3.5 Background material

More information

Use of Extreme Value Statistics in Modeling Biometric Systems

Use of Extreme Value Statistics in Modeling Biometric Systems Use of Extreme Value Statistics in Modeling Biometric Systems Similarity Scores Two types of matching: Genuine sample Imposter sample Matching scores Enrolled sample 0.95 0.32 Probability Density Decision

More information

Pair-Wise Multiple Comparisons (Simulation)

Pair-Wise Multiple Comparisons (Simulation) Chapter 580 Pair-Wise Multiple Comparisons (Simulation) Introduction This procedure uses simulation analyze the power and significance level of three pair-wise multiple-comparison procedures: Tukey-Kramer,

More information

Land Cover Stratified Accuracy Assessment For Digital Elevation Model derived from Airborne LIDAR Dade County, Florida

Land Cover Stratified Accuracy Assessment For Digital Elevation Model derived from Airborne LIDAR Dade County, Florida Land Cover Stratified Accuracy Assessment For Digital Elevation Model derived from Airborne LIDAR Dade County, Florida FINAL REPORT Submitted October 2004 Prepared by: Daniel Gann Geographic Information

More information

Two-Stage Least Squares

Two-Stage Least Squares Chapter 316 Two-Stage Least Squares Introduction This procedure calculates the two-stage least squares (2SLS) estimate. This method is used fit models that include instrumental variables. 2SLS includes

More information

range: [1,20] units: 1 unique values: 20 missing.: 0/20 percentiles: 10% 25% 50% 75% 90%

range: [1,20] units: 1 unique values: 20 missing.: 0/20 percentiles: 10% 25% 50% 75% 90% ------------------ log: \Term 2\Lecture_2s\regression1a.log log type: text opened on: 22 Feb 2008, 03:29:09. cmdlog using " \Term 2\Lecture_2s\regression1a.do" (cmdlog \Term 2\Lecture_2s\regression1a.do

More information

NCSS Statistical Software. Robust Regression

NCSS Statistical Software. Robust Regression Chapter 308 Introduction Multiple regression analysis is documented in Chapter 305 Multiple Regression, so that information will not be repeated here. Refer to that chapter for in depth coverage of multiple

More information

Assessing the Quality of the Natural Cubic Spline Approximation

Assessing the Quality of the Natural Cubic Spline Approximation Assessing the Quality of the Natural Cubic Spline Approximation AHMET SEZER ANADOLU UNIVERSITY Department of Statisticss Yunus Emre Kampusu Eskisehir TURKEY ahsst12@yahoo.com Abstract: In large samples,

More information

Math 120 Introduction to Statistics Mr. Toner s Lecture Notes 3.1 Measures of Central Tendency

Math 120 Introduction to Statistics Mr. Toner s Lecture Notes 3.1 Measures of Central Tendency Math 1 Introduction to Statistics Mr. Toner s Lecture Notes 3.1 Measures of Central Tendency lowest value + highest value midrange The word average: is very ambiguous and can actually refer to the mean,

More information

Section 3.4: Diagnostics and Transformations. Jared S. Murray The University of Texas at Austin McCombs School of Business

Section 3.4: Diagnostics and Transformations. Jared S. Murray The University of Texas at Austin McCombs School of Business Section 3.4: Diagnostics and Transformations Jared S. Murray The University of Texas at Austin McCombs School of Business 1 Regression Model Assumptions Y i = β 0 + β 1 X i + ɛ Recall the key assumptions

More information

Applied Survey Data Analysis Module 2: Variance Estimation March 30, 2013

Applied Survey Data Analysis Module 2: Variance Estimation March 30, 2013 Applied Statistics Lab Applied Survey Data Analysis Module 2: Variance Estimation March 30, 2013 Approaches to Complex Sample Variance Estimation In simple random samples many estimators are linear estimators

More information

in this course) ˆ Y =time to event, follow-up curtailed: covered under ˆ Missing at random (MAR) a

in this course) ˆ Y =time to event, follow-up curtailed: covered under ˆ Missing at random (MAR) a Chapter 3 Missing Data 3.1 Types of Missing Data ˆ Missing completely at random (MCAR) ˆ Missing at random (MAR) a ˆ Informative missing (non-ignorable non-response) See 1, 38, 59 for an introduction to

More information

PubHlth 640 Intermediate Biostatistics Unit 2 - Regression and Correlation. Simple Linear Regression Software: Stata v 10.1

PubHlth 640 Intermediate Biostatistics Unit 2 - Regression and Correlation. Simple Linear Regression Software: Stata v 10.1 PubHlth 640 Intermediate Biostatistics Unit 2 - Regression and Correlation Simple Linear Regression Software: Stata v 10.1 Emergency Calls to the New York Auto Club Source: Chatterjee, S; Handcock MS and

More information

Estimation of Item Response Models

Estimation of Item Response Models Estimation of Item Response Models Lecture #5 ICPSR Item Response Theory Workshop Lecture #5: 1of 39 The Big Picture of Estimation ESTIMATOR = Maximum Likelihood; Mplus Any questions? answers Lecture #5:

More information