# Chapter 3. Bootstrap. 3.1 Introduction. 3.2 The general idea

Save this PDF as:

Size: px
Start display at page:

## Transcription

1 Chapter 3 Bootstrap 3.1 Introduction The estimation of parameters in probability distributions is a basic problem in statistics that one tends to encounter already during the very first course on the subject. Along with the estimate itself we are usually interested in its accuracy, which is can be described in terms of the bias and the variance of the estimator, as well as confidence intervals around it. Sometimes, such measures of accuracy can be derived analytically. Often, they can not. Bootstrap is a technique that can be used to estimate them numerically from a single data set The general idea Let X 1,..., X n be a i.i.d. sample from distribution F, that is P(X i x) = F(x), and let X (1),..., X (n) be the corresponding ordered sample. We are interested in some parameter θ which is associated with this distribution (mean, median, variance etc). There is also an estimator ˆθ = t({x 1,..., X n }), with t denoting some function, that we can use to estimate θ from data. In this setting, it is the deviation of ˆθ from θ that is of primarily interest. Ideally, to get an approximation of the estimator distribution we would like to repeat the data-generating experiment, say, N times, calculating ˆθ for each of the N data sets. That is, we would draw N samples of size n from the true distribution F (with replacement, if F is discrete). In practice, this is impossible. The bootstrap method is based on the following simple idea: Even if we do not know F we can approximate it from data and use this approximation, ˆF, instead of F itself. This idea leads to several flavours of bootstrap that deviate in how, exactly, the approximation ˆF is obtained. Two broad areas are the parametric and nonparametric bootstrap. 1 More generally, bootstrap is a method of approximating the distribution of functions of the data (statistics), which can serve different purposes, among others the construction of CI and hypothesis testing. 13

2 The non-parametric estimate is the so called empirical distribution (you will see the corresponding pdf if you simply do a histogram of the data), that can be formally defined as follows: Definition 3.1. With # denoting the number of members of a set, the empirical distribution function F is given by F(x) = #{i : X i x} n = 0 if x (, X (1) ), i/n if x [X (i), X (i+1) ) for i {1,..., n 1}, 1 if x [X (n), ). That is, it is a discrete distribution that puts mass 1/n on each data point in your sample. The parametric estimate assumes that the data comes from a certain distribution family (Normal, Gamma etc). That is, we say that we know the general functional form of the pdf, but not the exact parameters. Those parameters can then be estimated from the data (typically with Maximum Likelihood) and plugged in the pdf to get ˆF. This estimation method leads to more accurate inference if we guessed the distribution family correctly but, on the other hand, ˆF may be quite far from F if the family assumption is wrong Algorithms The two algorithms below describe how the bootstrap can be implemented. Non-parametric bootstrap Assuming a data set x = (x 1,..., x n ) is available. 1. Fix the number of bootstrap re-samples N. Often N [1000, 2000]. 2. Sample a new data set x set of size n from x with replacement (this is equivalent to sampling from the empirical cdf ˆF). 3. Estimate θ from x. Call the estimate ˆθ i, for i = 1,..., N. Store. 4. Repeat step 2 and 3 N times. 5. Consider the empirical distribution of (ˆθ 1,..., ˆθ N ) as an approximation of the true distribution of ˆθ. 14

3 Parametric bootstrap Assuming a data set x = (x 1,..., x n ) is available. 1. Assume that the data comes from a known distribution family F ψ described by a set of parameters ψ (for a Normal distribution ψ = (µ, σ) with µ being the expected value and σ the standard deviation). 2. Estimate ψ with, for example, Maximum likelihood, obtaining the estimate ˆψ. 3. Fix the number of bootstrap re-samples N. Often N [1000, 2000]. 4. Sample a new data set x set of size n from F ˆψ. 5. Estimate θ from x. Call the estimate ˆθ i, for i = 1,..., N. Store. 6. Repeat 4 and 5 N times. 7. Consider the empirical distribution of (ˆθ 1,..., ˆθ N ) as an approximation of the true distribution of ˆθ. To concretize, let us say that X i N(0, 1), θ is the median and it is estimated by ˆθ = X (n/2). In the Figure 3.1 the distribution of ˆθ approximated with the non-parametric and parametric bootstrap is plotted. Note that the parametric distribution is smoother than the non-parametric one, since the samples were drawn from a continuous distribution. As stated earlier, the parametric version could be more accurate, but also potentially more difficult to set up. In particular, we have to sample from F ˆψ which is not easily done if F is not one of the common distributions (such as Normal) for which a pre-programmed sample function is usually available. 3.3 Bias and variance estimation The theoretical bias and variance of an estimator ˆθ are defined as Bias(ˆθ) = E[ˆθ θ] = E[ˆθ] θ ] Var(ˆθ) = E [(ˆθ E[ˆθ]) 2 In words, the bias is a measure of a systematic error (ˆθ tends to be either smaller of larger than θ) while the variance is a measure of random error. In order to obtain the bootstrap estimates of bias and variance we plug in the original estimate ˆθ (which is a constant given data) in place of θ and ˆθ (the distribution of which we get from bootstrap) in place of ˆθ. This leads us to the 15

4 The non parametric bootstrap The parametric bootstrap Frequency Frequency median median Figure 3.1: The non-parametric and the parametric bootstrap distribution of the median, N = following approximations: Bias(ˆθ) 1 ˆθ i N ˆθ = ˆθ. ˆθ Var(ˆθ) 1 N 1 i (ˆθ i ˆθ. )2 That is, the variance is, as usual, estimated by the sample variance (but for the bootstrap sample of ˆθ) and bias is estimated by how much the original ˆθ deviates from the average of the bootstrap sample. 3.4 Confidence intervals There are several methods for CI construction with Bootstrap, the most popular being normal, basic and percentile. Let ˆθ (i) be the ordered bootstrap estimates, with i = 1,..., N indicating the different samples. Let α be the significance level. In all that follows, ˆθ α will denote the α-quantile of the distribution of ˆθ. You can approximate this quantile with ˆθ ([(N+1)α]) with the [x] indicating the integer part of x or through interpolation. i Basic [2ˆθ ˆθ 1 α/2, 2ˆθ ˆθ α/2 ] Normal [ˆθ z α/2 ŝe, ˆθ + z α/2 ŝe] Percentile [ˆθ α/2, ˆθ 1 α/2] 16

5 with z α denoting an α quantile from a Normal distribution and ŝe the estimated standard deviation of ˆθ calculated from the bootstrap sample. Basic CI To obtain this confidence interval we start with W = ˆθ θ (compare to the classic CI that correspond to a t-test). If the distribution of W was known, then a twosided CI could be obtained by considering P(w α/2 W w 1 α/2 ) = 1 α, which leads to CI = [l low, l up ] = [ˆθ w α/2, ˆθ w 1 α/2 ]. However, the distribution of W is not known, and is approximated with the distribution for W = ˆθ ˆθ, with w α denoting the corresponding α-quantile. The CI becomes [ˆθ w α/2, ˆθ w 1 α/2 ]. Noting that w α = ˆθ α ˆθ and substituting this expression in the CI formulation leads to the definition given earlier. This interval construction relies on an assumption about the distribution of ˆθ θ, namely that it is independent of θ. This assumption, called pivotality, does not necessarily hold in most cases. However, the interval gives acceptable results even if ˆθ θ is close to pivotal. Normal CI This confidence interval probably looks familiar since it is almost an exact replica of the commonly used confidence interval for a population mean. Indeed, similarly to that familiar CI, we can get it if we have reasons to believe that Z = (ˆθ θ)/ŝe N(0, 1) (often true asymptotically). Alternatively, we can again consider W = ˆθ θ, place the restriction that it should be normal and estimate its variance with bootstrap. Observe that the normal CI also implicitly assumes pivotality. Percentile CI These type of CI may seem natural and obvious but actually requires a quite convoluted argument to motivate. Without going into details, you get this CI by assuming that there exist a transformation h such that the distribution of h(ˆθ) h(θ) is pivotal, symmetric and centered around 0. You then construct a Basic CI for h(θ) rather than θ itself, get the α quantiles and transform those back to the original scale by applying h 1. Observe that although we do not need to know h explicitly, the existence of such a transformation is a must, which is not always the case. 3.5 The limitations of bootstrap Yes, those do exist. Bootstrap tends to give results easily (too much so, in fact), but it is possible that those results are completely wrong. More than that, they can be completely wrong without being obvious about it. Let take a look at a few classics. 17

6 3.5.1 Infinite variance In the general idea of bootstrapping we plugged in an estimate ˆF of the density F, and then sampled from it. This works only if ˆF actually is a good estimate of F, that is it captures the essential features of F despite being based on a finite sample. This may not be the case if F is very heavy tailed, i. e. has infinite variance. The intuition is that in this case extremely large or small values can occur, and when they do they have a great effect on the bootstrap estimate of the distribution of ˆθ, making it unstable. As a consequence, the measures of accuracy of ˆθ, such as CI, will be unreliable. The classic example of this is the mean estimator ˆθ = X with the data generated from a Cauchy distribution. For it, both the first and the second moments (e.g. mean and variance) are infinite, leading to nonsensical confidence intervals even for large sample sizes. A less obvious example is a non-central Student t distribution with 2 degrees of freedom. This distribution has pdf f X (x) = 1 2 (1 + (x µ) 2 ) 3/2 for x R. where µ is the location parameter, defined as E[X]. So, the first moment is finite and its estimator X is consistent. The second moment, however, is infinite, and the right tail of the distribution grows heavier with increasing µ. This leads to the 95% CI coverage probabilities that are quite far from the the supposed 95% even when the sample size is as large as 500. Frequency Theta Figure 3.2: Histogram of 1000 non-parametric bootstrap samples of ˆθ, the data from Uni(0, θ) distribution. θ = 1. 18

7 3.5.2 Parameter on the boundary The classical example here is the Uni(0, θ) distribution. The maximum likelihood estimate ˆθ is simply max(x 1,..., x n ), which will always be biased (ˆθ < θ). In this case the non-parametric bootstrap leads to a very discrete distribution, with more than half of the bootstrap estimates ˆθ equal to the original ˆθ (Figure 3.2). Clearly, if the quantiles used in CIs are taken from this distribution the results will be far from accurate. However, the parametric bootstrap will give a much smoother distribution and more reliable results Lack of pivotality In all the CI descriptions above the word pivotality shows up. So we can guess that it is a bad thing not to have (see Lab, task 3). To circumvent this, something called studentized bootstrap can be used. The idea behind the method is simple and can be seen as an extrapolation of the basic bootstrap CI. There, we looked at W = ˆθ θ. Now, we instead consider the standardized version W = (ˆθ θ)/σ, with σ denoting the standard deviation of ˆθ. The confidence interval will then be calculated through P(w α/2 W w 1 α/2 ) = P(ˆθ w α/2 σ θ ˆθ w 1 α/2 σ) = 1 α As with the basic CI, the distribution of W is not known and has to be approximated, this time with the distribution of W = (ˆθ ˆθ)/ŝe. Here, ŝe denotes the standard deviation corresponding to each bootstrap sample. This CI is considered to be more reliable than the ones described earlier. However, there is a catch, and this catch is the estimate of the standard deviation of ˆθ, ŝe. This estimate is not easily obtained. You can get it either parametrically (but then you need a model which you probably don t have) or through re-sampling. This means that you have to do an extra bootstrap for each of the original bootstrap samples. That is, you will have a loop within a loop, and it can become very heavy computationally. 3.6 software Will be done in R. Functions you may find useful: sample, replicate, rgamma, boot, boot.ci, dgamma, nlm 3.7 Laboration: Bootstrap re-sampling. The aim of this laboration is to study the performance of bootstrap estimates, as well as corresponding variance, bias and CI, using different bootstrap techniques. 19

9 Construct Studentized confidence intervals for the variance. Check the coverage as in the above. Any improvement? You can construct the studentized CI both from scratch with your own code or using boot and boot.ci. If you choose to do the latter, make sure that the statistic function that you have to specify in boot returns two values: ˆθ and Var(ˆθ ). In order to pass the lab task 1 has to be completed. 21

### MS&E 226: Small Data

MS&E 226: Small Data Lecture 13: The bootstrap (v3) Ramesh Johari ramesh.johari@stanford.edu 1 / 30 Resampling 2 / 30 Sampling distribution of a statistic For this lecture: There is a population model

### CHAPTER 2 Modeling Distributions of Data

CHAPTER 2 Modeling Distributions of Data 2.2 Density Curves and Normal Distributions The Practice of Statistics, 5th Edition Starnes, Tabor, Yates, Moore Bedford Freeman Worth Publishers Density Curves

### Optimization and Simulation

Optimization and Simulation Statistical analysis and bootstrapping Michel Bierlaire Transport and Mobility Laboratory School of Architecture, Civil and Environmental Engineering Ecole Polytechnique Fédérale

### Chapter 2 Modeling Distributions of Data

Chapter 2 Modeling Distributions of Data Section 2.1 Describing Location in a Distribution Describing Location in a Distribution Learning Objectives After this section, you should be able to: FIND and

### Lecture 12. August 23, Department of Biostatistics Johns Hopkins Bloomberg School of Public Health Johns Hopkins University.

Lecture 12 Department of Biostatistics Johns Hopkins Bloomberg School of Public Health Johns Hopkins University August 23, 2007 1 2 3 4 5 1 2 Introduce the bootstrap 3 the bootstrap algorithm 4 Example

### An Introduction to the Bootstrap

An Introduction to the Bootstrap Bradley Efron Department of Statistics Stanford University and Robert J. Tibshirani Department of Preventative Medicine and Biostatistics and Department of Statistics,

### Chapter 6 Normal Probability Distributions

Chapter 6 Normal Probability Distributions 6-1 Review and Preview 6-2 The Standard Normal Distribution 6-3 Applications of Normal Distributions 6-4 Sampling Distributions and Estimators 6-5 The Central

### CHAPTER 2 Modeling Distributions of Data

CHAPTER 2 Modeling Distributions of Data 2.2 Density Curves and Normal Distributions The Practice of Statistics, 5th Edition Starnes, Tabor, Yates, Moore Bedford Freeman Worth Publishers HW 34. Sketch

### 4.5 The smoothed bootstrap

4.5. THE SMOOTHED BOOTSTRAP 47 F X i X Figure 4.1: Smoothing the empirical distribution function. 4.5 The smoothed bootstrap In the simple nonparametric bootstrap we have assumed that the empirical distribution

### Lecture 3 - Object-oriented programming and statistical programming examples

Lecture 3 - Object-oriented programming and statistical programming examples Björn Andersson (w/ Ronnie Pingel) Department of Statistics, Uppsala University February 1, 2013 Table of Contents 1 Some notes

### Name: Date: Period: Chapter 2. Section 1: Describing Location in a Distribution

Name: Date: Period: Chapter 2 Section 1: Describing Location in a Distribution Suppose you earned an 86 on a statistics quiz. The question is: should you be satisfied with this score? What if it is the

### Heteroskedasticity and Homoskedasticity, and Homoskedasticity-Only Standard Errors

Heteroskedasticity and Homoskedasticity, and Homoskedasticity-Only Standard Errors (Section 5.4) What? Consequences of homoskedasticity Implication for computing standard errors What do these two terms

### Clustering. Mihaela van der Schaar. January 27, Department of Engineering Science University of Oxford

Department of Engineering Science University of Oxford January 27, 2017 Many datasets consist of multiple heterogeneous subsets. Cluster analysis: Given an unlabelled data, want algorithms that automatically

### Probability Models.S4 Simulating Random Variables

Operations Research Models and Methods Paul A. Jensen and Jonathan F. Bard Probability Models.S4 Simulating Random Variables In the fashion of the last several sections, we will often create probability

### Learning Objectives. Continuous Random Variables & The Normal Probability Distribution. Continuous Random Variable

Learning Objectives Continuous Random Variables & The Normal Probability Distribution 1. Understand characteristics about continuous random variables and probability distributions 2. Understand the uniform

### Chapter 6. THE NORMAL DISTRIBUTION

Chapter 6. THE NORMAL DISTRIBUTION Introducing Normally Distributed Variables The distributions of some variables like thickness of the eggshell, serum cholesterol concentration in blood, white blood cells

### Chapter 2: Modeling Distributions of Data

Chapter 2: Modeling Distributions of Data Section 2.2 The Practice of Statistics, 4 th edition - For AP* STARNES, YATES, MOORE Chapter 2 Modeling Distributions of Data 2.1 Describing Location in a Distribution

### 4.3 The Normal Distribution

4.3 The Normal Distribution Objectives. Definition of normal distribution. Standard normal distribution. Specialties of the graph of the standard normal distribution. Percentiles of the standard normal

### Developing Effect Sizes for Non-Normal Data in Two-Sample Comparison Studies

Developing Effect Sizes for Non-Normal Data in Two-Sample Comparison Studies with an Application in E-commerce Durham University Apr 13, 2010 Outline 1 Introduction Effect Size, Complementory for Hypothesis

### Nonparametric regression using kernel and spline methods

Nonparametric regression using kernel and spline methods Jean D. Opsomer F. Jay Breidt March 3, 016 1 The statistical model When applying nonparametric regression methods, the researcher is interested

### Chapter 6. THE NORMAL DISTRIBUTION

Chapter 6. THE NORMAL DISTRIBUTION Introducing Normally Distributed Variables The distributions of some variables like thickness of the eggshell, serum cholesterol concentration in blood, white blood cells

### Week 4: Describing data and estimation

Week 4: Describing data and estimation Goals Investigate sampling error; see that larger samples have less sampling error. Visualize confidence intervals. Calculate basic summary statistics using R. Calculate

### Ch6: The Normal Distribution

Ch6: The Normal Distribution Introduction Review: A continuous random variable can assume any value between two endpoints. Many continuous random variables have an approximately normal distribution, which

### Bootstrap confidence intervals Class 24, Jeremy Orloff and Jonathan Bloom

1 Learning Goals Bootstrap confidence intervals Class 24, 18.05 Jeremy Orloff and Jonathan Bloom 1. Be able to construct and sample from the empirical distribution of data. 2. Be able to explain the bootstrap

### Chapter 2. Descriptive Statistics: Organizing, Displaying and Summarizing Data

Chapter 2 Descriptive Statistics: Organizing, Displaying and Summarizing Data Objectives Student should be able to Organize data Tabulate data into frequency/relative frequency tables Display data graphically

### The Normal Distribution & z-scores

& z-scores Distributions: Who needs them? Why are we interested in distributions? Important link between distributions and probabilities of events If we know the distribution of a set of events, then we

### Unit 8 SUPPLEMENT Normal, T, Chi Square, F, and Sums of Normals

BIOSTATS 540 Fall 017 8. SUPPLEMENT Normal, T, Chi Square, F and Sums of Normals Page 1 of Unit 8 SUPPLEMENT Normal, T, Chi Square, F, and Sums of Normals Topic 1. Normal Distribution.. a. Definition..

### ISyE 6416: Computational Statistics Spring Lecture 13: Monte Carlo Methods

ISyE 6416: Computational Statistics Spring 2017 Lecture 13: Monte Carlo Methods Prof. Yao Xie H. Milton Stewart School of Industrial and Systems Engineering Georgia Institute of Technology Determine area

### Frequency Distributions

Displaying Data Frequency Distributions After collecting data, the first task for a researcher is to organize and summarize the data so that it is possible to get a general overview of the results. Remember,

### What is a Graphon? Daniel Glasscock, June 2013

What is a Graphon? Daniel Glasscock, June 2013 These notes complement a talk given for the What is...? seminar at the Ohio State University. The block images in this PDF should be sharp; if they appear

### Chapter 2: Frequency Distributions

Chapter 2: Frequency Distributions Chapter Outline 2.1 Introduction to Frequency Distributions 2.2 Frequency Distribution Tables Obtaining ΣX from a Frequency Distribution Table Proportions and Percentages

### STAT 725 Notes Monte Carlo Integration

STAT 725 Notes Monte Carlo Integration Two major classes of numerical problems arise in statistical inference: optimization and integration. We have already spent some time discussing different optimization

### Confidence Intervals: Estimators

Confidence Intervals: Estimators Point Estimate: a specific value at estimates a parameter e.g., best estimator of e population mean ( ) is a sample mean problem is at ere is no way to determine how close

### PRIMAL-DUAL INTERIOR POINT METHOD FOR LINEAR PROGRAMMING. 1. Introduction

PRIMAL-DUAL INTERIOR POINT METHOD FOR LINEAR PROGRAMMING KELLER VANDEBOGERT AND CHARLES LANNING 1. Introduction Interior point methods are, put simply, a technique of optimization where, given a problem

### 1 Methods for Posterior Simulation

1 Methods for Posterior Simulation Let p(θ y) be the posterior. simulation. Koop presents four methods for (posterior) 1. Monte Carlo integration: draw from p(θ y). 2. Gibbs sampler: sequentially drawing

### CHAPTER 2 DESCRIPTIVE STATISTICS

CHAPTER 2 DESCRIPTIVE STATISTICS 1. Stem-and-Leaf Graphs, Line Graphs, and Bar Graphs The distribution of data is how the data is spread or distributed over the range of the data values. This is one of

### Splines and penalized regression

Splines and penalized regression November 23 Introduction We are discussing ways to estimate the regression function f, where E(y x) = f(x) One approach is of course to assume that f has a certain shape,

### STA Rev. F Learning Objectives. Learning Objectives (Cont.) Module 3 Descriptive Measures

STA 2023 Module 3 Descriptive Measures Learning Objectives Upon completing this module, you should be able to: 1. Explain the purpose of a measure of center. 2. Obtain and interpret the mean, median, and

### Blending of Probability and Convenience Samples:

Blending of Probability and Convenience Samples: Applications to a Survey of Military Caregivers Michael Robbins RAND Corporation Collaborators: Bonnie Ghosh-Dastidar, Rajeev Ramchand September 25, 2017

### Nonparametric Estimation of Distribution Function using Bezier Curve

Communications for Statistical Applications and Methods 2014, Vol. 21, No. 1, 105 114 DOI: http://dx.doi.org/10.5351/csam.2014.21.1.105 ISSN 2287-7843 Nonparametric Estimation of Distribution Function

### The Normal Distribution & z-scores

& z-scores Distributions: Who needs them? Why are we interested in distributions? Important link between distributions and probabilities of events If we know the distribution of a set of events, then we

### Robust Shape Retrieval Using Maximum Likelihood Theory

Robust Shape Retrieval Using Maximum Likelihood Theory Naif Alajlan 1, Paul Fieguth 2, and Mohamed Kamel 1 1 PAMI Lab, E & CE Dept., UW, Waterloo, ON, N2L 3G1, Canada. naif, mkamel@pami.uwaterloo.ca 2

### Pair-Wise Multiple Comparisons (Simulation)

Chapter 580 Pair-Wise Multiple Comparisons (Simulation) Introduction This procedure uses simulation analyze the power and significance level of three pair-wise multiple-comparison procedures: Tukey-Kramer,

### CHAPTER 1. Introduction. Statistics: Statistics is the science of collecting, organizing, analyzing, presenting and interpreting data.

1 CHAPTER 1 Introduction Statistics: Statistics is the science of collecting, organizing, analyzing, presenting and interpreting data. Variable: Any characteristic of a person or thing that can be expressed

### Introduction to the Bootstrap and Robust Statistics

Introduction to the Bootstrap and Robust Statistics PSY711/712 Winter Term 2009 1 t-based Confidence Intervals This document shows how to calculate confidence intervals for a sample mean using the t statistic

### STA Module 4 The Normal Distribution

STA 2023 Module 4 The Normal Distribution Learning Objectives Upon completing this module, you should be able to 1. Explain what it means for a variable to be normally distributed or approximately normally

### STA /25/12. Module 4 The Normal Distribution. Learning Objectives. Let s Look at Some Examples of Normal Curves

STA 2023 Module 4 The Normal Distribution Learning Objectives Upon completing this module, you should be able to 1. Explain what it means for a variable to be normally distributed or approximately normally

### Laboratory #11. Bootstrap Estimates

Name Laboratory #11. Bootstrap Estimates Randomization methods so far have been used to compute p-values for hypothesis testing. Randomization methods can also be used to place confidence limits around

### BIOL Gradation of a histogram (a) into the normal curve (b)

(التوزيع الطبيعي ( Distribution Normal (Gaussian) One of the most important distributions in statistics is a continuous distribution called the normal distribution or Gaussian distribution. Consider the

### Robotics. Lecture 5: Monte Carlo Localisation. See course website for up to date information.

Robotics Lecture 5: Monte Carlo Localisation See course website http://www.doc.ic.ac.uk/~ajd/robotics/ for up to date information. Andrew Davison Department of Computing Imperial College London Review:

### Cumulative Distribution Function (CDF) Deconvolution

Cumulative Distribution Function (CDF) Deconvolution Thomas Kincaid September 20, 2013 Contents 1 Introduction 1 2 Preliminaries 1 3 Read the Simulated Variables Data File 2 4 Illustration of Extraneous

### Computation of the variance-covariance matrix

Computation of the variance-covariance matrix An example with the Countr package. Tarak Kharrat 1 and Georgi N. Boshnakov 2 1 Salford Business School, University of Salford, UK. 2 School of Mathematics,

### Distributions of Continuous Data

C H A P T ER Distributions of Continuous Data New cars and trucks sold in the United States average about 28 highway miles per gallon (mpg) in 2010, up from about 24 mpg in 2004. Some of the improvement

### Resampling Methods. Levi Waldron, CUNY School of Public Health. July 13, 2016

Resampling Methods Levi Waldron, CUNY School of Public Health July 13, 2016 Outline and introduction Objectives: prediction or inference? Cross-validation Bootstrap Permutation Test Monte Carlo Simulation

### Univariate Extreme Value Analysis. 1 Block Maxima. Practice problems using the extremes ( 2.0 5) package. 1. Pearson Type III distribution

Univariate Extreme Value Analysis Practice problems using the extremes ( 2.0 5) package. 1 Block Maxima 1. Pearson Type III distribution (a) Simulate 100 maxima from samples of size 1000 from the gamma

### Chapter 5. Understanding and Comparing Distributions. Copyright 2012, 2008, 2005 Pearson Education, Inc.

Chapter 5 Understanding and Comparing Distributions The Big Picture We can answer much more interesting questions about variables when we compare distributions for different groups. Below is a histogram

### Chapter 2 - Descriptive Statistics

Chapter 2 - Descriptive Statistics Our ultimate aim in this class is to study Inferential statistics but before we can start making decisions about data we need to know how to describe this data in a useful

### Statistical Analysis of Output from Terminating Simulations

Statistical Analysis of Output from Terminating Simulations Statistical Analysis of Output from Terminating Simulations Time frame of simulations Strategy for data collection and analysis Confidence intervals

### STP 226 ELEMENTARY STATISTICS NOTES PART 2 - DESCRIPTIVE STATISTICS CHAPTER 3 DESCRIPTIVE MEASURES

STP 6 ELEMENTARY STATISTICS NOTES PART - DESCRIPTIVE STATISTICS CHAPTER 3 DESCRIPTIVE MEASURES Chapter covered organizing data into tables, and summarizing data with graphical displays. We will now use

### MAT 102 Introduction to Statistics Chapter 6. Chapter 6 Continuous Probability Distributions and the Normal Distribution

MAT 102 Introduction to Statistics Chapter 6 Chapter 6 Continuous Probability Distributions and the Normal Distribution 6.2 Continuous Probability Distributions Characteristics of a Continuous Probability

### Nonparametric Approaches to Regression

Nonparametric Approaches to Regression In traditional nonparametric regression, we assume very little about the functional form of the mean response function. In particular, we assume the model where m(xi)

### Statistics... Basic Descriptive Statistics

Statistics... Statistics is the study of how best to 1. collect data "Probability" is the mathematical description of "chance" and is a tool for the above that must be part of the course... but it is not

### Splines. Patrick Breheny. November 20. Introduction Regression splines (parametric) Smoothing splines (nonparametric)

Splines Patrick Breheny November 20 Patrick Breheny STA 621: Nonparametric Statistics 1/46 Introduction Introduction Problems with polynomial bases We are discussing ways to estimate the regression function

### You ve already read basics of simulation now I will be taking up method of simulation, that is Random Number Generation

Unit 5 SIMULATION THEORY Lesson 39 Learning objective: To learn random number generation. Methods of simulation. Monte Carlo method of simulation You ve already read basics of simulation now I will be

### Section 2.2 Normal Distributions. Normal Distributions

Section 2.2 Normal Distributions Normal Distributions One particularly important class of density curves are the Normal curves, which describe Normal distributions. All Normal curves are symmetric, single-peaked,

### To calculate the arithmetic mean, sum all the values and divide by n (equivalently, multiple 1/n): 1 n. = 29 years.

3: Summary Statistics Notation Consider these 10 ages (in years): 1 4 5 11 30 50 8 7 4 5 The symbol n represents the sample size (n = 10). The capital letter X denotes the variable. x i represents the

### Analytical model A structure and process for analyzing a dataset. For example, a decision tree is a model for the classification of a dataset.

Glossary of data mining terms: Accuracy Accuracy is an important factor in assessing the success of data mining. When applied to data, accuracy refers to the rate of correct values in the data. When applied

### height VUD x = x 1 + x x N N 2 + (x 2 x) 2 + (x N x) 2. N

Math 3: CSM Tutorial: Probability, Statistics, and Navels Fall 2 In this worksheet, we look at navel ratios, means, standard deviations, relative frequency density histograms, and probability density functions.

### Table : IEEE Single Format ± a a 2 a 3 :::a 8 b b 2 b 3 :::b 23 If exponent bitstring a :::a 8 is Then numerical value represented is ( ) 2 = (

Floating Point Numbers in Java by Michael L. Overton Virtually all modern computers follow the IEEE 2 floating point standard in their representation of floating point numbers. The Java programming language

### Lab 3 (80 pts.) - Assessing the Normality of Data Objectives: Creating and Interpreting Normal Quantile Plots

STAT 350 (Spring 2015) Lab 3: SAS Solutions 1 Lab 3 (80 pts.) - Assessing the Normality of Data Objectives: Creating and Interpreting Normal Quantile Plots Note: The data sets are not included in the solutions;

### Simulation Studies of the Basic Packet Routing Problem

Simulation Studies of the Basic Packet Routing Problem Author: Elena Sirén 48314u Supervisor: Pasi Lassila February 6, 2001 1 Abstract In this paper the simulation of a basic packet routing problem using

### CHAPTER 2: DESCRIPTIVE STATISTICS Lecture Notes for Introductory Statistics 1. Daphne Skipper, Augusta University (2016)

CHAPTER 2: DESCRIPTIVE STATISTICS Lecture Notes for Introductory Statistics 1 Daphne Skipper, Augusta University (2016) 1. Stem-and-Leaf Graphs, Line Graphs, and Bar Graphs The distribution of data is

### What We ll Do... Random

What We ll Do... Random- number generation Random Number Generation Generating random variates Nonstationary Poisson processes Variance reduction Sequential sampling Designing and executing simulation

### Chapter 3 Analyzing Normal Quantitative Data

Chapter 3 Analyzing Normal Quantitative Data Introduction: In chapters 1 and 2, we focused on analyzing categorical data and exploring relationships between categorical data sets. We will now be doing

### A Random Number Based Method for Monte Carlo Integration

A Random Number Based Method for Monte Carlo Integration J Wang and G Harrell Department Math and CS, Valdosta State University, Valdosta, Georgia, USA Abstract - A new method is proposed for Monte Carlo

### The Multi Stage Gibbs Sampling: Data Augmentation Dutch Example

The Multi Stage Gibbs Sampling: Data Augmentation Dutch Example Rebecca C. Steorts Bayesian Methods and Modern Statistics: STA 360/601 Module 8 1 Example: Data augmentation / Auxiliary variables A commonly-used

### Bootstrapped and Means Trimmed One Way ANOVA and Multiple Comparisons in R

Bootstrapped and Means Trimmed One Way ANOVA and Multiple Comparisons in R Another way to do a bootstrapped one-way ANOVA is to use Rand Wilcox s R libraries. Wilcox (2012) states that for one-way ANOVAs,

### Statistics 202: Data Mining. c Jonathan Taylor. Outliers Based in part on slides from textbook, slides of Susan Holmes.

Outliers Based in part on slides from textbook, slides of Susan Holmes December 2, 2012 1 / 1 Concepts What is an outlier? The set of data points that are considerably different than the remainder of the

### Monte Carlo Analysis

Monte Carlo Analysis Andrew Q. Philips* February 7, 27 *Ph.D Candidate, Department of Political Science, Texas A&M University, 2 Allen Building, 4348 TAMU, College Station, TX 77843-4348. aphilips@pols.tamu.edu.

### Lecture Notes on Hash Tables

Lecture Notes on Hash Tables 15-122: Principles of Imperative Computation Frank Pfenning Lecture 13 February 24, 2011 1 Introduction In this lecture we introduce so-called associative arrays, that is,

### VARIANCE REDUCTION TECHNIQUES IN MONTE CARLO SIMULATIONS K. Ming Leung

POLYTECHNIC UNIVERSITY Department of Computer and Information Science VARIANCE REDUCTION TECHNIQUES IN MONTE CARLO SIMULATIONS K. Ming Leung Abstract: Techniques for reducing the variance in Monte Carlo

### Chapter 3 - Displaying and Summarizing Quantitative Data

Chapter 3 - Displaying and Summarizing Quantitative Data 3.1 Graphs for Quantitative Data (LABEL GRAPHS) August 25, 2014 Histogram (p. 44) - Graph that uses bars to represent different frequencies or relative

### Metrics for Performance Evaluation How to evaluate the performance of a model? Methods for Performance Evaluation How to obtain reliable estimates?

Model Evaluation Metrics for Performance Evaluation How to evaluate the performance of a model? Methods for Performance Evaluation How to obtain reliable estimates? Methods for Model Comparison How to

### Monte Carlo Integration

Lab 18 Monte Carlo Integration Lab Objective: Implement Monte Carlo integration to estimate integrals. Use Monte Carlo Integration to calculate the integral of the joint normal distribution. Some multivariable

### 2) In the formula for the Confidence Interval for the Mean, if the Confidence Coefficient, z(α/2) = 1.65, what is the Confidence Level?

Pg.431 1)The mean of the sampling distribution of means is equal to the mean of the population. T-F, and why or why not? True. If you were to take every possible sample from the population, and calculate

### Expectation-Maximization. Nuno Vasconcelos ECE Department, UCSD

Expectation-Maximization Nuno Vasconcelos ECE Department, UCSD Plan for today last time we started talking about mixture models we introduced the main ideas behind EM to motivate EM, we looked at classification-maximization

### Distribution-free Predictive Approaches

Distribution-free Predictive Approaches The methods discussed in the previous sections are essentially model-based. Model-free approaches such as tree-based classification also exist and are popular for

### Standard Errors in OLS Luke Sonnet

Standard Errors in OLS Luke Sonnet Contents Variance-Covariance of ˆβ 1 Standard Estimation (Spherical Errors) 2 Robust Estimation (Heteroskedasticity Constistent Errors) 4 Cluster Robust Estimation 7

### arxiv: v1 [stat.me] 29 May 2015

MIMCA: Multiple imputation for categorical variables with multiple correspondence analysis Vincent Audigier 1, François Husson 2 and Julie Josse 2 arxiv:1505.08116v1 [stat.me] 29 May 2015 Applied Mathematics

### Continuous Improvement Toolkit. Normal Distribution. Continuous Improvement Toolkit.

Continuous Improvement Toolkit Normal Distribution The Continuous Improvement Map Managing Risk FMEA Understanding Performance** Check Sheets Data Collection PDPC RAID Log* Risk Analysis* Benchmarking***

### MA651 Topology. Lecture 4. Topological spaces 2

MA651 Topology. Lecture 4. Topological spaces 2 This text is based on the following books: Linear Algebra and Analysis by Marc Zamansky Topology by James Dugundgji Fundamental concepts of topology by Peter

### Unit 7 Statistics. AFM Mrs. Valentine. 7.1 Samples and Surveys

Unit 7 Statistics AFM Mrs. Valentine 7.1 Samples and Surveys v Obj.: I will understand the different methods of sampling and studying data. I will be able to determine the type used in an example, and

### Lecture 14. December 19, Department of Biostatistics Johns Hopkins Bloomberg School of Public Health Johns Hopkins University.

geometric Lecture 14 Department of Biostatistics Johns Hopkins Bloomberg School of Public Health Johns Hopkins University December 19, 2007 geometric 1 2 3 4 geometric 5 6 7 geometric 1 Review about logs

### Cryptography and Network Security. Prof. D. Mukhopadhyay. Department of Computer Science and Engineering. Indian Institute of Technology, Kharagpur

Cryptography and Network Security Prof. D. Mukhopadhyay Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur Module No. # 01 Lecture No. # 38 A Tutorial on Network Protocols

### What is Process Capability?

6. Process or Product Monitoring and Control 6.1. Introduction 6.1.6. What is Process Capability? Process capability compares the output of an in-control process to the specification limits by using capability

### Model selection and validation 1: Cross-validation

Model selection and validation 1: Cross-validation Ryan Tibshirani Data Mining: 36-462/36-662 March 26 2013 Optional reading: ISL 2.2, 5.1, ESL 7.4, 7.10 1 Reminder: modern regression techniques Over the

### Optimal designs for comparing curves

Optimal designs for comparing curves Holger Dette, Ruhr-Universität Bochum Maria Konstantinou, Ruhr-Universität Bochum Kirsten Schorning, Ruhr-Universität Bochum FP7 HEALTH 2013-602552 Outline 1 Motivation