Lecture 3 - Object-oriented programming and statistical programming examples

Similar documents
Chapter 3. Bootstrap. 3.1 Introduction. 3.2 The general idea

More Summer Program t-shirts

Discussion Notes 3 Stepwise Regression and Model Selection

LAB #2: SAMPLING, SAMPLING DISTRIBUTIONS, AND THE CLT

Package FWDselect. December 19, 2015

Getting started with simulating data in R: some helpful functions and how to use them Ariel Muldoon August 28, 2018

Work through the sheet in any order you like. Skip the starred (*) bits in the first instance, unless you re fairly confident.

The glmmml Package. August 20, 2006

Acknowledgments. Acronyms

Package glmmml. R topics documented: March 25, Encoding UTF-8 Version Date Title Generalized Linear Models with Clustering

R Programming Basics - Useful Builtin Functions for Statistics

Nina Zumel and John Mount Win-Vector LLC

R Programming: Worksheet 6

Lecture 12. August 23, Department of Biostatistics Johns Hopkins Bloomberg School of Public Health Johns Hopkins University.

Statistics I 2011/2012 Notes about the third Computer Class: Simulation of samples and goodness of fit; Central Limit Theorem; Confidence intervals.

Week 7: The normal distribution and sample means

Monte Carlo Analysis

BIOL 458 BIOMETRY Lab 10 - Multiple Regression

predict and Friends: Common Methods for Predictive Models in R , Spring 2015 Handout No. 1, 25 January 2015

Problem Set #8. Econ 103

Chapters 5-6: Statistical Inference Methods

Exploratory model analysis

CARTWARE Documentation

Bootstrapping Methods

Estimation of Item Response Models

UP School of Statistics Student Council Education and Research

Goals of the Lecture. SOC6078 Advanced Statistics: 9. Generalized Additive Models. Limitations of the Multiple Nonparametric Models (2)

STENO Introductory R-Workshop: Loading a Data Set Tommi Suvitaival, Steno Diabetes Center June 11, 2015

Programming and Post-Estimation

Model validation T , , Heli Hiisilä

Today. Lecture 4: Last time. The EM algorithm. We examine clustering in a little more detail; we went over it a somewhat quickly last time

1. Estimation equations for strip transect sampling, using notation consistent with that used to

Design of Experiments

R practice. Eric Gilleland. 20th May 2015

Fathom Dynamic Data TM Version 2 Specifications

An introduction to plotting data

Descriptive Statistics, Standard Deviation and Standard Error

Statistics 406 Exam November 17, 2005

GLM II. Basic Modeling Strategy CAS Ratemaking and Product Management Seminar by Paul Bailey. March 10, 2015

Dealing with Categorical Data Types in a Designed Experiment

Introduction to R Programming

Predictive Checking. Readings GH Chapter 6-8. February 8, 2017

1 Pencil and Paper stuff

Cross-validation and the Bootstrap

Modelling and Quantitative Methods in Fisheries

STAT 135 Lab 1 Solutions

Bootstrap Confidence Interval of the Difference Between Two Process Capability Indices

Robust Linear Regression (Passing- Bablok Median-Slope)

Unit 5: Estimating with Confidence

GOV 2001/ 1002/ E-2001 Section 1 1 Monte Carlo Simulation

1 Lab 1. Graphics and Checking Residuals

Lab 4: Distributions of random variables

Categorical Data in a Designed Experiment Part 2: Sizing with a Binary Response

Computing With R Handout 1

Cross-validation and the Bootstrap

Generating random samples from user-defined distributions

On the usage of the grim package

Bootstrapping Method for 14 June 2016 R. Russell Rhinehart. Bootstrapping

STAT 2607 REVIEW PROBLEMS Word problems must be answered in words of the problem.

R Programming: Worksheet 3

Computational statistics Jamie Griffin. Semester B 2018 Lecture 1

Poisson Regression and Model Checking

Chapter 10: Extensions to the GLM

Package caic4. May 22, 2018

Learner Expectations UNIT 1: GRAPICAL AND NUMERIC REPRESENTATIONS OF DATA. Sept. Fathom Lab: Distributions and Best Methods of Display

Package samplesizelogisticcasecontrol

This is called a linear basis expansion, and h m is the mth basis function For example if X is one-dimensional: f (X) = β 0 + β 1 X + β 2 X 2, or

Linear Model Selection and Regularization. especially usefull in high dimensions p>>100.

Lab #9: ANOVA and TUKEY tests

Probability and Statistics for Final Year Engineering Students

Today s Lecture. Factors & Sampling. Quick Review of Last Week s Computational Concepts. Numbers we Understand. 1. A little bit about Factors

MS&E 226: Small Data

Package OrthoPanels. November 11, 2016

Lab 5 - Risk Analysis, Robustness, and Power

4.5 The smoothed bootstrap

Simulating power in practice

5.5 Regression Estimation

1 Methods for Posterior Simulation

NCSS Statistical Software

R Primer for Introduction to Mathematical Statistics 8th Edition Joseph W. McKean

[1] CURVE FITTING WITH EXCEL

Section 3.2: Multiple Linear Regression II. Jared S. Murray The University of Texas at Austin McCombs School of Business

Tutorial 3: Probability & Distributions Johannes Karreth RPOS 517, Day 3

Linear Methods for Regression and Shrinkage Methods

1. Start WinBUGS by double clicking on the WinBUGS icon (or double click on the file WinBUGS14.exe in the WinBUGS14 directory in C:\Program Files).

Frequently Asked Questions Updated 2006 (TRIM version 3.51) PREPARING DATA & RUNNING TRIM

Lecture 3: Basics of R Programming

From logistic to binomial & Poisson models

Improving the Post-Smoothing of Test Norms with Kernel Smoothing

Simulation and resampling analysis in R

The Bootstrap and Jackknife

Chapter 2: Statistical Models for Distributions

Package glinternet. June 15, 2018

Resources for statistical assistance. Quantitative covariates and regression analysis. Methods for predicting continuous outcomes.

Regression Analysis and Linear Regression Models

Generalized Additive Models

Exercises R For Simulations Columbia University EPIC 2015 (no answers)

More advanced use of mgcv. Simon Wood Mathematical Sciences, University of Bath, U.K.

Introduction to hypothesis testing

For our example, we will look at the following factors and factor levels.

Transcription:

Lecture 3 - Object-oriented programming and statistical programming examples Björn Andersson (w/ Ronnie Pingel) Department of Statistics, Uppsala University February 1, 2013

Table of Contents 1 Some notes on object-oriented programming

R objects revisited R includes many different classes of objects, with a unique structure and with unique properties Using the built-in way of handling objects of different types there exists the possibility to write generic functions operate differently depending on the class of the arguments this simplifies usage and makes R flexible You can create your own object classes which suit your needs and you can write functions tailored for these objects that only function for arguments of the correct class ensures that functions are not used improperly

A few important definitions A class determines what an object is supposed to be made of (vectors, matrices, formulas etc) A generic function is a function which directs an object to a method depending on the object s class, and then operations are made to the object using this method (plot(), summary(), print() are examples of generic functions) A method is the set of operations directed to by a generic function You cannot use a method itself the way you use a function

Why use classes? By using classes you can accommodate more types of data than is possible using e.g. data frames or lists You can define what each slot in the class is to be constituted of and then it will be impossible to assign other types of data to these slots You maintain a higher level of certainty that the object is what it is supposed to be and as a result any analysis you make is more trustworthy (in general) In e.g. a data frame you can manipulate the data as much as you wish - classes allow for restrictions which are often useful

Why use methods? Methods provide a way to use R more easily A generic function allows for different things to be executed depending on the object class plot() does something diffent for a glm object compared to an object of class matrix When writing a package you can create your own generic functions which make usage and development simpler Extract information from different objects you have created in a consistent way Ease your own programming by providing methods for functions only seen inside the code

How I use classes and methods In the kequate I added a few classes to ease computations within the package. I also added a new class for the output from the main function methods for this class was added for the functions plot() and summary() methods are also provided for other functions in the package which allow for comparisons

Table of Contents 1 Some notes on object-oriented programming

The bootstrap The idea of the bootstrap By bootstrapping we mean the act of resampling from a random sample that we have observed and drawing conclusions about an estimator based on these resamples. In a sense you pull yourself up by your boostraps - you do something which is not possible. But bootstrapping actually works!

The bootstrap Words from the originator of the bootstrap I also wish to thank the many friends who suggested names more colorful than Bootstrap, including Swiss Army Knife, Meat Axe, Swan-Dive, Jack-Rabbit, and my personal favorite, the Shotgun, which to paraphrase Tukey, can blow the head off any problem if the statistician can stand the resulting mess. - Efron (1979)

The bootstrap The idea of the bootstrap The reasoning behind the bootstrap is largely as follows: Your observed sample is a random instantiation from the population of interest Therefore, a random sample from your sample can be viewed as a random sample from the population of interest As such, the distribution of an estimator for a parameter of interest can be estimated by calculating the estimate for each bootstrap sample

The bootstrap Bootstrap vs MC Bootstrapping and monte-carlo simulation are both based on repetitive sampling. What is the difference? Monte-carlo simulation Data generation with known values of the parameters. Used to test drive estimators. Bootstrapping Uses the original, initial sample as the population from which to resample. You can estimate the variability of the statistic and the shape of its sampling distribution. The bootstrap has had a considerable impact on statistics and offers a new way to find standard errors and confidence intervals.

The bootstrap Estimating the standard error of a statistic using the bootstrap We find the bootstrap estimate from the following steps: 1 We have a random sample X of size n and a statistic s(x). 2 We draw a random sample X of size n with replacement from X. 3 We repeat step 2 to obtain B independent bootstrap samples and calculate the statistic s(x ) i for each bootstrap sample. 4 The bootstrap estimate of the standard error of the statistic s(x) is then the standard deviation of the bootstrap sample statistics: ˆ SE B [s(x)] = 1 B 1 [ B ] B 2 s(x i=1 ) i s(x ) i B i=1 1/2

The bootstrap Asymptotic results We define the ideal bootstrap estimate of the standard error of a statistic as the standard deviation of the m bootstrap values s(z j ) m seˆf (s(x)) = j=1 w j s(zj ) 2 m w j s(z j ) This is not tractable to compute. However, it can be shown that lim B So the bootstrap works! j=1 SE ˆ B [s(x)] = seˆf (s(x)). 1/2.

The bootstrap The bootstrap method using R sample() draws a random sample from a vector with or without replacement. > sample(1:10) [1] 10 5 8 3 6 7 9 4 1 2 > sample(1:10, replace=true) [1] 1 8 5 2 2 3 2 8 2 10 Read help(sample) for details. Remember that you can select rows in a data frame like: > testdata <- data.frame(a=runif(10), B=rpois(10, 10), + C=rbinom(10, 1, 0.5)) > testdata[c(3,1),] A B C 3 0.2119918 11 0 1 0.3012755 11 0

The bootstrap The bootstrap method using R The basic bootstrap is very easy to implement in R. We write a simple function to calculate the bootstrap estimate of the standard error of the mean: > bootstrapsemean <- function(x, B){ + res <- numeric(b) + for(i in 1:B) + res[i] <- mean(sample(x, replace=true)) + res <- sd(res) + return(res) + }

The bootstrap The bootstrap method using R > ttx <- rnorm(100) > bootstrapsemean(ttx, 10) [1] 0.08158119 > bootstrapsemean(ttx, 1000) [1] 0.1006367 We have i, X i N(0, 1) and independent. Hence: Var( X) = Var( n i=1 X i n ) = We note that 1/100 = 0.1. n i=1 Var(X i ) n 2 = n i=1 1 n 2 = 1 n.

The bootstrap The bootstrap method using R Of course, in the case of the sample mean we do not need the bootstrap estimate of the variance since it is readily available. However, in many situations we do not have a way of finding the variance of a statistic. In many such cases the bootstrap works. > bootstrapsemedian <- function(x, B){ + res <- numeric(b) + for(i in 1:B) + res[i] <- median(sample(x, replace=true)) + res <- sd(res) + return(res) + } > bootstrapsemedian(ttx, 1000) [1] 0.191876

The bootstrap The bootstrap method using R In Assignment 1, I will ask you to write a function for the bootstrap of a particular statistic which depends on two variables. The function should: Have input arguments such that you can specify a data frame containing the data and the number of replications to be used Calculate the estimate of the statistic and its bootstrap standard error Provide a suitable output of the estimate of the statistic and the standard error of the statistic I will also ask you to provide plots of the distribution for the bootstrapped statistic.

The bootstrap The bootstrap sometimes fails If the support for the random variable X depends on the parameter θ you want to estimate, and s(x) is the estimator, then the bootstrap may fail for example a R.V. X such that X U(0, θ) If certain regularity conditions are violated then the bootstrap fails These conditions are however not as strict as those required by e.g. the Delta method (asymptotic approximation using taylor expansion) The matching estimator used in causal inference is an example of when the bootstrap fails.

The bootstrap How many bootstrap replications? As many as you have time for! Rule of thumb: 50-200 Use system.time() to check how fast the bootstrap runs and choose a reasonable number For many problems you will however need more than 1000 replications For the statistics used in the presentation we can choose a very large number of replications without any problems

The bootstrap Kernel equating: a bootstrap example Equating is a statistical method used in educational measurement to ensure that the results of standardized testing are comparable Kernel equating is a special type of equating using a Gaussian kernel to calculate the equating function Kernel equating requires the selection of a bandwidth Problem: we do not have a way to derive the analytical standard errors of equating when considering the most commonly used bandwidth selection The bootstrap can be used in this case! The bootstrap shows that the influence of the bandwidth selection is very small - the currently used analytical standard errors are in fact a decent approximation

Some useful statistical functions in R Included functions for common distributions Generate random numbers rnorm(n, mean=0, sd=1) rpois, rbinom(), rchisq() etc. Density function/probaility mass function dnorm(x, mean=0, sd=1) dpois, dbinom(), dchisq() etc. Distribution function pnorm(q, mean=0, sd=1) ppois, pbinom(), pchisq() etc.

Some useful statistical functions in R Some more plotting functions hist() plots a histogram of your data qqnorm() plots the sample quantiles of a data vector and compares it the normal case The function density() calculates the density of your data which can then be plotted You can write: plot(density(x)), where x is the vector of data points

Generalized linear models Fitting generalized linear models in R Using the function glm() in R an array of linear models can be fitted. glm() has many arguments, the most important of which are: formula - the form of the model specified, e.g. y~x+z+x:z family - the link function used, e.g. gaussian, poisson, binomial etc. (defaults to gaussian) data - a data frame (not required) Gaussian linear model: > x <- rnorm(100) > y <- 1.2 * x + rnorm(100) > glmgauss <- glm(y~x)

Generalized linear models Fitting generalized linear models in R > glmgauss Call: glm(formula = y ~ x) Coefficients: (Intercept) x 0.2812 1.2840 Degrees of Freedom: 99 Total (i.e. Null); Null Deviance: 233 Residual Deviance: 77.79 AIC: 264.7 98 Residual

Generalized linear models Fitting generalized linear models in R The fitted values are stored in the glm object as fitted.values. The observed values are stored as y. > gaussfitted <- glmgauss$fitted.values > gaussobs <- glmgauss$y You can choose to also save the design matrix (i.e. the explanatory variables) if specifying x=true in the glm() function call.

Generalized linear models Fitting generalized linear models in R: data frames With data frames you can easily specify models with glm(). > z <- rnorm(100) > xyz <- data.frame(x=x, y=y, z=z) > glmxyz <- glm(x~., data=xyz) When you write x~. you use x as the response and the rest of the variables in the data frame as explanatory variables.

Generalized linear models Automatic model selection in R In Assignment 1 I will ask you to write a function which automatically selects the best generalized linear model for an arbitrary response variable according to some criterion. The criteria are AIC = 2p 2 log(l) and BIC = log(n)p 2 log(l), where p is the number of parameters in the model and n is the sample size.

Generalized linear models Automatic model selection in R The function step() in R can be used to stepwise search for the best model with respect to some criterion. If you provide a glm object to step() the function will default to provide the best model using a backward search starting with the full model specified. Read the help file!

Table of Contents 1 Some notes on object-oriented programming

Some tips If you get stuck/get an error message you don t understand, read the help files for the function or google your error message Use online manuals such as Quick-R (http://www.statmethods.net/)

Group presentation of assignments I decided to generate a random sequence of integers from 1 to 8 where the first number in the sequence would correspond to presenting Exercise 1, the second to be the discussant for Exercise 1 and so on. I generated random numbers from http://www.random.org. The site uses atmospheric data as its source of randomness. I retrieved the following sequence of integers from 1 to 8: 6 1 5 7 2 4 3 8 The R package random has features to detect if a sequence is not random if you want to check it (it is likely that this is too short of a sequence though). See the schedule of the seminar for the full list!

Next time Today 16.15-18.00 I will not go through any more new material but rather be available for questions An apportunity for you to work on the exercises in Assignment 1 and the report for said assignment