Lecture 3 - Object-oriented programming and statistical programming examples

Lecture 3 - Object-oriented programming and statistical programming examples Björn Andersson (w/ Ronnie Pingel) Department of Statistics, Uppsala University February 1, 2013

Table of Contents 1 Some notes on object-oriented programming

R objects revisited R includes many different classes of objects, with a unique structure and with unique properties Using the built-in way of handling objects of different types there exists the possibility to write generic functions operate differently depending on the class of the arguments this simplifies usage and makes R flexible You can create your own object classes which suit your needs and you can write functions tailored for these objects that only function for arguments of the correct class ensures that functions are not used improperly

A few important definitions A class determines what an object is supposed to be made of (vectors, matrices, formulas etc) A generic function is a function which directs an object to a method depending on the object s class, and then operations are made to the object using this method (plot(), summary(), print() are examples of generic functions) A method is the set of operations directed to by a generic function You cannot use a method itself the way you use a function

Why use classes? By using classes you can accommodate more types of data than is possible using e.g. data frames or lists You can define what each slot in the class is to be constituted of and then it will be impossible to assign other types of data to these slots You maintain a higher level of certainty that the object is what it is supposed to be and as a result any analysis you make is more trustworthy (in general) In e.g. a data frame you can manipulate the data as much as you wish - classes allow for restrictions which are often useful

Why use methods? Methods provide a way to use R more easily A generic function allows for different things to be executed depending on the object class plot() does something diffent for a glm object compared to an object of class matrix When writing a package you can create your own generic functions which make usage and development simpler Extract information from different objects you have created in a consistent way Ease your own programming by providing methods for functions only seen inside the code

How I use classes and methods In the kequate I added a few classes to ease computations within the package. I also added a new class for the output from the main function methods for this class was added for the functions plot() and summary() methods are also provided for other functions in the package which allow for comparisons

Table of Contents 1 Some notes on object-oriented programming

The bootstrap The idea of the bootstrap By bootstrapping we mean the act of resampling from a random sample that we have observed and drawing conclusions about an estimator based on these resamples. In a sense you pull yourself up by your boostraps - you do something which is not possible. But bootstrapping actually works!

The bootstrap Words from the originator of the bootstrap I also wish to thank the many friends who suggested names more colorful than Bootstrap, including Swiss Army Knife, Meat Axe, Swan-Dive, Jack-Rabbit, and my personal favorite, the Shotgun, which to paraphrase Tukey, can blow the head off any problem if the statistician can stand the resulting mess. - Efron (1979)

The bootstrap The idea of the bootstrap The reasoning behind the bootstrap is largely as follows: Your observed sample is a random instantiation from the population of interest Therefore, a random sample from your sample can be viewed as a random sample from the population of interest As such, the distribution of an estimator for a parameter of interest can be estimated by calculating the estimate for each bootstrap sample

The bootstrap Bootstrap vs MC Bootstrapping and monte-carlo simulation are both based on repetitive sampling. What is the difference? Monte-carlo simulation Data generation with known values of the parameters. Used to test drive estimators. Bootstrapping Uses the original, initial sample as the population from which to resample. You can estimate the variability of the statistic and the shape of its sampling distribution. The bootstrap has had a considerable impact on statistics and offers a new way to find standard errors and confidence intervals.

The bootstrap Estimating the standard error of a statistic using the bootstrap We find the bootstrap estimate from the following steps: 1 We have a random sample X of size n and a statistic s(x). 2 We draw a random sample X of size n with replacement from X. 3 We repeat step 2 to obtain B independent bootstrap samples and calculate the statistic s(x ) i for each bootstrap sample. 4 The bootstrap estimate of the standard error of the statistic s(x) is then the standard deviation of the bootstrap sample statistics: ˆ SE B [s(x)] = 1 B 1 [ B ] B 2 s(x i=1 ) i s(x ) i B i=1 1/2

The bootstrap Asymptotic results We define the ideal bootstrap estimate of the standard error of a statistic as the standard deviation of the m bootstrap values s(z j ) m seˆf (s(x)) = j=1 w j s(zj ) 2 m w j s(z j ) This is not tractable to compute. However, it can be shown that lim B So the bootstrap works! j=1 SE ˆ B [s(x)] = seˆf (s(x)). 1/2.

The bootstrap The bootstrap method using R sample() draws a random sample from a vector with or without replacement. > sample(1:10) [1] 10 5 8 3 6 7 9 4 1 2 > sample(1:10, replace=true) [1] 1 8 5 2 2 3 2 8 2 10 Read help(sample) for details. Remember that you can select rows in a data frame like: > testdata <- data.frame(a=runif(10), B=rpois(10, 10), + C=rbinom(10, 1, 0.5)) > testdata[c(3,1),] A B C 3 0.2119918 11 0 1 0.3012755 11 0

The bootstrap The bootstrap method using R The basic bootstrap is very easy to implement in R. We write a simple function to calculate the bootstrap estimate of the standard error of the mean: > bootstrapsemean <- function(x, B){ + res <- numeric(b) + for(i in 1:B) + res[i] <- mean(sample(x, replace=true)) + res <- sd(res) + return(res) + }

The bootstrap The bootstrap method using R > ttx <- rnorm(100) > bootstrapsemean(ttx, 10) [1] 0.08158119 > bootstrapsemean(ttx, 1000) [1] 0.1006367 We have i, X i N(0, 1) and independent. Hence: Var( X) = Var( n i=1 X i n ) = We note that 1/100 = 0.1. n i=1 Var(X i ) n 2 = n i=1 1 n 2 = 1 n.

The bootstrap The bootstrap method using R Of course, in the case of the sample mean we do not need the bootstrap estimate of the variance since it is readily available. However, in many situations we do not have a way of finding the variance of a statistic. In many such cases the bootstrap works. > bootstrapsemedian <- function(x, B){ + res <- numeric(b) + for(i in 1:B) + res[i] <- median(sample(x, replace=true)) + res <- sd(res) + return(res) + } > bootstrapsemedian(ttx, 1000) [1] 0.191876

The bootstrap The bootstrap method using R In Assignment 1, I will ask you to write a function for the bootstrap of a particular statistic which depends on two variables. The function should: Have input arguments such that you can specify a data frame containing the data and the number of replications to be used Calculate the estimate of the statistic and its bootstrap standard error Provide a suitable output of the estimate of the statistic and the standard error of the statistic I will also ask you to provide plots of the distribution for the bootstrapped statistic.

The bootstrap The bootstrap sometimes fails If the support for the random variable X depends on the parameter θ you want to estimate, and s(x) is the estimator, then the bootstrap may fail for example a R.V. X such that X U(0, θ) If certain regularity conditions are violated then the bootstrap fails These conditions are however not as strict as those required by e.g. the Delta method (asymptotic approximation using taylor expansion) The matching estimator used in causal inference is an example of when the bootstrap fails.

The bootstrap How many bootstrap replications? As many as you have time for! Rule of thumb: 50-200 Use system.time() to check how fast the bootstrap runs and choose a reasonable number For many problems you will however need more than 1000 replications For the statistics used in the presentation we can choose a very large number of replications without any problems

The bootstrap Kernel equating: a bootstrap example Equating is a statistical method used in educational measurement to ensure that the results of standardized testing are comparable Kernel equating is a special type of equating using a Gaussian kernel to calculate the equating function Kernel equating requires the selection of a bandwidth Problem: we do not have a way to derive the analytical standard errors of equating when considering the most commonly used bandwidth selection The bootstrap can be used in this case! The bootstrap shows that the influence of the bandwidth selection is very small - the currently used analytical standard errors are in fact a decent approximation

Some useful statistical functions in R Included functions for common distributions Generate random numbers rnorm(n, mean=0, sd=1) rpois, rbinom(), rchisq() etc. Density function/probaility mass function dnorm(x, mean=0, sd=1) dpois, dbinom(), dchisq() etc. Distribution function pnorm(q, mean=0, sd=1) ppois, pbinom(), pchisq() etc.

Some useful statistical functions in R Some more plotting functions hist() plots a histogram of your data qqnorm() plots the sample quantiles of a data vector and compares it the normal case The function density() calculates the density of your data which can then be plotted You can write: plot(density(x)), where x is the vector of data points

Generalized linear models Fitting generalized linear models in R Using the function glm() in R an array of linear models can be fitted. glm() has many arguments, the most important of which are: formula - the form of the model specified, e.g. y~x+z+x:z family - the link function used, e.g. gaussian, poisson, binomial etc. (defaults to gaussian) data - a data frame (not required) Gaussian linear model: > x <- rnorm(100) > y <- 1.2 * x + rnorm(100) > glmgauss <- glm(y~x)

Generalized linear models Fitting generalized linear models in R > glmgauss Call: glm(formula = y ~ x) Coefficients: (Intercept) x 0.2812 1.2840 Degrees of Freedom: 99 Total (i.e. Null); Null Deviance: 233 Residual Deviance: 77.79 AIC: 264.7 98 Residual

Generalized linear models Fitting generalized linear models in R The fitted values are stored in the glm object as fitted.values. The observed values are stored as y. > gaussfitted <- glmgauss$fitted.values > gaussobs <- glmgauss$y You can choose to also save the design matrix (i.e. the explanatory variables) if specifying x=true in the glm() function call.

Generalized linear models Fitting generalized linear models in R: data frames With data frames you can easily specify models with glm(). > z <- rnorm(100) > xyz <- data.frame(x=x, y=y, z=z) > glmxyz <- glm(x~., data=xyz) When you write x~. you use x as the response and the rest of the variables in the data frame as explanatory variables.

Generalized linear models Automatic model selection in R In Assignment 1 I will ask you to write a function which automatically selects the best generalized linear model for an arbitrary response variable according to some criterion. The criteria are AIC = 2p 2 log(l) and BIC = log(n)p 2 log(l), where p is the number of parameters in the model and n is the sample size.

Generalized linear models Automatic model selection in R The function step() in R can be used to stepwise search for the best model with respect to some criterion. If you provide a glm object to step() the function will default to provide the best model using a backward search starting with the full model specified. Read the help file!

Table of Contents 1 Some notes on object-oriented programming

Some tips If you get stuck/get an error message you don t understand, read the help files for the function or google your error message Use online manuals such as Quick-R (http://www.statmethods.net/)

Group presentation of assignments I decided to generate a random sequence of integers from 1 to 8 where the first number in the sequence would correspond to presenting Exercise 1, the second to be the discussant for Exercise 1 and so on. I generated random numbers from http://www.random.org. The site uses atmospheric data as its source of randomness. I retrieved the following sequence of integers from 1 to 8: 6 1 5 7 2 4 3 8 The R package random has features to detect if a sequence is not random if you want to check it (it is likely that this is too short of a sequence though). See the schedule of the seminar for the full list!

Next time Today 16.15-18.00 I will not go through any more new material but rather be available for questions An apportunity for you to work on the exercises in Assignment 1 and the report for said assignment