Computational statistics Jamie Griffin Semester B 2018 Lecture 1
Course overview This course is not: Statistical computing Programming This course is: Computational statistics Statistical methods that use computation to replace certain assumptions Programming language R is used to implement them Main focus is on understanding the methods
Main topics covered Estimating probability densities Non-parametric tests, e.g. permutation tests Cross-validation Bootstrap This is all in the frequentist framework. Computational methods for Bayesian inference are not covered.
Lecture notes Slides will be put on QMplus before each lecture. More formal set of notes (single pdf file) will be updated as term progresses. More extensive proofs than covered in the lectures. http://qmplus.qmul.ac.uk/course/view.php?id=3136
Time-table Lectures: 9.00-11.00 Friday, Laws 1.19 Lectures are mainly about methods, not R programming Computer practicals: 11.00-12.00 Tuesday, Engineering W128D.2 These use R Office hours: 12.30-13.30 Friday, Queens CB202 Better to email me beforehand Week 7 No lecture or regular practical. Possibly a revision practical session, if there is demand for this.
Assessment Final exam 70% Entirely written, not computer-based In May or June 3 hours for level 7, 2 hours for level 6 Coursework 30% Weekly exercise sheets
Exercise sheets Some is pen and paper, some uses R One question to be handed in each week One per week (9 in total) Available before each tutorial, to be handed in at next tutorial None in first week, so first one to be handed in 30th January
Books Gentle, J.E. Elements of Computational Statistics (2002) Covers most of the material on the course Library has book and electronic version Computational Statistics (2009) by the same author - expanded version, extra material is not in this module Efron, B. and Tibshirani, R.J. An Introduction to the Bootstrap (1994) Covers some of the later material (bootstrap, cross-validation) Library only has hard copy, not electronic Davison, A. C., Hinkley, D. V. Bootstrap methods and their application (1997) Library has book and electronic version
What is R? Free software system for data analysis Initially developed by R. Ihaka and R. Gentleman (1996) Currently developed by the R Core Team (around 20 people) Largest collection of tools for statistics and data analysis (1,000s of contributors)
R is free No need to pay Source code is freely available Anyone can re-use, modify and distribute the code
Who uses R? Academics and other researchers Increasingly, in business Companies using R New York Times Lloyds Google
Popularity of R This webpage has several sources of data on how widely used different statistical environments are: Popularity of data science software According to Google Scholar hits, R recently ( 2016) overtook SAS to become the second most widely used for research articles. SPSS was top, but declining rapidly. R was most widely used according to survey of data scientists.
Where can you get it? google "R" The R project: http://www.r-project.org distribution on CRAN: http://cran.r-project.org/mirrors.html available on Windows, MacOSX, Linux
What does it do? Base, core packages (30), and additional packages (> 12,000): probability distributions statistical tests linear/non-linear modelling multivariate analysis time series spatial statistics networks maps... See "task views": http://cran.r-project.org/web/views/
R GUI Part of R Simple graphical interface Not menu-driven
R Studio Free interface to R that looks nicer than R GUI Syntax highlighting Debugging Command completion Also not menu-driven for running code May need to select version of R when first running it
Running code R is an interpreted programming language, derived from S-plus Interactively in command window In script file, select text, then CTRL + R Run a script in batch mode (we re not covering this)
Getting help Different ways of getting help:?foo or help("foo"): access the help page of foo (if you know the name of the command)??foo or help.search("foo"): look for "foo" in help pages RSiteSearch("foo"): search foo in help pages and forum archives Cross validated - general statistical questions, use tag "r" to ask or search for questions about R. Mailing lists Search main mailing list Book by R core team
Numbers Set a equal to 10+2 and then display the result: a = 10+2 a Variable and function names are case-sensitive. Operators have the usual meaning: + - * / ˆ
Assignment a = 8 b = 10 a == b a = b == checks if a and b are equal. = assigns the value of b to a. In R, can also use <- for assignment: a <- b Many R users prefer <-, as there are some places in which = does not work, but we will not meet them in this course.
Boolean variables Boolean variables are known as logicals in R. a = TRUE b = FALSE a/2 And, or, not: a & b a b!a Any non-zero value maps to TRUE, zero maps to FALSE.
Vectors A vector is a list of objects of the same kind. In R these can be numeric, logical, characters. There are multiple ways to specify vectors: u = vector(length=5, mode="numeric") v = c(1, 3, 4) w = rep(3, length.out=4) x = rep(3, times=4) y = seq(from=2, by=4, length.out=5) z = seq(from=2, to=10, by=2) u = 1:5 s = c("a", "b", "e")
Functions on vectors u = 1:5 length(u) sum(u) Inbuilt functions like exp, log, sin can act on single numbers or on vectors. exp(u) sin(u*pi/2) Operators we saw previously act on each element of a vector.
A = matrix(1:15,nrow=3) A Transpose t(a) Matrix multiplication B = matrix(1:15,ncol=3) A %*% B B %*% A u = 1:5 A %*% u u %*% B Matrices
Data frames A data frame is like a matrix, but different variables (columns) can have different types: d = data.frame(age=c(10,54,3), sex=c("m","f","m")) summary(d) Show first few rows of data frame: head(d) $ to access named element: d$age
v = 1:10 v[1] v[1:3] v[c(1, 2, 4)] v[c(0, 2, 4)] Using logicals v<5 v[v<5] Subsetting vectors
These have intuitive names: sum mean var sd median min max Summary statistics
Probability distribution functions Normal distribution: dnorm(x, mean=2, sd=3) probability density function (pdf) at x pnorm cumulative distribution function (cdf) qnorm quantile function (inverse of cdf) rnorm(n, mean=1, sd=2) generate a vector of N random numbers. To set random number seed, if you may need to reproduce the results. set.seed(n) where n is a positive integer.
Other distributions Similar functions exist for a range of distributions: runif: uniform distribution rpois: Poisson rbinom: binomial rgamma: gamma rbeta: beta rexp: exponential rgeom: geometric rcauchy: Cauchy mvrnorm: multivariate Normal