Computational statistics Jamie Griffin. Semester B 2018 Lecture 1

Similar documents
Description/History Objects/Language Description Commonly Used Basic Functions. More Specific Functionality Further Resources

LAB #2: SAMPLING, SAMPLING DISTRIBUTIONS, AND THE CLT

R Programming Basics - Useful Builtin Functions for Statistics

Scientific Computing: Lecture 1

Numeric Vectors STAT 133. Gaston Sanchez. Department of Statistics, UC Berkeley

Getting started with simulating data in R: some helpful functions and how to use them Ariel Muldoon August 28, 2018

An Introduction to R- Programming

Short Introduction to R

CREATING SIMULATED DATASETS Edition by G. David Garson and Statistical Associates Publishing Page 1

Revising CS-M41. Oliver Kullmann Computer Science Department Swansea University. Linux Lab Swansea, December 13, 2011.

Al al-bayt University Prince Hussein Bin Abdullah College for Information Technology Computer Science Department

Random Number Generation and Monte Carlo Methods

An introduction to R WS 2013/2014

Revising CS-M41. Oliver Kullmann Computer Science Department Swansea University. Robert Recorde room Swansea, December 13, 2013.

The R statistical computing environment

Advanced Econometric Methods EMET3011/8014

An introduction to R WS 2013/2014

A Quick Introduction to R

Today s Lecture. Factors & Sampling. Quick Review of Last Week s Computational Concepts. Numbers we Understand. 1. A little bit about Factors

Introduction to Queueing Theory for Computer Scientists

Introduction to machine learning, pattern recognition and statistical data modelling Coryn Bailer-Jones

Chapter 3: Dynamic Testing Techniques

A Short Introduction to R

Lecture 3 - Object-oriented programming and statistical programming examples

An Introduction to the Bootstrap

Advanced R Programming - Lecture 1

R Short Course Session 1

R is a programming language of a higher-level Constantly increasing amount of packages (new research) Free of charge Website:

CSc 520. Course Outline (Subject to change) Course Outline (Subject to change)... Principles of Programming Languages. Christian Collberg

Introduction to Programming in C Department of Computer Science and Engineering. Lecture No. #16 Loops: Matrix Using Nested for Loop

1/22/2018. Multivariate Applications in Ecology (BSC 747) Ecological datasets are very often large and complex

CSE 336. Introduction to Programming. for Electronic Commerce. Why You Need CSE336

Getting Started. Slides R-Intro: R-Analytics: R-HPC:

Stochastic Models. Introduction to R. Walt Pohl. February 28, Department of Business Administration

COMP Data Structures

Linear transformations Affine transformations Transformations in 3D. Graphics 2009/2010, period 1. Lecture 5: linear and affine transformations

Organisation. Assessment

Business Statistics: R tutorials

Stat Wk 5. Random number generation. Special variables in data steps. Setting labels.

COMP Data Structures

1 Introduction. 1.1 What is Statistics?

In this course, you need to use Pearson etext. Go to "Pearson etext and Video Notes".

Visual Programming (CBVP2103) This course is worth 3 credit hours Will be covered in weeks Total 13 topics Assessment

CS 3030 Scripting Languages Syllabus

A VERY BRIEF INTRODUCTION TO R

1 Pencil and Paper stuff

An introduction to R: Organisation and Basics of Algorithmics

An Introduction to R. Subhajit Dutta Stat-Math Unit. Indian Statistical Institute, Kolkata October 17, 2012

Vectors and Matrices Flow Control Plotting Functions Simulating Systems Installing Packages Getting Help Assignments. R Tutorial

STENO Introductory R-Workshop: Loading a Data Set Tommi Suvitaival, Steno Diabetes Center June 11, 2015

Intro Intro.3

CSE 20. Lecture 4: Number System and Boolean Function. CSE 20: Lecture2

Biostatistics & SAS programming. Kevin Zhang

Introduction to R: Part I

Introduction to Scientific Python, CME 193 Jan. 9, web.stanford.edu/~ermartin/teaching/cme193-winter15

On R for Statistics. Subhajit Dutta Stat-Math Unit. Indian Statistical Institute, Kolkata September 16, 2011

Package capwire. February 19, 2015

Package simed. November 27, 2017

Course Name: Database Systems - 1 Course Code: IS211

7 Control Structures, Logical Statements

Algorithms and Data Structures. Algorithms and Data Structures. Algorithms and Data Structures. Algorithms and Data Structures

Introduction to R. Course in Practical Analysis of Microarray Data Computational Exercises

Course Name: Database Design Course Code: IS414

Mails : ; Document version: 14/09/12

CS 3030 Scripting Languages Syllabus

STA 313: Topics in Statistics

An introduction to R 1 / 29

R programming Philip J Cwynar University of Pittsburgh School of Information Sciences and Intelligent Systems Program

MS in Applied Statistics: Study Guide for the Data Science concentration Comprehensive Examination. 1. MAT 456 Applied Regression Analysis

Introduction to R. Daniel Berglund. 9 November 2017

Columbus State Community College Mathematics Department Public Syllabus. Course and Number: MATH 1172 Engineering Mathematics A

Why use R? Getting started. Why not use R? Introduction to R: It s hard to use at first. To perform inferential statistics (e.g., use a statistical

Finite Math - J-term Homework. Section Inverse of a Square Matrix

UP School of Statistics Student Council Education and Research

Introduction to R Benedikt Brors Dept. Intelligent Bioinformatics Systems German Cancer Research Center

Software Testing Prof. Meenakshi D Souza Department of Computer Science and Engineering International Institute of Information Technology, Bangalore

Introduction to Databases Fall-Winter 2010/11. Syllabus

GEOMETRY. Teacher: LAIRD JONAS ADDRESS:

Outline EXPERIENCE WITH TWO OOP LANGUAGES IN ONE COURSE. HISTORY Methodology and learning design of the course Experience from classes

This document is designed to get you started with using R

Package RegressionFactory

Algorithms and Data Structures

COS 333: Advanced Programming Techniques

SAS (Statistical Analysis Software/System)

In this course, you need to use Pearson etext. Go to "Pearson etext and Video Notes".

RNA-Seq. Joshua Ainsley, PhD Postdoctoral Researcher Lab of Leon Reijmers Neuroscience Department Tufts University

CS 4800: Algorithms & Data. Lecture 1 January 10, 2017

CS240: Programming in C

Database Systems (INFR10070) Dr Paolo Guagliardo. University of Edinburgh. Fall 2016

CSCI455: Introduction to Programming System Design

PIANOS requirements specifications

Monte Carlo Simula/on and Copula Func/on. by Gerardo Ferrara

The History and Use of R. Joseph Kambourakis

Weekly Discussion Sections & Readings

Teaching Manual Math 2131

Last time. Reasoning about programs. Coming up. Project Final Presentations. This Thursday, Nov 30: 4 th in-class exercise

Reasoning about programs

limma: A brief introduction to R

STAT 135 Lab 1 Solutions

CSE 167: Introduction to Computer Graphics. Jürgen P. Schulze, Ph.D. University of California, San Diego Fall Quarter 2016

Transcription:

Computational statistics Jamie Griffin Semester B 2018 Lecture 1

Course overview This course is not: Statistical computing Programming This course is: Computational statistics Statistical methods that use computation to replace certain assumptions Programming language R is used to implement them Main focus is on understanding the methods

Main topics covered Estimating probability densities Non-parametric tests, e.g. permutation tests Cross-validation Bootstrap This is all in the frequentist framework. Computational methods for Bayesian inference are not covered.

Lecture notes Slides will be put on QMplus before each lecture. More formal set of notes (single pdf file) will be updated as term progresses. More extensive proofs than covered in the lectures. http://qmplus.qmul.ac.uk/course/view.php?id=3136

Time-table Lectures: 9.00-11.00 Friday, Laws 1.19 Lectures are mainly about methods, not R programming Computer practicals: 11.00-12.00 Tuesday, Engineering W128D.2 These use R Office hours: 12.30-13.30 Friday, Queens CB202 Better to email me beforehand Week 7 No lecture or regular practical. Possibly a revision practical session, if there is demand for this.

Assessment Final exam 70% Entirely written, not computer-based In May or June 3 hours for level 7, 2 hours for level 6 Coursework 30% Weekly exercise sheets

Exercise sheets Some is pen and paper, some uses R One question to be handed in each week One per week (9 in total) Available before each tutorial, to be handed in at next tutorial None in first week, so first one to be handed in 30th January

Books Gentle, J.E. Elements of Computational Statistics (2002) Covers most of the material on the course Library has book and electronic version Computational Statistics (2009) by the same author - expanded version, extra material is not in this module Efron, B. and Tibshirani, R.J. An Introduction to the Bootstrap (1994) Covers some of the later material (bootstrap, cross-validation) Library only has hard copy, not electronic Davison, A. C., Hinkley, D. V. Bootstrap methods and their application (1997) Library has book and electronic version

What is R? Free software system for data analysis Initially developed by R. Ihaka and R. Gentleman (1996) Currently developed by the R Core Team (around 20 people) Largest collection of tools for statistics and data analysis (1,000s of contributors)

R is free No need to pay Source code is freely available Anyone can re-use, modify and distribute the code

Who uses R? Academics and other researchers Increasingly, in business Companies using R New York Times Lloyds Google

Popularity of R This webpage has several sources of data on how widely used different statistical environments are: Popularity of data science software According to Google Scholar hits, R recently ( 2016) overtook SAS to become the second most widely used for research articles. SPSS was top, but declining rapidly. R was most widely used according to survey of data scientists.

Where can you get it? google "R" The R project: http://www.r-project.org distribution on CRAN: http://cran.r-project.org/mirrors.html available on Windows, MacOSX, Linux

What does it do? Base, core packages (30), and additional packages (> 12,000): probability distributions statistical tests linear/non-linear modelling multivariate analysis time series spatial statistics networks maps... See "task views": http://cran.r-project.org/web/views/

R GUI Part of R Simple graphical interface Not menu-driven

R Studio Free interface to R that looks nicer than R GUI Syntax highlighting Debugging Command completion Also not menu-driven for running code May need to select version of R when first running it

Running code R is an interpreted programming language, derived from S-plus Interactively in command window In script file, select text, then CTRL + R Run a script in batch mode (we re not covering this)

Getting help Different ways of getting help:?foo or help("foo"): access the help page of foo (if you know the name of the command)??foo or help.search("foo"): look for "foo" in help pages RSiteSearch("foo"): search foo in help pages and forum archives Cross validated - general statistical questions, use tag "r" to ask or search for questions about R. Mailing lists Search main mailing list Book by R core team

Numbers Set a equal to 10+2 and then display the result: a = 10+2 a Variable and function names are case-sensitive. Operators have the usual meaning: + - * / ˆ

Assignment a = 8 b = 10 a == b a = b == checks if a and b are equal. = assigns the value of b to a. In R, can also use <- for assignment: a <- b Many R users prefer <-, as there are some places in which = does not work, but we will not meet them in this course.

Boolean variables Boolean variables are known as logicals in R. a = TRUE b = FALSE a/2 And, or, not: a & b a b!a Any non-zero value maps to TRUE, zero maps to FALSE.

Vectors A vector is a list of objects of the same kind. In R these can be numeric, logical, characters. There are multiple ways to specify vectors: u = vector(length=5, mode="numeric") v = c(1, 3, 4) w = rep(3, length.out=4) x = rep(3, times=4) y = seq(from=2, by=4, length.out=5) z = seq(from=2, to=10, by=2) u = 1:5 s = c("a", "b", "e")

Functions on vectors u = 1:5 length(u) sum(u) Inbuilt functions like exp, log, sin can act on single numbers or on vectors. exp(u) sin(u*pi/2) Operators we saw previously act on each element of a vector.

A = matrix(1:15,nrow=3) A Transpose t(a) Matrix multiplication B = matrix(1:15,ncol=3) A %*% B B %*% A u = 1:5 A %*% u u %*% B Matrices

Data frames A data frame is like a matrix, but different variables (columns) can have different types: d = data.frame(age=c(10,54,3), sex=c("m","f","m")) summary(d) Show first few rows of data frame: head(d) $ to access named element: d$age

v = 1:10 v[1] v[1:3] v[c(1, 2, 4)] v[c(0, 2, 4)] Using logicals v<5 v[v<5] Subsetting vectors

These have intuitive names: sum mean var sd median min max Summary statistics

Probability distribution functions Normal distribution: dnorm(x, mean=2, sd=3) probability density function (pdf) at x pnorm cumulative distribution function (cdf) qnorm quantile function (inverse of cdf) rnorm(n, mean=1, sd=2) generate a vector of N random numbers. To set random number seed, if you may need to reproduce the results. set.seed(n) where n is a positive integer.

Other distributions Similar functions exist for a range of distributions: runif: uniform distribution rpois: Poisson rbinom: binomial rgamma: gamma rbeta: beta rexp: exponential rgeom: geometric rcauchy: Cauchy mvrnorm: multivariate Normal