Statistical Programming with R

Similar documents
Data Structures STAT 133. Gaston Sanchez. Department of Statistics, UC Berkeley

MBV4410/9410 Fall Bioinformatics for Molecular Biology. Introduction to R

Chapter 3 Structure of a C Program

These are notes for the third lecture; if statements and loops.

Absolute C++ Walter Savitch

Full file at

Pace University. Fundamental Concepts of CS121 1

STEPHEN WOLFRAM MATHEMATICADO. Fourth Edition WOLFRAM MEDIA CAMBRIDGE UNIVERSITY PRESS

Our Strategy for Learning Fortran 90

Introduction to Programming in C Department of Computer Science and Engineering. Lecture No. #43. Multidimensional Arrays

Control Flow. COMS W1007 Introduction to Computer Science. Christopher Conway 3 June 2003

C++ Programming: From Problem Analysis to Program Design, Third Edition

Page. No. 1/15 CS201 Introduction to Programmming Solved Subjective Questions From spring 2010 Final Term Papers By vuzs Team

MATLAB = MATrix LABoratory. Interactive system. Basic data element is an array that does not require dimensioning.

Interactive MATLAB use. Often, many steps are needed. Automated data processing is common in Earth science! only good if problem is simple

MATH (CRN 13695) Lab 1: Basics for Linear Algebra and Matlab

C Language Part 1 Digital Computer Concept and Practice Copyright 2012 by Jaejin Lee

ITS Introduction to R course

Mails : ; Document version: 14/09/12

The SPL Programming Language Reference Manual

Programming Fundamentals - A Modular Structured Approach using C++ By: Kenneth Leroy Busbee

Introduction to the workbook and spreadsheet

Introduction to R 21/11/2016

CMSC 330: Organization of Programming Languages. OCaml Imperative Programming

IPCoreL. Phillip Duane Douglas, Jr. 11/3/2010

Objectives. Chapter 2: Basic Elements of C++ Introduction. Objectives (cont d.) A C++ Program (cont d.) A C++ Program

JAVASCRIPT AND JQUERY: AN INTRODUCTION (WEB PROGRAMMING, X452.1)

Chapter 2: Basic Elements of C++

Chapter 2: Basic Elements of C++ Objectives. Objectives (cont d.) A C++ Program. Introduction

The Warhol Language Reference Manual

CSCI 171 Chapter Outlines

JME Language Reference Manual

High Institute of Computer Science & Information Technology Term : 1 st. El-Shorouk Academy Acad. Year : 2013 / Year : 2 nd


CS1114: Matlab Introduction

CITS2401 Computer Analysis & Visualisation

CS 112 Introduction to Computing II. Wayne Snyder Computer Science Department Boston University

CS 112 Introduction to Computing II. Wayne Snyder Computer Science Department Boston University

CS201- Introduction to Programming Latest Solved Mcqs from Midterm Papers May 07,2011. MIDTERM EXAMINATION Spring 2010

The PCAT Programming Language Reference Manual

CSE 431S Type Checking. Washington University Spring 2013

\n is used in a string to indicate the newline character. An expression produces data. The simplest expression

egrapher Language Reference Manual

SECTION 1: INTRODUCTION. ENGR 112 Introduction to Engineering Computing

Twister: Language Reference Manual

A Guide for the Unwilling S User

Chapter 2 Basic Elements of C++

CS1114: Matlab Introduction

R: BASICS. Andrea Passarella. (plus some additions by Salvatore Ruggieri)

Starting with a great calculator... Variables. Comments. Topic 5: Introduction to Programming in Matlab CSSE, UWA

An Introductory Tutorial: Learning R for Quantitative Thinking in the Life Sciences. Scott C Merrill. September 5 th, 2012

Ordinary Differential Equation Solver Language (ODESL) Reference Manual


(Refer Slide Time 01:41 min)

Preview from Notesale.co.uk Page 6 of 52

Features of C. Portable Procedural / Modular Structured Language Statically typed Middle level language

Constraint-based Metabolic Reconstructions & Analysis H. Scott Hinton. Matlab Tutorial. Lesson: Matlab Tutorial

And Parallelism. Parallelism in Prolog. OR Parallelism

R Command Summary. Steve Ambler Département des sciences économiques École des sciences de la gestion. April 2018

A First Book of ANSI C Fourth Edition. Chapter 8 Arrays

Computer Science 322 Operating Systems Mount Holyoke College Spring Topic Notes: C and Unix Overview

LECTURE 17. Expressions and Assignment

Introduction to MATLAB

DATABASE AUTOMATION USING VBA (ADVANCED MICROSOFT ACCESS, X405.6)

CSCE 120: Learning To Code

A brief introduction to R

SPARK-PL: Introduction

UNIT- 3 Introduction to C++

Computer Components. Software{ User Programs. Operating System. Hardware

CSI Lab 02. Tuesday, January 21st

Intro. Scheme Basics. scm> 5 5. scm>

Control Flow Structures

CIS133J. Working with Numbers in Java

Lecture 3: C Programm

Programming Language Basics

Introduction to MATLAB

Introduction to Matlab

C The new standard

Computer Vision. Matlab

9/21/17. Outline. Expression Evaluation and Control Flow. Arithmetic Expressions. Operators. Operators. Notation & Placement

QUIZ. 1. Explain the meaning of the angle brackets in the declaration of v below:

Problem Solving with C++

Expressions and Casting

Grace days can not be used for this assignment

A Basic Guide to Using Matlab in Econ 201FS

CSE 1001 Fundamentals of Software Development 1. Identifiers, Variables, and Data Types Dr. H. Crawford Fall 2018

YOLOP Language Reference Manual

CS109A ML Notes for the Week of 1/16/96. Using ML. ML can be used as an interactive language. We. shall use a version running under UNIX, called

Eigen Tutorial. CS2240 Interactive Computer Graphics

M4.1-R3: PROGRAMMING AND PROBLEM SOLVING THROUGH C LANGUAGE

Computer Science 2500 Computer Organization Rensselaer Polytechnic Institute Spring Topic Notes: C and Unix Overview

Week - 01 Lecture - 04 Downloading and installing Python

C-LANGUAGE CURRICULAM

PROBLEM SOLVING WITH FORTRAN 90

Java is an objet-oriented programming language providing features that support

3. Java - Language Constructs I

How to program with Matlab (PART 1/3)

CS1 Lecture 3 Jan. 18, 2019


Statistical Software Camp: Introduction to R

Transcription:

(connorharris@college.harvard.edu) CS50, Harvard University October 27, 2015 If you want to follow along with the demos, download R at cran.r-project.org or from your Linux package manager

What is R? Programming language designed for statistics and data analysis Imperative, weakly typed, interpreted Broadly C-like syntax Extensive library and native facilities for data mining and statistical analysis

Recommended references (and free legal downloads) Venables et al.: An Introduction to R Official R beginner s guide, maintained by core developers https://cran.r-project.org/doc/manuals/r-intro.pdf C. R. Shalizi: Advanced Data Analysis from an Elementary Point of View Statistics textbook with hundreds of R figures and code samples, plus appendix with useful advice on R programming http://www.stat.cmu.edu/ cshalizi/adafaepov/

General caveats This discussion only scratches the surface of R s capabilities. I have simplified my discussion of many things, especially the capabilities and call syntax of certain functions. I have presented enough for the most common, simple use-cases; consult the documentation for the full detail.

Why use R? Consider using R for components of your project that involve: Mining of large data sets Automated or complicated statistical analysis Data visualization or graphing

Type ontology Atomic types: numeric (64-bit floating point), character (strings), logical (Boolean); coercion (and scanf() equivalents) with as.numeric et sim. Vectors, matrices, higher-dimensionalarrays of the above Lists (type of associative array; vectors of lists behave a bit oddly) No real pure atomic types: single values treated as arrays of length 1 No mixed-type arrays; if you try you ll get implicit string conversions

Unusual syntactic features No variable declarations! Assignment with <- Comments with # Modular residues and integer division with %% and %/% Ranges with colon (2:5 is a vector [2 3 4 5]) One-indexing! For-loops: for (value in vector) {... } Function syntax: foo <- function(args) {... } (Forgot to mention this one in the talk: semicolons after statements are optional at the ends of lines)

Vectors Constructed with c(datum 1,..., datum n ) (arguments can also be vectors, though resulting array is flattened) Cannot be of mixed type Behave as if padded infinitely with value NA ( missing value ) Unary functions map over arrays Binary functions are applied entry by entry (cycling shorter array if necessary) Access with square brackets containing (one-indexed!) indices. Can pass vector of indices to get vector of corresponding elements, negative indices to remove elements (NB: differs from Python) Summary statistics with summary()

Matrices Initialized with matrix(data, nrow=rows, ncol=columns); data (a vector) fills matrix first up to down, then left to right Excellent facilities for matrix multiplication (a %*% b), spectral decomposition (eigen(a)), and other common tasks Arrays (higher-dimensional matrices) can be initialized with array(dim=c(dim 1,..., dim n )) Access columns with foo[rownum,], columns with foo[,colnum]

Lists Type of associative array Initialized as list(key 1 =val 1,..., key n =val n ) Access and set values with foo$key Access individual key value pairs with integer indices or as foo["key"] Reading from a nonexistent key returns NULL, not an error; this can trip you up

Data frames Type of list in which every value is a vector of the same length Used for representing data table Initialized with data.frame([column-name 1 =]column-data 1,..., column-name n =]column-data n, [row.names=string-vector]) Access columns with foo["column-name"] or foo[column-index] Access rows with foo[row-index,] (note trailing comma) Row and column names accessible with rownames(foo) and colnames(foo) Print header and first few rows with head(foo)

Functions foo <- function(arg 1 [=default 1 ],..., arg n [=default n ]) {... } Called as foo([arg 1 =]value 1,..., [arg n =]value n ) No need for explicit return() statement (note parentheses): last statement evaluated is return value but explicit return() is often better style Keyword arguments in function calls can be included in any order

Data import and export Read tabular data into data frames with read.table() (text files), read.xls() (Excel spreadsheets), read.csv() (CSV files), et sim. Write data frames out as tables with write.table(), etc. Save arbitrary R objects to binary files and reload them with save(object 1,..., object n, file=file) and load(file) Files are written and read by default into R s working directory, readable and modifiable with getwd() and setwd(dir)

Multilinear regression Syntax: model <- lm(y x 1 [+x 2 [...[+x n ]...]][, dataframe]) y is the dependent variable, x 1,..., x n the independent variables; can be either vectors or column headers of the data frame specified in the optional second argument Possible to specify more complex formulae, e.g. model <- lm(y^2 + 1 log(x) Print summaries of the model (with line-of-best-fit parameters, etc.) with summary For calculating simple correlations, use cor(vec1, vec2[, method=method])

Plotting Workhorse function is plot(x, y,...), plus many variants and specializations Takes two vectors of the same length; precede with attach(dataframe) to use column headers instead of separate vectors Myriad optional arguments for controlling various details of the plot. Some common ones: type ("p" for points [i.e. scatter plot], "l" for lines, et sim.); main (overall title); xlab, ylab (axis labels), col (default color) Can add fit best-fit lines and local regression curves with abline(regression-model) and lines(lowess(x, y)) Default graphical output is a pop-up window; write to files (in, e.g., PNG format) with png(filename); close devices with dev.off() Facilities for 3D and contour plotting, but I won t go into these now Some third-party libraries for making animation, but a better choice is to use R to generate the frames for animations and then combine them with a third-party program like FFmpeg or ImageMagick

Time for a demo Unix and OS X users: open a terminal window and type R at the command prompt Windows users: find R in the list of programs in the start menu

Foreign-function interface I Allows R code to call C functions Why would you want to do this? Higher speed in inner loops Reuse of existing C libraries Interface is slightly arcane and existing tutorials are confusing Following instructions are for Unix-like systems (e.g. Linux, OS X) I don t know about Windows Please do not write your final project on Windows

Foreign-function interface II Must take all arguments as pointers (NB: for arrays, this is a pointer to the first element) Floating-point type is double (64-bit) Must return void; tells result by modifying arguments void dotprod ( double vec1, double vec2, i n t n, double out ) { out = 0 ; f o r ( i n t i = 0 ; i < n ; i ++) { out += vec1 [ i ] vec2 [ i ] ; } }

Foreign-function interface III Good practice to write a function in C that takes arguments without pointers and then a wrapper function that handles the FFI requirements: double dotprod i n t e r n a l ( double vec1, double vec2, i n t n ) { double r e s u l t = 0 ; f o r ( i n t i = 0 ; i < n ; i ++) { r e s u l t += vec1 [ i ] vec2 [ i ] } return r e s u l t ; } void dotprod ( double vec1, double vec2, i n t n, double out ) { out = dotprod i n t e r n a l ( vec1, vec2, n ) ; }

Foreign-function interface IV Compile C code with R CMD SHLIB foo.c (in the OS shell, not R) creates library foo.so Use library in R code with dyn.load("foo.so") Call using.c() function; takes name of C routine and type-coerced arguments (using as.integer, as.double, as.character, as.logical) Returns list (associative array) of parameter names and modified values r e s u l t <.C( dotprod, as. double ( vec1 ), as. double ( vec2 ), as. i n t e g e r ( length ( vec1 ) ), as. double ( 0 ) ) p r o d u ct < r e s u l t $ out

Bad practices: Explicit loops Slow, inelegant Unnecessary because of R s facile vector handling Replace with higher-order functions (Map, Reduce, Find, Filter) or apply functions; see Shalizi for other methods

Bad practices: Appending to vectors Several equivalent syntaxes, e.g. vec [ length ( vec )+1] < newvalue vec < c ( vec, newvalue ) Vectors must be completely reallocated when resized Pre-allocate vectors to the necessary size vec < vector ( length =1000) Changing iterated reallocations to preallocation caused a thousandfold speedup in one of my own projects (numerical differential-equation solver, vectors of length 10 4 10 5 )

Error handling R prefers continuing after possible errors to stopping, which can produce unexpected behavior in hard-to-predict places Two easy mistakes to make: vector values where single numbers are expected, and NULL values cause functions to behave strangely, but don t throw clean errors Sanity-check function inputs with stopifnot() (equivalent to C s assert())

End I am happy to take questions by e-mail: connorharris@college.harvard.edu I am also happy to serve as an unofficial adviser for anyone using R in a final project; talk to your TF and write me an e-mail if so