R Programming Basics - Useful Builtin Functions for Statistics

Similar documents
Chapter 5: Joint Probability Distributions and Random

1 Pencil and Paper stuff

An introduction to R WS 2013/2014

Getting started with simulating data in R: some helpful functions and how to use them Ariel Muldoon August 28, 2018

Fathom Dynamic Data TM Version 2 Specifications

Package uclaboot. June 18, 2003

SPSS Basics for Probability Distributions

The simpleboot Package

Stochastic Models. Introduction to R. Walt Pohl. February 28, Department of Business Administration

STATS PAD USER MANUAL

Biostatistics & SAS programming. Kevin Zhang

Bluman & Mayer, Elementary Statistics, A Step by Step Approach, Canadian Edition

Multivariate Normal Random Numbers

R Primer for Introduction to Mathematical Statistics 8th Edition Joseph W. McKean

Package visualize. April 28, 2017

Stat 302 Statistical Software and Its Applications SAS: Distributions

Markov chain Monte Carlo methods

Package simed. November 27, 2017

Will Monroe July 21, with materials by Mehran Sahami and Chris Piech. Joint Distributions

Learner Expectations UNIT 1: GRAPICAL AND NUMERIC REPRESENTATIONS OF DATA. Sept. Fathom Lab: Distributions and Best Methods of Display

Lecture 6: Chapter 6 Summary

Business Statistics: R tutorials

A Quick Introduction to R

Homework set 4 - Solutions

Statistics I 2011/2012 Notes about the third Computer Class: Simulation of samples and goodness of fit; Central Limit Theorem; Confidence intervals.

Unit 8 SUPPLEMENT Normal, T, Chi Square, F, and Sums of Normals

Introduction to the Practice of Statistics using R: Chapter 6

Sections 4.3 and 4.4

Quality Control-2 The ASTA team

R Introduction 1. J.M. Ponciano. August 17, Objects in Splus language include, and. x<-list(ncolors=2,colors=c("red","blue"))

Part I, Chapters 4 & 5. Data Tables and Data Analysis Statistics and Figures

AN INTRODUCTION TO R

Computational statistics Jamie Griffin. Semester B 2018 Lecture 1

Computing With R Handout 1

Statistics in GeoGebra Source: The GeoGebra Online Manual (

Introduction to RStudio

3. Data Analysis and Statistics

CHAPTER 2: Describing Location in a Distribution

Generating random samples from user-defined distributions

Learning Objectives. Continuous Random Variables & The Normal Probability Distribution. Continuous Random Variable

CHAPTER 1. Introduction. Statistics: Statistics is the science of collecting, organizing, analyzing, presenting and interpreting data.

CHAPTER 2 Modeling Distributions of Data

Lab 3 (80 pts.) - Assessing the Normality of Data Objectives: Creating and Interpreting Normal Quantile Plots

Lecture 3 - Object-oriented programming and statistical programming examples

Introduction to scientific programming in R

Appendix A: A Sample R Session

a. divided by the. 1) Always round!! a) Even if class width comes out to a, go up one.

Ch3 E lement Elemen ar tary Descriptive Descriptiv Statistics

Lab 7 Statistics I LAB 7 QUICK VIEW

YEAR 12 Core 1 & 2 Maths Curriculum (A Level Year 1)

Multistat2 1

Chapter 6 Normal Probability Distributions

Today. Lecture 4: Last time. The EM algorithm. We examine clustering in a little more detail; we went over it a somewhat quickly last time

STATISTICAL LABORATORY, April 30th, 2010 BIVARIATE PROBABILITY DISTRIBUTIONS

Testing Random- Number Generators

The Normal Distribution

Introduction to the R Language

Minitab Study Card J ENNIFER L EWIS P RIESTLEY, PH.D.

So..to be able to make comparisons possible, we need to compare them with their respective distributions.

StatsMate. User Guide

S CHAPTER return.data S CHAPTER.Data S CHAPTER

Package sspse. August 26, 2018

Use of Extreme Value Statistics in Modeling Biometric Systems

Chapter 2 Modeling Distributions of Data

CREATING THE DISTRIBUTION ANALYSIS

Table Of Contents. Table Of Contents

Tutorial: Using Tina Vision s Quantitative Pattern Recognition Tool.

Voluntary State Curriculum Algebra II

8. MINITAB COMMANDS WEEK-BY-WEEK

Normal Distribution. 6.4 Applications of Normal Distribution

Excel 2010 with XLSTAT

Acquisition Description Exploration Examination Understanding what data is collected. Characterizing properties of data.

The first few questions on this worksheet will deal with measures of central tendency. These data types tell us where the center of the data set lies.

Tutorial 3: Probability & Distributions Johannes Karreth RPOS 517, Day 3

SD 372 Pattern Recognition

Linear Modeling with Bayesian Statistics

Computational Methods for Finding Probabilities of Intervals

SAS (Statistical Analysis Software/System)

PEER Report Addendum.

Data Mining Chapter 3: Visualizing and Exploring Data Fall 2011 Ming Li Department of Computer Science and Technology Nanjing University

Chapter 3: Data Description Calculate Mean, Median, Mode, Range, Variation, Standard Deviation, Quartiles, standard scores; construct Boxplots.

Minitab 17 commands Prepared by Jeffrey S. Simonoff

Statistics I Practice 2 Notes Probability and probabilistic models; Introduction of the statistical inference

PART 1 PROGRAMMING WITH MATHLAB

Regression on SAT Scores of 374 High Schools and K-means on Clustering Schools

Chapter 3. Bootstrap. 3.1 Introduction. 3.2 The general idea

Probability Distributions

Introduction to R, Github and Gitlab

Package visualizationtools

Stat 528 (Autumn 2008) Density Curves and the Normal Distribution. Measures of center and spread. Features of the normal distribution

Interval Estimation. The data set belongs to the MASS package, which has to be pre-loaded into the R workspace prior to use.

Package logitnorm. R topics documented: February 20, 2015

Quantitative - One Population

#a- a vector of 100 random number from a normal distribution a<-rnorm(100, mean= 32, sd=6)

CHAPTER 6. The Normal Probability Distribution

Chapter 6: DESCRIPTIVE STATISTICS

Introduction to Queueing Theory for Computer Scientists

Practice in R. 1 Sivan s practice. 2 Hetroskadasticity. January 28, (pdf version)

Name: Date: Period: Chapter 2. Section 1: Describing Location in a Distribution

Package SHELF. August 16, 2018

Transcription:

R Programming Basics - Useful Builtin Functions for Statistics Vectorized Arithmetic - most arthimetic operations in R work on vectors. Here are a few commonly used summary statistics. testvect = c(1,3,5,2,9,10,7,8,6) min(testvect) # minimum ## [1] 1 max(testvect) # maximum ## [1] 10 mean(testvect) # mean ## [1] 5.666667 median(testvect) # median ## [1] 6 var(testvect) #variance ## [1] 10 sd(testvect) # standard deviation ## [1] 3.162278 vect1 = cars$speed vect2 = cars$dist quantile(vect1) # sample quantiles ## 0% 25% 50% 75% 100% ## 4 12 15 19 25 summary(vect1) # same thing, plus the mean ## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 4.0 12.0 15.0 15.4 19.0 25.0 cov(vect1,vect2) # sample covariance between vect1 and vect2 ## [1] 109.9469 cor(vect1,vect2) # sample correlation coefficient ## [1] 0.8068949 1

Calculations on probability distributions - area under curve, percentiles, random sampling distributions - *binom* : Binomial Distribution - *pois* : Poisson Distribution - *unif* : Uniform Distribution - *exp* : Exponential Distribution - *norm* : Normal Distribution - *chisq* : Chi-Squared Distribution - *t* : t Distribution - *f* : F Distribution key prefix letters # *d* : probability density function/probability mass function. # *p* : cumulative distribution function. # *q* : quantile function. Returns percentiles, also known as quantiles. # *r* : random number generation Examples What is the area under the standard normal curve to the left of 1.5? pnorm(1.5) ## [1] 0.9331928 What point cuts off an area.975 to its left under the standard normal curve? this point also cuts off an area.025 to its right under the standard normal curve. qnorm(.975) ## [1] 1.959964 What is the area under the standard normal curve to the left of 1.96? To the left of -1.96? What is the area under the curve betwen -1.96 and 1.96? pnorm(1.96) ## [1] 0.9750021 pnorm(-1.96) ## [1] 0.0249979 pnorm(1.96)-pnorm(-1.96) ## [1] 0.9500042 Simulate 100 observations from the normal distribution with mean µ = 3 and standard deviation σ = 1.5. Calculate some summary statistics for the data, and also show a normal QQ plot, which can be used to assess normality. data=rnorm(100,mean=3,sd=1.5) summary(data) 2

## Min. 1st Qu. Median Mean 3rd Qu. Max. ## -0.6916 1.9712 3.0364 2.9851 3.8912 5.9703 qqnorm(data) qqline(data) Normal Q Q Plot Sample Quantiles 0 1 2 3 4 5 6 2 1 0 1 2 Theoretical Quantiles Plot the standard normal density function φ(z) from z = 3.1 to z = 3.1, where φ(z) = 1 2π e.5z2 z=seq(from=-3.5, to=3.5, length.out=300) dz=dnorm(z) plot(z,dz,xlab="z",ylab="phi(z)",main="standard normal curve",type="l") 3

standard normal curve phi(z) 0.0 0.1 0.2 0.3 0.4 3 2 1 0 1 2 3 What is the probability that a normal random variable with mean 4 and standard deviation 1.5 is greater than 2.5? 1-pnorm(2.5,4,1.5) ## [1] 0.8413447 What is the probability that a t random variable with 12 degrees of freedom is less than 2? Greater than 2? pt(2,12) ## [1] 0.9656725 1-pt(2,12) ## [1] 0.03432751 What are the upper and lower 2.5 th percentile of a t random variable with 6 degrees of freedom? qt(c(.025,.975),6) ## [1] -2.446912 2.446912 Simulate 200 observations from the uniform distribution on the interval (0,5). Make a histogram of the data with 10 bars. data=runif(200,0,5) hist(data,nclass=10) z 4

Histogram of data Frequency 0 5 10 15 20 25 0 1 2 3 4 5 data What about with 1000 observations? Does it look more uniform? Make a probability histogram, which means the bars are normalized so that the total area is 1. data=runif(1000,0,5) hist(data,nclass=10,prob=t) Histogram of data Density 0.00 0.05 0.10 0.15 0.20 0 1 2 3 4 5 data 5

What about with 10000? The law of large numbers indicates that the histogram will look close to uniform for a sufficiently large sample size. data=runif(10000,0,5) hist(data,nclass=10,prob=t) Histogram of data Density 0.00 0.05 0.10 0.15 0.20 0 1 2 3 4 5 data What is the probability that a binomial random variable with parameters n and p is equal to x? eg. with n = 10, p =.2, x = 2 dbinom(2, size=10, prob=.2) ## [1] 0.3019899 p(x) = ( nx ) p x (1 p) n x What is the probability that a binomial random variable X with parameters n and p is less than or equal to x? eg. with n = 10, p =.2, x = 2 pbinom(2, size=10, prob=.2) ## [1] 0.6777995 P (X x) = x k=0 ( nx ) p x (1 p) n x Simulate 20 observations from the binomial distribution with n = 10 and p =.2. data=rbinom(2, size=10, prob=.2) data 6

## [1] 1 2 When generating random numbers, use set.seed() to reproduce results. set.seed(100) rnorm(5) ## [1] -0.50219235 0.13153117-0.07891709 0.88678481 0.11697127 rnorm(5) ## [1] 0.3186301-0.5817907 0.7145327-0.8252594-0.3598621 set.seed(100) # reproduce the results rnorm(5) ## [1] -0.50219235 0.13153117-0.07891709 0.88678481 0.11697127 rnorm(5) ## [1] 0.3186301-0.5817907 0.7145327-0.8252594-0.3598621 Useful functions for working on matrices. apply - apply a function over rows or columns of a matrix syntax: apply(matrixname,dimension,fun) example: for the airquality data set, count the number of missing values by row (for each record), and by column (for each variable) nmiss=function(x){sum(is.na(x))} apply(airquality,1,nmiss) ## [1] 0 0 0 0 2 1 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 2 0 0 0 0 1 1 1 1 ## [36] 1 1 0 1 0 0 1 1 0 1 1 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 0 0 0 1 0 0 0 0 0 ## [71] 0 1 0 0 1 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 1 1 1 0 0 0 1 1 0 0 ## [106] 0 1 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ## [141] 0 0 0 0 0 0 0 0 0 1 0 0 0 apply(airquality,2,nmiss) ## Ozone Solar.R Wind Temp Month Day ## 37 7 0 0 0 0 mydata=read.csv("http://www-bcf.usc.edu/~gareth/isl/auto.csv") ## %*% -inner product. dot product. matrix multiplication. M1=matrix(1:9,byrow=T,ncol=3) M2=matrix(1:15,byrow=T,ncol=5) M1 ## [,1] [,2] [,3] ## [1,] 1 2 3 ## [2,] 4 5 6 ## [3,] 7 8 9 7

M2 ## [,1] [,2] [,3] [,4] [,5] ## [1,] 1 2 3 4 5 ## [2,] 6 7 8 9 10 ## [3,] 11 12 13 14 15 M1%*%M2 ## [,1] [,2] [,3] [,4] [,5] ## [1,] 46 52 58 64 70 ## [2,] 100 115 130 145 160 ## [3,] 154 178 202 226 250 outer - calculate an outer product between two arrays outer(x, Y, FUN = "*",...) X, Y: First and second arguments for function FUN. Typically a vector or array. FUN: a function to use on the outer products. The function FUN will be called using each element of X and each element of Y. Example: The following calculates a table of values xˆy, for each x in 1:10, and each value of y in -2:2. Row and column names are added to the table. x=1:10 y=-2:2 powertable=outer(x,y,fun="^") rownames(powertable)=x colnames(powertable)=y 8