Install RStudio from - use the standard installation.

Similar documents
Intro to R for Epidemiologists

Correlation. January 12, 2019

Visualizing Data: Customization with ggplot2

Chapter 6: DESCRIPTIVE STATISTICS

Statistical Package for the Social Sciences INTRODUCTION TO SPSS SPSS for Windows Version 16.0: Its first version in 1968 In 1975.

Minitab 17 commands Prepared by Jeffrey S. Simonoff

Excel 2010 with XLSTAT

Facets and Continuous graphs

Distributions of Continuous Data

Stat 528 (Autumn 2008) Density Curves and the Normal Distribution. Measures of center and spread. Features of the normal distribution

Introduction. About this Document. What is SPSS. ohow to get SPSS. oopening Data

1 RefresheR. Figure 1.1: Soy ice cream flavor preferences

Part I, Chapters 4 & 5. Data Tables and Data Analysis Statistics and Figures

Individual Covariates

Week 7: The normal distribution and sample means

Fathom Dynamic Data TM Version 2 Specifications

In Minitab interface has two windows named Session window and Worksheet window.

Applied Regression Modeling: A Business Approach

An Introduction to the R Commander

Spring 2017 CS130 - Intro to R 1 R VISUALIZING DATA. Spring 2017 CS130 - Intro to R 2

Excel Tips and FAQs - MS 2010

R Visualizing Data. Fall Fall 2016 CS130 - Intro to R 1

LAB #2: SAMPLING, SAMPLING DISTRIBUTIONS, AND THE CLT

R practice. Eric Gilleland. 20th May 2015

CHAPTER 6. The Normal Probability Distribution

Practical 2: Plotting

STATGRAPHICS Sigma Express for Microsoft Excel. Getting Started

Introduction to Geospatial Analysis

Getting started with ggplot2

Data Analyst Nanodegree Syllabus

LAB 1 INSTRUCTIONS DESCRIBING AND DISPLAYING DATA

Bluman & Mayer, Elementary Statistics, A Step by Step Approach, Canadian Edition

An Introduction to Minitab Statistics 529

Minitab Study Card J ENNIFER L EWIS P RIESTLEY, PH.D.

8. MINITAB COMMANDS WEEK-BY-WEEK

Chapter 2 Assignment (due Thursday, April 19)

Chapter 2: The Normal Distributions

Your Name: Section: INTRODUCTION TO STATISTICAL REASONING Computer Lab #4 Scatterplots and Regression

Introduction to Minitab 1

A. Using the data provided above, calculate the sampling variance and standard error for S for each week s data.

How to Make Graphs in EXCEL

Statistics: Normal Distribution, Sampling, Function Fitting & Regression Analysis (Grade 12) *

Measures of Dispersion

Research Methods for Business and Management. Session 8a- Analyzing Quantitative Data- using SPSS 16 Andre Samuel

Regression III: Advanced Methods

We have written lots of code so far It has all been inside of the main() method What about a big program? The main() method is going to get really

Homework 1 Excel Basics

Bar Charts and Frequency Distributions

Learner Expectations UNIT 1: GRAPICAL AND NUMERIC REPRESENTATIONS OF DATA. Sept. Fathom Lab: Distributions and Best Methods of Display

IQR = number. summary: largest. = 2. Upper half: Q3 =

Instructions for Using ABCalc James Alan Fox Northeastern University Updated: August 2009

Introduction to ggvis. Aimee Gott R Consultant

Practical 2: Using Minitab (not assessed, for practice only!)

Lecture 4: Data Visualization I

This document is designed to get you started with using R

Statistics with a Hemacytometer

Brief Guide on Using SPSS 10.0

Page 1. Graphical and Numerical Statistics

The diamonds dataset Visualizing data in R with ggplot2

R Exercise Practical 2. Follow the instructions described in this sheet. Type or paste any answers you are asked for into a Microsoft Word document.

Chapter 2. Descriptive Statistics: Organizing, Displaying and Summarizing Data

MBV4410/9410 Fall Bioinformatics for Molecular Biology. Introduction to R

Chapter Two: Descriptive Methods 1/50

Data Analysis and Solver Plugins for KSpread USER S MANUAL. Tomasz Maliszewski

MHPE 494: Data Analysis. Welcome! The Analytic Process

Introduction to R Software

Resources for statistical assistance. Quantitative covariates and regression analysis. Methods for predicting continuous outcomes.

Fractions. 7th Grade Math. Review of 6th Grade. Slide 1 / 306 Slide 2 / 306. Slide 4 / 306. Slide 3 / 306. Slide 5 / 306.

SPSS. (Statistical Packages for the Social Sciences)

1. Select[] is used to select items meeting a specified criterion.

Statistical transformations

An Introduction to R- Programming

GenStat for Schools. Disappearing Rock Wren in Fiordland

Chapter 6 Normal Probability Distributions

MINITAB 17 BASICS REFERENCE GUIDE

Things you ll know (or know better to watch out for!) when you leave in December: 1. What you can and cannot infer from graphs.

Visual Analytics. Visualizing multivariate data:

Dealing with Data in Excel 2013/2016

Chapter 3: Data Description Calculate Mean, Median, Mode, Range, Variation, Standard Deviation, Quartiles, standard scores; construct Boxplots.

Lecture 6: Chapter 6 Summary

L E A R N I N G O B JE C T I V E S

Data can be in the form of numbers, words, measurements, observations or even just descriptions of things.

Applied Regression Modeling: A Business Approach

Problem Set #8. Econ 103

Data Analyst Nanodegree Syllabus

EXST 7014, Lab 1: Review of R Programming Basics and Simple Linear Regression

Correctly Compute Complex Samples Statistics

23.2 Normal Distributions

Age & Stage Structure: Elephant Model

Chapter 5. Normal. Normal Curve. the Normal. Curve Examples. Standard Units Standard Units Examples. for Data

03 - Intro to graphics (with ggplot2)

The first thing we ll need is some numbers. I m going to use the set of times and drug concentration levels in a patient s bloodstream given below.

WELCOME! Lecture 3 Thommy Perlinger

Spreadsheet and Graphing Exercise Biology 210 Introduction to Research

ggplot2 basics Hadley Wickham Assistant Professor / Dobelman Family Junior Chair Department of Statistics / Rice University September 2011

Chapter 1 Histograms, Scatterplots, and Graphs of Functions

SPSS QM II. SPSS Manual Quantitative methods II (7.5hp) SHORT INSTRUCTIONS BE CAREFUL

6. More Loops, Control Structures, and Bootstrapping

LAB #1: DESCRIPTIVE STATISTICS WITH R

Technical Support Minitab Version Student Free technical support for eligible products

Transcription:

Session 1: Reading in Data Before you begin: Install RStudio from http://www.rstudio.com/ide/download/ - use the standard installation. Go to the course website; http://faculty.washington.edu/kenrice/rintro/ - and locate the datasets we will be using. 1. Ensure you can read in the datasets used in the lecture, and that you obtain the same data summaries seen in the slides. 2. Read in the crab data, available on the course site. This dataset describe 173 female horseshoe crabs, and contains data on their weight (in kg), width of shell (in cm), color, condition of their spines, and the number of satellites they have satellites are male crabs, who attach to the females. a. What is the mean number of satellites, in female crabs whose width is <26cm, versus those with width >= 26 cm? (Hint: use the View window to help you) b. Tabulate the crabs color, for i. All the crabs ii. Those with width < 26cm iii. Those with width >=26cm? Hint: use the table() command 3. The othercrab data is also on the course site; in contains the same crab data, but as a comma-separated file, and has a.csv extension. Read in this dataset and check it is the same values as the version in Q2.

Session 2: More data summary & using functions 1. Read in the crab data in these two ways; a. Directly from the command line, using a downloaded copy of the data. b. Directly from the course website. (NB make sure you log on to the wireless network first) 2. Cross-tabulate the color of the crabs and their number of satellites. Which colors of female crabs appear to attract more male horseshoe crabs? 3. In the salary data, we saw that 4 observations had missing $salary variables. Using is.na() and other commands, determine which number rows have these missing values. 4. Using the help page for read.csv() to help you, read in the othercrab data again. (NB the.csv format is widely-used, for small and medium-sized rectangular datasets. It s a good simple choice if you need to transfer data between systems, e.g. from an Excel spreadsheet to R.)

Session 3: Plotting functions and formulas 1. Using the crab data (for the last time!); a. Plot number of satellites versus width b. Plot number of satellites versus weight, and color-code the points differently if their width exceeds 26cm c. Illustrate the distribution of number of satellites, for crabs of different colors. (There are several ways to do this) For all of these, try re-sizing the graph after it is drawn; what happens to the axes? 2. Save one of your plots from Q1 as a graphics file (e.g. PNG, JPEG, PDF) and then place it in an external document (e.g. a PowerPoint slide). Are the axes big enough? 3. The titanic dataset, on the course site, contains information on who survived the sinking of the Titanic, amount Males/Females, 1 st /2 nd /3 rd class passengers and Crew, Adults/Children. For each category, n denotes the total number, and prop denotes the proportion of them who survived. Using the stripchart() function and the formula syntax, illustrate which categories were most and least likely to survive. (Hint: you will need to use the help page for stripchart(), but its syntax follows the commands seen in the slides)

Session 4: Adding features to plots These exercises use the mtcars dataset, that is supplied with R. Use?mtcars to access a description of it. 1. Create a figure with boxplots to illustrate the distribution of mpg (miles per gallon) for the different cyl (number of cylinder) types, where each boxplot in the figure has a different color. Add a legend to the plot. 2. Create a scatterplot to illustrate the relationship of mpg (miles per gallon) versus wt (car weight). a. Using abline(), add a straight line that you think summarizes this relationship. (Hint: experiment with a few straight lines, replotting the graph each time) b. Add scatterplot smoothers to the plot using the lowess() and supsmu()functions, where one of the lines is red and the other is blue. (Hint: use the commands in the slides from session 4 on scatterplot smoothers!) c. Use legend() to add an appropriate legend to the scatterplot. 3. Do question 2 again, but this time include the two scatterplot smoothers that have different line types and different colors, where one line is dotted and the other is dashed. Include an appropriate legend. (Hint: for plotting different line types, use the lty option with the lines()function, e.g., set lty=2 in the function call). 4. Put your results from Q1/2/3 into one figure, using layout().

Session 5: Over and over 1. Using the code from the slides, obtain the approximate value of: a. The expected value of the median of a sample of size 501, from the Exp(1) distribution b. The variance of the same random variable In both cases, compare your answers to those in the slides, that use samples of size 51. 2. The correlation of two random variables can be obtained using the cor() function, so for example in the built-in mtcars dataset, the correlation of weight and miles per gallon can be obtained from with(mtcars, cor(mpg, wt)) Using the correlation as the test statistic, implement a permutation test of the null hypothesis that mpg and wt are uncorrelated.

Session 6: More loops, control structures, and bootstrapping Download the lawschool dataset, from the course website, and read it into your R session. This dataset is a random sample of 15 law schools, taken from a collection of 82 participating law schools in a large study of admission practices. Two measurements were made on the entering classes of each school: LSAT, the average score for the class on a national law test, and GPA, the average undergraduate grade-point average for the class. We will estimate the correlation of LSAT and GPA, and obtain a 95% confidence interval for this correlation. 1. Illustrate the relationship of LSAT and GPA using a plot (or plots) of your choice. 2. Calculate the correlation of LSAT and GPA with the cor()function. 3. Bootstrap the correlation of LSAT and GPA using a repeat()or while() loop. Start with 50 bootstrap replicates, and then try using 200, 800, and 3200 bootstrap replicates. Keep the output from all four versions. 4. In one graphical window, plot histograms of the bootstrap correlation values for the four bootstrap replicate sizes considered in the previous question (50, 200, 800, and 3,200 replicates). Also calculate the standard deviation of the correlation of LSAT and GPA for each of the bootstrap replicate sizes. 5. Calculate a 95% confidence interval for the correlation of LSAT and GPA using the bootstrap quantiles with 3,200 bootstrap replicates.

Session 7: Fitting models 1. Using the built-in mtcars dataset: a. Implement a t-test of the null hypothesis that the average miles per gallon is equal in automatic and manual cars. (The am variable is 1 for manual, 0 for automatic) Check that your output is sensible by plotting a boxplot of the same data. b. Implement analysis of variance, to assess whether the mean miles per gallon is equal in cars with different numbers of forward gears. Again, check your output using a boxplot, and be careful to use a factor representation of the gear variable. 2. Again using the mtcars dataset, implement linear regression of miles per gallon on weight. How does this compare to your eyeball estimate in Session 3? Obtain a p-value assessing the hypothesis that the linear trend in this dataset is flat how does it compare to the permutation p-value from Session 5? 3. Obtain a 95% confidence interval for the linear trend between LSAT and GPA, in the lawschool dataset. Compare this with what you got in Session 6. 4. [For keen people!] The titaniclong dataset on the course site contains individual Titanic survival data each row of the dataset represents one person. Implement logistic regression of survival (1/0) on class, and produce 95% confidence intervals for the odds ratios comparing survival in 1 st and 2 nd classes to 3 rd class.

Session 8: Introduction to R Packages The ggplot2 package is a widely-used data visualization package for enhanced graphics in R. In this session we will use ggplot2 for plotting data from the salary dataset. The two main generic plotting functions of ggplot2 are qplot()and ggplot(). These functions work by trying to guess a useful form of plot, depending on the user-supplied data and desired geometry. In addition to their help pages, detailed information about these (and other) plotting functions in this package can be found at http://ggplot2.org/. 1. Using either the drop-down menus or the command line (with the install.packages() and library() commands) install this package and load it into your current R session. 2. Some qplot()commands to plot smooth density curves of salary for each gender are given below. Create a similar density plot of salary by rank. qplot(salary, geom="density", fill=gender, data=salary, xlab="monthly Salary", ylab="density") qplot(salary, geom="density", fill=gender, data=salary, xlab="monthly Salary", ylab="density", alpha=0.5) 3. By adjusting the geom argument described in the documentation for gplot() use qplot()to plot a histogram of salaries, and a histogram of salaries across rank. Finally, create a histogram of salary by gender within each rank (hint: use the interaction() command). 4. The ggplot()command extends the basic idea of qplot; the user has to specify aesthetic mappings (via the aes() command) that describe how variables in the data are mapped to visual properties, after which other aspects can be added. For example, for a boxplot of salary by rank gender, we can use ggplot(data=salary, aes(x = rank, y = salary, fill = gender) ) + geom_boxplot() Another add-on to the plot it to break it break it down by another variable, using +facet_wrap(~variablename). Using the help page for facet_wrap() to help you, create boxplots of salary by gender within each rank across fields. 5. [For keen people!] Illustrate the distribution of salary by gender within each rank for the different starting years.

Special Exercise This exercise is for you to try in the evening. We will review various solutions to it in the final session so even if you don t get a complete solution, tackling some of it will be useful preparation for the final session. The goal is to implement Conway s Game of Life in R. This game is a simple evolutionary model, where cells on a grid either live or die according to the following rules; 1. Any live cell with fewer than two live neighbors dies, as if caused by under-population. 2. Any live cell with two or three live neighbors lives on to the next generation. 3. Any live cell with more than three live neighbors dies, as if by overcrowding. 4. Any dead cell with exactly three live neighbors becomes a live cell, as if by reproduction. At each generation, we count the neighbors, then do all the updates to status (i.e. live/dead). For example; First Generation Count of Neighbors Second Generation 1 1 2 2 3 2 1 1 0 3 2 3 1 1 2 2 4 3 6 4 2 2 1 3 3 3 2 1 3 3 4 3 3 4 2 2 2 2 2 2 1 1 1 2 2 1 1 1 1 Your code should plot the grid of alive/dead cells in the R graphics window, for several generations. Some animations showing examples are on the course site. Some tips: (note we ll give more tips in the evening session) Start with a small grid, i.e. 10x10, but write your code so that it can be scaled up later Use the rect() function to draw the grid At each generation, use a double for() loop, i.e. for(i in 1:nrows){ for(j in 1:ncols){ <do some operations on cell[i,j]> }} There are two options for what to do with the edges and corners; o You may count the neighbors based on the visible grid, so e.g. cells on the edges can have 5 neighbors, while those in the corners have up to 3. With this approach, your code for counting neighbors will be different for edge cells, corner cells, and regular internal cells. o You may wrap around the definition of neighbors, so e.g. the left edge cells neighbor those on the right edge. With this approach, modular arithmetic will be useful; to do this in R try e.g. 1:20 %% 7, to generate the remainders when you divide 1:20 by 7. Once your code works on small examples, try it on larger grids, initiated at random. For example, matrix(rbinom(40*40, 1, 0.3),40,40) generates a 40x40 matrix with entries that have 30% chance of being 0, and 70% chance of being 1. [For keen people] Keep track of how long a cell has been alive, and color-code according to age.

Session 9: Writing functions 1. The factorial of a non-negative integer n is defined to be n! = n (n 1) (n 2) 2 1, so for example, 4! = 4 3 2 1 = 24. Create a function that takes a nonnegative integer as the argument and returns the factorial of the integer. (Hint: you can use a while loop, but there are many ways to do this). What is the value of 10!? 2. The formula for converting a temperature in Fahrenheit (F) to Celsius (C) is: C = (5/9) (F 32) Write a function that converts a Fahrenheit temperature to Celsius. Use this function to create a data frame containing Fahrenheit values 30, 31, 32, up to 100, and the corresponding temperatures in Celsius. 3. Obtain a root for the following function with the Newton-Raphson method: f(x) = 5x 3 4x 2 + 12x 7 (Hint: Implement the Newton-Raphson function given in the session 9 slides).