Lab 5, part b: Scatterplots and Correlation

Similar documents
Lab 1: Getting started with R and RStudio Questions? or

Your Name: Section: INTRODUCTION TO STATISTICAL REASONING Computer Lab #4 Scatterplots and Regression

Using Excel for Graphical Analysis of Data

STAT 113: R/RStudio Intro

Applied Regression Modeling: A Business Approach

Working with Macros. Creating a Macro

An introduction to plotting data

Rockefeller College MPA Excel Workshop: Clinton Impeachment Data Example

STAT 213: R/RStudio Intro

1 Introduction to Using Excel Spreadsheets

Intro to Stata for Political Scientists

Submitting Assignments

In math, the rate of change is called the slope and is often described by the ratio rise

How Do I Choose Which Type of Graph to Use?

Statistics 13, Lab 1. Getting Started. The Mac. Launching RStudio and loading data

Introduction to R Programming

Introduction to Scientific Computing with Matlab

Math 121 Project 4: Graphs

Graphing Calculator Overview

Using the Dashboard. The dashboard allows you to see, and drill into, important summary information about the health of your reliability solution.

Survey of Math: Excel Spreadsheet Guide (for Excel 2016) Page 1 of 9

Part 6b: The effect of scale on raster calculations mean local relief and slope

DOING MORE WITH EXCEL: MICROSOFT OFFICE 2013

Chapter 3: Data Description Calculate Mean, Median, Mode, Range, Variation, Standard Deviation, Quartiles, standard scores; construct Boxplots.

Decimals should be spoken digit by digit eg 0.34 is Zero (or nought) point three four (NOT thirty four).

Depending on the computer you find yourself in front of, here s what you ll need to do to open SPSS.

Tips and Guidance for Analyzing Data. Executive Summary

MATLAB Demo. Preliminaries and Getting Started with Matlab

Project 11 Graphs (Using MS Excel Version )

CHAPTER 1 COPYRIGHTED MATERIAL. Finding Your Way in the Inventor Interface

Matlab notes Matlab is a matrix-based, high-performance language for technical computing It integrates computation, visualisation and programming usin

GIS LAB 1. Basic GIS Operations with ArcGIS. Calculating Stream Lengths and Watershed Areas.

Unit I Supplement OpenIntro Statistics 3rd ed., Ch. 1

Week 1: Introduction to R, part 1

MATH3880 Introduction to Statistics and DNA MATH5880 Statistics and DNA Practical Session Monday, 16 November pm BRAGG Cluster

Blackboard for Faculty: Grade Center (631) In this document:

Homework 1 Excel Basics

Further Maths Notes. Common Mistakes. Read the bold words in the exam! Always check data entry. Write equations in terms of variables

PSU Standard-based Grading Setup. Working with Standards. Activity 1 Creating a Grade Conversion Scale

Math Lab 6: Powerful Fun with Power Series Representations of Functions Due noon Thu. Jan. 11 in class *note new due time, location for winter quarter

Scatterplot: The Bridge from Correlation to Regression

Module 1: Introduction RStudio

Lab - Monitor and Manage System Resources in Windows 7 and Vista

Statistics with a Hemacytometer

Spreadsheet View and Basic Statistics Concepts

AP Statistics Summer Review Packet

Lastly, in case you don t already know this, and don t have Excel on your computers, you can get it for free through IT s website under software.

Using Excel for Graphical Analysis of Data

Graphical Analysis of Data using Microsoft Excel [2016 Version]

MIS 0855 Data Science (Section 006) Fall 2017 In-Class Exercise (Day 18) Finding Bad Data in Excel

ENCM 339 Fall 2017: Editing and Running Programs in the Lab

AP Statistics Summer Math Packet

Welcome to class! Put your Create Your Own Survey into the inbox. Sign into Edgenuity. Begin to work on the NC-Math I material.

A/D Converter. Sampling. Figure 1.1: Block Diagram of a DSP System

QUEEN MARY, UNIVERSITY OF LONDON. Introduction to Statistics

Fundamentals: Expressions and Assignment

Correlation. January 12, 2019

Programming with Python

Creating a Dropbox Folder & Category (6/2016)

An Introduction to the R Commander

Introduction to Minitab 1

Excel Functions & Tables

Activity 1 Creating a simple gradebook

STANDARDS OF LEARNING CONTENT REVIEW NOTES ALGEBRA I. 2 nd Nine Weeks,

SPSS 11.5 for Windows Assignment 2

Applied Regression Modeling: A Business Approach

OneNote. Using OneNote on the Desktop. Starting screen. The OneNote interface the Ribbon

ELEC4042 Signal Processing 2 MATLAB Review (prepared by A/Prof Ambikairajah)

Exploration Assignment #1. (Linear Systems)

Data Science and Machine Learning Essentials

Using the Health Indicators database to help students research Canadian health issues

CS 051 Homework Laboratory #2

Intermediate Microsoft Excel

Hi. I m a three. I m always a three. I never ever change. That s why I m a constant.

TABLE OF CONTENTS SECTION 1: INTRODUCTION TO PRIMAVERA PROJECT MANAGEMENT 3 PROJECT MANAGEMENT MODULE 3

A short guide to learning more technology This week s topic: Windows 10 Tips

Creating a Book Trailer with Windows Live Movie Maker. A book trailer is different than a book talk. A book trailer

The first thing we ll need is some numbers. I m going to use the set of times and drug concentration levels in a patient s bloodstream given below.

Introduction to Stata: An In-class Tutorial

Fall 2016 CS130 - Regression Analysis 1 7. REGRESSION. Fall 2016

E-Business Systems 1 INTE2047 Lab Exercises. Lab 5 Valid HTML, Home Page & Editor Tables

EXERCISE: GETTING STARTED WITH SAV

addition + =5+C2 adds 5 to the value in cell C2 multiplication * =F6*0.12 multiplies the value in cell F6 by 0.12

Exploratory data analysis with one and two variables

LAB 1 INSTRUCTIONS DESCRIBING AND DISPLAYING DATA

Lecture 2: SML Basics

Year 10 General Mathematics Unit 2

PSpice Tutorial. Physics 160 Spring 2006

EGR 111 Introduction to MATLAB

1. Basic Steps for Data Analysis Data Editor. 2.4.To create a new SPSS file

Subject. Creating a diagram. Dataset. Importing the data file. Descriptive statistics with TANAGRA.

BIOL 417: Biostatistics Laboratory #3 Tuesday, February 8, 2011 (snow day February 1) INTRODUCTION TO MYSTAT

USER GUIDE: Typefitter Automation for print, online and mobile

STANDARDS OF LEARNING CONTENT REVIEW NOTES ALGEBRA II. 3 rd Nine Weeks,

VCEasy VISUAL FURTHER MATHS. Overview

Mathematics 9 Exploration Lab Scatter Plots and Lines of Best Fit. a line used to fit into data in order to make a prediction about the data.

This assignment is due the first day of school. Name:

STAT 213 HW0a. R/RStudio Intro / Basic Descriptive Stats. Last Revised February 5, 2018

Distributions of Continuous Data

Using a percent or a letter grade allows us a very easy way to analyze our performance. Not a big deal, just something we do regularly.

Transcription:

Lab 5, part b: Scatterplots and Correlation Toews, Math 160, Fall 2014 November 21, 2014 Objectives: 1. Get more practice working with data frames 2. Start looking at relationships between two variables Introduction Most of the techniques we ve learned thus far in this class have pertained to a single variable. The truly interesting questions in statistics generally involve two variables, however. For example, we might be interest in whether or not there is a relationship between smoking and lung cancer, or exercise and longevity. Our goal in two variable statistics is generally to investigate how different variables influence one another. This lab introduces you to some techniques in R that you can use to start exploring two-variable questions. The most basic technique at our disposal is a scatterplot: we plot the value of one variable against another, and then examine the plot for revealing patterns. A pattern that looks like a line is particularly compelling, for it suggests that as one variable increases (eg smoking levels), the other increased proportionally (eg cancer rates.) We can calculate a number called the correlation that gives us the strength of the linear relationship between two variables. You ll get some practice calculating this number in this lab. Due by Monday, December 1 1. Your <YourFirstName>_lab5.R script, in your Dropbox. 2. Turn in your Lab Notebook in class. Activities Getting Organized In Lab 5, Part A, you created a lab5 folder on your laptop. Navigate to that folder and set it as the working directory. Also open up the file lab5.r that you started in the last lab you ll add the commands you use today to the same file. Getting the data Download the files sgpdata.rdata and freelunchdata.rdata to your computer. If you downloaded freelunchdata.rdata last week, redownload and save over the old file I ve made a few changes to the file and would rather you had the new one. 1

Load free lunch data In the file browser in the lower right pane of Rstudio, browse to your Lab5 directory and then click More -> Set as working directory. Double click on the files freelunchdata.rdata and sgpdata.rdata and import them into your workspace. Alternatively, use the load command: load('freelunchdata.rdata') fld = freelunchdata # load the data # rename the data variable with a short, easy name Refine your data In the Environment tab (upper right pane), click on freelunchdata and take a look at the schools. Some schools stand out as not like the others. For example, if we re doing a sociological analysis, we might not want to include the Remann Juvenile Hall Detention Center. We might like to drop these schools from our analysis. We might also like to dro Special Services. Note that these items are on rows 44 and 50, respectively. Here s how we drop them from your data: fld = fld[-c(44,50),] 1. Note that square brackets are used for indexing our data frame 2. Note the minus sign in front of the c(44,50) vector: the minus sign means drop. 3. Note the comma after the -c(44,50) expression: elements before the comma refer to rows, elements after refer to columns. 4. Note that I go ahead and store the modifed free lunch data in the variable fld. I still have freelunchdata floating around in the workspace, so if I make a mistake, I can always go back and get the original data. In particular, however, I don t modify the original data set freelunchdata. This is good practice: keep a pristine copy of the data on hand at all times. Pause for reflection # 1: Are there other schools that you might drop from your analysis? Make some comments in your lab book about which ones, and then go ahead and drop them. Load test score data We ll be interested in exploring whether or not there is a relationship between the level of free lunch assistance at a school and the results of standardized testing. To do this, we ll need to load up some standardized test scores: load('mathsgpdata.rdata') sgp = mathsgpdata # give data a name that is easy to work with Take a look at this data set by clicking it in the Environment tab. Note that there are fewer schools represented here, and I ve already weeded out a bunch of schools that might not fit within the scope of our analysis. 2

Prepare to make a scatter plot: how to get a common set of schools We re going to focus on the variable MedianSGP. We d like to form a scatterplot of the percentage of free-lunch eligible students against the median SGP score. To do this, we need to make sure that we have exactly the same set of schools in both data sets. Here s how we can do this: idx = fld$school.name %in% sgp$schoolname # which names from fld are in sgp? fld = fld[idx,] # restrict fld data to include just these names idx = sgp$schoolname %in% fld$school.name #which names from sgp are in fld? sgp = sgp[idx,] # restrict sgp data to include just these names 1. The a %in% b command checks to see what names from a are in b, and returns the indices these names. 2. Running this command the other way, i.e. b %in% a, checks to see what names from b are in a, and returns the indices of these names. 3. By running the command both ways, and restricting the appropriate data set after each one, we limit the data to just those rows that that correspond to school names in both data sets. Make the scatterplot Now that we ve reduced both data sets to consist of just the same school names, making a scatterplot is easy. You do it like this: plot(fld$percent.eligible.for.free..reduced.lunch, sgp$mediansgp) sgp$mediansgp 30 35 40 45 50 55 60 20 40 60 80 fld$percent.eligible.for.free..reduced.lunch 3

Pause for reflection # 2: Take a look at the data. Does it look like there is a relationship? In your lab book, comment on the form, direction, and strength of that relation. Calculate the correlation Remember that the correlation is a number between -1 and 1 that characterizes the strength of the linear relationship between two variables. You can calculate the correlation between free lunch eligibility and SGP scores with the following command: correl = cor(fld$percent.eligible.for.free..reduced.lunch, sgp$mediansgp) correl ## [1] -0.1016 Pause for reflection # 3: Is the sign of this correlation (positive or negative) what you would expect, based purely on socioligical grounds? Is it what you would expect, based on looking at the scatterplot you just generated? Fit a line to the data Finally, we d like to fit a line to the data that shows a rough theoretical relationship between free lunch data and SGP scores. There s a lot of mathematical machinery that goes into making such a line, but it s easy to do in R: res = lm(sgp$mediansgp ~ fld$percent.eligible.for.free..reduced.lunch) plot(fld$percent.eligible.for.free..reduced.lunch, sgp$mediansgp) abline(res) 4

sgp$mediansgp 30 35 40 45 50 55 60 20 40 60 80 fld$percent.eligible.for.free..reduced.lunch 1. The function lm calculates parameters for a linear model between the two variables. Note the use of the tilde. We store the results of this function in a variable called res. 2. The function abline simply adds a best fit line to an existing scatterplot. The only thing it needs to form this line is the output of the lm function. 3. CAUTION: ORDER IS IMPORTANT! Note that the plot command you issued above had the fld data first, and then the sgp data this produces a plot with fld data on the horizontal axis, and SGP data on the vertical. In the lm command above, we need to switch the order (if you don t, your line won t fit the data!) Pause for reflection #4: Use your correlation coefficient and your best-fit-line to summarize in plain language what you feel the relation might be between free-lunch-eligibility and test scores 5