Lab 5, part b: Scatterplots and Correlation Toews, Math 160, Fall 2014 November 21, 2014 Objectives: 1. Get more practice working with data frames 2. Start looking at relationships between two variables Introduction Most of the techniques we ve learned thus far in this class have pertained to a single variable. The truly interesting questions in statistics generally involve two variables, however. For example, we might be interest in whether or not there is a relationship between smoking and lung cancer, or exercise and longevity. Our goal in two variable statistics is generally to investigate how different variables influence one another. This lab introduces you to some techniques in R that you can use to start exploring two-variable questions. The most basic technique at our disposal is a scatterplot: we plot the value of one variable against another, and then examine the plot for revealing patterns. A pattern that looks like a line is particularly compelling, for it suggests that as one variable increases (eg smoking levels), the other increased proportionally (eg cancer rates.) We can calculate a number called the correlation that gives us the strength of the linear relationship between two variables. You ll get some practice calculating this number in this lab. Due by Monday, December 1 1. Your <YourFirstName>_lab5.R script, in your Dropbox. 2. Turn in your Lab Notebook in class. Activities Getting Organized In Lab 5, Part A, you created a lab5 folder on your laptop. Navigate to that folder and set it as the working directory. Also open up the file lab5.r that you started in the last lab you ll add the commands you use today to the same file. Getting the data Download the files sgpdata.rdata and freelunchdata.rdata to your computer. If you downloaded freelunchdata.rdata last week, redownload and save over the old file I ve made a few changes to the file and would rather you had the new one. 1
Load free lunch data In the file browser in the lower right pane of Rstudio, browse to your Lab5 directory and then click More -> Set as working directory. Double click on the files freelunchdata.rdata and sgpdata.rdata and import them into your workspace. Alternatively, use the load command: load('freelunchdata.rdata') fld = freelunchdata # load the data # rename the data variable with a short, easy name Refine your data In the Environment tab (upper right pane), click on freelunchdata and take a look at the schools. Some schools stand out as not like the others. For example, if we re doing a sociological analysis, we might not want to include the Remann Juvenile Hall Detention Center. We might like to drop these schools from our analysis. We might also like to dro Special Services. Note that these items are on rows 44 and 50, respectively. Here s how we drop them from your data: fld = fld[-c(44,50),] 1. Note that square brackets are used for indexing our data frame 2. Note the minus sign in front of the c(44,50) vector: the minus sign means drop. 3. Note the comma after the -c(44,50) expression: elements before the comma refer to rows, elements after refer to columns. 4. Note that I go ahead and store the modifed free lunch data in the variable fld. I still have freelunchdata floating around in the workspace, so if I make a mistake, I can always go back and get the original data. In particular, however, I don t modify the original data set freelunchdata. This is good practice: keep a pristine copy of the data on hand at all times. Pause for reflection # 1: Are there other schools that you might drop from your analysis? Make some comments in your lab book about which ones, and then go ahead and drop them. Load test score data We ll be interested in exploring whether or not there is a relationship between the level of free lunch assistance at a school and the results of standardized testing. To do this, we ll need to load up some standardized test scores: load('mathsgpdata.rdata') sgp = mathsgpdata # give data a name that is easy to work with Take a look at this data set by clicking it in the Environment tab. Note that there are fewer schools represented here, and I ve already weeded out a bunch of schools that might not fit within the scope of our analysis. 2
Prepare to make a scatter plot: how to get a common set of schools We re going to focus on the variable MedianSGP. We d like to form a scatterplot of the percentage of free-lunch eligible students against the median SGP score. To do this, we need to make sure that we have exactly the same set of schools in both data sets. Here s how we can do this: idx = fld$school.name %in% sgp$schoolname # which names from fld are in sgp? fld = fld[idx,] # restrict fld data to include just these names idx = sgp$schoolname %in% fld$school.name #which names from sgp are in fld? sgp = sgp[idx,] # restrict sgp data to include just these names 1. The a %in% b command checks to see what names from a are in b, and returns the indices these names. 2. Running this command the other way, i.e. b %in% a, checks to see what names from b are in a, and returns the indices of these names. 3. By running the command both ways, and restricting the appropriate data set after each one, we limit the data to just those rows that that correspond to school names in both data sets. Make the scatterplot Now that we ve reduced both data sets to consist of just the same school names, making a scatterplot is easy. You do it like this: plot(fld$percent.eligible.for.free..reduced.lunch, sgp$mediansgp) sgp$mediansgp 30 35 40 45 50 55 60 20 40 60 80 fld$percent.eligible.for.free..reduced.lunch 3
Pause for reflection # 2: Take a look at the data. Does it look like there is a relationship? In your lab book, comment on the form, direction, and strength of that relation. Calculate the correlation Remember that the correlation is a number between -1 and 1 that characterizes the strength of the linear relationship between two variables. You can calculate the correlation between free lunch eligibility and SGP scores with the following command: correl = cor(fld$percent.eligible.for.free..reduced.lunch, sgp$mediansgp) correl ## [1] -0.1016 Pause for reflection # 3: Is the sign of this correlation (positive or negative) what you would expect, based purely on socioligical grounds? Is it what you would expect, based on looking at the scatterplot you just generated? Fit a line to the data Finally, we d like to fit a line to the data that shows a rough theoretical relationship between free lunch data and SGP scores. There s a lot of mathematical machinery that goes into making such a line, but it s easy to do in R: res = lm(sgp$mediansgp ~ fld$percent.eligible.for.free..reduced.lunch) plot(fld$percent.eligible.for.free..reduced.lunch, sgp$mediansgp) abline(res) 4
sgp$mediansgp 30 35 40 45 50 55 60 20 40 60 80 fld$percent.eligible.for.free..reduced.lunch 1. The function lm calculates parameters for a linear model between the two variables. Note the use of the tilde. We store the results of this function in a variable called res. 2. The function abline simply adds a best fit line to an existing scatterplot. The only thing it needs to form this line is the output of the lm function. 3. CAUTION: ORDER IS IMPORTANT! Note that the plot command you issued above had the fld data first, and then the sgp data this produces a plot with fld data on the horizontal axis, and SGP data on the vertical. In the lm command above, we need to switch the order (if you don t, your line won t fit the data!) Pause for reflection #4: Use your correlation coefficient and your best-fit-line to summarize in plain language what you feel the relation might be between free-lunch-eligibility and test scores 5