Statistical Analysis in R Guest Lecturer: Maja Milosavljevic January 28, 2015

Size: px

Start display at page:

Download "Statistical Analysis in R Guest Lecturer: Maja Milosavljevic January 28, 2015"

Rose Fox
6 years ago
Views:

1 Statistical Analysis in R Guest Lecturer: Maja Milosavljevic January 28, 2015 Data Exploration Import Relevant Packages: library(grdevices) library(graphics) library(plyr) library(hexbin) library(base) library(stats) library(mosaic) library(datasets) The Lahman package contains Sean Lahman s Baseball Database stored as a set of R data.frames. Today we will be using the Teams data.frame. library(lahman) data(teams) What variables are in the Teams data.frame? For variable descriptions, see packages/lahman/lahman.pdf. names(teams) We can easily view the first few rows of a dataset with the head() command. This allows us to get an idea of what the data looks like without having to open the entire data.frame. head(teams) If we want to get information on the class (data/type structure) of each variable and the kinds of values it takes on, we can use the str() command. str(teams) Which teams have Boston in their Name? unique(teams[grep("boston", Teams$name), c("name")]) Subset on Boston Red Sox: redsox = subset(teams, name == "Boston Red Sox") How would we find the unique values that the variable W (wins) takes on in the Red Sox dataset? 1

2 sort(unique(redsox$w)) What about the unique values of W in the Teams dataset? Try on your own! Summary Statistics Let s get a little more information on the Wins variable in both the Red Sox and Teams datasets. We can easily find the minimum, maximum, median, mean, 1st and 3rd quartiles using the summary() command. Going a little deeper, we can plot the boxplots of Wins in both datasets and visually compare the summary statistics. Note that the boxplots show all summary statistics except the mean. summary(redsox$w) ## Min. 1st Qu. Median Mean 3rd Qu. Max. ## summary(teams$w) ## Min. 1st Qu. Median Mean 3rd Qu. Max. ## boxplot(redsox$w, Teams$W, range = 0, names=c("red Sox", "Teams"), ylab = "Wins") Wins Red Sox Teams What conclusions can we make? Correlation Coefficient Now let s say we re interested in discovering if Wins is positively correlated with Home Runs (the number of wins increases as the number of home runs increases). We can start out with plotting the two variables against each other: plot(teams$hr, Teams$W, xlab = "Home Runs", ylab = "Wins") Before we go on to calculate the correlation coefficient, let s make a plot that is a little more visually pleasing than the one we just produced. We are Visual Analysts after all. 2

3 hbin <- hexbin(teams$hr, Teams$W, xbins = 20, xlab = "HR", ylab = "W") plot(hbin) W Counts HR There is some increase in Home Runs as Wins increases. Let s calculate the correlation coefficient to be sure. The correlation coefficient can take on any value between -1 and 1, inclusive. A value of -1 represents a strong negative correlation (one variable decreases as the other increases), a value of 0 represents no correlation, and a value of 1 represents a strong positive correlation. Calculate the correlation coefficient: cor(teams$w, Teams$HR) ## [1] Thus, we have a moderate positive relationship between Home Runs and Wins. Data Enrichment The Teams dataset has a lot of valuable information in it that we can use to perform various explorations. But let s say we re interested in exploring Winning Proportion and Runs Scored versus Runs Allowed, two variables that do not exist in our dataset. What do we do? We create the new variables ourselves! Don t assume that you re stuck with the variables in your dataset. You can always create and add new variables. Winning Proportion is the number of Wins divided by Wins plus Losses. Runs scored versus Runs Allowed is the number of Runs minus the number of Runs Allowed. df <- mutate(teams, WP = W / (W + L), RunDiff = R - RA) Now we have a new data.frame that contains the two additional variables we re interested in. 3

4 Missing Values R stores a missing value with NA. Use is.na() to check for missing values in your data. Use the na.rm = TRUE argument when computing statistics, such as the mean, on vectors with missing values. Use na.omit() to remove all rows with missing values in a dataset. head(teams$dp) is.na(teams$dp) sum(is.na(teams$dp)) mean(teams$dp) mean(teams$dp, na.rm = TRUE) nomissing = na.omit(teams) sum(is.na(nomissing)) Quantify Uncertainties The Teams dataset is a full, complete dataset that contains ALL of the data for the population of Baseball Statistics. This is a rarity; we often have a random sample of the population that we are trying to get information about. In these cases, uncertainty quantification is key, so here is how we do it. Disclaimer: The only reason we re creating a random sample is so we can learn how the bootstrap technique works. Since we already have ALL of the data in the Teams dataset, we know with certainty that the true population mean of DP is simply the mean DP from the Teams dataset. sample = Teams[sample(nrow(Teams), 1000), ] # create our random sample to justify the # usage of the bootstrap technique Using our random sample of Baseball Statistics throughout history, we can provide information on the true population mean Double Plays (DP). To do this we calculate a confidence interval using the sample mean DP. A confidence interval allows us to quantify our uncertainty in the sample estimate. If we end up with a tight bound for our confidence interval then we know that our sample estimate is a good estimate for the true Double Plays (DP) mean. However, if we have a large bound then we should use caution when using the sample estimate. Today, we will use the bootstrap technique to find the confidence interval. The bootstrap technique works as follows: Let n = the number of observations you have in your dataset. 1. Create a new dataset of n observations by resampling from your orignal dataset with replacement. 2. Compute the statistic of interest on this sample. Here we are interested in the mean. 3. Repeat steps 1 and 2 many times and collect the results. In step 1, why do we need to sample with replacement? Question: How confident are we that the mean DP of this sample represents the true population mean? Answer: mean(sample$dp, na.rm=true) #sample DP mean bstrap <- do(10000) * mean(resample(sample$dp), na.rm=true) densityplot(~result, data=bstrap,plot.points=false, xlab = "Mean Double Plays") qdata(c(0.025,0.975), vals = result, data = bstrap) We are 95% confident that the true population mean for Double Plays (DP) is between , We have quanitified our uncertainty in the sample mean estimate. 4

5 Statistical Models All statistical models are wrong, but some are useful. - George Box Simple Linear Regression: Let s find a model for predicting Winning Proportion(WP) using Runs Scored vs. Runs Allowed(RunDiff) in our new df dataset. Note that both variables are quantitative. There are three key assumptions for using linear regression: 1. The relationship between your response and predictor variables is linear 2. Normality of the residuals 3. Constant Variance of the residuals df <- df[df$g > 158, ] # let's filter on the Teams that played a full season plot(df$rundiff, df$wp, col = "blue", xlab = "RunDiff", ylab = "WP") # assumption 1 WP mdl <- lm(df$wp ~ df$rundiff) summary(mdl) RunDiff ## ## Call: ## lm(formula = df$wp ~ df$rundiff) ## ## Residuals: ## Min 1Q Median 3Q Max ## ## ## Coefficients: ## Estimate Std. Error t value Pr(> t ) ## (Intercept) 5.001e e <2e-16 *** ## df$rundiff 6.460e e <2e-16 *** ## --- 5

6 ## Signif. codes: 0 '***' '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: on 1251 degrees of freedom ## Multiple R-squared: , Adjusted R-squared: ## F-statistic: 9169 on 1 and 1251 DF, p-value: < 2.2e-16 resi <- residuals(mdl) fit <- predict(mdl) hist(resi) # assumption 2 Histogram of resi Frequency resi plot(fit,resi, ylab = "Residual", xlab = "Fitted Value") # assumption 3 Residual Fitted Value Have we met our assumptions? Can we conclude that using a linear model to predict WP from RunDiff was appropriate? Now that we have shown that using a linear model was appropriate, how do we quanitfy how good of a model this is? Answer: Use R 2 to assess the goodness-of-fit. Since our response variable, WP, is quantitative, we can use the R 2 (Coefficient of Determination) value, or the adjusted R 2 value, to assess the goodness-of-fit. R 2 takes on a value between 0 and 1, inclusive, and 6

7 represents the proportion of the variability in the response explained by the model. Thus, a good model will have a value close to 1. The R 2 value will automatically increase if we add more predictors to the model. Why do you think this might be? The adjusted R 2 value is a modified version of R 2 that penalizes the model for adding additional variables. It ensures that the information added to the model makes up for the added complexity of the model. Since our model has only one variable, both values are the same. A good rule of thumb: Simpler is better. If your adjusted R 2 value only increases a little bit when you add a variable, you probably don t need it. What is the goodness-of-fit of this model? Is it a good model? plot(df$rundiff, df$wp, col = "blue", xlab = "RunDiff", ylab = "WP") abline(mdl) Automated Feature Selection In the previous section we found a good simple linear model for predicting WP using a single variable. But now we want to go a little deeper and add more explanatory variables to our model. How can we easily find the best set of explanatory variables to predict WP? Answer: Automated Feature Selection! R has a few automated feature selection techniques that you can explore on your own, such as stepwise, forward, and best subset(warning: computationally intensive for datasets with many variables). Today we ll use Backward Elimination. Backward Elimination throws all of the predictors into the model, and then removes the ones that are the least statistically significant. Make sure that your data has no missing values. rel = df[,c("wp", "Rank", "H", "RA", "HR", "SV", "IPouts", "HRA", "BB", "SO", "SB", "CS", "HBP", "SF", "RunDiff")] nomissing = na.omit(rel) mdl.full = lm(wp ~., data = nomissing) bck = step(mdl.full, direction = "backward", trace = "FALSE") summary(bck) Don t forget to check your diagnostic plots! resi=residuals(bck) fit=predict(bck) hist(resi) plot(fit,resi, ylab = "Residual", xlab = "Fitted Value") What is the goodness-of-fit of this model? Is it a good model? Our model captures a significant portion of the variability in the response. However, let s look back at the quote by George Box stated earlier. Is our model useful? For starters, our model violates the simpler is better rule of thumb. If you were presenting this model to a Baseball Coach who was interested in understanding their Team s WP, would you tell them that the value is based on of their Home Runs(HR) value and of their Rank(R) value, in addition to many other factors? No, you wouldn t, because that doesn t make any sense contextually. If we want to try narrowing down our model to fewer predictors we can pick the variables that are the most significant (ie. the variables that have the most number of astericks to the right of them). Just because R claims a model is the most significant statistically doesn t mean it s the most significant contextually. 7

8 Presenting your Data with Context Many different tools exist for analyzing your data and presenting your findings. However, if you don t present your results with context, the meaning is lost. A great example of presenting data with context is available at By overlaying the baseball diamond on the plot, the author is able to present the locations of Carlos Gomez s catches on the field with context. Additional Topics: Time Series Analysis A big assumption in linear regression is that each data point is independent of the others (the value of one does not affect the value of others). However, when time is involved, this assumption breaks down. For 8

9 example, if today it is below freezing, it is likely to be below freezing tomorrow. Intuitively, we expect data points measured closer together in time to have response values that are similar. When we have data collected over time (time series data), we have seperate statistical tools to work with. R has a predefined dataset, co2, of time series data. The dataset consists of monthly measurements of Carbon Dioxide from 1959 to If the data you are working with is time series data, you can use stl() to get the estimated seasonal, trend, and remainder components for the original data. monthplot() will give you a plot of the estimated monthly means for either the seasonal, trend, or remainder components. fit = stl(co2, s.window = "periodic") plot(fit) remainder trend seasonal data monthplot(fit, choice = "seasonal") time 9

10 seasonal J F M A M J J A S O N D Further Reading If you re interested in using a Statistical Model in your final project and want to learn more about how to handle categorical variables, logical variables, higher order terms, and the various models that exist for your data, check out: Cannon, A. R. (2013). STAT2: Building models for a world of data. New York: W.H. Freeman. If you re interested in presenting your R code and final report in an R Markdown file like this one, check out: Udwin, D. & Baumer, B. (2015). R Markdown. For more information on Confidence Intervals: For more information on Diagnosic Plots for Regression: For more information on Time Series Analysis: Cryer, J. D. (2009). Timer Series Analysis with Applications in R. Blackwell Publishing Ltd 10

Section 2.3: Simple Linear Regression: Predictions and Inference

Section 2.3: Simple Linear Regression: Predictions and Inference Jared S. Murray The University of Texas at Austin McCombs School of Business Suggested reading: OpenIntro Statistics, Chapter 7.4 1 Simple