Stats 50: Linear Regression Analysis of NCAA Basketball Data April 8, 2016

Similar documents
NCAA Instructions Nerdy for Sports

Weighted Powers Ranking Method

Section 2.3: Simple Linear Regression: Predictions and Inference

Bulgarian Math Olympiads with a Challenge Twist

Section 2.1: Intro to Simple Linear Regression & Least Squares

Equations of planes in

Metric Madness. We can quickly see that the rings in the Manhattan metric are not round; they are diamond rings! David Clark

GENREG DID THAT? Clay Barker Research Statistician Developer JMP Division, SAS Institute

predict and Friends: Common Methods for Predictive Models in R , Spring 2015 Handout No. 1, 25 January 2015

Project and Production Management Prof. Arun Kanda Department of Mechanical Engineering Indian Institute of Technology, Delhi

PS 6: Regularization. PART A: (Source: HTF page 95) The Ridge regression problem is:

Multiple Linear Regression

Mathematics of Data. INFO-4604, Applied Machine Learning University of Colorado Boulder. September 5, 2017 Prof. Michael Paul

Lecture 3: Linear Classification

Here is the data collected.

Logistic Regression and Gradient Ascent

Excel Basics: Working with Spreadsheets

Hot X: Algebra Exposed

How to Make Graphs in EXCEL

Exercise 2.23 Villanova MAT 8406 September 7, 2015

Modular Arithmetic. is just the set of remainders we can get when we divide integers by n

610 R12 Prof Colleen F. Moore Analysis of variance for Unbalanced Between Groups designs in R For Psychology 610 University of Wisconsin--Madison

To complete the computer assignments, you ll use the EViews software installed on the lab PCs in WMC 2502 and WMC 2506.

Matrix Multiplication Studio April 20, 2007

Lesson 2. Introducing Apps. In this lesson, you ll unlock the true power of your computer by learning to use apps!

What s the Difference?

Hanoi, Counting and Sierpinski s triangle Infinite complexity in finite area

HOMEWORK #4 SOLUTIONS - MATH 4160

Curve fitting using linear models

Tips and Guidance for Analyzing Data. Executive Summary

Football result prediction using simple classification algorithms, a comparison between k-nearest Neighbor and Linear Regression

Saturday Morning Math Group Austin Math Circle Austin Area Problem Solving Challenge

Salary 9 mo : 9 month salary for faculty member for 2004

Investigative Skills Toolkit (Numeric) Student Task Sheet TI-Nspire Numeric Version

MATLAB COMPUTATIONAL FINANCE CONFERENCE Quantitative Sports Analytics using MATLAB

CS6375: Machine Learning Gautam Kunapuli. Mid-Term Review

Team Prob. Team Prob

The problem we have now is called variable selection or perhaps model selection. There are several objectives.

Section 3.4: Diagnostics and Transformations. Jared S. Murray The University of Texas at Austin McCombs School of Business

Distributions of Continuous Data

Lastly, in case you don t already know this, and don t have Excel on your computers, you can get it for free through IT s website under software.

Correlation. January 12, 2019

Twentieth Annual University of Oregon Eugene Luks Programming Competition

6.001 Notes: Section 8.1

Check Skills You ll Need (For help, go to Lesson 1-2.) Evaluate each expression for the given value of x.

Animations involving numbers

= 3 + (5*4) + (1/2)*(4/2)^2.

ECE199JL: Introduction to Computer Engineering Fall 2012 Notes Set 2.4. Example: Bit-Sliced Comparison

MTH 122 Calculus II Essex County College Division of Mathematics and Physics 1 Lecture Notes #11 Sakai Web Project Material

Linear algebra deals with matrixes: two-dimensional arrays of values. Here s a matrix: [ x + 5y + 7z 9x + 3y + 11z

(Refer Slide Time: 01.26)

Experimental Design and Graphical Analysis of Data

CHAPTER 2 DESCRIPTIVE STATISTICS

SIDStats Volleyball User Documentation

Section 2.1: Intro to Simple Linear Regression & Least Squares

Using Genetic Algorithms to Solve the Box Stacking Problem

Introduction to Mixed Models: Multivariate Regression

Algebra of Sets. Aditya Ghosh. April 6, 2018 It is recommended that while reading it, sit with a pen and a paper.

SPM Users Guide. This guide elaborates on powerful ways to combine the TreeNet and GPS engines to achieve model compression and more.

A-SSE.1.1, A-SSE.1.2-

ES-2 Lecture: Fitting models to data

Robotics. Lecture 5: Monte Carlo Localisation. See course website for up to date information.

LP-Modelling. dr.ir. C.A.J. Hurkens Technische Universiteit Eindhoven. January 30, 2008

How to Get DAILY Hits in the Pick 3 Game!

Algebra 2 Chapter Relations and Functions

sin(a+b) = sinacosb+sinbcosa. (1) sin(2πft+θ) = sin(2πft)cosθ +sinθcos(2πft). (2)

Non-Linearity of Scorecard Log-Odds

Lecture 9. Support Vector Machines

CPSC 340: Machine Learning and Data Mining. Principal Component Analysis Fall 2017

Algebra II. Slide 1 / 179. Slide 2 / 179. Slide 3 / 179. Rational Expressions & Equations. Table of Contents

CPSC 340: Machine Learning and Data Mining. Kernel Trick Fall 2017

Chapter 6: DESCRIPTIVE STATISTICS

An introduction to plotting data

To make sense of data, you can start by answering the following questions:

Getting started with simulating data in R: some helpful functions and how to use them Ariel Muldoon August 28, 2018

Shadows in the graphics pipeline

5 R1 The one green in the same place so either of these could be green.

Approximation Algorithms

Building Better Parametric Cost Models

( ) = Y ˆ. Calibration Definition A model is calibrated if its predictions are right on average: ave(response Predicted value) = Predicted value.

Lecturers: Sanjam Garg and Prasad Raghavendra March 20, Midterm 2 Solutions

Chapter 1 Introduction to using R with Mind on Statistics

This blog addresses the question: how do we determine the intersection of two circles in the Cartesian plane?

MergeSort, Recurrences, Asymptotic Analysis Scribe: Michael P. Kim Date: April 1, 2015

Fundamentals of Operations Research. Prof. G. Srinivasan. Department of Management Studies. Indian Institute of Technology Madras.

Model selection and validation 1: Cross-validation

Dynamic Arrays and Amortized Analysis

2016 Stat-Ease, Inc. Taking Advantage of Automated Model Selection Tools for Response Surface Modeling

Sinusoidal Data Worksheet

2

GAMs semi-parametric GLMs. Simon Wood Mathematical Sciences, University of Bath, U.K.

ISA 562: Information Security, Theory and Practice. Lecture 1

Algebra II. Working with Rational Expressions. Slide 1 / 179 Slide 2 / 179. Slide 4 / 179. Slide 3 / 179. Slide 6 / 179.

Regression. Dr. G. Bharadwaja Kumar VIT Chennai

Linear Methods for Regression and Shrinkage Methods

Linear Feature Engineering 22

POLYHEDRAL GEOMETRY. Convex functions and sets. Mathematical Programming Niels Lauritzen Recall that a subset C R n is convex if

Basic Triangle Congruence Lesson Plan

Hands on Datamining & Machine Learning with Weka

Exercise 4: Loops, Arrays and Files

Transcription:

Stats 50: Linear Regression Analysis of NCAA Basketball Data April 8, 2016 Today we will analyze a data set containing the outcomes of every game in the 2012-2013 regular season, and the postseason NCAA tournament. Our goals will be to: Estimate the quality of each team via linear regression to obtain objective rankings Predict the winner and margin of victory in future games Get better at using R to manipulate and analyze data! 1. Loading in the data First, read in the files games.csv and teams.csv into the data frames games and teams. You don t have to download these data files as they are hosted on the Internet; just read them in using the URLs below. We will load them as is (i.e. strings not converted to factors). games <- read.csv("http://statweb.stanford.edu/~jgorham/games.csv",as.is=true) teams <- read.csv("http://statweb.stanford.edu/~jgorham/teams.csv",as.is=true) The games data has an entry for each game played, and the teams data has an entry for each Division 1 team (there are a few non-d1 teams represented in the games data). First, let s make one vector containing all of the team names, because the three columns do not perfectly agree. This will be useful later. all.teams <- sort(unique(c(teams$team,games$home,games$away))) Take some time and explore these two data files; for example, glance at the first few rows using head. Then try and answer the following questions: Problems 1. How many games were played? How many teams are there? 2. Did Stanford make the NCAA tournament? What was its final AP and USA Today ranking? 3. How many games did Stanford play? 4. What was Stanford s win-loss record? Solutions 1. We can answer these questions by calling nrow on both datasets: nrow(teams); nrow(games) ## [1] 347 ## [1] 5541 1

There are 347 teams and 5541 games played. 2. Let s take a look at the Stanford row in the teams dataset: teams[teams$team == "stanford-cardinal", ] ## team conference intourney aprank usatodayrank ## 249 stanford-cardinal pac-12 0 NA NA It appears Stanford failed to make the tournament in 2012-2013, and was unranked in both polls at the end of the season. :( 3. Let s pull out all the games in which Stanford played.and store them in a new data frame. Then we can check the length of that data frame. games.stanford <- games[(games$home == "stanford-cardinal") (games$away == "stanford-cardinal"), ] nrow(games.stanford) ## [1] 34 4. We need to identify the games that Stanford won. The with function is a useful trick so that I don t have to keep typing games.stanford$ in front of every variable each time. won <- with(games.stanford, ((home == "stanford-cardinal") & (homescore > awayscore)) ((away == "stanford-cardinal") & (homescore < awayscore))) nrow(games.stanford[won, ]) ## [1] 19 Stanford won 19 games, and since they played 34 games, their win-loss record was 19-15. 2. A Linear Regression Model for Ranking Teams Now, let s try and use a linear regression model to determine which teams are better than others. The general strategy is to define a statistical model such that the parameters correspond to whatever quantities we want to estimate; in this case, we care about estimating the quality of each basketball team. Our response variable for today is the margin of victory (or defeat) for the home team in a particular game. That is, define y i = (home score away score) in game i (1) Now, we want to define a linear regression model that explains the response, y i, in terms of both teams merits. The simplest such model will look something like y i = quality of home(i) quality of away(i) + noise (2) where home(i) and away(i) are the home and away teams for game i. Keeping in mind the general strategy, this means that we want to define the coefficients β such that β j represents the quality of team j. Now it 2

just remains the define the predictors X. To formulate this model as a linear regression in standard form, we need a definition for x ij such that y i = j x ij β j + ε i (3) How can we do this? A little clever thinking shows that we can define one predictor variable for each team, which is a sort of signed dummy variable. In particular, for game i and team j, let +1 j is home(i) x ij = 1 j is away(i) 0 j didn t play. (4) For example, if game i consists of team 1 visiting team 2, then x i = ( 1, 1, 0, 0,..., 0). Now we can check that x ij β j = β home(i) β away(i) (5) j as desired, so the coefficient β j corresponds exactly to the quality of team j in our model. Great! Now let s try and code this up. Let s initialize a data frame X0 with nrow(games) rows and length(all.teams) columns, filled with zeros. We also label the columns of X0 accordingly. We call it X0 for now, because we re going to need to make a slight modification in the next section before we can run the regression. X0 <- as.data.frame(matrix(0,nrow(games),length(all.teams))) names(x0) <- all.teams Right now the matrix is just filled with zeros. You ll fill in the actual entries yourself below. Problems 1. Create a vector y corresponding to the response variable as described in Equation (1) above. 2. Fill in X0 column by column, according to Equation (4). Solutions 1. We define the response vector y: y <- games$homescore - games$awayscore 2. Next we fill in X0: ## Fill in the columns, one by one for(team in all.teams) { X0[,team] <- 1*(games$home==team) - 1*(games$away==team) } Now, we are in good shape, because our problem is formulated as a standard linear regression! Now we can use all of the usual tools to estimate β, make predictions, etc. 3

3. An Identifiability Problem When we fit our model, we will ask R to find the best-fitting β vector. There is a small problem, however: for any candidate value of β, there are infinitely many other values β that make exactly the same predictions. So the best β is not uniquely defined. For any constant c, suppose that I redefine β j = β j + c. Then for every game i, β home(i) β away(i) = β home(i) β away(i) (6) so the distribution of y is identical for parameters β and β, no matter what c is. We can never distinguish these two models from each other, because the models make identical predictions no matter what. In statistical lingo, this is called an identifiability problem. It very often arises with dummy variables. Intuitively, this problem occurs because our response is a difference of team performances, i.e. margin of scores in each game, and so the outcome only depends on the relative quality of two teams rather than the absolute qualities. If two very bad teams play against each other, we might expect the score differential to be 2 points, but if two very good teams play against each other, we might also expect the score differential to be 2 points. To fix this problem, we can pick a special baseline team j and require that β j = 0. We will take Stanford s team as the baseline; this has the additional benefit of allowing us to compare Stanford to other teams directly using the estimated coefficients. Problem Modify the X0 matrix from the previous section to implement the above restriction. Name the modified matrix X. (Actually, lm in R is smart enough to fix this automatically for you by arbitrarily picking one team to be the baseline. But let s not blindly rely on that, and instead do it ourselves, so we can better understand what R is actually doing.) Solution We can effectively force β j = 0 for the j corresponding to Stanford by eliminating that column from the predictor matrix. X <- X0[,names(X0)!= "stanford-cardinal"] [Bonus Questions] If you have time, consider the following questions: Suppose that we had chosen a different team as our baseline. How would the estimates be different? Would we obtain identical rankings? Would we obtain identical standard errors? Under what circumstances would we still have an identifiability problem even after constraining β Stanford = 0? 4. Fitting the model Now, let s fit our model. 4

Problem Fit the model using the lm function, regressing y on X with the below conditions. Recall that if you don t know exactly how to use the lm function, you should look at the documentation by calling?lm. 1. Use all the columns in the X matrix, and make sure to fit the model without an intercept. (Hint: The formula should look like y ~ 0 +.. What do each of these parts mean?) 2. Only include regular season games in the model. Explore the estimated coefficients using summary. What is the R 2 value? Solution reg.season.games <- which(games$gametype=="reg") fit <- lm(y ~ 0 +., data=x, subset=reg.season.games) head(coef(summary(fit))) ## Estimate Std. Error t value Pr(> t ) ## `air-force-falcons` -6.687934 2.897768-2.3079603 2.104205e-02 ## `akron-zips` -2.557678 2.841221-0.9002039 3.680552e-01 ## `alabama-a&m-bulldogs` -30.590013 2.908702-10.5167213 1.338001e-25 ## `alabama-crimson-tide` -2.851010 2.802443-1.0173304 3.090456e-01 ## `alabama-state-hornets` -29.958427 2.849296-10.5143267 1.371671e-25 ## `albany-great-danes` -13.334555 2.818343-4.7313458 2.291935e-06 summary(fit)$r.squared ## [1] 0.551235 It looks like our model explains about half of the variability in basketball scores. 5. Interpreting the Model Now, let s try to interpret the model that we just fit. Problems 1. Based on this model, what would be a reasonable point spread if Stanford played Berkeley (california-golden-bears)? What if Stanford played Louisville (louisville-cardinals) (that year s national champions)? 2. What would be a reasonable point spread if Duke (duke-blue-devils) played North Carolina (north-carolina-tar-heels)? If North Carolina played NC State (north-carolina-state-wolfpack)? 3. Does the dataset and model support the notion of home field advantage? How many points per game is it? 5

Solutions 1. Since β Stanford = 0, the estimated coefficient ˆβ Berkeley is exactly the estimated score difference for Berkeley against Stanford (and similarly for Louisville). coef(fit)["`california-golden-bears`"]; coef(fit)["`louisville-cardinals`"] ## `california-golden-bears` ## -1.575168 ## `louisville-cardinals` ## 11.94171 So we might predict Stanford to beat Berkeley by 2 points but lose to Louisville by 12 points. (Ignore the column name here; R just keeps the name of the first value input.) 2. The expected score difference is So we can answer the question by comparing their coefficients: ˆβ Duke ˆβ UNC (7) coef(fit)["`duke-blue-devils`"] - coef(fit)["`north-carolina-tar-heels`"] ## `duke-blue-devils` ## 7.00085 Similarly, coef(fit)["`north-carolina-tar-heels`"] - coef(fit)["`north-carolina-state-wolfpack`"] ## `north-carolina-tar-heels` ## 0.05843945 So we might expect Duke to beat UNC by 7 points, and UNC to have a very slight edge over NC State. 3. We can assess home field advantage by including another column, which is 1 if and only if that game was played in a non-neutral location: homeadv <- 1 - games$neutrallocation Xh <- cbind(homeadv=homeadv, X) homeadv.mod <- lm(y ~ 0 +., data=xh, subset=reg.season.games) head(coef(summary(homeadv.mod)), 1) ## Estimate Std. Error t value Pr(> t ) ## homeadv 3.528499 0.1568782 22.49197 8.654579e-107 It looks like home field advantage is worth about 3.5 points per match and is clearly significant. 6