Section 2.1: Intro to Simple Linear Regression & Least Squares

Similar documents
Section 2.1: Intro to Simple Linear Regression & Least Squares

Section 2.2: Covariance, Correlation, and Least Squares

Section 3.2: Multiple Linear Regression II. Jared S. Murray The University of Texas at Austin McCombs School of Business

Section 2.3: Simple Linear Regression: Predictions and Inference

Section 3.4: Diagnostics and Transformations. Jared S. Murray The University of Texas at Austin McCombs School of Business

Gelman-Hill Chapter 3

Section 4.1: Time Series I. Jared S. Murray The University of Texas at Austin McCombs School of Business

Regression Analysis and Linear Regression Models

22s:152 Applied Linear Regression

14.2 The Regression Equation

Among those 14 potential explanatory variables,non-dummy variables are:

Orange Juice data. Emanuele Taufer. 4/12/2018 Orange Juice data (1)

22s:152 Applied Linear Regression

Multiple Linear Regression

STAT Statistical Learning. Predictive Modeling. Statistical Learning. Overview. Predictive Modeling. Classification Methods.

Multiple Regression White paper

( ) = Y ˆ. Calibration Definition A model is calibrated if its predictions are right on average: ave(response Predicted value) = Predicted value.

Salary 9 mo : 9 month salary for faculty member for 2004

More data analysis examples

7. Solve the following compound inequality. Write the solution in interval notation.

Further Maths Notes. Common Mistakes. Read the bold words in the exam! Always check data entry. Write equations in terms of variables

26, 2016 TODAY'S AGENDA QUIZ

Applied Statistics and Econometrics Lecture 6

MULTIPLE REGRESSION IN EXCEL EXCEL LAB #8

2.1 Solutions to Exercises

A Multiple-Line Fitting Algorithm Without Initialization Yan Guo

Cross-Validation Alan Arnholt 3/22/2016

Statistics Lab #7 ANOVA Part 2 & ANCOVA

Statistical Analysis in R Guest Lecturer: Maja Milosavljevic January 28, 2015

Regression on the trees data with R

Exercise 2.23 Villanova MAT 8406 September 7, 2015

Bivariate Linear Regression James M. Murray, Ph.D. University of Wisconsin - La Crosse Updated: October 04, 2017

Transform both equations in each system of equations so that each coefficient is an integer.

AA BB CC DD EE. Introduction to Graphics in R

ST512. Fall Quarter, Exam 1. Directions: Answer questions as directed. Please show work. For true/false questions, circle either true or false.

Lab #13 - Resampling Methods Econ 224 October 23rd, 2018

Math 263 Excel Assignment 3

Model selection. Peter Hoff. 560 Hierarchical modeling. Statistics, University of Washington 1/41

Slide Copyright 2005 Pearson Education, Inc. SEVENTH EDITION and EXPANDED SEVENTH EDITION. Chapter 13. Statistics Sampling Techniques

The Statistical Sleuth in R: Chapter 10

RELEASED. Spring 2013 North Carolina Measures of Student Learning: NC s Common Exams. Advanced Functions and Modeling

Stat 5303 (Oehlert): Response Surfaces 1

CDAA No. 4 - Part Two - Multiple Regression - Initial Data Screening

RELATIONSHIPS BETWEEN VARIABLES

Two-Stage Least Squares

Stat 579: More Preliminaries, Reading from Files

Unit 0: Extending Algebra 1 Concepts

Chapter 1 Polynomials and Modeling

Bivariate (Simple) Regression Analysis

Building Better Parametric Cost Models

Course # and Section #: MTH 120: WN 110 Start Date: Feb. 2, The following CHECKED items are ALLOWED to be taken in the testing room by students;

Chapter 4: Analyzing Bivariate Data with Fathom

Chapter 7: Linear regression

Regression. Dr. G. Bharadwaja Kumar VIT Chennai

ACTIVITY: Representing Data by a Linear Equation

THE UNIVERSITY OF BRITISH COLUMBIA FORESTRY 430 and 533. Time: 50 minutes 40 Marks FRST Marks FRST 533 (extra questions)

Predicting housing price

EXST 7014, Lab 1: Review of R Programming Basics and Simple Linear Regression

Key: 5 9 represents a team with 59 wins. (c) The Kansas City Royals and Cleveland Indians, who both won 65 games.

Page 1 CCM6+ Unit 10 Graphing UNIT 10 COORDINATE PLANE. CCM Name: Math Teacher: Projected Test Date:

NEURAL NETWORKS. Cement. Blast Furnace Slag. Fly Ash. Water. Superplasticizer. Coarse Aggregate. Fine Aggregate. Age

Variable selection is intended to select the best subset of predictors. But why bother?

Lesson 5: Identifying Proportional and Non-Proportional Relationships in Graphs

list of names. Graph the scatter plot of the residuals 4-6 Regression and Median-Fit Lines

Practice in R. 1 Sivan s practice. 2 Hetroskadasticity. January 28, (pdf version)

Ex.1 constructing tables. a) find the joint relative frequency of males who have a bachelors degree.

. predict mod1. graph mod1 ed, connect(l) xlabel ylabel l1(model1 predicted income) b1(years of education)

Chapter 2: Linear Equations and Functions

Section 1.1: Functions and Models

22s:152 Applied Linear Regression DeCook Fall 2011 Lab 3 Monday October 3

Writing Linear Equations

Unit 2 Day 6. Characteristics Of Quadratic, Even, And Odd Functions

Name Per. a. Make a scatterplot to the right. d. Find the equation of the line if you used the data points from week 1 and 3.

Regression Lab 1. The data set cholesterol.txt available on your thumb drive contains the following variables:

9.1 Random coefficients models Constructed data Consumer preference mapping of carrots... 10

Chapter 8: Regression. Self-test answers

Year 10 General Mathematics Unit 2

Statistics can best be defined as a collection and analysis of numerical information.

Analysis of variance - ANOVA

Mathematics of Data. INFO-4604, Applied Machine Learning University of Colorado Boulder. September 5, 2017 Prof. Michael Paul

Summer Algebra Review

IQR = number. summary: largest. = 2. Upper half: Q3 =

Unit Essential Questions: Does it matter which form of a linear equation that you use?

Ready To Go On? Skills Intervention 9-1 Multiple Representations of Functions

Model Selection and Inference

MEASURES OF CENTRAL TENDENCY

Linear Functions. Connection to AP*: AP Calculus Topic: Analysis of Functions

1. How would you describe the relationship between the x- and y-values in the scatter plot?

Contents Cont Hypothesis testing

STAT 2607 REVIEW PROBLEMS Word problems must be answered in words of the problem.

Discriminant analysis in R QMMA

UNIT 1A EXPLORING UNIVARIATE DATA

Sixth Grade Mathematics 2017 Released Items Analysis

LESSON Constructing and Analyzing Scatter Plots

L24-Fri-28-Oct-2016-Sec-3-4-Quadratic-Models-HW25-Moodle-Q20

/4 Directions: Graph the functions, then answer the following question.

Worksheet Practice PACKET

YEAR 12 Trial Exam Paper FURTHER MATHEMATICS. Written examination 1. Worked solutions

Algebra 2 Chapter Relations and Functions

Linear Equations in Two Variables

Transcription:

Section 2.1: Intro to Simple Linear Regression & Least Squares Jared S. Murray The University of Texas at Austin McCombs School of Business Suggested reading: OpenIntro Statistics, Chapter 7.1, 7.2 1

Regression: General Introduction Regression analysis is the most widely used statistical tool for understanding relationships among variables It provides a conceptually simple method for investigating functional relationships between one or more factors and an outcome of interest The relationship is expressed in the form of an equation or a model connecting the response or dependent variable and one or more explanatory or predictor variables 2

Why? Straight prediction questions: For how much will my house sell? How many runs per game will the Red Sox score this year? Will this person like that movie? (e.g., Netflix) Explanation and understanding: What is the impact of getting an MBA on lifetime income? How do the returns of a mutual fund relate to the market? Does Walmart discriminate against women when setting salaries? 3

1st Example: Predicting House Prices Problem: Predict market price based on observed characteristics Solution: Look at property sales data where we know the price and some observed characteristics. Build a decision rule that predicts price as a function of the observed characteristics. 4

Predicting House Prices What characteristics do we use? We have to define the variables of interest and develop a specific quantitative measure of these variables Many factors or variables affect the price of a house size number of baths garage neighborhood... 5

Predicting House Prices To keep things super simple, let s focus only on size. The value that we seek to predict is called the dependent (or output) variable, and we denote this: Y, e.g. the price of the house (thousands of dollars) The variable that we use to aid in prediction is the independent, explanatory, or input variable, and this is labelled X, e.g. the size of house (thousands of square feet) 6

Predicting House Prices What does this data look like? 7

Predicting House Prices It is much more useful to look at a scatterplot: plot(price ~ Size, data = housing) Price 60 100 160 1.0 1.5 2.0 2.5 3.0 3.5 Size In other words, view the data as points in the X Y plane. 8

Linear Prediction Appears to be a linear relationship between price and size: As size goes up, price goes up. The line shown was fit by the eyeball method. 9

Linear Prediction Recall that the equation of a line is: Y = b 0 + b 1 X Where b 0 is the intercept and b 1 is the slope. The intercept value is in units of Y ($1,000). The slope is in units of Y per units of X ($1,000/1,000 sq ft). 10

Linear Prediction Y b 1 Y = b 0 + b 1 X b 0 1 2 X Our eyeball line has b 0 = 35, b 1 = 40. 11

Linear Prediction Can we do better than the eyeball method? We desire a strategy for estimating the slope and intercept parameters in the model Ŷ = b 0 + b 1 X A reasonable way to fit a line is to minimize the amount by which the fitted value differs from the actual value. This amount is called the residual. 12

Linear Prediction What is the fitted value? Y i Ŷ i X i The dots are the observed values and the line represents our fitted values given by Ŷ = b 0 + b 1 X. 13

Linear Prediction What is the residual for the ith observation? Y i Ŷ i e i = Y i Ŷ i = Residual i X i We can write Y i = Ŷi + (Y i Ŷi) = Ŷ i + e i. 14

Least Squares Ideally we want to minimize the size of all residuals: If they were all zero we would have a perfect line. Trade-off between moving closer to some points and at the same time moving away from other points. The line fitting process: Take each residual e i and assign it a weight e 2 i. Bigger residuals = bigger mistakes = higher weights Minimize the total of these weights to get best possible fit. Least Squares chooses b 0 and b 1 to minimize N i=1 e2 i N ei 2 = e1+e 2 2+ 2 +en 2 = (Y 1 Ŷ 1 ) 2 +(Y 2 Ŷ 2 ) 2 + +(Y N Ŷ N ) 2 i=1 15

Least Squares LS chooses a different line from ours: b 0 = 38.88 and b 1 = 35.39 What do b 0 and b 1 mean again? Our line LS line 16

Least Squares in R The lm command fits linear (regression) models fit = lm(price ~ Size, data = housing) print(fit) ## ## Call: ## lm(formula = Price ~ Size, data = housing) ## ## Coefficients: ## (Intercept) Size ## 38.88 35.39 17

fit = lm(price ~ Size, data = housing) summary(fit) ## ## Call: ## lm(formula = Price ~ Size, data = housing) ## ## Residuals: ## Min 1Q Median 3Q Max ## -30.425-8.618 0.575 10.766 18.498 ## ## Coefficients: ## Estimate Std. Error t value Pr(> t ) ## (Intercept) 38.885 9.094 4.276 0.000903 *** ## Size 35.386 4.494 7.874 2.66e-06 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 14.14 on 13 degrees of freedom ## Multiple R-squared: 0.8267,Adjusted R-squared: 0.8133 ## F-statistic: 62 on 1 and 13 DF, p-value: 2.66e-06 18

2nd Example: Offensive Performance in Baseball 1. Problems: 2. Solutions: Evaluate/compare traditional measures of offensive performance Help evaluate the worth of a player Compare prediction rules that forecast runs as a function of either AVG (batting average), SLG (slugging percentage) or OBP (on base percentage) 19

2nd Example: Offensive Performance in Baseball 20

Baseball Data Using AVG Each observation corresponds to a team in MLB. Each quantity is the average over a season. Y = runs per game; X = AVG (batting average) LS fit: Runs/Game = -3.93 + 33.57 AVG 21

Baseball Data Using SLG Y = runs per game X = SLG (slugging percentage) LS fit: Runs/Game = -2.52 + 17.54 SLG 22

Baseball Data Using OBP Y = runs per game X = OBP (on base percentage) LS fit: Runs/Game = -7.78 + 37.46 OBP 23

Baseball Data What is the best prediction rule? Let s compare the predictive ability of each model using the average squared error 1 n n ei 2 = i=1 n i=1 (Ŷi Y i ) 2 n 24

Place your Money on OBP! Average Squared Error AVG 0.083 SLG 0.055 OBP 0.026 25

Linear Prediction Ŷ n+1 = b 0 + b 1 x n+1 b 0 is the intercept and b 1 is the slope We find b 0 and b 1 using Least Squares For a new value of the independent variable OBP (say x n+1 ) we can predict the response Y n+1 using the fitted line 26

More on Least Squares From now on, terms fitted values (Ŷ i ) and residuals (e i ) refer to those obtained from the least squares line. The fitted values and residuals have some special properties... 27

The Fitted Values and X plot(predict(fit) ~ Size, data = housing, ylab = "fitted va fitted values yhat 80 100 120 140 160 1.0 1.5 2.0 2.5 3.0 3.5 Size cor(predict(fit), housing$size) ## [1] 1 28

The Residuals and X plot(resid(fit) ~ Size, data = housing, ylab = "residuals") residuals 30 10 0 10 20 1.0 1.5 2.0 2.5 3.0 3.5 Size mean(resid(fit)); cor(resid(fit), housing$size) ## [1] -9.633498e-17 ## [1] 2.120636e-17 (i.e., zero). What s going on here? Next time... 29