Chapter 7: Linear regression

Similar documents
Multiple Regression White paper

9.1. How do the nerve cells in your brain communicate with each other? Signals have. Like a Glove. Least Squares Regression

/4 Directions: Graph the functions, then answer the following question.

9.8 Rockin the Residuals

MDM 4UI: Unit 8 Day 2: Regression and Correlation

Further Maths Notes. Common Mistakes. Read the bold words in the exam! Always check data entry. Write equations in terms of variables

Appendix 2: PREPARATION & INTERPRETATION OF GRAPHS

ACTIVITY: Representing Data by a Linear Equation

Ex.1 constructing tables. a) find the joint relative frequency of males who have a bachelors degree.

Year 10 General Mathematics Unit 2

Learner Expectations UNIT 1: GRAPICAL AND NUMERIC REPRESENTATIONS OF DATA. Sept. Fathom Lab: Distributions and Best Methods of Display

Regression Analysis and Linear Regression Models

Example 1 of panel data : Data for 6 airlines (groups) over 15 years (time periods) Example 1

CDAA No. 4 - Part Two - Multiple Regression - Initial Data Screening

SLStats.notebook. January 12, Statistics:

Two-Stage Least Squares

Find the Relationship: An Exercise in Graphical Analysis

Algebra 1, 4th 4.5 weeks

1. Solve the following system of equations below. What does the solution represent? 5x + 2y = 10 3x + 5y = 2

EXST 7014, Lab 1: Review of R Programming Basics and Simple Linear Regression

Chapter 4: Analyzing Bivariate Data with Fathom

Data Analysis Multiple Regression

Bivariate (Simple) Regression Analysis

Homework set 4 - Solutions

LESSON Constructing and Analyzing Scatter Plots

STA 570 Spring Lecture 5 Tuesday, Feb 1

Connecting the Dots Making Connections Between Arithmetic Sequences and Linear Functions

26, 2016 TODAY'S AGENDA QUIZ

GRAPHS AND STATISTICS Residuals Common Core Standard

Graphical Analysis of Data using Microsoft Excel [2016 Version]

Section 2.1: Intro to Simple Linear Regression & Least Squares

PSY 9556B (Feb 5) Latent Growth Modeling

Chapter 5: Polynomial Functions

Appendix 3: PREPARATION & INTERPRETATION OF GRAPHS

Regression on SAT Scores of 374 High Schools and K-means on Clustering Schools

Algebra 2 Chapter Relations and Functions

Algebra 1B Assignments Chapter 6: Linear Equations (All graphs must be drawn on GRAPH PAPER!)

Robust Linear Regression (Passing- Bablok Median-Slope)

Ready to Go On? Skills Intervention 1-1. Exploring Transformations. 2 Holt McDougal Algebra 2. Name Date Class

Chapter 1 Linear Equations

Learning Packet THIS BOX FOR INSTRUCTOR GRADING USE ONLY. Mini-Lesson is complete and information presented is as found on media links (0 5 pts)

Name Per. a. Make a scatterplot to the right. d. Find the equation of the line if you used the data points from week 1 and 3.

( ) = Y ˆ. Calibration Definition A model is calibrated if its predictions are right on average: ave(response Predicted value) = Predicted value.

Assignment 2. with (a) (10 pts) naive Gauss elimination, (b) (10 pts) Gauss with partial pivoting

Vocabulary Unit 2-3: Linear Functions & Healthy Lifestyles. Scale model a three dimensional model that is similar to a three dimensional object.

WESTMORELAND COUNTY PUBLIC SCHOOLS Integrated Instructional Pacing Guide and Checklist Algebra, Functions & Data Analysis

Scatterplot: The Bridge from Correlation to Regression

Further Differentiation

Section 1.1: Functions and Models

An Example of Using inter5.exe to Obtain the Graph of an Interaction

Make a new column in your table, zbeers, which will be the z-score of Beers To do that:

Linear Regression in two variables (2-3 students per group)

2.2. Changing One Dimension

Applied Regression Modeling: A Business Approach

14.2 The Regression Equation

Example 1: Given below is the graph of the quadratic function f. Use the function and its graph to find the following: Outputs

Psychology 282 Lecture #21 Outline Categorical IVs in MLR: Effects Coding and Contrast Coding

8.5 Quadratic Functions and Their Graphs

1. Assumptions. 1. Introduction. 2. Terminology

Here is the data collected.

Voluntary State Curriculum Algebra II

Correlation. January 12, 2019

W7 DATA ANALYSIS 2. Your graph should look something like that in Figure W7-2. It shows the expected bell shape of the Gaussian distribution.

Secondary 1 Vocabulary Cards and Word Walls Revised: June 27, 2012

STANDARDS OF LEARNING CONTENT REVIEW NOTES. ALGEBRA I Part II. 3 rd Nine Weeks,

Graphing Functions. 0, < x < 0 1, 0 x < is defined everywhere on R but has a jump discontinuity at x = 0. h(x) =

Your Name: Section: INTRODUCTION TO STATISTICAL REASONING Computer Lab #4 Scatterplots and Regression

Standards Level by Objective Hits Goals Objs # of objs by % w/in std Title Level Mean S.D. Concurr.

Instructor: Barry McQuarrie Page 1 of 6

1. What specialist uses information obtained from bones to help police solve crimes?

Part I, Chapters 4 & 5. Data Tables and Data Analysis Statistics and Figures

Section 3.4: Diagnostics and Transformations. Jared S. Murray The University of Texas at Austin McCombs School of Business

D-Optimal Designs. Chapter 888. Introduction. D-Optimal Design Overview

Integrated Mathematics I Performance Level Descriptors

Slide Copyright 2005 Pearson Education, Inc. SEVENTH EDITION and EXPANDED SEVENTH EDITION. Chapter 13. Statistics Sampling Techniques

STANDARDS OF LEARNING CONTENT REVIEW NOTES ALGEBRA I. 4 th Nine Weeks,

Building Better Parametric Cost Models

Using Excel for Graphical Analysis of Data

THIS IS NOT REPRESNTATIVE OF CURRENT CLASS MATERIAL. STOR 455 Midterm 1 September 28, 2010

Catholic Regional College Sydenham

Minnesota Academic Standards for Mathematics 2007

Mathematics. Algebra, Functions, and Data Analysis Curriculum Guide. Revised 2010

ES-2 Lecture: Fitting models to data

Exploratory model analysis

Mathematics Scope & Sequence Grade 8 Revised: June 2015

Error Analysis, Statistics and Graphing

The problem we have now is called variable selection or perhaps model selection. There are several objectives.

How to use FSBforecast Excel add in for regression analysis

Solving General Linear Equations w/ Excel

LINEAR FUNCTIONS, SLOPE, AND CCSS-M

Chemistry 1A Graphing Tutorial CSUS Department of Chemistry

DoE with Visual-XSel 13.0

1. How would you describe the relationship between the x- and y-values in the scatter plot?

Things to Know for the Algebra I Regents

Graphing Linear Equations

3.6 Graphing Piecewise-Defined Functions and Shifting and Reflecting Graphs of Functions

To calculate the arithmetic mean, sum all the values and divide by n (equivalently, multiple 1/n): 1 n. = 29 years.

Exam January? 9:30 11:30

Exam 4. In the above, label each of the following with the problem number. 1. The population Least Squares line. 2. The population distribution of x.

The Graph of an Equation

Transcription:

Chapter 7: Linear regression Objective (1) Learn how to model association bet. 2 variables using a straight line (called "linear regression"). (2) Learn to assess the quality of regression models. (3) Understand & avoid common mistakes made in regression modeling. Concept briefs: * General straight line equation (Y = m X + b) * Slope of a straight line (m) = How much Y changes when X increases by 1. * Intercept of straight line (b) = Value of Y when X=0. * Linear regression = Process of finding the "best fit" straight line to approimately model linear associations. * Predicted response = Y-value obtained by plugging in any X into regr. model. * Residual = Error between predicted Y-value & true Y-value (if known). * Least-squares best fit = The particular straight line that minimizes the sum of the square of the residuals.

"Best-fit" straight line model Problem statement * You have a scatterplot that shows linear association bet. 2 variables. [e.g., Starting salary vs GPA] * You want "to model" the association with a straight line equation. For eample: Predicted_starting_salary = m * GPA + b where "m" and "b" are to be determined so that we get the best line. How to find best fit? * Develop concept using the Z - Zy scatterplot. * Then rescale & apply it in any other setting Zy r 1 Z Fact: Best-fit line has Slope = r, Intercept = 0 (r = correlation coefficient) salary (y) Best Fit line: Slope = r GPA () S S y Intercept = solve for it using mean values of, y variables. S & Sy denote the standard deviations E.g., Predicted_starting_salary = 1000 * GPA + 28,800 ($)

What is an error or residual? Errors, residuals and R-squared * Recap: The goal is to predict the response (y-value) for any given -value. * We have some known data values that can be used to judge quality. * Residual = (error) = True y - Predicted y e = y - ^y X True Y ($) Pred. Y ($) Error ($) 2.6 33,400 31,400 2,000 2.7 31,800 31,500 300 2.8 29,200 31,600-2,400 salary (y) GPA () How to assess quality of regression model? There are 3 things one should always look at: (1) Graphical: Scatterplot of e vs -data value (Or, y-data value). (2) Numerical: Calculate standard deviation, or variance, in e. (3) Numerical: Calculate R 2. * Scatterplot should look random, with no patterns or outliers. * Variance (or SD) of e should be small relative to variance of y.

What is R-squared? This is a key numerical measure of quality: * It is related to both r and to the variance in residuals (e). * Simplest interpretation: R 2 = r 2 100 (epressed as %) E.g., If r = -0.8, then R 2 = 100 (-0.8) 2 = 64% Very important connection to residuals and 'e': * R 2 tells how well the regression model accounts for the variance in the data. Large R 2 says most of the data's variance is modeled by the regression equation. * 100 - R 2 is the percent of data variance not accounted for by the model. This is equal to the variance in e. In what way is our line the "best-fit"? * This is based on the least-squares principle: This particular line has the minimum sum of the square of the residuals. In other words, it gives the least possible variance in e.

E.69, Pg. 210 (a) Use the given R2 to find r: Since R 2 = r 2 100, r = 84/100 = 0.84 = 0.9165 (b) 91.65% of the variability in mean temperature is accounted for by the variability in CO 2 levels. (c) Must learn to interpret typical regression table (comes from software) * "Intercept" gives the "b" value in the straight line equation. * Variable CO 2 's coefficient is the slope of the straight line. Thus, the regression equation here would be Predicted_mean_temp = 11.0276 + 0.0089*CO 2 (in degrees C) (d, e) Reword the question to: "Interpret the slope and intercept in this contet." Slope: "The model predicts that for each 1 ppm increase in CO 2 level, the mean temperature will increase by 0.0089 o C." Intercept: [Makes no sense here, but you can still blindly interpret!] "According to the model, when there is no CO 2 in the atmosphere the mean temperature will be 11.0276 o C." Not very meaningful in this contet, since there is always some CO 2 in the atmosphere. (f) The scatterplot of the residuals looks fairly random, with no specific pattern. Thus it seems appropriate to use a linear model here. (g) Predicted_mean_temp = 11.0276 + 0.0089*CO 2 = 11.0276 + 0.0089*(400) = 14.59 o C The model predicts the mean temperature will be 14.59 o C when the CO 2 level in the atmosphere is 400 ppm.

E.60, Pg. 208 (a) Yes, because the scatterplot shows an association that looks fairly strong and linear. (b) R2 = 87.3% says: 87.3% of variability in the use of other drugs (Yvariable) is accounted for by variability in the use of Marijuana (X-variable). (c) Regression equation is of the form: Y = m X + b. * First, figure out what our Y and X are: Y = Other drug use (predicted) * Net, find slope, m = r S S y. X = Marijuana use % The eercise has given us Sy = 10.2%, S = 15.6% We can calculate r from R2: r = 87. 3/ 100 = 0. 873 = 0. 934 Thus, slope m = 0.934 *(10.2 / 15.6) = 0.611 So our equation is: Other drug use = 0.611 Marijuana_use + b (%) * Net, find b by plugging in the mean values and solving for it: 11.6 (%) = 0.611 23.9 (%) + b b = -3.0 % * Final model: Other drug use = 0.611 Marijuana_use - 3 (%) (d) Slope means: "For each 1% increase in teens using Marijuana, the model predicts there is a 0.611% increase in teens using other drugs." (e) No, this doesn't confirm that Marijuana use leads to the use of other drugs. It only shows there is an association between use of these two drugs.

Common regression misuses & errors (1) Scatterplot of residuals * Look for inadvertent shape or patterns or significant outliers. * There is a problem if you see any of these. (2) Etrapolation * This is the use of a model to predict outside the X-range for which it was designed. * It is dangerous to do this! The prediction may be completely off. (3) Outliers & influence points * Look at scatterplot of original variables to identify outliers. * Outliers can severely messup the analysis & the conclusions drawn. (4) Be aware when working with "averaged" datasets E.g., Average GPA vs. average starting salary (for different years, say) * Averaged data, invariably, has much lower scatter than full/raw data. This creates the illusion of stronger association than there really is.