OLS Assumptions and Goodness of Fit

Similar documents
Serial Correlation and Heteroscedasticity in Time series Regressions. Econometric (EC3090) - Week 11 Agustín Bénétrix

Heteroskedasticity and Homoskedasticity, and Homoskedasticity-Only Standard Errors

Example 1 of panel data : Data for 6 airlines (groups) over 15 years (time periods) Example 1

Standard Errors in OLS Luke Sonnet

Applied Statistics and Econometrics Lecture 6

7. Collinearity and Model Selection

Multicollinearity and Validation CIVL 7012/8012

MS&E 226: Small Data

Chapters 5-6: Statistical Inference Methods

Section 3.2: Multiple Linear Regression II. Jared S. Murray The University of Texas at Austin McCombs School of Business

Model Diagnostic tests

For our example, we will look at the following factors and factor levels.

Comparison of Means: The Analysis of Variance: ANOVA

First-level fmri modeling

Applied Regression Modeling: A Business Approach

STA 4273H: Statistical Machine Learning

PANEL DATA REGRESSION MODELS IN EVIEWS: Pooled OLS, Fixed or Random effect model?

STAT 2607 REVIEW PROBLEMS Word problems must be answered in words of the problem.

Missing Data Analysis for the Employee Dataset

Multivariate Analysis Multivariate Calibration part 2

Data Management - 50%

Gov Troubleshooting the Linear Model II: Heteroskedasticity

Resources for statistical assistance. Quantitative covariates and regression analysis. Methods for predicting continuous outcomes.

Section 2.3: Simple Linear Regression: Predictions and Inference

Lecture 26: Missing data

StatCalc User Manual. Version 9 for Mac and Windows. Copyright 2018, AcaStat Software. All rights Reserved.

Nonparametric Testing

Notes on Simulations in SAS Studio

A Multiple-Line Fitting Algorithm Without Initialization Yan Guo

Labor Economics with STATA. Estimating the Human Capital Model Using Artificial Data

Excel Assignment 4: Correlation and Linear Regression (Office 2016 Version)

Missing Data Analysis for the Employee Dataset

Assignments Fill out this form to do the assignments or see your scores.

CREATING THE ANALYSIS

Panel Data 4: Fixed Effects vs Random Effects Models

Building Better Parametric Cost Models

Week 11: Interpretation plus

Regression. Dr. G. Bharadwaja Kumar VIT Chennai

SOCY7706: Longitudinal Data Analysis Instructor: Natasha Sarkisian. Panel Data Analysis: Fixed Effects Models

Workshop 8: Model selection

MCMC Diagnostics. Yingbo Li MATH Clemson University. Yingbo Li (Clemson) MCMC Diagnostics MATH / 24

The problem we have now is called variable selection or perhaps model selection. There are several objectives.

Statistical Analysis of Metabolomics Data. Xiuxia Du Department of Bioinformatics & Genomics University of North Carolina at Charlotte

Data Mining. ❷Chapter 2 Basic Statistics. Asso.Prof.Dr. Xiao-dong Zhu. Business School, University of Shanghai for Science & Technology

Everything you did not want to know about least squares and positional tolerance! (in one hour or less) Raymond J. Hintz, PLS, PhD University of Maine

Historical Data RSM Tutorial Part 1 The Basics

2.830J / 6.780J / ESD.63J Control of Manufacturing Processes (SMA 6303) Spring 2008

MS&E 226: Small Data

An Introductory Guide to Stata

Week 4: Describing data and estimation

Section 3.4: Diagnostics and Transformations. Jared S. Murray The University of Texas at Austin McCombs School of Business

Deep Learning. Vladimir Golkov Technical University of Munich Computer Vision Group

Session 2: Fixed and Random Effects Estimation

Chapter 3: Describing, Exploring & Comparing Data

Regression Analysis and Linear Regression Models

CPSC 340: Machine Learning and Data Mining. Multi-Class Classification Fall 2017

Lecture 7: Linear Regression (continued)

Regression on SAT Scores of 374 High Schools and K-means on Clustering Schools

Estimation of Item Response Models

Evaluation Strategies for Network Classification

Quality Checking an fmri Group Result (art_groupcheck)

Mean Tests & X 2 Parametric vs Nonparametric Errors Selection of a Statistical Test SW242

Metrics for Performance Evaluation How to evaluate the performance of a model? Methods for Performance Evaluation How to obtain reliable estimates?

Predictive Analysis: Evaluation and Experimentation. Heejun Kim

Introduction to mixed-effects regression for (psycho)linguists

Two-Stage Least Squares

Multivariate Capability Analysis

Statistical Analysis of MRI Data

Departments of Economics and Agricultural and Applied Economics Ph.D. Written Qualifying Examination August 2010 will not required

Last time... Coryn Bailer-Jones. check and if appropriate remove outliers, errors etc. linear regression

Detecting and Circumventing Collinearity or Ill-Conditioning Problems

Statistics: Normal Distribution, Sampling, Function Fitting & Regression Analysis (Grade 12) *

Data Analysis Guidelines

CS 229 Midterm Review

Data Mining. Wes Wilson Gerry Wiener Bill Myers

Using SPSS with The Fundamentals of Political Science Research

Week 4: Simple Linear Regression II

Boosting Simple Model Selection Cross Validation Regularization

1. Assumptions. 1. Introduction. 2. Terminology

Integers & Absolute Value Properties of Addition Add Integers Subtract Integers. Add & Subtract Like Fractions Add & Subtract Unlike Fractions

MODEL SELECTION AND MODEL AVERAGING IN THE PRESENCE OF MISSING VALUES

CPSC 340: Machine Learning and Data Mining. Logistic Regression Fall 2016

Evaluation. Evaluate what? For really large amounts of data... A: Use a validation set.

Introduction to hypothesis testing

Network Management System Dimensioning with Performance Data. Kaisa Tuisku

Learner Expectations UNIT 1: GRAPICAL AND NUMERIC REPRESENTATIONS OF DATA. Sept. Fathom Lab: Distributions and Best Methods of Display

Introduction to Machine Learning CMU-10701

CPSC 427: Object-Oriented Programming

Error Analysis, Statistics and Graphing

Chapter 8 The C 4.5*stat algorithm

CS 229 Final Project - Using machine learning to enhance a collaborative filtering recommendation system for Yelp

Bias-Variance Analysis of Ensemble Learning

Further Maths Notes. Common Mistakes. Read the bold words in the exam! Always check data entry. Write equations in terms of variables

Parallel line analysis and relative potency in SoftMax Pro 7 Software

Spatial Patterns Point Pattern Analysis Geographic Patterns in Areal Data

Missing Data. SPIDA 2012 Part 6 Mixed Models with R:

Minitab 17 commands Prepared by Jeffrey S. Simonoff

Algorithms in Systems Engineering ISE 172. Lecture 12. Dr. Ted Ralphs

THIS IS NOT REPRESNTATIVE OF CURRENT CLASS MATERIAL. STOR 455 Midterm 1 September 28, 2010

Missing Data. Where did it go?

Transcription:

OLS Assumptions and Goodness of Fit

A little warm-up Assume I am a poor free-throw shooter. To win a contest I can choose to attempt one of the two following challenges: A. Make three out of four free throws B. Make six out of eight Which should I choose? Why?

Gauss-Markov Assumptions These are the full ideal conditions If these are met, OLS is BLUE i.e. efficient and unbiased. Your data will rarely meet these conditions This class helps you understand what to do about this.

Pop Quiz Take out a sheet of paper and write down all the Gauss-Markov assumptions.

Assumptions of Classical Linear Regression A1: The regression model is linear in parameters It may not be linear in variables Y=B 1 +B 2 X i

Assumptions of Classical Linear Regression A1: The regression model is linear in parameters It may not be linear in variables Y=B 1 +B 2 X+B 3 X 2

Assumptions of Classical Linear Regression A2: X values are fixed in repeated sampling Think about an experiment with different dosages assigned to different groups We can also do this if X values vary in repeated sampling, as long as cov(xi, ui) = 0 See chapter 13 if you re curious about the details

What if we violate linearity? If you have a non-linear relationship between X and Y and you don t include an X-squared or X- cubed term, what is the problem? A true relationship may exist between X and Y that you fail to detect.

A2: X values are fixed in repeated sampling Think about an experiment with different dosages assigned to different groups We can also do this if X values vary in repeated sampling, as long as cov(xi, ui) = 0 Think about this as requiring random sampling

Expected Value of Errors is Zero A3: Mean value of ui = 0. E[ui Xi] = 0 E[ui] = 0 if X is fixed (non-stochastic) Its ok to have big errors, but we can t be wrong systematically We call that bias

What if the expected value of the errors is not zero? This would indicate specification error Omitted variable bias, for example

Assumptions of Classical Linear Regression Homoskedasticity or Constant Variance of ui

What happens if we violate homoskedasticity? This is called heteroskedasticity. Model uncertainty varies from observation to observation. Often true in cross-sectional data due to omitted variable bias. See chapter 13 if you re curious about the details of heteroskedasticity

No Autocorrelation A5: No autocorrelation between disturbances cov(ui,uj Xi, Xj) = 0 The observations are sampled independently

What if we have autocorrelation? More or less always the case in panel data. So we have panel-corrected standard errors, etc. Also sometimes the case if we sample multiple children from the same family, or multiple regions from the same country, etc. Clustered standard errors

Degrees of Freedom A6: Number of observations n must be greater than the number of parameters to be estimated n > number of explanatory variables AKA degrees of freedom

Not Enough Degrees of Freedom If you don t have enough degrees of freedom, you can t estimate your parameters The smaller your sample size, the less precise your estimates (i.e. large standard errors). Unable to reject the null hypothesis of no difference even if the true effect is large.

Variation but no Outliers A7: X must vary, but there must not be any outliers

What if there are outliers? Our model works too hard to fit these values, giving them effectively too much weight This is the squared errors problem.

Correct specification A8: Regression model is correctly specified. The correct variables are included We have the correct functional form Correct assumptions about the probability distributions of Y i, X i and u i.

No perfect multicollinearity A9: With multiple regression, we add the assumption of no perfect multicollinearity The correlation between any two x variables < 1

No perfect multicollinearity With perfect collinearity, we have to drop one x variable to even estimate our betas. With near-perfect collinearity, variance is inflated But estimates are not biased

Gauss-Markov Theorem When all those assumptions hold, OLS is BLUE Best Linear Unbiased Estimator Best means least variance (most efficient) Unbiased means: E[ ˆ2] = 2

How good does it fit? To measure reduction in errors we need a benchmark for comparison. The mean of the dependent variable is a relevant and tractable benchmark for comparing predictions. The mean of Y represents our best guess at the value of Y i absent other information.

Sums of Squares This gives us the following 'sum-of-squares' measures: Total Variation = Explained Variation + Unexplained Variation

How well does our model perform? R squared statistic = TSS-USS/TSS =ESS/TSS Bounded between 0 and 1 Higher values indicate a better fit

Questions How do the fitted values of Y change if we multiply X by a constant? What if we add a constant to X?

Why do we have an error term The error term includes the effect of all X variables not in our model that still effect Y. Parsimony, intrinsic randomness of humans, Vague theory, measurement error, wrong functional form

What does an error term imply? If we run our project multiple times, we will estimate a slightly different regression line every time.

How do we know if our test statistic is any good? OLS is an estimator It calculates the slope of the sample regression line (i.e. the SRF) It gives us a test statistic (i.e. a p-value) What does that mean? IF AND ONLY IF the assumptions of OLS are met, and the true slope of the population regression line is 0, there is an x percent chance we would estimate a slope this large in our sample regression.

Can we test that? YES! First, estimate our regression line, and calculate the critical value (p=.05) 2nd, lets make there be no relationship. shuffle the data 3rd, re-estimate the regression line. Is the slope steeper than our critical value? Repeat steps 2 and 3 10,000 times. How often should the slope of the regression line be greater than the critical value I ve

What does that tell us? It tells us our type 1 error rate How often would we reject the null when we shouldn t (i.e. when the null is true) What about Type 2 errors? How often would we fail to reject the null when the true value of beta is actually B1? To calculate that, we need the sample size, the variance of x and y, all the OLS assumptions Or we can simulate it Validity and Power