Stat 5100 Handout #19 SAS: Influential Observations and Outliers

Similar documents
Stat 5100 Handout #6 SAS: Linear Regression Remedial Measures

Stat 5100 Handout #11.a SAS: Variations on Ordinary Least Squares

Stat 500 lab notes c Philip M. Dixon, Week 10: Autocorrelated errors

CH5: CORR & SIMPLE LINEAR REFRESSION =======================================

Stat 5100 Handout #14.a SAS: Logistic Regression

THE UNIVERSITY OF BRITISH COLUMBIA FORESTRY 430 and 533. Time: 50 minutes 40 Marks FRST Marks FRST 533 (extra questions)

CDAA No. 4 - Part Two - Multiple Regression - Initial Data Screening

THIS IS NOT REPRESNTATIVE OF CURRENT CLASS MATERIAL. STOR 455 Midterm 1 September 28, 2010

Data Management - 50%

CREATING THE ANALYSIS

1 Downloading files and accessing SAS. 2 Sorting, scatterplots, correlation and regression

STAT 5200 Handout #25. R-Square & Design Matrix in Mixed Models

Bivariate (Simple) Regression Analysis

Simulating Multivariate Normal Data

Getting Correct Results from PROC REG

CSC 328/428 Summer Session I 2002 Data Analysis for the Experimenter FINAL EXAM

STAT:5201 Applied Statistic II

Outline. Topic 16 - Other Remedies. Ridge Regression. Ridge Regression. Ridge Regression. Robust Regression. Regression Trees. Piecewise Linear Model

Cell means coding and effect coding

SAS Workshop. Introduction to SAS Programming. Iowa State University DAY 2 SESSION IV

Centering and Interactions: The Training Data

Applied Regression Modeling: A Business Approach

Applied Regression Modeling: A Business Approach

PubHlth 640 Intermediate Biostatistics Unit 2 - Regression and Correlation. Simple Linear Regression Software: Stata v 10.1

Factorial ANOVA. Skipping... Page 1 of 18

ST512. Fall Quarter, Exam 1. Directions: Answer questions as directed. Please show work. For true/false questions, circle either true or false.

Detecting and Circumventing Collinearity or Ill-Conditioning Problems

Lab #9: ANOVA and TUKEY tests

STAT 2607 REVIEW PROBLEMS Word problems must be answered in words of the problem.

Baruch College STA Senem Acet Coskun

* Sample SAS program * Data set is from Dean and Voss (1999) Design and Analysis of * Experiments. Problem 3, page 129.

Factorial ANOVA with SAS

Stat 5100 Handout #15 SAS: Alternative Predictor Variable Types

Conditional and Unconditional Regression with No Measurement Error

Stat 5100 Handout #12.f SAS: Time Series Case Study (Unit 7)

SASEG 9B Regression Assumptions

Here is Kellogg s custom menu for their core statistics class, which can be loaded by typing the do statement shown in the command window at the very

SLStats.notebook. January 12, Statistics:

Introduction to Hierarchical Linear Model. Hsueh-Sheng Wu CFDR Workshop Series January 30, 2017

SAS/STAT 13.1 User s Guide. The REG Procedure

MS in Applied Statistics: Study Guide for the Data Science concentration Comprehensive Examination. 1. MAT 456 Applied Regression Analysis

ST Lab 1 - The basics of SAS

Within-Cases: Multivariate approach part one

Regression Analysis and Linear Regression Models

Generalized Least Squares (GLS) and Estimated Generalized Least Squares (EGLS)

Introduction to Statistical Analyses in SAS

SAS/STAT 13.1 User s Guide. The SCORE Procedure

Two-Stage Least Squares

Introduction to Stata: An In-class Tutorial

STA 570 Spring Lecture 5 Tuesday, Feb 1

Multiple Regression White paper

Product Catalog. AcaStat. Software

Lab 07: Multiple Linear Regression: Variable Selection

Instruction on JMP IN of Chapter 19

Resources for statistical assistance. Quantitative covariates and regression analysis. Methods for predicting continuous outcomes.

Repeated Measures Part 4: Blood Flow data

SOCY7706: Longitudinal Data Analysis Instructor: Natasha Sarkisian. Panel Data Analysis: Fixed Effects Models

Variable selection is intended to select the best subset of predictors. But why bother?

Applied Regression Modeling: A Business Approach

5.5 Regression Estimation

An Econometric Study: The Cost of Mobile Broadband

Panel Data 4: Fixed Effects vs Random Effects Models

I. MODEL. Q3i: Check my . Q29s: I like to see films and TV programs from other countries. Q28e: I like to watch TV shows on a laptop/tablet/phone

Statistics and Data Analysis. Common Pitfalls in SAS Statistical Analysis Macros in a Mass Production Environment

An introduction to SPSS

Withdrawn Equity Offerings: Event Study and Cross-Sectional Regression Analysis Using Eventus Software

Week 4: Simple Linear Regression III

Lecture 7: Linear Regression (continued)

CH9.Generalized Additive Model

range: [1,20] units: 1 unique values: 20 missing.: 0/20 percentiles: 10% 25% 50% 75% 90%

Week 4: Simple Linear Regression II

Introduction to SAS. I. Understanding the basics In this section, we introduce a few basic but very helpful commands.

Regression on SAT Scores of 374 High Schools and K-means on Clustering Schools

Subset Selection in Multiple Regression

- 1 - Fig. A5.1 Missing value analysis dialog box

Study Guide. Module 1. Key Terms

STAT 503 Fall Introduction to SAS

Minitab 17 commands Prepared by Jeffrey S. Simonoff

LAMPIRAN. Sampel Penelitian

Chapter 6: Linear Model Selection and Regularization

STAT 5200 Handout #21. Split Split Plot Example (Ch. 16)

Outlier Detection Using the Forward Search in SAS/IML Studio

Week 5: Multiple Linear Regression II

JMASM 46: Algorithm for Comparison of Robust Regression Methods In Multiple Linear Regression By Weighting Least Square Regression (SAS)

INTRODUCTION TO PANEL DATA ANALYSIS

Regression on the trees data with R

Problem set for Week 7 Linear models: Linear regression, multiple linear regression, ANOVA, ANCOVA

Model Diagnostic tests

Salary 9 mo : 9 month salary for faculty member for 2004

SPSS QM II. SPSS Manual Quantitative methods II (7.5hp) SHORT INSTRUCTIONS BE CAREFUL

EXST3201 Mousefeed01 Page 1

Chapter 16 The SIMLIN Procedure. Chapter Table of Contents

6:1 LAB RESULTS -WITHIN-S ANOVA

Example 1 of panel data : Data for 6 airlines (groups) over 15 years (time periods) Example 1

THE LINEAR PROBABILITY MODEL: USING LEAST SQUARES TO ESTIMATE A REGRESSION EQUATION WITH A DICHOTOMOUS DEPENDENT VARIABLE

. predict mod1. graph mod1 ed, connect(l) xlabel ylabel l1(model1 predicted income) b1(years of education)

Independent Variables

Further Maths Notes. Common Mistakes. Read the bold words in the exam! Always check data entry. Write equations in terms of variables

Two useful macros to nudge SAS to serve you

Lab 2: OLS regression

Transcription:

Stat 5100 Handout #19 SAS: Influential Observations and Outliers Example: Data collected on 50 countries relevant to a cross-sectional study of a lifecycle savings hypothesis, which states that the response variable SavRatio: aggregate personal saving divided by disposable income can be explained by the following four predictor variables: AvIncome: per-capita disposable income, in USD (yearly average over decade) GrowRate: percentage growth rate in per-capita disposable income (over decade) PopU15: percentage of the population less than 15 years old (yearly average over decade) PopO75: percentage of the population over 75 years old (yearly average over decade) The decade is 1960-1970. These data are published in section 2.2 of Regression Diagnostics: Identifying Influential Data and Sources of Collinearity (1980) by Belsley, Kuh, and Welsch (limited excerpt available through Google books). /* Define options */ ods html image_dpi=300 style=journal; /* Read in the data */ proc import out=work.savings dbms=csv replace datafile = "C:\jrstevens\Teaching\Stat5100\Data\savings.csv"; getnames=yes; datarow=2; /* Look at a regression model to predict SavRatio, with diagnostics for influential obs. and outliers */ proc reg data = savings plots(label)=(cooksd RStudentByLeverage DFFITS DFBETAS); id Country; model SavRatio = PopU15 PopO75 AvIncome GrowRate / data; output out=out1 r=resid p=pred; title1 'Predict SavRatio'; 1

Predict SavRatio Number of Observations Read 50 Number of Observations Used 50 Analysis of Variance Source DF Sum of Squares Mean Square F Value Pr > F Model 4 332.91525 83.22881 5.76 0.0008 Error 45 650.71300 14.46029 Corrected Total 49 983.62825 Root MSE 3.80267 R-Square 0.3385 Dependent Mean 9.67100 Adj R-Sq 0.2797 Coeff Var 39.32033 Parameter Estimates Variable DF Parameter Estimate Standard Error t Value Pr > t Intercept 1 28.56609 7.35452 3.88 0.0003 PopU15 1-0.46119 0.14464-3.19 0.0026 PopO75 1-1.69150 1.08360-1.56 0.1255 AvIncome 1-0.00033690 0.00093111-0.36 0.7192 GrowRate 1 0.40969 0.19620 2.09 0.0425 2

3

4

5

Obs Country PopU15 SavRatio PopU15 PopO75 SavRatio PopO75 AvIncome SavRatio AvIncome GrowRate SavRatio GrowRate 1 Australia -1.13831 1.38856-0.38628 1.51696 768.25943 0.60475-0.01280 0.85833 21 Ireland 6.87903 0.21857 1.49268 0.86626-928.27477 3.70387-1.59754 2.73663 22 Italy -3.46588 3.52520 0.02001 1.89291-530.90583 2.10562-1.03136 1.50421 23 Japan -9.48247 9.65473-1.30002 7.48046 328.00826 5.17098 4.14032 6.97775 24 Korea -2.03533-5.16830-0.44035-5.36212-11.96560-6.10295 1.88266-5.33567 44 United States 4.41082-3.14583-0.30252-0.59987 2191.50614-1.84991 1.46020-0.51335 45 Venezuela 2.53497 2.46341-0.11300 3.82366 444.00821 3.48293-2.30789 2.68698 46 Zambia -0.70557 10.07632-0.40325 10.43302 130.21746 9.70704 1.49950 10.36525 49 Libya 7.98336-6.51140 0.83739-4.24597 49.71961-2.84628 12.47740 2.28240 50 Malaysia 1.93033-3.86112-0.13636-2.74022 243.12949-3.05278 1.68312-2.28130 6

/* Check other assumptions */ /* Define shortcut macro, using line copied from www.stat.usu.edu/jrstevens/stat5100/resid_num_diag_1line.sas */ %macro resid_num_diag(dataset,... /* Call shortcut macro */ %resid_num_diag(dataset=out1, datavar=resid, label='residual', predvar=pred, predlabel='predicted'); P-value for Brown-Forsythe test of constant variance in Residual vs. Predicted Value Obs t_bf BF_pvalue 1 2.40263 0.020193 Output for correlation test of normality of Residual (Check text Table B.6 for threshold) Pearson Correlation Coefficients, N = 50 Prob > r under H0: Rho=0 resid expectnorm resid Residual 1.00000 0.99252 <.0001 expectnorm 0.99252 1.00000 <.0001 7

/* Alternative thresholds for influential obs. and outlier diagnostics */ data temp; p=5; /* p = # beta's (incl. intercept */ n = 50; /* n = sample size */ CooksDsimple = 4/n; CooksD10 = finv(.10,p,n-p); CooksD20 = finv(.20,p,n-p); CooksD50 = finv(.50,p,n-p); RStudent95 = tinv((1-.05/2),(n-p)); RStudent95Bonf = tinv((1-.05/2/n),(n-p)); Leverage2 = 2*p/n; Leverage3 = 3*p/n; DFBETAS = 2/n**0.5; if (n <= 30) then DFBETAS = 1; DFFITS = 2*(p/n)**0.5; if (n <= 30) then DFFITS = 1; ; proc print data=temp; var CooksDsimple CooksD10 CooksD20 CooksD50 RStudent95 RStudent95Bonf Leverage2 Leverage3 DFBETAS DFFITS; title1 'Alternative thresholds'; Alternative thresholds Obs CooksDsimple CooksD10 CooksD20 CooksD50 RStudent95 RStudent95Bonf Leverage2 Leverage3 DFBETAS DFFITS 1 0.08 0.31729 0.46527 0.88349 2.01410 3.52025 0.2 0.3 0.28284 0.63246 8

/* Now look more closely at distribution of predictors, and suspect observations */ proc univariate data=savings noprint; var PopU15 PopO75 AvIncome GrowRate; histogram PopU15 PopO75 AvIncome GrowRate; title1; proc print data=savings; where country = 'Ireland' country = 'Japan' country = 'United States' country = 'Libya' country = 'Zambia'; var country SavRatio PopU15 PopO75 AvIncome GrowRate; title1 'Suspect observations'; 9

Suspect observations Obs Country SavRatio PopU15 PopO75 AvIncome GrowRate 21 Ireland 11.34 31.16 4.19 1139.95 2.99 23 Japan 21.1 27.01 1.91 1257.28 8.21 44 United States 7.56 29.81 3.43 4001.89 2.45 46 Zambia 18.56 45.25 0.56 138.33 5.14 49 Libya 8.89 43.69 2.07 123.58 16.71 /*************************************************** Possible Remedial Measures for these data: Drop Japan -- PopU15 and PopO75 don't match profile (influential but not outliers) Take log of AvIncome and log of GrowRate -- their distributions are skew right -- the extreme observation in each is a suspect obs. (United States for AvIncome, Libya for GrowRate) ***************************************************/ /* Create new data set and fit regression model; check assumptions */ data newsavings; set savings; if country ne 'Japan'; logavincome = log(avincome); loggrowrate = log(growrate); proc reg data = newsavings plots(label)=(cooksd RStudentByLeverage DFFITS DFBETAS); id Country; model SavRatio = PopU15 PopO75 logavincome loggrowrate / ; output out=out2 r=resid p=pred; title1 'Predict SavRatio'; title2 '(after remedial measures)'; 10

Predict SavRatio (after remedial measures) Parameter Estimates Variable DF Parameter Estimate Standard Error t Value Pr > t Intercept 1 26.25118 10.52632 2.49 0.0165 PopU15 1-0.33837 0.15791-2.14 0.0377 PopO75 1-0.68558 1.13571-0.60 0.5492 logavincome 1-0.71860 0.97492-0.74 0.4650 loggrowrate 1 1.33042 0.72528 1.83 0.0734 11

/* Check model assumptions */ %resid_num_diag(dataset=out2, datavar=resid, label='new Residual', predvar=pred, predlabel='new Predicted Value'); 12

P-value for Brown-Forsythe test of constant variance in New Residual vs. New Predicted Value Obs t_bf BF_pvalue 1 2.43339 0.018815 Output for correlation test of normality of New Residual (Check text Table B.6 for threshold) Pearson Correlation Coefficients, N = 49 Prob > r under H0: Rho=0 resid expectnorm resid New Residual 1.00000 0.99516 <.0001 expectnorm 0.99516 1.00000 <.0001 /* Look at final model */ proc reg data = newsavings; model SavRatio = PopU15 loggrowrate; title1 'Final Model'; Parameter Estimates Variable DF Parameter Estimate Standard Error t Value Pr > t Intercept 1 14.27955 2.40166 5.95 <.0001 PopU15 1-0.18046 0.05915-3.05 0.0038 loggrowrate 1 1.45209 0.71058 2.04 0.0468 13