Graphics before and after model fitting. Nicholas J. Cox University of Durham.

Similar documents
Stata FAQ NJC Stata Plots

Minitab 17 commands Prepared by Jeffrey S. Simonoff

Introduction to STATA

Applied Regression Modeling: A Business Approach

THE L.L. THURSTONE PSYCHOMETRIC LABORATORY UNIVERSITY OF NORTH CAROLINA. Forrest W. Young & Carla M. Bann

Here is Kellogg s custom menu for their core statistics class, which can be loaded by typing the do statement shown in the command window at the very

Introduction. About this Document. What is SPSS. ohow to get SPSS. oopening Data

One Factor Experiments

Goals of the Lecture. SOC6078 Advanced Statistics: 9. Generalized Additive Models. Limitations of the Multiple Nonparametric Models (2)

* Open, describe & summarize data set, & save the listwise observations as a new data set

Intro to Stata. University of Virginia Library data.library.virginia.edu. September 16, 2014

Labor Economics with STATA. Estimating the Human Capital Model Using Artificial Data

Moving Beyond Linearity

Introduction to Statistical Analyses in SAS

An introduction to SPSS

ST Lab 1 - The basics of SAS

Applied Regression Modeling: A Business Approach

RSM Split-Plot Designs & Diagnostics Solve Real-World Problems

Study Guide. Module 1. Key Terms

PubHlth 640 Intermediate Biostatistics Unit 2 - Regression and Correlation. Simple Linear Regression Software: Stata v 10.1

Generalized least squares (GLS) estimates of the level-2 coefficients,

Getting started with Stata 2017: Cheat-sheet

How to use FSBforecast Excel add in for regression analysis

Technical Support Minitab Version Student Free technical support for eligible products

Multiple Regression White paper

Birkbeck College Department of Economics, Mathematics and Statistics.

2.830J / 6.780J / ESD.63J Control of Manufacturing Processes (SMA 6303) Spring 2008

Warm-Up Exercises. Find the x-intercept and y-intercept 1. 3x 5y = 15 ANSWER 5; y = 2x + 7 ANSWER ; 7

8. MINITAB COMMANDS WEEK-BY-WEEK

STATISTICS FOR PSYCHOLOGISTS

Making the Transition from R-code to Arc

International Graduate School of Genetic and Molecular Epidemiology (GAME) Computing Notes and Introduction to Stata

Regression for non-linear data

STATA 13 INTRODUCTION

STAT 311 (3 CREDITS) VARIANCE AND REGRESSION ANALYSIS ELECTIVE: ALL STUDENTS. CONTENT Introduction to Computer application of variance and regression

= = P. IE 434 Homework 2 Process Capability. Kate Gilland 10/2/13. Figure 1: Capability Analysis

GLM II. Basic Modeling Strategy CAS Ratemaking and Product Management Seminar by Paul Bailey. March 10, 2015

Using PC SAS/ASSIST* for Statistical Analyses

MINITAB Release Comparison Chart Release 14, Release 13, and Student Versions

REALCOM-IMPUTE: multiple imputation using MLwin. Modified September Harvey Goldstein, Centre for Multilevel Modelling, University of Bristol

Problem set for Week 7 Linear models: Linear regression, multiple linear regression, ANOVA, ANCOVA

CREATING THE ANALYSIS

Fathom Dynamic Data TM Version 2 Specifications

STAT 705 Introduction to generalized additive models

The Time Series Forecasting System Charles Hallahan, Economic Research Service/USDA, Washington, DC

Page 1. Graphical and Numerical Statistics

Data-Analysis Exercise Fitting and Extending the Discrete-Time Survival Analysis Model (ALDA, Chapters 11 & 12, pp )

Introduction to Computing for Sociologists Neustadtl

Lab 2: OLS regression

Design of Experiments

A First Tutorial in Stata

book 2014/5/6 15:21 page v #3 List of figures List of tables Preface to the second edition Preface to the first edition

SPSS QM II. SPSS Manual Quantitative methods II (7.5hp) SHORT INSTRUCTIONS BE CAREFUL

Tutorial: Using Tina Vision s Quantitative Pattern Recognition Tool.

Regression Analysis and Linear Regression Models

Multiple Linear Regression

Applied Regression Modeling: A Business Approach

Dr. Barbara Morgan Quantitative Methods

Geology Geomath Estimating the coefficients of various Mathematical relationships in Geology

How to use FSBForecast Excel add-in for regression analysis (July 2012 version)

CHAPTER 3 AN OVERVIEW OF DESIGN OF EXPERIMENTS AND RESPONSE SURFACE METHODOLOGY

Data Analysis Multiple Regression

Response to API 1163 and Its Impact on Pipeline Integrity Management

lme4: Mixed-effects modeling with R

The first thing we ll need is some numbers. I m going to use the set of times and drug concentration levels in a patient s bloodstream given below.

SPSS: AN OVERVIEW. V.K. Bhatia Indian Agricultural Statistics Research Institute, New Delhi

Section 18-1: Graphical Representation of Linear Equations and Functions

3 Graphical Displays of Data

Computer Experiments: Space Filling Design and Gaussian Process Modeling

Spatial Interpolation & Geostatistics

CHAPTER 7 EXAMPLES: MIXTURE MODELING WITH CROSS- SECTIONAL DATA

Exploratory model analysis

MPhil computer package lesson: getting started with Eviews

DoE with Visual-XSel 13.0

FUNCTIONS AND MODELS

JUST THE MATHS UNIT NUMBER STATISTICS 1 (The presentation of data) A.J.Hobson

Data Management - 50%

Introduction to Stata. Getting Started. This is the simple command syntax in Stata and more conditions can be added as shown in the examples.

Discussion Notes 3 Stepwise Regression and Model Selection

Excel For Algebra. Conversion Notes: Excel 2007 vs Excel 2003

Chapter 2 Modeling Distributions of Data

Kim Mannemar Sønderskov. Stata. A practical introduction. Hans Reitzels Forlag

Descriptives. Graph. [DataSet1] C:\Documents and Settings\BuroK\Desktop\Prestige.sav

Week 7: The normal distribution and sample means

Introduction to SAS. I. Understanding the basics In this section, we introduce a few basic but very helpful commands.

Introduction to Stata. Written by Yi-Chi Chen

EXST 7014, Lab 1: Review of R Programming Basics and Simple Linear Regression

EXCEL ADD-IN FOR REGRESSION ANALYSIS

Last time... Coryn Bailer-Jones. check and if appropriate remove outliers, errors etc. linear regression

CHAPTER 2 Modeling Distributions of Data

Descriptive Statistics, Standard Deviation and Standard Error

From datasets to resultssets in Stata

INTRODUCTION to. Program in Statistics and Methodology (PRISM) Daniel Blake & Benjamin Jones January 15, 2010

Brief Guide on Using SPSS 10.0

Lab #9: ANOVA and TUKEY tests

CITS4009 Introduc0on to Data Science

SYS 6021 Linear Statistical Models

STATISTICS (STAT) Statistics (STAT) 1

FMA901F: Machine Learning Lecture 3: Linear Models for Regression. Cristian Sminchisescu

Stat 528 (Autumn 2008) Density Curves and the Normal Distribution. Measures of center and spread. Features of the normal distribution

Transcription:

Graphics before and after model fitting Nicholas J. Cox University of Durham n.j.cox@durham.ac.uk 1

It is commonplace to compute various flavours of residual and predicted values after fitting many different kinds of model. This allows production of a great variety of diagnostic graphics, used to examine the general and specific fit between data and model and to seek possible means of improving the model. Several different graphs may be inspected in many modelling exercises, partly because each kind may be best for particular purposes, and partly because in many analyses a variety of models in terms of functional form, choice of predictors, and so forth may be entertained, at least briefly. It is therefore helpful to be able to produce such graphs very rapidly. 2

Official Stata supplies as built-ins a bundle of commands originally written for use after regress: avplot and avplots cprplot and acprplot lvr2plot rvfplot and rvpplot. These were introduced in Stata 3.0 in 1992 and are documented at [R] regdiag. More recently, in an update to Stata 7.0 on 6 September 2001, all but the first two have been modified so that they may be used after anova. Despite their many uses, this suite omits some very useful kinds of plot, while none of the commands may be used after other modelling commands. 3

Here the focus is on a new set of commands, which are biased to graphics useful for models predicting continuous response variables. Note well other work for categorical dependent variable models by J. Scott Long and Jeremy Freese. The ideal followed in this project is to make minimal assumptions about which modelling command has been issued previously. The down-side for users is that if the data and the previous model results do not match the assumptions, it is possible to get either bizarre results or an error message. Programs (other than those in progress) may be downloaded from SSC. 4

The principles followed include as far as possible, the command name by itself should produce a useful plot predict is used to produce temporary variables for residuals, fitted values, etc. each graph refers to the last model fitted each graph has reasonably smart default axis titles, etc. options are provided for key needs, especially separate() to produce separate predicted curves given dummy variables 5

The commands which have been written include 1. anovaplot shows fitted or predicted values from an immediately previous one-, two- or threeway anova. By default the data for the response are also plotted. In particular, anovaplot can show interaction plots. 2. indexplot plots estimation results (by default whatever predict produces by default) from an immediately previous regress or similar command versus a numeric index or identifier variable, if that is supplied, or observation number, if that is not supplied. Values are shown, by default, as vertical spikes starting at 0. 6

3. ovfplot plots observed vs fitted or predicted values for the response from an immediately previous regress or similar command, with by default a line of equality superimposed. 4. qfrplot plots quantile plots of fitted values, minus their mean, and residuals from the previous estimation command. Fitted values are whatever predict produces by default and residuals are whatever predict, res produces. Comparing the distributions gives an overview of their variability and some idea of their fine structure. By default plots are side-by-side. Quantile plots may be observed vs normal (Gaussian). (work in progress: separate plots) 7

5. rdplot graphs residual distributions. The residuals are, by default, those calculated by predict, residuals or (if the previous estimation command was glm) by predict, response. The graph by default is a single or multiple dotplot, as produced by dotplot: histograms or box plots may be selected by specifying either the histogram or the box option. 8

6. regplot plots fitted or predicted values from an immediately previous regress or similar command. By default the data for the response are also plotted. With one syntax, no varname is specified. regplot shows the response and predicted values on the y axis and the covariate named first in the regress or similar command on the x axis. Thus with this syntax the plot shown is sensitive to the order in which covariates are specified in the estimation command. With another syntax, a varname is supplied, which may name any numeric variable. This is used as the variable on the x axis. 9

Thus in practice regplot is most useful when the fitted values are a smooth function of the variable shown on the x axis, or a set of such functions given also one or more dummy variables as covariates. However, other applications also arise, such as plotting observed and predicted values from a time series model versus time. 10

7. rvfplot2 graphs a residual-versus-fitted plot, a graph of the residuals versus the fitted values. The residuals are, by default, those calculated by predict, residuals or (if the previous estimation command was glm) by predict, response. The fitted values are those produced by predict by default after each estimation command. rvfplot2 is offered as a generalisation of rvfplot in official Stata. (work in progress: rac graphs residual autocorrelations) 11

In earlier work, two commands were produced with the main aim of producing graphs useful before model fitting. sparl, the name being a strained acronym for scatter plot and regression line, was designed to display basic regression results, including nicely formatted summary statistics, log transformations on the fly, quadratic fit and confidence interval options, and so forth. Originally designed to be very simple, this command is currently suffering from featuritis. (A much simpler and loosely similar program is included in StataQuest.) tsgraph (with Kit Baum) shows time series plots. It depends on data having been tsset, and correspondingly is smart about the time variable (and where relevant the panel variable). 12