SAS Workshop. Introduction to SAS Programming. Iowa State University DAY 2 SESSION IV

Similar documents
ST Lab 1 - The basics of SAS

SAS Workshop. Introduction to SAS Programming. Iowa State University DAY 3 SESSION I

Stat 5100 Handout #14.a SAS: Logistic Regression

Applied Regression Modeling: A Business Approach

Lab 07: Multiple Linear Regression: Variable Selection

Stat 5100 Handout #19 SAS: Influential Observations and Outliers

SAS/STAT 13.1 User s Guide. The REG Procedure

This electronic supporting information S4 contains the main steps for fitting a response surface model using Minitab 17 (Minitab Inc.).

Introduction to Statistical Analyses in SAS

Outline. Topic 16 - Other Remedies. Ridge Regression. Ridge Regression. Ridge Regression. Robust Regression. Regression Trees. Piecewise Linear Model

2017 ITRON EFG Meeting. Abdul Razack. Specialist, Load Forecasting NV Energy

Applied Regression Modeling: A Business Approach

Regression on SAT Scores of 374 High Schools and K-means on Clustering Schools

STA121: Applied Regression Analysis

Stat 500 lab notes c Philip M. Dixon, Week 10: Autocorrelated errors

Variable selection is intended to select the best subset of predictors. But why bother?

Data Management - 50%

GLMSELECT for Model Selection

Discussion Notes 3 Stepwise Regression and Model Selection

Stat 302 Statistical Software and Its Applications SAS: Data I/O

7. Collinearity and Model Selection

Exploratory model analysis

Minitab 17 commands Prepared by Jeffrey S. Simonoff

CREATING THE ANALYSIS

Information Criteria Methods in SAS for Multiple Linear Regression Models

1 Downloading files and accessing SAS. 2 Sorting, scatterplots, correlation and regression

9.2 User s Guide SAS/STAT. The LOESS Procedure. (Book Excerpt) SAS Documentation

SASEG 9B Regression Assumptions

Level I: Getting comfortable with my data in SAS. Descriptive Statistics

Lecture 24: Generalized Additive Models Stat 704: Data Analysis I, Fall 2010

SD10 A SAS MACRO FOR PERFORMING BACKWARD SELECTION IN PROC SURVEYREG

Generalized Additive Models

Lecture 25: Review I

SAS/STAT 14.1 User s Guide. The LOESS Procedure

A Step by Step Guide to Learning SAS

Linear Model Selection and Regularization. especially usefull in high dimensions p>>100.

Stat 302 Statistical Software and Its Applications SAS: Data I/O & Descriptive Statistics

THE UNIVERSITY OF BRITISH COLUMBIA FORESTRY 430 and 533. Time: 50 minutes 40 Marks FRST Marks FRST 533 (extra questions)

SPSS QM II. SPSS Manual Quantitative methods II (7.5hp) SHORT INSTRUCTIONS BE CAREFUL

Data Analytics Training Program

BUSINESS ANALYTICS. 96 HOURS Practical Learning. DexLab Certified. Training Module. Gurgaon (Head Office)

Predicting Web Service Levels During VM Live Migrations

Introductory Guide to SAS:

Multivariate Analysis Multivariate Calibration part 2

CH5: CORR & SIMPLE LINEAR REFRESSION =======================================

INTRODUCTION TO SAS HOW SAS WORKS READING RAW DATA INTO SAS

Lab #9: ANOVA and TUKEY tests

INTRODUCTION TO SAS STAT 525 FALL 2013

Resources for statistical assistance. Quantitative covariates and regression analysis. Methods for predicting continuous outcomes.

Lecture on Modeling Tools for Clustering & Regression

MS in Applied Statistics: Study Guide for the Data Science concentration Comprehensive Examination. 1. MAT 456 Applied Regression Analysis

Lecture 13: Model selection and regularization

Applied Regression Modeling: A Business Approach

SAS Training BASE SAS CONCEPTS BASE SAS:

STAT 2607 REVIEW PROBLEMS Word problems must be answered in words of the problem.

Multicollinearity and Validation CIVL 7012/8012

SYS 6021 Linear Statistical Models

Nonparametric Approaches to Regression

Getting Started with JMP at ISU

THIS IS NOT REPRESNTATIVE OF CURRENT CLASS MATERIAL. STOR 455 Midterm 1 September 28, 2010

Assignment No: 2. Assessment as per Schedule. Specifications Readability Assignments

STAT 705 Introduction to generalized additive models

Basics of Multivariate Modelling and Data Analysis

EXST 7014, Lab 1: Review of R Programming Basics and Simple Linear Regression

RESAMPLING METHODS. Chapter 05

Chapter 2: Getting Data Into SAS

SAS Online Training: Course contents: Agenda:

Contents of SAS Programming Techniques

Generalized Additive Model

Box-Cox Transformation for Simple Linear Regression

Chapter 9 Robust Regression Examples

Cross-validation and the Bootstrap

book 2014/5/6 15:21 page v #3 List of figures List of tables Preface to the second edition Preface to the first edition

The Time Series Forecasting System Charles Hallahan, Economic Research Service/USDA, Washington, DC

Simulation of Imputation Effects Under Different Assumptions. Danny Rithy

Chapter 6: Linear Model Selection and Regularization

Assignment 6 - Model Building

22s:152 Applied Linear Regression

CDAA No. 4 - Part Two - Multiple Regression - Initial Data Screening

A SAS Macro for Covariate Specification in Linear, Logistic, or Survival Regression

Chapter 10: Variable Selection. November 12, 2018

ST512. Fall Quarter, Exam 1. Directions: Answer questions as directed. Please show work. For true/false questions, circle either true or false.

Stat 5100 Handout #11.a SAS: Variations on Ordinary Least Squares

Overview. Background. Locating quantitative trait loci (QTL)

22s:152 Applied Linear Regression

Topics in Machine Learning-EE 5359 Model Assessment and Selection

Cross-validation and the Bootstrap

SAS Workshop. Iowa State University May 9, Introduction to SAS Programming. Day 1 Session Iii

RSM Split-Plot Designs & Diagnostics Solve Real-World Problems

SAS Cloud Analytic Services 3.1: Graphing Your Output

Easing into Data Exploration, Reporting, and Analytics Using SAS Enterprise Guide

5.5 Regression Estimation

[POLS 8500] Stochastic Gradient Descent, Linear Model Selection and Regularization

BIOL 458 BIOMETRY Lab 10 - Multiple Regression

Two-Stage Least Squares

Lab 6 More Linear Regression

Using Multivariate Adaptive Regression Splines (MARS ) to enhance Generalised Linear Models. Inna Kolyshkina PriceWaterhouseCoopers

CHAPTER 3 AN OVERVIEW OF DESIGN OF EXPERIMENTS AND RESPONSE SURFACE METHODOLOGY

Data-Splitting Models for O3 Data

Virtual Accessing of a SAS Data Set Using OPEN, FETCH, and CLOSE Functions with %SYSFUNC and %DO Loops

Transcription:

SAS Workshop Introduction to SAS Programming DAY 2 SESSION IV Iowa State University May 10, 2016

Controlling ODS graphical output from a procedure Many SAS procedures produce default plots in ODS graphics format One way to select graphs (and tables) to be output is to use ODS SELECT statement as we have seen in previous examples. The ODS SELECT only add these plots to the output sent to the ODS destination (but it may not stop them from being produced!). To selectively generate only the required plots, use the PLOTS= option available in statistical procedures that support ODS Graphics The simplest PLOTS= specifications is of the form PLOTS=plot-request or PLOTS= (plot-requests). This does not stop the default plots from a procedure being produced. To do that, use the ONLY global option for e.g., plots(only)=residuals Using plots=none disables all ODS Graphics for the current proc step

Sample SAS Program C4 data muscle; input x y; label x='age' y='muscle Mass'; datalines; 71 82 64 91 53 100 49 105 78 77 ; ods pdf file="u:\documents\sas_workshop_spring2016\c4_out.pdf"; proc reg data=muscle plots(only)=(diagnostics qq residualbypredicted fit residuals); model y = x/r; title Simple Linear Regression Analysis of Muscle Mass Data ; ods pdf close;

Sample SAS Program C5 data lead; input Sample $ x y; label x='traffic Flow' y='lead Content'; datalines; A 8.3 227 B 8.3 312 C 12.1 362 D 12.1 521 E 17.0 640 F 17.0 539 G 17.0 728 H 24.3 945 I 24.3 738 J 24.3 759 K 33.6 1263 L 10.0. M 15.0. ; proc reg data=lead plots(only label)=(fit rstudentbyleverage cooksd); model y=x/clm cli; id Sample; title Prediction Intervals: Lead Content Data ;

Sample SAS Program C6 In the following example we use plot= options to request the diagnostic panel and the regression fit plot from a regression of Weight vs. Height in the biology data. We use the clb option to compute the confidence intervals for the regression coefficients and print all residual and influence statistics using the r and influence options. libname libc "U:\Documents\SAS_Workshop_Spring2016\Data\"; ods select ANOVA ParameterEstimates OutputStatistics FitPlot; proc reg data=libc.biology plots(only)=(fit diagnostics); model Weight=Height/clb r influence; title Regression of Weight on Height: Biology Class ;

Small SAS Project Import an Excel data set as SAS data set using proc import The data consists of air pollution and related values for 41 U.S. 3 cities. SO2 in the air in mcg / m is the response variable Use proc sgscatter to obtain a scatterplot matrix and proc reg to do a preliminary multiple regression analysis Looking at the plots alone, Obs #31 looks like an influential y-outlier. Must look at diagnostic statistics to confirm this Read a file containing City Names indexed by the same City # used in the above Excel file, combine it with first SAS dataset using a merge, and save the resulting SAS dataset In a second SAS program, access this SAS dataset and perform a variable subset selection procedure using proc reg

Sample SAS Program C7 libname mylib "U:\Documents\SAS_Workshop_Spring2016\Data\"; proc import out= work.air datafile= "U:\Documents\SAS_Workshop_Spring2016\Data\air_pollution.xls" dbms=xls replace; getnames=yes; proc print data=air; ods rtf file="u:\documents\sas_workshop_spring2016\c7_out.rtf" style=statistical; proc sgscatter data=air; title "Scatterplot Matrix for Air Pollution Data"; matrix SO2--PrecipDays; proc reg corr data=air plots(only label)=diagnostics; model SO2 = AvTemp--PrecipDays/r influence clb vif; id City title 'Model fitted with all explanatory variables'; ods rtf close;

Sample SAS Program C7 (Continued) data names; infile "U:\Documents\SAS_Workshop_Spring2016\Data\city_names.txt" truncover; input City CityName $14.; proc print data=names; title "List of City Names"; data mylib.pollution; merge air names; by City; proc print data=mylib.pollution; title "Listing of Air Pollution Data Set Merged with City Names";

Sample SAS Program C8 libname mylib "U:\Documents\SAS_Workshop_Spring2016\Data\"; data cleaned; set mylib.pollution; if _N_= 31 then delete; ods pdf file="u:\documents\sas_workshop_spring2016\c8_out.pdf"; proc reg data=cleaned plots(only)=(criteria cp(label)); model SO2 = AvTemp--PrecipDays/selection=rsquare start=2 stop=4 best=4 cp sse mse; title "Models fitted with all explanatory variables (with Obs#31 deleted)"; ods pdf close;

Model Building: Variable Selection in Regression The aim of variable selection methods is to identify a subset of k predictors (i.e.,x-variables) that has good predictive power. Classical methods are based on entering (called forward selection) or deleting (called backward elimination) a single variable at-a-time from the current model, based on p-values. The significance of a variable to be entered to a model is calculated using an F-statistic by comparing the current model with the model with the new variable. If this variable is significant at the significance level for entry (called sle in SAS), (by comparing its p-value with sle). the variable is entered The same process is used for deleting variables. A variable is deleted if it is significant at significance level for stay (or sls in SAS). The stepwise selection method combines these two methods. In each iteration of the method a forward selection step is followed by a backward elimination step.

All Subset Selection Method Suppose we start with k predictor variables: x1, x2,, xk Fit all models of size p where p = 1,, k i.e., 1-var models, 2-var models etc. Pick the best among these models of each size. Here best is defined as 2 having the largest R. Select a single best model from among these models using a criterion such as 2 Cp, (or AIC), BIC, or adjusted R. For the selected model to be unbiased, we would like Cp to be close to p or smaller. AIC and Cp are equivalent for models with normal errors. Generally we select the model that has the lowest BIC value. There is no guarantee that the selected models will perform well when accuracy of predicting new observations is of interest. A standard approach for assessing predictive ability of different regression models is to evaluate their performance on a hold out data set (often called the validation data set). When a sufficiently large data set is available, this is usually achieved by randomly splitting the data into a training data set and a validation data set.