GLMSELECT for Model Selection

Similar documents
Data Management - 50%

Model Inference and Averaging. Baging, Stacking, Random Forest, Boosting

Statistical Consulting Topics Using cross-validation for model selection. Cross-validation is a technique that can be used for model evaluation.

Predictive Analytics: Demystifying Current and Emerging Methodologies. Tom Kolde, FCAS, MAAA Linda Brobeck, FCAS, MAAA

SAS Workshop. Introduction to SAS Programming. Iowa State University DAY 3 SESSION I

SAS/STAT 15.1 User s Guide The GLMSELECT Procedure

Linear Model Selection and Regularization. especially usefull in high dimensions p>>100.

Outline. Topic 16 - Other Remedies. Ridge Regression. Ridge Regression. Ridge Regression. Robust Regression. Regression Trees. Piecewise Linear Model

SAS/STAT 12.3 User s Guide. The QUANTSELECT Procedure (Chapter)

An Animated Guide: Penalized variable selection techniques in SAS and Quantile Regression

From Building Better Models with JMP Pro. Full book available for purchase here.

Big Data Methods. Chapter 5: Machine learning. Big Data Methods, Chapter 5, Slide 1

SAS/STAT 15.1 User s Guide The HPREG Procedure

MS in Applied Statistics: Study Guide for the Data Science concentration Comprehensive Examination. 1. MAT 456 Applied Regression Analysis

SAS/STAT 15.1 User s Guide The QUANTSELECT Procedure

Lecture 27: Review. Reading: All chapters in ISLR. STATS 202: Data mining and analysis. December 6, 2017

Variable Selection 6.783, Biomedical Decision Support

Enterprise Miner Tutorial Notes 2 1

Using Multivariate Adaptive Regression Splines (MARS ) to enhance Generalised Linear Models. Inna Kolyshkina PriceWaterhouseCoopers

Didacticiel - Études de cas

2017 ITRON EFG Meeting. Abdul Razack. Specialist, Load Forecasting NV Energy

SAS Workshop. Introduction to SAS Programming. Iowa State University DAY 2 SESSION IV

Leveling Up as a Data Scientist. ds/2014/10/level-up-ds.jpg

Resampling methods (Ch. 5 Intro)

Regression on SAT Scores of 374 High Schools and K-means on Clustering Schools

Moving Beyond Linearity

Introduction to Automated Text Analysis. bit.ly/poir599

Stat 342 Exam 3 Fall 2014

What s new in SAS 9.2

Using Machine Learning to Optimize Storage Systems

Predictive modelling / Machine Learning Course on Big Data Analytics

CSE Data Mining Concepts and Techniques STATISTICAL METHODS (REGRESSION) Professor- Anita Wasilewska. Team 13

Package FWDselect. December 19, 2015

Multivariate Analysis Multivariate Calibration part 2

Evaluating generalization (validation) Harvard-MIT Division of Health Sciences and Technology HST.951J: Medical Decision Support

A Systematic Overview of Data Mining Algorithms

Introduction to Classification & Regression Trees

Nonparametric Approaches to Regression

Nonparametric Classification Methods

Evolution of Regression II: From OLS to GPS to MARS Hands-on with SPM

Mapping of Hierarchical Activation in the Visual Cortex Suman Chakravartula, Denise Jones, Guillaume Leseur CS229 Final Project Report. Autumn 2008.

Predict Outcomes and Reveal Relationships in Categorical Data

Evaluating Classifiers

Gradient LASSO algoithm

Machine Learning Techniques for Data Mining

ST Lab 1 - The basics of SAS

Generalized Additive Models

Knowledge Discovery and Data Mining

Slides for Data Mining by I. H. Witten and E. Frank

SAS/STAT 15.1 User s Guide The HPQUANTSELECT Procedure

Ivy s Business Analytics Foundation Certification Details (Module I + II+ III + IV + V)

Random Forests and Boosting

Cross-validation. Cross-validation is a resampling method.

The SAS Denomininator Degrees of Freedom Option

CS249: ADVANCED DATA MINING

Preface to the Second Edition. Preface to the First Edition. 1 Introduction 1

Lecture 13: Model selection and regularization

Species distribution modelling for combined data sources

Machine Learning Duncan Anderson Managing Director, Willis Towers Watson

Lab 07: Multiple Linear Regression: Variable Selection

Logistic Model Selection with SAS PROC s LOGISTIC, HPLOGISTIC, HPGENSELECT

Linear Methods for Regression and Shrinkage Methods

Multiresponse Sparse Regression with Application to Multidimensional Scaling

SAS Visual Analytics 8.2: Working with SAS Visual Statistics

MR-2010I %MktRuns Macro %MktRuns Macro

Just for the Kids School Reports: A data-oriented web site powered by SAS

Discussion Notes 3 Stepwise Regression and Model Selection

LECTURE 6: CROSS VALIDATION

Lecture 24: Generalized Additive Models Stat 704: Data Analysis I, Fall 2010

Partitioning Data. IRDS: Evaluation, Debugging, and Diagnostics. Cross-Validation. Cross-Validation for parameter tuning

Regression Model Building for Large, Complex Data with SAS Viya Procedures

CHAPTER 3 AN OVERVIEW OF DESIGN OF EXPERIMENTS AND RESPONSE SURFACE METHODOLOGY

Feature Selection. CE-725: Statistical Pattern Recognition Sharif University of Technology Spring Soleymani

22s:152 Applied Linear Regression

Classification and Regression Trees

Development of Prediction Model for Linked Data based on the Decision Tree for Track A, Task A1

Nonparametric Methods Recap

Cross-validation for detecting and preventing overfitting

Stat 5100 Handout #14.a SAS: Logistic Regression

FMA901F: Machine Learning Lecture 3: Linear Models for Regression. Cristian Sminchisescu

A Systematic Overview of Data Mining Algorithms. Sargur Srihari University at Buffalo The State University of New York

22s:152 Applied Linear Regression

Excel to R and back 1

Conquering Massive Clinical Models with GPU. GPU Parallelized Logistic Regression

Evaluating Classifiers

Package EBglmnet. January 30, 2016

Evolution of Regression III:

Model Assessment and Selection. Reference: The Elements of Statistical Learning, by T. Hastie, R. Tibshirani, J. Friedman, Springer

CART. Classification and Regression Trees. Rebecka Jörnsten. Mathematical Sciences University of Gothenburg and Chalmers University of Technology

Lasso Regression: Regularization for feature selection

Lecture 20: Bagging, Random Forests, Boosting

An Overview MARS is an adaptive procedure for regression, and is well suited for high dimensional problems.

Nina Zumel and John Mount Win-Vector LLC

Linear Mixed Models: Methodology and Algorithms

Information Criteria Methods in SAS for Multiple Linear Regression Models

14. League: A factor with levels A and N indicating player s league at the end of 1986

Data-Splitting Models for O3 Data

Data Mining: Classifier Evaluation. CSCI-B490 Seminar in Computer Science (Data Mining)

Overview and Practical Application of Machine Learning in Pricing

Random Forest in Genomic Selection

Transcription:

Winnipeg SAS User Group Meeting May 11, 2012 GLMSELECT for Model Selection Sylvain Tremblay SAS Canada Education Copyright 2010 SAS Institute Inc. All rights reserved.

Proc GLM Proc REG Class Statement Contrasts Proc GLMSELECT Variable selection Diagnostics Graphics 2

Agenda Overview of GLMSELECT Basic Syntax Variable Selection Using Partitionning References Conclusion / Questions 3

Overview of Proc GLMSELECT Performs effect selection in the framework of general linear models A variety of model selection methods are available extensive capabilities for customizing the selection with a wide variety of selection and stopping criteria The procedure also provides graphical summaries of the selection search It produces output data sets and supports the SCORE statement 4

Model Specification supports different parameterizations for classification effects supports any degree of interaction (crossed effects) and nested effects supports hierarchy among effects supports partitioning of data into training, validation, and testing roles supports constructed effects including spline and multimember effects 5

Selection Control provides multiple effect selection methods enables selection from a very large number of effects (tens of thousands) provides effect selection based on a variety of selection criteria provides stopping rules based on a variety of model evaluation criteria provides leave-one-out and k-fold cross validation supports data resampling and model averaging 6

Display and Output produces graphical representation of selection process produces output data sets containing predicted values and residuals produces macro variables containing selected models supports parallel processing of BY groups supports multiple SCORE statements 7

Agenda Overview of GLMSELECT Basic Syntax Variable Selection Using Partitionning References Conclusion / Questions 8

Basic Syntax 9

Model Selection Methods SELECTION=method <(method options)> None Forward Backward Stepwise LAR Least Angle Regression LASSO Least Absolute Shrinkage and Selection Operator 10

Model Selection Methods Method Options SELECTION=method <(method options)> 11

Agenda Overview of GLMSELECT Basic Syntax Variable Selection Using Partitionning References Conclusion / Questions 12

Signal versus Noise Predictive Modeling Target = Signal + Noise Signal = Systematic Variation = Predictable Noise = Random Variation = Unpredictable 13

Effect selection by partitioning the data Very useful in the context of predictive modeling Split the data in two partitions: training & validation You fit (train) the model on the training partition The model is evaluated on the validation data The best model is the simplest model that has the best performance on the validation data The goal is to avoid overfitting 14

Agenda Overview of GLMSELECT Basic Syntax Variable Selection Using Partitionning References Conclusion / Questions 15

Reference Materials Proc GLMSELECT Documentation http://support.sas.com/documentation/cdl/en/statug/63962/html/defau lt/viewer.htm#glmselect_toc.htm Introducing the GLMSELECT PROCEDURE for Model Selection http://www2.sas.com/proceedings/sugi31/207-31.pdf 16

Conclusion Proc GLMSELECT is a powerful tool for automatic variable selection for linear models Very flexible with many selection methods and stopping criterion to choose from: LARS and LASSO Resampling techniques: k-fold validation and bootstrap Data partitionnning 17

Questions? THANK YOU! Sylvain.Tremblay@sas.com Copyright 2010 SAS Institute Inc. All rights reserved.