Using Multivariate Adaptive Regression Splines (MARS ) to enhance Generalised Linear Models. Inna Kolyshkina PriceWaterhouseCoopers

Size: px
Start display at page:

Download "Using Multivariate Adaptive Regression Splines (MARS ) to enhance Generalised Linear Models. Inna Kolyshkina PriceWaterhouseCoopers"

Transcription

1 Using Multivariate Adaptive Regression Splines (MARS ) to enhance Generalised Linear Models. Inna Kolyshkina PriceWaterhouseCoopers

2 Why enhance GLM? Shortcomings of the linear modelling approach. GLM being a linear technique shares the usual shortcomings of the linear modelling approach: operate under the assumption that data is distributed according to a distribution in the exponential family are affected by multi-collinearity, outliers and missing values in the data, often troublesome to use for selecting important predictors and their interactions. categorical predictors with large numbers of categories (for example, postcode, occupation code etc) can lead to unreliable results due to sparsityrelated issues. Linear models take longer to build because of the need to address the issues above by transforming both numeric and categorical predictors and choosing predictors and their interactions by hand which can prove to be a lengthy task. Page 2

3 Data mining techniques in contrast, are typically fast, easily select predictors and their interactions, are minimally affected with missing values, outliers or collinearity and effectively process high-level categorical predictors. This suggests that combining a linear approach with data mining tools can expedite the modelling process, allowing the modeller to attain equal or better model accuracy in less time with the same level of interpretability. Such models, usually combining decision trees, multivariate adaptive regression splines and GLM have been used by our team in a number of projects (see Kolyshkina and Brookes, 2002). Page 3

4 Strengths of GLM can be effectively combined with the computational power of data mining methods The advance of the new methodologies does not mean that the proven, effective techniques such as GLM should be wholly replaced by them. This talk: 1. Discusses how the advantages and strengths of GLM can be combined with the computational power of data mining methods presenting an example of the combining (MARS ) and GLM approaches 2. Provides comparison of the results of this combined model to those achievable by hand-fitted GLM in terms of: time taken predictive power selection of predictors and their interactions interpretability of the model precision and model fit Page 4

5 MARS methodology overview.

6 MARS can be described as Generalization of stepwise linear regression or Modification of the CART method for the regression setting (ie for a continuous dependent variable) Page 6

7 MARS model - knots for continuous predictors For each continuous predictor X MARS forms pairs of piecewise linear functions with knots at each observed value t of the predictor: X1 is equal to X2 is equal to (X-t) if X >t 0 otherwise (t- X) if X <=t 0 otherwise It then builds a model using only these functions and their products. Value t is then called a knot Page 7

8 MARS model - knots. Example. For example, for the variable AGE such piecewise linear functions may be: AGE1 is equal to AGE2 is equal to (AGE-20) if AGE>20 0 otherwise (20-AGE) if AGE<=20 0 otherwise Value 20 is then called a knot Page 8

9 MARS model - piecewise constant functions for categorical predictors For categorical predictors MARS considers all possible binary partitions of the categories. Each such partition generates a pair of piecewise constant basis functions that are indicator functions for the two sets of categories. Page 9

10 MARS model - basis functions We will call these functions and their products basis functions. MARS model is a linear combination of basis functions: y= a 0 + a 1 *BF1+ a 2 *BF2 + a 3 *BF3+ The coefficients a 0, a 1, a 3, are then estimated by minimising the residual sum of squares (like in linear regression). Page 10

11 MARS model building. The MARS model is built in a two-phase process. In the first phase, a model is grown by adding basis functions in the manner similar to forward selection process in linear regression until an overly large model is found. In the second phase, basis functions are deleted in order of least contribution to the model in the manner similar to backward deletion process in linear regression Page 11

12 Optimal MARS model selection. Degrees of Freedom penalty. The optimal MARS model is the one with the lowest GCV (generalized crossvalidation) measure. The GCV criterion actually does not involve cross validation: C(M) is the cost-complexity measure of a model containing M basis functions. Note that C(M)=M is the usual measure used in linear regression. The MSE is calculated by dividing the sum of squared errors by N-M instead of by N. The GCV formula enables C(M) > M; in other words, enables charging each basis function with more than one degree of freedom. Page 12

13 MARS model building. Why this model strategy? The basis functions and their products are only non-zero over a part of their range, as a result the regression surface is built up parsimoniously, non-zero coefficients are only used locally, only where they are needed (unlike polynomials that produce non-zero product everywhere). By allowing for any arbitrary shape for the response function as well as for interactions, and by using the two-phase model selection method, MARS is capable of reliably tracking very complex data structures that often hide in high-dimensional data. Page 13

14 How the Use of MARS Can Expedite GLM Building. Most of the shortcomings of linear models outlined above can be overcome by using MARS output basis functions as the input for GLM. This makes sense because MARS : quickly selects important predictors and their interactions and transforms numeric and categorical predictors in such a way that the resulting variables are linearly independent and easy to interpret Is minimally affected by multicollinearity, outliers and missing values in the data, easily handles categorical predictors with large numbers of categories and therefore requires less data preparation than linear methods The modeller though needs to make sure that the transformed predictors make business sense, and that the MARS model is stable. Page 14

15 Are MARS output functions OK to use as GLM inputs? We have seen that although MARS output functions are not created specifically to be used as the input for a linear model, in practice about 90% of them turn out to be significant predictors in a GLM. Another feature of the MARS output functions that makes them useful is that they are linearly independent as stated in Friedman (1991) which means that the multi-collinearity issues do not arise in the GLM that uses them as explanatory variables. Page 15

16 Combined models are easily implemented in SAS MARS package is easily available, inexpensive and can work with data in most formats (SAS, SPSS, dbf etc). The output MARS produces can be combined with any GLM software with minimal effort as it is easy to code in any program language such as SAS which is the main data analysis software package used by many actuaries. Page 16

17 Case Study. Application of combined models to Queensland Industry CTP data provided by Motor Accident Insurance Commission (MAIC).

18 ackground. The methodology described above was applied in order to model the ultimate incurred number of claims based on reported claim data. The data used was industry-wide auto liability data from Queensland (commonly called Compulsory Third Party or CTP in Australia). Page 18

19 Data Description. Individual claim data was aggregated into the number of claims reported for each accident month and development month for input to the GLM. The variables used for the analysis were: accident month accident quarter number of casualties development month development quarter number of vehicles in the calendar year number of vehicles exposed in the month Page 19

20 Hand-fitted GLM An initial GLM was created without using MARS. This was a Poisson model with the log link, using the number of vehicles exposed in the month as the offset. The transformations and interactions of the input variables were created manually for the purposes of both best model fit and interpretability. The model fit was assessed by usual methods such as ratio of deviance to the degrees of freedom, predictor significance, link test and residual analysis. All the assessments showed adequacy of the model fit. Page 20

21 MARS - enhanced GLM A second model was created by pre-processing the variables in MARS as and then including them in a GLM as the inputs: built a MARS model with the ratio of incurred number of claims to number of vehicles exposed in the month as the dependent variable. we then used basis functions produced by the MARS model as the inputs in the Poisson model with log link and the number of vehicles exposed in the month as the offset. the GLM output showed that most ofthese variables were significant. We used backward elimination to refine the model by excluding the inputs that were not significant. the resulting model fit was assessed in the same way as for the handfitted model. Page 21

22 The model output included explanatory variables preprocessed by MARS in the form of basis functions: BF1 = max(0, LGDV ); BF2 = max(0, LGDV ); BF4 = max(0, DEVMTHB ); BF5 = max(0, EXPMONTH ) * BF4; BF6 = max(0, EXPMONTH ) * BF4; BF59 = max(0, DEVMTHB ) * BF17; Page 22

23 As you can see, predictors transformed by MARS can be copied and pasted from MARS to SAS directly. They are easy to interpret. We copy and paste them into SAS and run the following SAS code to add them to the data: data data1; set glmcours.data; BF1 = max(0, LGDV ); BF2 = max(0, LGDV ); BF59 = max(0, DEVMTHB ) * BF17; run ; Page 23

24 SAS code for MARS-enhanced GLM. PROC GENMOD DATA=DATA1; model Claims = BF1 BF2 BF4 BF5 BF6 BF7 BF8 BF9 BF11 BF12 BF13 BF14 BF17 BF19 BF20 BF21 BF23 BF24 BF26 BF27 BF30 BF31 BF32 BF34 BF35 BF38 BF40 BF42 BF46 BF49 BF50 BF53 BF55 BF59 / DIST=POISSON OFFSET= LGVH LINK=LOG TYPE3 ; scwgt WEIGHT; OUTPUT OUT= residuals STDRESDEV=strdev_CLAIMS XBETA=xbeta_CLAIMS PREDICTED=pred_CLAIMS; RUN; Page 24

25 Comparison of Models

26 Goodness of Fit Comparison. The fit of both GLMs was assessed by usual methods such as ratio of deviance to the degrees of freedom, predictor significance, link test and residual analysis. Additionally we assessed gains charts for both models. The MARS - enhanced GLM has shown a similar if not slightly better fit to the hand-fitted GLM. Page 26

27 MARS-enhanced GLM. Significance of MARS basis functions. The GLM output showed that most of these variables were significant. LR Statistics For Type 3 Analysis Chi- Source DF Square Pr > ChiSq BF <.0001 BF <.0001 BF <.0001 We then used backward elimination to refine the model by excluding the inputs that were not significant. Page 27

28 Analysis of actual versus expected values of the number of claims. The records were ranked from highest to lowest in terms of predicted number of claims for each model, and then segmented into 20 equally sized groups. The average predicted and actual values of the number of claims for each group were then calculated and graphed. Claim frequency. Hand-fitted GLM Claim frequency. MARS-enhanced GLM 30,000 30,000 number of claims 20,000 10,000 number of claims 20,000 10, % of data % of data 0.95 Page 28

29 Comparison Of the Fit The models fit equally well, with the MARS - enhanced GLM having a marginally better fit. The hand-fitted GLM slightly over-predicts for the fifth group and under-predicts for the third group the MARS - enhanced GLM predicts well for the higher expected numbers of claims but slightly overpredicts for the groups with lower numbers of claims. the scale of these errors is of little business importance. Page 29

30 Gains chart for number of incurred claims for both models. The gains chart graph shows that the models perform equally well. 1.2 Gains Chart Detailed analysis of the actual tables that gains charts are based on suggests that the MARS - enhanced GLM performs marginally better than the hand-fitted model Hand-fitted GLM MARS GLM base line Page 30

31 ARS -created vs hand-transformed variables and predictor interactions: imilarities and differences. MARS -created predictor variables and those which were manually created showed a great degree of similarity. For example, if we compare the predictors based on the variable development month, MARS placed knots mostly at the same points that were found important by the hand-fitted model. The differences included the fact that the hand-fitted model included the variable minimum (development month, 10) which is equal to 10 if development month is greater than 10 and is equal to development month otherwise, while MARS selected 9 rather than 10 as the knot point. MARS also selected interactions of predictors that were not picked up by the hand-fitted model such as the interaction of development month and experience month. Page 31

32 Interpretability of the models. Interpretability of the models was similar. The hand-fitted model was easier to interpret because it included less predictors and less predictor interactions than the MARS - enhanced model. Page 32

33 Findings and results. The fit and precision of the models was similar with the MARS -enhanced GLM showing a slightly better fit than the hand-fitted GLM. The MARS -enhanced GLM included predictor interactions not picked up by the hand-fitted model. This effect would be more pronounced in raw data modelling than in modelling summarised data, as the trends observed are likely to be smoother and easier to identify as random variation cancels out. The hand-fitted model was easier to interpret because it included less predictors and less predictor interactions than the other model. Building of the MARS -enhanced GLM was considerably faster and more efficient. These findings suggest that MARS is a useful tool to enhance and expedite GLM modelling. Page 33

34 Future directions. For large data sets, our team has found that combining decision trees (CART ) with MARS and GLM proves quite effective as described in Kolyshkina & Brookes, Also as an additional check of fit of a GLM model, a parallel model built in MARS can be used. The model will be built as a part of the stage described previously and will not require additional time. The model equation can be copied and pasted from MARS to SAS or another package directly. Comparison of the predicted values of this model to the GLM model graphically and numerically can suggest some insights and inform the choice of the best of the models. Page 34

35 The results described above as well as a number of projects completed by our team for large insurer clients demonstrate that combining GLM with data mining methodologies such as MARS can be very useful for actuarial analysis.

An Overview MARS is an adaptive procedure for regression, and is well suited for high dimensional problems.

An Overview MARS is an adaptive procedure for regression, and is well suited for high dimensional problems. Data mining and Machine Learning, Mar 24, 2008 MARS: Multivariate Adaptive Regression Splines (Friedman 1991, Friedman and Silverman 1989) An Overview MARS is an adaptive procedure for regression, and

More information

GLM II. Basic Modeling Strategy CAS Ratemaking and Product Management Seminar by Paul Bailey. March 10, 2015

GLM II. Basic Modeling Strategy CAS Ratemaking and Product Management Seminar by Paul Bailey. March 10, 2015 GLM II Basic Modeling Strategy 2015 CAS Ratemaking and Product Management Seminar by Paul Bailey March 10, 2015 Building predictive models is a multi-step process Set project goals and review background

More information

Generalized Additive Model

Generalized Additive Model Generalized Additive Model by Huimin Liu Department of Mathematics and Statistics University of Minnesota Duluth, Duluth, MN 55812 December 2008 Table of Contents Abstract... 2 Chapter 1 Introduction 1.1

More information

Lecture 24: Generalized Additive Models Stat 704: Data Analysis I, Fall 2010

Lecture 24: Generalized Additive Models Stat 704: Data Analysis I, Fall 2010 Lecture 24: Generalized Additive Models Stat 704: Data Analysis I, Fall 2010 Tim Hanson, Ph.D. University of South Carolina T. Hanson (USC) Stat 704: Data Analysis I, Fall 2010 1 / 26 Additive predictors

More information

Regression. Dr. G. Bharadwaja Kumar VIT Chennai

Regression. Dr. G. Bharadwaja Kumar VIT Chennai Regression Dr. G. Bharadwaja Kumar VIT Chennai Introduction Statistical models normally specify how one set of variables, called dependent variables, functionally depend on another set of variables, called

More information

Sandeep Kharidhi and WenSui Liu ChoicePoint Precision Marketing

Sandeep Kharidhi and WenSui Liu ChoicePoint Precision Marketing Generalized Additive Model and Applications in Direct Marketing Sandeep Kharidhi and WenSui Liu ChoicePoint Precision Marketing Abstract Logistic regression 1 has been widely used in direct marketing applications

More information

An introduction to SPSS

An introduction to SPSS An introduction to SPSS To open the SPSS software using U of Iowa Virtual Desktop... Go to https://virtualdesktop.uiowa.edu and choose SPSS 24. Contents NOTE: Save data files in a drive that is accessible

More information

Generalized Additive Models

Generalized Additive Models Generalized Additive Models Statistics 135 Autumn 2005 Copyright c 2005 by Mark E. Irwin Generalized Additive Models GAMs are one approach to non-parametric regression in the multiple predictor setting.

More information

SPSS QM II. SPSS Manual Quantitative methods II (7.5hp) SHORT INSTRUCTIONS BE CAREFUL

SPSS QM II. SPSS Manual Quantitative methods II (7.5hp) SHORT INSTRUCTIONS BE CAREFUL SPSS QM II SHORT INSTRUCTIONS This presentation contains only relatively short instructions on how to perform some statistical analyses in SPSS. Details around a certain function/analysis method not covered

More information

Nonparametric Approaches to Regression

Nonparametric Approaches to Regression Nonparametric Approaches to Regression In traditional nonparametric regression, we assume very little about the functional form of the mean response function. In particular, we assume the model where m(xi)

More information

Predictive Analytics: Demystifying Current and Emerging Methodologies. Tom Kolde, FCAS, MAAA Linda Brobeck, FCAS, MAAA

Predictive Analytics: Demystifying Current and Emerging Methodologies. Tom Kolde, FCAS, MAAA Linda Brobeck, FCAS, MAAA Predictive Analytics: Demystifying Current and Emerging Methodologies Tom Kolde, FCAS, MAAA Linda Brobeck, FCAS, MAAA May 18, 2017 About the Presenters Tom Kolde, FCAS, MAAA Consulting Actuary Chicago,

More information

Goals of the Lecture. SOC6078 Advanced Statistics: 9. Generalized Additive Models. Limitations of the Multiple Nonparametric Models (2)

Goals of the Lecture. SOC6078 Advanced Statistics: 9. Generalized Additive Models. Limitations of the Multiple Nonparametric Models (2) SOC6078 Advanced Statistics: 9. Generalized Additive Models Robert Andersen Department of Sociology University of Toronto Goals of the Lecture Introduce Additive Models Explain how they extend from simple

More information

CH9.Generalized Additive Model

CH9.Generalized Additive Model CH9.Generalized Additive Model Regression Model For a response variable and predictor variables can be modeled using a mean function as follows: would be a parametric / nonparametric regression or a smoothing

More information

Resources for statistical assistance. Quantitative covariates and regression analysis. Methods for predicting continuous outcomes.

Resources for statistical assistance. Quantitative covariates and regression analysis. Methods for predicting continuous outcomes. Resources for statistical assistance Quantitative covariates and regression analysis Carolyn Taylor Applied Statistics and Data Science Group (ASDa) Department of Statistics, UBC January 24, 2017 Department

More information

Chapter 10: Extensions to the GLM

Chapter 10: Extensions to the GLM Chapter 10: Extensions to the GLM 10.1 Implement a GAM for the Swedish mortality data, for males, using smooth functions for age and year. Age and year are standardized as described in Section 4.11, for

More information

Discussion Notes 3 Stepwise Regression and Model Selection

Discussion Notes 3 Stepwise Regression and Model Selection Discussion Notes 3 Stepwise Regression and Model Selection Stepwise Regression There are many different commands for doing stepwise regression. Here we introduce the command step. There are many arguments

More information

SPM Users Guide. This guide elaborates on powerful ways to combine the TreeNet and GPS engines to achieve model compression and more.

SPM Users Guide. This guide elaborates on powerful ways to combine the TreeNet and GPS engines to achieve model compression and more. SPM Users Guide Model Compression via ISLE and RuleLearner This guide elaborates on powerful ways to combine the TreeNet and GPS engines to achieve model compression and more. Title: Model Compression

More information

Last time... Bias-Variance decomposition. This week

Last time... Bias-Variance decomposition. This week Machine learning, pattern recognition and statistical data modelling Lecture 4. Going nonlinear: basis expansions and splines Last time... Coryn Bailer-Jones linear regression methods for high dimensional

More information

Lecture 16: High-dimensional regression, non-linear regression

Lecture 16: High-dimensional regression, non-linear regression Lecture 16: High-dimensional regression, non-linear regression Reading: Sections 6.4, 7.1 STATS 202: Data mining and analysis November 3, 2017 1 / 17 High-dimensional regression Most of the methods we

More information

STAT 705 Introduction to generalized additive models

STAT 705 Introduction to generalized additive models STAT 705 Introduction to generalized additive models Timothy Hanson Department of Statistics, University of South Carolina Stat 705: Data Analysis II 1 / 22 Generalized additive models Consider a linear

More information

Data Management - 50%

Data Management - 50% Exam 1: SAS Big Data Preparation, Statistics, and Visual Exploration Data Management - 50% Navigate within the Data Management Studio Interface Register a new QKB Create and connect to a repository Define

More information

3 Ways to Improve Your Regression

3 Ways to Improve Your Regression 3 Ways to Improve Your Regression Introduction This tutorial will take you through the steps demonstrated in the 3 Ways to Improve Your Regression webinar. First, you will be introduced to a dataset about

More information

Statistics & Analysis. Fitting Generalized Additive Models with the GAM Procedure in SAS 9.2

Statistics & Analysis. Fitting Generalized Additive Models with the GAM Procedure in SAS 9.2 Fitting Generalized Additive Models with the GAM Procedure in SAS 9.2 Weijie Cai, SAS Institute Inc., Cary NC July 1, 2008 ABSTRACT Generalized additive models are useful in finding predictor-response

More information

Generalized Additive Models

Generalized Additive Models :p Texts in Statistical Science Generalized Additive Models An Introduction with R Simon N. Wood Contents Preface XV 1 Linear Models 1 1.1 A simple linear model 2 Simple least squares estimation 3 1.1.1

More information

Geology Geomath Estimating the coefficients of various Mathematical relationships in Geology

Geology Geomath Estimating the coefficients of various Mathematical relationships in Geology Geology 351 - Geomath Estimating the coefficients of various Mathematical relationships in Geology Throughout the semester you ve encountered a variety of mathematical relationships between various geologic

More information

STA121: Applied Regression Analysis

STA121: Applied Regression Analysis STA121: Applied Regression Analysis Variable Selection - Chapters 8 in Dielman Artin Department of Statistical Science October 23, 2009 Outline Introduction 1 Introduction 2 3 4 Variable Selection Model

More information

REPLACING MLE WITH BAYESIAN SHRINKAGE CAS ANNUAL MEETING NOVEMBER 2018 GARY G. VENTER

REPLACING MLE WITH BAYESIAN SHRINKAGE CAS ANNUAL MEETING NOVEMBER 2018 GARY G. VENTER REPLACING MLE WITH BAYESIAN SHRINKAGE CAS ANNUAL MEETING NOVEMBER 2018 GARY G. VENTER ESTIMATION Problems with MLE known since Charles Stein 1956 paper He showed that when estimating 3 or more means, shrinking

More information

GLMSELECT for Model Selection

GLMSELECT for Model Selection Winnipeg SAS User Group Meeting May 11, 2012 GLMSELECT for Model Selection Sylvain Tremblay SAS Canada Education Copyright 2010 SAS Institute Inc. All rights reserved. Proc GLM Proc REG Class Statement

More information

( ) = Y ˆ. Calibration Definition A model is calibrated if its predictions are right on average: ave(response Predicted value) = Predicted value.

( ) = Y ˆ. Calibration Definition A model is calibrated if its predictions are right on average: ave(response Predicted value) = Predicted value. Calibration OVERVIEW... 2 INTRODUCTION... 2 CALIBRATION... 3 ANOTHER REASON FOR CALIBRATION... 4 CHECKING THE CALIBRATION OF A REGRESSION... 5 CALIBRATION IN SIMPLE REGRESSION (DISPLAY.JMP)... 5 TESTING

More information

Moving Beyond Linearity

Moving Beyond Linearity Moving Beyond Linearity Basic non-linear models one input feature: polynomial regression step functions splines smoothing splines local regression. more features: generalized additive models. Polynomial

More information

Variable selection is intended to select the best subset of predictors. But why bother?

Variable selection is intended to select the best subset of predictors. But why bother? Chapter 10 Variable Selection Variable selection is intended to select the best subset of predictors. But why bother? 1. We want to explain the data in the simplest way redundant predictors should be removed.

More information

Linear Model Selection and Regularization. especially usefull in high dimensions p>>100.

Linear Model Selection and Regularization. especially usefull in high dimensions p>>100. Linear Model Selection and Regularization especially usefull in high dimensions p>>100. 1 Why Linear Model Regularization? Linear models are simple, BUT consider p>>n, we have more features than data records

More information

Regression Model Building for Large, Complex Data with SAS Viya Procedures

Regression Model Building for Large, Complex Data with SAS Viya Procedures Paper SAS2033-2018 Regression Model Building for Large, Complex Data with SAS Viya Procedures Robert N. Rodriguez and Weijie Cai, SAS Institute Inc. Abstract Analysts who do statistical modeling, data

More information

CREATING THE ANALYSIS

CREATING THE ANALYSIS Chapter 14 Multiple Regression Chapter Table of Contents CREATING THE ANALYSIS...214 ModelInformation...217 SummaryofFit...217 AnalysisofVariance...217 TypeIIITests...218 ParameterEstimates...218 Residuals-by-PredictedPlot...219

More information

Machine Learning Duncan Anderson Managing Director, Willis Towers Watson

Machine Learning Duncan Anderson Managing Director, Willis Towers Watson Machine Learning Duncan Anderson Managing Director, Willis Towers Watson 21 March 2018 GIRO 2016, Dublin - Response to machine learning Don t panic! We re doomed! 2 This is not all new Actuaries adopt

More information

Regression on SAT Scores of 374 High Schools and K-means on Clustering Schools

Regression on SAT Scores of 374 High Schools and K-means on Clustering Schools Regression on SAT Scores of 374 High Schools and K-means on Clustering Schools Abstract In this project, we study 374 public high schools in New York City. The project seeks to use regression techniques

More information

TERTIARY INSTITUTIONS SERVICE CENTRE (Incorporated in Western Australia)

TERTIARY INSTITUTIONS SERVICE CENTRE (Incorporated in Western Australia) TERTIARY INSTITUTIONS SERVICE CENTRE (Incorporated in Western Australia) Royal Street East Perth, Western Australia 6004 Telephone (08) 9318 8000 Facsimile (08) 9225 7050 http://www.tisc.edu.au/ THE AUSTRALIAN

More information

FMA901F: Machine Learning Lecture 3: Linear Models for Regression. Cristian Sminchisescu

FMA901F: Machine Learning Lecture 3: Linear Models for Regression. Cristian Sminchisescu FMA901F: Machine Learning Lecture 3: Linear Models for Regression Cristian Sminchisescu Machine Learning: Frequentist vs. Bayesian In the frequentist setting, we seek a fixed parameter (vector), with value(s)

More information

Linear Methods for Regression and Shrinkage Methods

Linear Methods for Regression and Shrinkage Methods Linear Methods for Regression and Shrinkage Methods Reference: The Elements of Statistical Learning, by T. Hastie, R. Tibshirani, J. Friedman, Springer 1 Linear Regression Models Least Squares Input vectors

More information

Random Forest A. Fornaser

Random Forest A. Fornaser Random Forest A. Fornaser alberto.fornaser@unitn.it Sources Lecture 15: decision trees, information theory and random forests, Dr. Richard E. Turner Trees and Random Forests, Adele Cutler, Utah State University

More information

Quantitative Methods in Management

Quantitative Methods in Management Quantitative Methods in Management MBA Glasgow University March 20-23, 2009 Luiz Moutinho, University of Glasgow Graeme Hutcheson, University of Manchester Exploratory Regression The lecture notes, exercises

More information

Assessing the Quality of the Natural Cubic Spline Approximation

Assessing the Quality of the Natural Cubic Spline Approximation Assessing the Quality of the Natural Cubic Spline Approximation AHMET SEZER ANADOLU UNIVERSITY Department of Statisticss Yunus Emre Kampusu Eskisehir TURKEY ahsst12@yahoo.com Abstract: In large samples,

More information

Monitoring of Manufacturing Process Conditions with Baseline Changes Using Generalized Regression

Monitoring of Manufacturing Process Conditions with Baseline Changes Using Generalized Regression Proceedings of the 10th International Conference on Frontiers of Design and Manufacturing June 10~1, 01 Monitoring of Manufacturing Process Conditions with Baseline Changes Using Generalized Regression

More information

AN INTRODUCTION TO MULTIVARIATE ADAPTIVE REGRESSION SPLINES FOR THE CANE INDUSTRY YL EVERINGHAM, J SEXTON

AN INTRODUCTION TO MULTIVARIATE ADAPTIVE REGRESSION SPLINES FOR THE CANE INDUSTRY YL EVERINGHAM, J SEXTON AN INTRODUCTION TO MULTIVARIATE ADAPTIVE REGRESSION SPLINES FOR THE CANE INDUSTRY By YL EVERINGHAM, J SEXTON School of Engineering and Physical Sciences, James Cook University yvette.everingham@jcu.edu.au

More information

CS 521 Data Mining Techniques Instructor: Abdullah Mueen

CS 521 Data Mining Techniques Instructor: Abdullah Mueen CS 521 Data Mining Techniques Instructor: Abdullah Mueen LECTURE 2: DATA TRANSFORMATION AND DIMENSIONALITY REDUCTION Chapter 3: Data Preprocessing Data Preprocessing: An Overview Data Quality Major Tasks

More information

What is machine learning?

What is machine learning? Machine learning, pattern recognition and statistical data modelling Lecture 12. The last lecture Coryn Bailer-Jones 1 What is machine learning? Data description and interpretation finding simpler relationship

More information

Nonparametric Risk Attribution for Factor Models of Portfolios. October 3, 2017 Kellie Ottoboni

Nonparametric Risk Attribution for Factor Models of Portfolios. October 3, 2017 Kellie Ottoboni Nonparametric Risk Attribution for Factor Models of Portfolios October 3, 2017 Kellie Ottoboni Outline The problem Page 3 Additive model of returns Page 7 Euler s formula for risk decomposition Page 11

More information

Statistical Modelling for Social Scientists. Manchester University. January 20, 21 and 24, Exploratory regression and model selection

Statistical Modelling for Social Scientists. Manchester University. January 20, 21 and 24, Exploratory regression and model selection Statistical Modelling for Social Scientists Manchester University January 20, 21 and 24, 2011 Graeme Hutcheson, University of Manchester Exploratory regression and model selection The lecture notes, exercises

More information

BIOL 458 BIOMETRY Lab 10 - Multiple Regression

BIOL 458 BIOMETRY Lab 10 - Multiple Regression BIOL 458 BIOMETRY Lab 0 - Multiple Regression Many problems in biology science involve the analysis of multivariate data sets. For data sets in which there is a single continuous dependent variable, but

More information

Stat 8053, Fall 2013: Additive Models

Stat 8053, Fall 2013: Additive Models Stat 853, Fall 213: Additive Models We will only use the package mgcv for fitting additive and later generalized additive models. The best reference is S. N. Wood (26), Generalized Additive Models, An

More information

CSE 158 Lecture 2. Web Mining and Recommender Systems. Supervised learning Regression

CSE 158 Lecture 2. Web Mining and Recommender Systems. Supervised learning Regression CSE 158 Lecture 2 Web Mining and Recommender Systems Supervised learning Regression Supervised versus unsupervised learning Learning approaches attempt to model data in order to solve a problem Unsupervised

More information

Evolution of Regression II: From OLS to GPS to MARS Hands-on with SPM

Evolution of Regression II: From OLS to GPS to MARS Hands-on with SPM Evolution of Regression II: From OLS to GPS to MARS Hands-on with SPM March 2013 Dan Steinberg Mikhail Golovnya Salford Systems Salford Systems 2013 1 Course Outline Today s Webinar: Hands-on companion

More information

REGULARIZED REGRESSION FOR RESERVING AND MORTALITY MODELS GARY G. VENTER

REGULARIZED REGRESSION FOR RESERVING AND MORTALITY MODELS GARY G. VENTER REGULARIZED REGRESSION FOR RESERVING AND MORTALITY MODELS GARY G. VENTER TODAY Advances in model estimation methodology Application to data that comes in rectangles Examples ESTIMATION Problems with MLE

More information

CHAPTER 3 AN OVERVIEW OF DESIGN OF EXPERIMENTS AND RESPONSE SURFACE METHODOLOGY

CHAPTER 3 AN OVERVIEW OF DESIGN OF EXPERIMENTS AND RESPONSE SURFACE METHODOLOGY 23 CHAPTER 3 AN OVERVIEW OF DESIGN OF EXPERIMENTS AND RESPONSE SURFACE METHODOLOGY 3.1 DESIGN OF EXPERIMENTS Design of experiments is a systematic approach for investigation of a system or process. A series

More information

Splines and penalized regression

Splines and penalized regression Splines and penalized regression November 23 Introduction We are discussing ways to estimate the regression function f, where E(y x) = f(x) One approach is of course to assume that f has a certain shape,

More information

IBM SPSS Categories 23

IBM SPSS Categories 23 IBM SPSS Categories 23 Note Before using this information and the product it supports, read the information in Notices on page 55. Product Information This edition applies to version 23, release 0, modification

More information

Lab 07: Multiple Linear Regression: Variable Selection

Lab 07: Multiple Linear Regression: Variable Selection Lab 07: Multiple Linear Regression: Variable Selection OBJECTIVES 1.Use PROC REG to fit multiple regression models. 2.Learn how to find the best reduced model. 3.Variable diagnostics and influential statistics

More information

Lecture 17: Smoothing splines, Local Regression, and GAMs

Lecture 17: Smoothing splines, Local Regression, and GAMs Lecture 17: Smoothing splines, Local Regression, and GAMs Reading: Sections 7.5-7 STATS 202: Data mining and analysis November 6, 2017 1 / 24 Cubic splines Define a set of knots ξ 1 < ξ 2 < < ξ K. We want

More information

Introduction to Statistical Analyses in SAS

Introduction to Statistical Analyses in SAS Introduction to Statistical Analyses in SAS Programming Workshop Presented by the Applied Statistics Lab Sarah Janse April 5, 2017 1 Introduction Today we will go over some basic statistical analyses in

More information

Knowledge Discovery and Data Mining

Knowledge Discovery and Data Mining Knowledge Discovery and Data Mining Basis Functions Tom Kelsey School of Computer Science University of St Andrews http://www.cs.st-andrews.ac.uk/~tom/ tom@cs.st-andrews.ac.uk Tom Kelsey ID5059-02-BF 2015-02-04

More information

Using R for Analyzing Delay Discounting Choice Data. analysis of discounting choice data requires the use of tools that allow for repeated measures

Using R for Analyzing Delay Discounting Choice Data. analysis of discounting choice data requires the use of tools that allow for repeated measures Using R for Analyzing Delay Discounting Choice Data Logistic regression is available in a wide range of statistical software packages, but the analysis of discounting choice data requires the use of tools

More information

One Factor Experiments

One Factor Experiments One Factor Experiments 20-1 Overview Computation of Effects Estimating Experimental Errors Allocation of Variation ANOVA Table and F-Test Visual Diagnostic Tests Confidence Intervals For Effects Unequal

More information

Clustering in Ratemaking: Applications in Territories Clustering

Clustering in Ratemaking: Applications in Territories Clustering Clustering in Ratemaking: Applications in Territories Clustering Ji Yao, PhD FIA ASTIN 13th-16th July 2008 INTRODUCTION Structure of talk Quickly introduce clustering and its application in insurance ratemaking

More information

Paper SDA-11. Logistic regression will be used for estimation of net error for the 2010 Census as outlined in Griffin (2005).

Paper SDA-11. Logistic regression will be used for estimation of net error for the 2010 Census as outlined in Griffin (2005). Paper SDA-11 Developing a Model for Person Estimation in Puerto Rico for the 2010 Census Coverage Measurement Program Colt S. Viehdorfer, U.S. Census Bureau, Washington, DC This report is released to inform

More information

Generalized least squares (GLS) estimates of the level-2 coefficients,

Generalized least squares (GLS) estimates of the level-2 coefficients, Contents 1 Conceptual and Statistical Background for Two-Level Models...7 1.1 The general two-level model... 7 1.1.1 Level-1 model... 8 1.1.2 Level-2 model... 8 1.2 Parameter estimation... 9 1.3 Empirical

More information

Statistical Models for Management. Instituto Superior de Ciências do Trabalho e da Empresa (ISCTE) Lisbon. February 24 26, 2010

Statistical Models for Management. Instituto Superior de Ciências do Trabalho e da Empresa (ISCTE) Lisbon. February 24 26, 2010 Statistical Models for Management Instituto Superior de Ciências do Trabalho e da Empresa (ISCTE) Lisbon February 24 26, 2010 Graeme Hutcheson, University of Manchester Exploratory regression and model

More information

UNIT 1: NUMBER LINES, INTERVALS, AND SETS

UNIT 1: NUMBER LINES, INTERVALS, AND SETS ALGEBRA II CURRICULUM OUTLINE 2011-2012 OVERVIEW: 1. Numbers, Lines, Intervals and Sets 2. Algebraic Manipulation: Rational Expressions and Exponents 3. Radicals and Radical Equations 4. Function Basics

More information

Further Maths Notes. Common Mistakes. Read the bold words in the exam! Always check data entry. Write equations in terms of variables

Further Maths Notes. Common Mistakes. Read the bold words in the exam! Always check data entry. Write equations in terms of variables Further Maths Notes Common Mistakes Read the bold words in the exam! Always check data entry Remember to interpret data with the multipliers specified (e.g. in thousands) Write equations in terms of variables

More information

An Optimal Search Process of Eigen Knots. for Spline Logistic Regression. Research Department, Point Right

An Optimal Search Process of Eigen Knots. for Spline Logistic Regression. Research Department, Point Right An Optimal Search Process of Eigen Knots for Spline Logistic Regression John Gao and Cheryl Caswell Research Department, Point Right Abstract The spline regression method usually defines a series piecewise

More information

Predict Outcomes and Reveal Relationships in Categorical Data

Predict Outcomes and Reveal Relationships in Categorical Data PASW Categories 18 Specifications Predict Outcomes and Reveal Relationships in Categorical Data Unleash the full potential of your data through predictive analysis, statistical learning, perceptual mapping,

More information

Data Mining. Part 2. Data Understanding and Preparation. 2.4 Data Transformation. Spring Instructor: Dr. Masoud Yaghini. Data Transformation

Data Mining. Part 2. Data Understanding and Preparation. 2.4 Data Transformation. Spring Instructor: Dr. Masoud Yaghini. Data Transformation Data Mining Part 2. Data Understanding and Preparation 2.4 Spring 2010 Instructor: Dr. Masoud Yaghini Outline Introduction Normalization Attribute Construction Aggregation Attribute Subset Selection Discretization

More information

Cognalysis TM Reserving System User Manual

Cognalysis TM Reserving System User Manual Cognalysis TM Reserving System User Manual Return to Table of Contents 1 Table of Contents 1.0 Starting an Analysis 3 1.1 Opening a Data File....3 1.2 Open an Analysis File.9 1.3 Create Triangles.10 2.0

More information

8.11 Multivariate regression trees (MRT)

8.11 Multivariate regression trees (MRT) Multivariate regression trees (MRT) 375 8.11 Multivariate regression trees (MRT) Univariate classification tree analysis (CT) refers to problems where a qualitative response variable is to be predicted

More information

Two-Stage Least Squares

Two-Stage Least Squares Chapter 316 Two-Stage Least Squares Introduction This procedure calculates the two-stage least squares (2SLS) estimate. This method is used fit models that include instrumental variables. 2SLS includes

More information

Recitation Supplement: Creating a Neural Network for Classification SAS EM December 2, 2002

Recitation Supplement: Creating a Neural Network for Classification SAS EM December 2, 2002 Recitation Supplement: Creating a Neural Network for Classification SAS EM December 2, 2002 Introduction Neural networks are flexible nonlinear models that can be used for regression and classification

More information

Multiple Linear Regression

Multiple Linear Regression Multiple Linear Regression Rebecca C. Steorts, Duke University STA 325, Chapter 3 ISL 1 / 49 Agenda How to extend beyond a SLR Multiple Linear Regression (MLR) Relationship Between the Response and Predictors

More information

High dimensional data analysis

High dimensional data analysis High dimensional data analysis Cavan Reilly October 24, 2018 Table of contents Data mining Random forests Missing data Logic regression Multivariate adaptive regression splines Data mining Data mining

More information

GENREG DID THAT? Clay Barker Research Statistician Developer JMP Division, SAS Institute

GENREG DID THAT? Clay Barker Research Statistician Developer JMP Division, SAS Institute GENREG DID THAT? Clay Barker Research Statistician Developer JMP Division, SAS Institute GENREG WHAT IS IT? The Generalized Regression platform was introduced in JMP Pro 11 and got much better in version

More information

THE UNIVERSITY OF BRITISH COLUMBIA FORESTRY 430 and 533. Time: 50 minutes 40 Marks FRST Marks FRST 533 (extra questions)

THE UNIVERSITY OF BRITISH COLUMBIA FORESTRY 430 and 533. Time: 50 minutes 40 Marks FRST Marks FRST 533 (extra questions) THE UNIVERSITY OF BRITISH COLUMBIA FORESTRY 430 and 533 MIDTERM EXAMINATION: October 14, 2005 Instructor: Val LeMay Time: 50 minutes 40 Marks FRST 430 50 Marks FRST 533 (extra questions) This examination

More information

Data mining techniques for actuaries: an overview

Data mining techniques for actuaries: an overview Data mining techniques for actuaries: an overview Emiliano A. Valdez joint work with Banghee So and Guojun Gan University of Connecticut Advances in Predictive Analytics (APA) Conference University of

More information

INTRO TO RANDOM FOREST BY ANTHONY ANH QUOC DOAN

INTRO TO RANDOM FOREST BY ANTHONY ANH QUOC DOAN INTRO TO RANDOM FOREST BY ANTHONY ANH QUOC DOAN MOTIVATION FOR RANDOM FOREST Random forest is a great statistical learning model. It works well with small to medium data. Unlike Neural Network which requires

More information

A popular method for moving beyond linearity. 2. Basis expansion and regularization 1. Examples of transformations. Piecewise-polynomials and splines

A popular method for moving beyond linearity. 2. Basis expansion and regularization 1. Examples of transformations. Piecewise-polynomials and splines A popular method for moving beyond linearity 2. Basis expansion and regularization 1 Idea: Augment the vector inputs x with additional variables which are transformation of x use linear models in this

More information

STAT 2607 REVIEW PROBLEMS Word problems must be answered in words of the problem.

STAT 2607 REVIEW PROBLEMS Word problems must be answered in words of the problem. STAT 2607 REVIEW PROBLEMS 1 REMINDER: On the final exam 1. Word problems must be answered in words of the problem. 2. "Test" means that you must carry out a formal hypothesis testing procedure with H0,

More information

Model Inference and Averaging. Baging, Stacking, Random Forest, Boosting

Model Inference and Averaging. Baging, Stacking, Random Forest, Boosting Model Inference and Averaging Baging, Stacking, Random Forest, Boosting Bagging Bootstrap Aggregating Bootstrap Repeatedly select n data samples with replacement Each dataset b=1:b is slightly different

More information

Applied Regression Modeling: A Business Approach

Applied Regression Modeling: A Business Approach i Applied Regression Modeling: A Business Approach Computer software help: SAS SAS (originally Statistical Analysis Software ) is a commercial statistical software package based on a powerful programming

More information

Machine Learning: An Applied Econometric Approach Online Appendix

Machine Learning: An Applied Econometric Approach Online Appendix Machine Learning: An Applied Econometric Approach Online Appendix Sendhil Mullainathan mullain@fas.harvard.edu Jann Spiess jspiess@fas.harvard.edu April 2017 A How We Predict In this section, we detail

More information

ALGEBRA II A CURRICULUM OUTLINE

ALGEBRA II A CURRICULUM OUTLINE ALGEBRA II A CURRICULUM OUTLINE 2013-2014 OVERVIEW: 1. Linear Equations and Inequalities 2. Polynomial Expressions and Equations 3. Rational Expressions and Equations 4. Radical Expressions and Equations

More information

Frequencies, Unequal Variance Weights, and Sampling Weights: Similarities and Differences in SAS

Frequencies, Unequal Variance Weights, and Sampling Weights: Similarities and Differences in SAS ABSTRACT Paper 1938-2018 Frequencies, Unequal Variance Weights, and Sampling Weights: Similarities and Differences in SAS Robert M. Lucas, Robert M. Lucas Consulting, Fort Collins, CO, USA There is confusion

More information

5.5 Regression Estimation

5.5 Regression Estimation 5.5 Regression Estimation Assume a SRS of n pairs (x, y ),..., (x n, y n ) is selected from a population of N pairs of (x, y) data. The goal of regression estimation is to take advantage of a linear relationship

More information

Stat 5100 Handout #14.a SAS: Logistic Regression

Stat 5100 Handout #14.a SAS: Logistic Regression Stat 5100 Handout #14.a SAS: Logistic Regression Example: (Text Table 14.3) Individuals were randomly sampled within two sectors of a city, and checked for presence of disease (here, spread by mosquitoes).

More information

Nonparametric Regression and Generalized Additive Models Part I

Nonparametric Regression and Generalized Additive Models Part I SPIDA, June 2004 Nonparametric Regression and Generalized Additive Models Part I Robert Andersen McMaster University Plan of the Lecture 1. Detecting nonlinearity Fitting a linear model to a nonlinear

More information

Statistics & Analysis. A Comparison of PDLREG and GAM Procedures in Measuring Dynamic Effects

Statistics & Analysis. A Comparison of PDLREG and GAM Procedures in Measuring Dynamic Effects A Comparison of PDLREG and GAM Procedures in Measuring Dynamic Effects Patralekha Bhattacharya Thinkalytics The PDLREG procedure in SAS is used to fit a finite distributed lagged model to time series data

More information

Business Club. Decision Trees

Business Club. Decision Trees Business Club Decision Trees Business Club Analytics Team December 2017 Index 1. Motivation- A Case Study 2. The Trees a. What is a decision tree b. Representation 3. Regression v/s Classification 4. Building

More information

Analytical model A structure and process for analyzing a dataset. For example, a decision tree is a model for the classification of a dataset.

Analytical model A structure and process for analyzing a dataset. For example, a decision tree is a model for the classification of a dataset. Glossary of data mining terms: Accuracy Accuracy is an important factor in assessing the success of data mining. When applied to data, accuracy refers to the rate of correct values in the data. When applied

More information

Stat 500 lab notes c Philip M. Dixon, Week 10: Autocorrelated errors

Stat 500 lab notes c Philip M. Dixon, Week 10: Autocorrelated errors Week 10: Autocorrelated errors This week, I have done one possible analysis and provided lots of output for you to consider. Case study: predicting body fat Body fat is an important health measure, but

More information

Nonlinearity and Generalized Additive Models Lecture 2

Nonlinearity and Generalized Additive Models Lecture 2 University of Texas at Dallas, March 2007 Nonlinearity and Generalized Additive Models Lecture 2 Robert Andersen McMaster University http://socserv.mcmaster.ca/andersen Definition of a Smoother A smoother

More information

22s:152 Applied Linear Regression

22s:152 Applied Linear Regression 22s:152 Applied Linear Regression Chapter 22: Model Selection In model selection, the idea is to find the smallest set of variables which provides an adequate description of the data. We will consider

More information

Multicollinearity and Validation CIVL 7012/8012

Multicollinearity and Validation CIVL 7012/8012 Multicollinearity and Validation CIVL 7012/8012 2 In Today s Class Recap Multicollinearity Model Validation MULTICOLLINEARITY 1. Perfect Multicollinearity 2. Consequences of Perfect Multicollinearity 3.

More information

Additive hedonic regression models for the Austrian housing market ERES Conference, Edinburgh, June

Additive hedonic regression models for the Austrian housing market ERES Conference, Edinburgh, June for the Austrian housing market, June 14 2012 Ao. Univ. Prof. Dr. Fachbereich Stadt- und Regionalforschung Technische Universität Wien Dr. Strategic Risk Management Bank Austria UniCredit, Wien Inhalt

More information

A toolbox of smooths. Simon Wood Mathematical Sciences, University of Bath, U.K.

A toolbox of smooths. Simon Wood Mathematical Sciences, University of Bath, U.K. A toolbo of smooths Simon Wood Mathematical Sciences, University of Bath, U.K. Smooths for semi-parametric GLMs To build adequate semi-parametric GLMs requires that we use functions with appropriate properties.

More information