Using Multivariate Adaptive Regression Splines (MARS ) to enhance Generalised Linear Models. Inna Kolyshkina PriceWaterhouseCoopers

Why enhance GLM? Shortcomings of the linear modelling approach. GLM being a linear technique shares the usual shortcomings of the linear modelling approach: operate under the assumption that data is distributed according to a distribution in the exponential family are affected by multi-collinearity, outliers and missing values in the data, often troublesome to use for selecting important predictors and their interactions. categorical predictors with large numbers of categories (for example, postcode, occupation code etc) can lead to unreliable results due to sparsityrelated issues. Linear models take longer to build because of the need to address the issues above by transforming both numeric and categorical predictors and choosing predictors and their interactions by hand which can prove to be a lengthy task. Page 2

Data mining techniques in contrast, are typically fast, easily select predictors and their interactions, are minimally affected with missing values, outliers or collinearity and effectively process high-level categorical predictors. This suggests that combining a linear approach with data mining tools can expedite the modelling process, allowing the modeller to attain equal or better model accuracy in less time with the same level of interpretability. Such models, usually combining decision trees, multivariate adaptive regression splines and GLM have been used by our team in a number of projects (see Kolyshkina and Brookes, 2002). Page 3

Strengths of GLM can be effectively combined with the computational power of data mining methods The advance of the new methodologies does not mean that the proven, effective techniques such as GLM should be wholly replaced by them. This talk: 1. Discusses how the advantages and strengths of GLM can be combined with the computational power of data mining methods presenting an example of the combining (MARS ) and GLM approaches 2. Provides comparison of the results of this combined model to those achievable by hand-fitted GLM in terms of: time taken predictive power selection of predictors and their interactions interpretability of the model precision and model fit Page 4

MARS methodology overview.

MARS can be described as Generalization of stepwise linear regression or Modification of the CART method for the regression setting (ie for a continuous dependent variable) Page 6

MARS model - knots for continuous predictors For each continuous predictor X MARS forms pairs of piecewise linear functions with knots at each observed value t of the predictor: X1 is equal to X2 is equal to (X-t) if X >t 0 otherwise (t- X) if X <=t 0 otherwise It then builds a model using only these functions and their products. Value t is then called a knot Page 7

MARS model - knots. Example. For example, for the variable AGE such piecewise linear functions may be: AGE1 is equal to AGE2 is equal to (AGE-20) if AGE>20 0 otherwise (20-AGE) if AGE<=20 0 otherwise Value 20 is then called a knot Page 8

MARS model - piecewise constant functions for categorical predictors For categorical predictors MARS considers all possible binary partitions of the categories. Each such partition generates a pair of piecewise constant basis functions that are indicator functions for the two sets of categories. Page 9

MARS model - basis functions We will call these functions and their products basis functions. MARS model is a linear combination of basis functions: y= a 0 + a 1 *BF1+ a 2 *BF2 + a 3 *BF3+ The coefficients a 0, a 1, a 3, are then estimated by minimising the residual sum of squares (like in linear regression). Page 10

MARS model building. The MARS model is built in a two-phase process. In the first phase, a model is grown by adding basis functions in the manner similar to forward selection process in linear regression until an overly large model is found. In the second phase, basis functions are deleted in order of least contribution to the model in the manner similar to backward deletion process in linear regression Page 11

Optimal MARS model selection. Degrees of Freedom penalty. The optimal MARS model is the one with the lowest GCV (generalized crossvalidation) measure. The GCV criterion actually does not involve cross validation: C(M) is the cost-complexity measure of a model containing M basis functions. Note that C(M)=M is the usual measure used in linear regression. The MSE is calculated by dividing the sum of squared errors by N-M instead of by N. The GCV formula enables C(M) > M; in other words, enables charging each basis function with more than one degree of freedom. Page 12

MARS model building. Why this model strategy? The basis functions and their products are only non-zero over a part of their range, as a result the regression surface is built up parsimoniously, non-zero coefficients are only used locally, only where they are needed (unlike polynomials that produce non-zero product everywhere). By allowing for any arbitrary shape for the response function as well as for interactions, and by using the two-phase model selection method, MARS is capable of reliably tracking very complex data structures that often hide in high-dimensional data. Page 13

How the Use of MARS Can Expedite GLM Building. Most of the shortcomings of linear models outlined above can be overcome by using MARS output basis functions as the input for GLM. This makes sense because MARS : quickly selects important predictors and their interactions and transforms numeric and categorical predictors in such a way that the resulting variables are linearly independent and easy to interpret Is minimally affected by multicollinearity, outliers and missing values in the data, easily handles categorical predictors with large numbers of categories and therefore requires less data preparation than linear methods The modeller though needs to make sure that the transformed predictors make business sense, and that the MARS model is stable. Page 14

Are MARS output functions OK to use as GLM inputs? We have seen that although MARS output functions are not created specifically to be used as the input for a linear model, in practice about 90% of them turn out to be significant predictors in a GLM. Another feature of the MARS output functions that makes them useful is that they are linearly independent as stated in Friedman (1991) which means that the multi-collinearity issues do not arise in the GLM that uses them as explanatory variables. Page 15

Combined models are easily implemented in SAS MARS package is easily available, inexpensive and can work with data in most formats (SAS, SPSS, dbf etc). The output MARS produces can be combined with any GLM software with minimal effort as it is easy to code in any program language such as SAS which is the main data analysis software package used by many actuaries. Page 16

Case Study. Application of combined models to Queensland Industry CTP data provided by Motor Accident Insurance Commission (MAIC).

ackground. The methodology described above was applied in order to model the ultimate incurred number of claims based on reported claim data. The data used was industry-wide auto liability data from Queensland (commonly called Compulsory Third Party or CTP in Australia). Page 18

Data Description. Individual claim data was aggregated into the number of claims reported for each accident month and development month for input to the GLM. The variables used for the analysis were: accident month accident quarter number of casualties development month development quarter number of vehicles in the calendar year number of vehicles exposed in the month Page 19

Hand-fitted GLM An initial GLM was created without using MARS. This was a Poisson model with the log link, using the number of vehicles exposed in the month as the offset. The transformations and interactions of the input variables were created manually for the purposes of both best model fit and interpretability. The model fit was assessed by usual methods such as ratio of deviance to the degrees of freedom, predictor significance, link test and residual analysis. All the assessments showed adequacy of the model fit. Page 20

MARS - enhanced GLM A second model was created by pre-processing the variables in MARS as and then including them in a GLM as the inputs: built a MARS model with the ratio of incurred number of claims to number of vehicles exposed in the month as the dependent variable. we then used basis functions produced by the MARS model as the inputs in the Poisson model with log link and the number of vehicles exposed in the month as the offset. the GLM output showed that most ofthese variables were significant. We used backward elimination to refine the model by excluding the inputs that were not significant. the resulting model fit was assessed in the same way as for the handfitted model. Page 21

The model output included explanatory variables preprocessed by MARS in the form of basis functions: BF1 = max(0, LGDV - 0.405); BF2 = max(0, 0.405 - LGDV ); BF4 = max(0, 13.000 - DEVMTHB ); BF5 = max(0, EXPMONTH - 51.000) * BF4; BF6 = max(0, 51.000 - EXPMONTH ) * BF4; BF59 = max(0, 16.000 - DEVMTHB ) * BF17; Page 22

As you can see, predictors transformed by MARS can be copied and pasted from MARS to SAS directly. They are easy to interpret. We copy and paste them into SAS and run the following SAS code to add them to the data: data data1; set glmcours.data; BF1 = max(0, LGDV - 0.405); BF2 = max(0, 0.405 - LGDV ); BF59 = max(0, 16.000 - DEVMTHB ) * BF17; run ; Page 23

SAS code for MARS-enhanced GLM. PROC GENMOD DATA=DATA1; model Claims = BF1 BF2 BF4 BF5 BF6 BF7 BF8 BF9 BF11 BF12 BF13 BF14 BF17 BF19 BF20 BF21 BF23 BF24 BF26 BF27 BF30 BF31 BF32 BF34 BF35 BF38 BF40 BF42 BF46 BF49 BF50 BF53 BF55 BF59 / DIST=POISSON OFFSET= LGVH LINK=LOG TYPE3 ; scwgt WEIGHT; OUTPUT OUT= residuals STDRESDEV=strdev_CLAIMS XBETA=xbeta_CLAIMS PREDICTED=pred_CLAIMS; RUN; Page 24

Comparison of Models

Goodness of Fit Comparison. The fit of both GLMs was assessed by usual methods such as ratio of deviance to the degrees of freedom, predictor significance, link test and residual analysis. Additionally we assessed gains charts for both models. The MARS - enhanced GLM has shown a similar if not slightly better fit to the hand-fitted GLM. Page 26

MARS-enhanced GLM. Significance of MARS basis functions. The GLM output showed that most of these variables were significant. LR Statistics For Type 3 Analysis Chi- Source DF Square Pr > ChiSq BF1 1 166.43 <.0001 BF2 1 167.61 <.0001 BF59 1 43.67 <.0001 We then used backward elimination to refine the model by excluding the inputs that were not significant. Page 27

Analysis of actual versus expected values of the number of claims. The records were ranked from highest to lowest in terms of predicted number of claims for each model, and then segmented into 20 equally sized groups. The average predicted and actual values of the number of claims for each group were then calculated and graphed. Claim frequency. Hand-fitted GLM Claim frequency. MARS-enhanced GLM 30,000 30,000 number of claims 20,000 10,000 number of claims 20,000 10,000-0.05 0.2 0.35 0.5 0.65 % of data 0.8 0.95-0.05 0.15 0.25 0.35 0.45 0.55 0.65 0.75 0.85 % of data 0.95 Page 28

Comparison Of the Fit The models fit equally well, with the MARS - enhanced GLM having a marginally better fit. The hand-fitted GLM slightly over-predicts for the fifth group and under-predicts for the third group the MARS - enhanced GLM predicts well for the higher expected numbers of claims but slightly overpredicts for the groups with lower numbers of claims. the scale of these errors is of little business importance. Page 29

Gains chart for number of incurred claims for both models. The gains chart graph shows that the models perform equally well. 1.2 Gains Chart Detailed analysis of the actual tables that gains charts are based on suggests that the MARS - enhanced GLM performs marginally better than the hand-fitted model. 1 0.8 0.6 0.4 0.2 0 Hand-fitted GLM MARS GLM base line Page 30

ARS -created vs hand-transformed variables and predictor interactions: imilarities and differences. MARS -created predictor variables and those which were manually created showed a great degree of similarity. For example, if we compare the predictors based on the variable development month, MARS placed knots mostly at the same points that were found important by the hand-fitted model. The differences included the fact that the hand-fitted model included the variable minimum (development month, 10) which is equal to 10 if development month is greater than 10 and is equal to development month otherwise, while MARS selected 9 rather than 10 as the knot point. MARS also selected interactions of predictors that were not picked up by the hand-fitted model such as the interaction of development month and experience month. Page 31

Interpretability of the models. Interpretability of the models was similar. The hand-fitted model was easier to interpret because it included less predictors and less predictor interactions than the MARS - enhanced model. Page 32

Findings and results. The fit and precision of the models was similar with the MARS -enhanced GLM showing a slightly better fit than the hand-fitted GLM. The MARS -enhanced GLM included predictor interactions not picked up by the hand-fitted model. This effect would be more pronounced in raw data modelling than in modelling summarised data, as the trends observed are likely to be smoother and easier to identify as random variation cancels out. The hand-fitted model was easier to interpret because it included less predictors and less predictor interactions than the other model. Building of the MARS -enhanced GLM was considerably faster and more efficient. These findings suggest that MARS is a useful tool to enhance and expedite GLM modelling. Page 33

Future directions. For large data sets, our team has found that combining decision trees (CART ) with MARS and GLM proves quite effective as described in Kolyshkina & Brookes, 2002. Also as an additional check of fit of a GLM model, a parallel model built in MARS can be used. The model will be built as a part of the stage described previously and will not require additional time. The model equation can be copied and pasted from MARS to SAS or another package directly. Comparison of the predicted values of this model to the GLM model graphically and numerically can suggest some insights and inform the choice of the best of the models. Page 34

The results described above as well as a number of projects completed by our team for large insurer clients demonstrate that combining GLM with data mining methodologies such as MARS can be very useful for actuarial analysis.