Sandeep Kharidhi and WenSui Liu ChoicePoint Precision Marketing

Similar documents
Generalized Additive Model

Random Forest A. Fornaser

MIT Samberg Center Cambridge, MA, USA. May 30 th June 2 nd, by C. Rea, R.S. Granetz MIT Plasma Science and Fusion Center, Cambridge, MA, USA

Statistics & Analysis. Fitting Generalized Additive Models with the GAM Procedure in SAS 9.2

Analytical model A structure and process for analyzing a dataset. For example, a decision tree is a model for the classification of a dataset.

Interpretable Machine Learning with Applications to Banking

Applying Supervised Learning

Predictive Analytics: Demystifying Current and Emerging Methodologies. Tom Kolde, FCAS, MAAA Linda Brobeck, FCAS, MAAA

Cyber attack detection using decision tree approach

Model Assessment and Selection. Reference: The Elements of Statistical Learning, by T. Hastie, R. Tibshirani, J. Friedman, Springer

Evaluation Measures. Sebastian Pölsterl. April 28, Computer Aided Medical Procedures Technische Universität München

Nonparametric Regression

Using Multivariate Adaptive Regression Splines (MARS ) to enhance Generalised Linear Models. Inna Kolyshkina PriceWaterhouseCoopers

Univariate and Multivariate Decision Trees

DI TRANSFORM. The regressive analyses. identify relationships

Data mining with Support Vector Machine

Hybrid Feature Selection for Modeling Intrusion Detection Systems

Big Data Methods. Chapter 5: Machine learning. Big Data Methods, Chapter 5, Slide 1

Feature Selection. CE-725: Statistical Pattern Recognition Sharif University of Technology Spring Soleymani

Statistics & Analysis. A Comparison of PDLREG and GAM Procedures in Measuring Dynamic Effects

Predict the Likelihood of Responding to Direct Mail Campaign in Consumer Lending Industry

STAT 705 Introduction to generalized additive models

Network Traffic Measurements and Analysis

Classification and Regression Trees

Classification Algorithms in Data Mining

Comparison of Optimization Methods for L1-regularized Logistic Regression

Contents Machine Learning concepts 4 Learning Algorithm 4 Predictive Model (Model) 4 Model, Classification 4 Model, Regression 4 Representation

Network. Department of Statistics. University of California, Berkeley. January, Abstract

AN INTRODUCTION TO MULTIVARIATE ADAPTIVE REGRESSION SPLINES FOR THE CANE INDUSTRY YL EVERINGHAM, J SEXTON

Linear Methods for Regression and Shrinkage Methods

Nonparametric Approaches to Regression

CHAPTER 5. BASIC STEPS FOR MODEL DEVELOPMENT

SPM Users Guide. This guide elaborates on powerful ways to combine the TreeNet and GPS engines to achieve model compression and more.

A Method for Comparing Multiple Regression Models

DATA MINING AND MACHINE LEARNING. Lecture 6: Data preprocessing and model selection Lecturer: Simone Scardapane

CS6375: Machine Learning Gautam Kunapuli. Mid-Term Review

Preface to the Second Edition. Preface to the First Edition. 1 Introduction 1

Chapter 5. Tree-based Methods

Topics in Machine Learning

Non-Linearity of Scorecard Log-Odds

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

An Optimal Search Process of Eigen Knots. for Spline Logistic Regression. Research Department, Point Right

Predict Outcomes and Reveal Relationships in Categorical Data

Local Minima in Regression with Optimal Scaling Transformations

The Curse of Dimensionality

Stat 342 Exam 3 Fall 2014

Recitation Supplement: Creating a Neural Network for Classification SAS EM December 2, 2002

REGRESSION BY SELECTING APPROPRIATE FEATURE(S)

Lecture 24: Generalized Additive Models Stat 704: Data Analysis I, Fall 2010

Machine Learning: An Applied Econometric Approach Online Appendix

Lecture 9: Support Vector Machines

Classification and Regression Trees

The gbev Package. August 19, 2006

Fast or furious? - User analysis of SF Express Inc

Classification and Regression via Integer Optimization

Evaluation Metrics. (Classifiers) CS229 Section Anand Avati

3 Ways to Improve Your Regression

Weka ( )

Performance Estimation and Regularization. Kasthuri Kannan, PhD. Machine Learning, Spring 2018

Multicollinearity and Validation CIVL 7012/8012

Evaluating generalization (validation) Harvard-MIT Division of Health Sciences and Technology HST.951J: Medical Decision Support

Logistic Model Tree With Modified AIC

Induction of Multivariate Decision Trees by Using Dipolar Criteria

Modelling Personalized Screening: a Step Forward on Risk Assessment Methods

Using Machine Learning to Optimize Storage Systems

劉介宇 國立台北護理健康大學 護理助產研究所 / 通識教育中心副教授 兼教師發展中心教師評鑑組長 Nov 19, 2012

Building a cascade detector and its applications in automatic target detection

Didacticiel - Études de cas. Comparison of the implementation of the CART algorithm under Tanagra and R (rpart package).

INTRODUCTION... 2 FEATURES OF DARWIN... 4 SPECIAL FEATURES OF DARWIN LATEST FEATURES OF DARWIN STRENGTHS & LIMITATIONS OF DARWIN...

GENREG DID THAT? Clay Barker Research Statistician Developer JMP Division, SAS Institute

Link Prediction for Social Network

Predicting Rare Failure Events using Classification Trees on Large Scale Manufacturing Data with Complex Interactions

Business Club. Decision Trees

Trimmed bagging a DEPARTMENT OF DECISION SCIENCES AND INFORMATION MANAGEMENT (KBI) Christophe Croux, Kristel Joossens and Aurélie Lemmens

arxiv: v1 [stat.ml] 25 Jan 2018

Building Better Parametric Cost Models

NONPARAMETRIC REGRESSION TECHNIQUES

Improving Quality of Products in Hard Drive Manufacturing by Decision Tree Technique

Improving Classification Accuracy for Single-loop Reliability-based Design Optimization

Model Inference and Averaging. Baging, Stacking, Random Forest, Boosting

MACHINE LEARNING TOOLBOX. Logistic regression on Sonar

Structured Completion Predictors Applied to Image Segmentation

USING CONVEX PSEUDO-DATA TO INCREASE PREDICTION ACCURACY

A technique for constructing monotonic regression splines to enable non-linear transformation of GIS rasters

SYS 6021 Linear Statistical Models

A General Greedy Approximation Algorithm with Applications

Comparing Univariate and Multivariate Decision Trees *

R (2) Data analysis case study using R for readily available data set using any one machine learning algorithm.

Assessing the Quality of the Natural Cubic Spline Approximation

CHAPTER 7 EXAMPLES: MIXTURE MODELING WITH CROSS- SECTIONAL DATA

For continuous responses: the Actual by Predicted plot how well the model fits the models. For a perfect fit, all the points would be on the diagonal.

ICA as a preprocessing technique for classification

Test designs for evaluating the effectiveness of mail packs Received: 30th November, 2001

Further Maths Notes. Common Mistakes. Read the bold words in the exam! Always check data entry. Write equations in terms of variables

Credit card Fraud Detection using Predictive Modeling: a Review

University of California, Berkeley

Decision Trees Dr. G. Bharadwaja Kumar VIT Chennai

Optimal Extension of Error Correcting Output Codes

Lecture 7: Decision Trees

Improving Quality of Products in Hard Drive Manufacturing by Decision Tree Technique

Transcription:

Generalized Additive Model and Applications in Direct Marketing Sandeep Kharidhi and WenSui Liu ChoicePoint Precision Marketing Abstract Logistic regression 1 has been widely used in direct marketing applications to develop response and conversion models. Although attractively simple, logistic regression has been criticized for failing to capture the nonlinearity, and therefore possibly not leading to satisfactory results. Introduced by Hastie and Tibshirani, Generalized Additive Model 2 (GAM) provides the ability to detect the nonlinear relationship between the response behavior and predictors. In this paper, we present an application of GAM in the direct marketing environment, and also demonstrate how to improve the interpretability and scoring scheme of GAM using other statistical techniques. Model performance is evaluated and compared among discussed models using the area under Receiver Operating Characteristic (ROC) curve. It is found that our proposed method is able to yield superior results in practice. Introduction How to assess a prospect s likelihood to respond to a marketing campaign has been a key interest in direct marketing. In most marketing companies, modelers are using logistic regression to develop response and conversion models to boost marketing campaign response rates and reduce expenses. While logistic regression is easy to understand and implement, it assumes a linear relationship between response and predictors, and may fail to capture a more complex nonlinear relationship that often exists with the real-life data. While neural networks have reportedly been successful in nonlinear modeling, 3 this success comes with the price of interpretability. As another alternative, GAM has recently shown promising success in modeling the default risk 4 by combining the strength of flexible modeling with the ease of interpretation. In this paper, we demonstrate the superiority of GAM over logistic regression through the development of a conversion model. Due to the nonparametric nature of GAM, we also discuss how to use other techniques such as Classification and Regression Tree 5 (CART) and Multivariate Adaptive Regression Splines 6 (MARS) as further improvements in the model development. Modeling Methodology In direct marketing, statistical models are often used to evaluate the probability that an individual will respond to a marketing campaign, for the aim of improving campaign efficiency and reducing marketing costs. Traditionally, discriminant analysis and logistic regression have been commonly used to model binary outcomes such as that kind of response action, under the assumption that predictors for these outcomes are linearly related to the response. However, a potential risk of such an assumption is model misspecification. While the effects of predictors are often neither linear nor monotonic in the real world, it is always challenging to find an appropriate functional form between the response and the predictor. Consequently, logistic regression may not 22

always be able to provide an adequate fit to the complex data structure. As an alternative to logistic regression, GAM relaxes the linear assumption, and assumes that the response is dependent on predictors in a flexible manner, which could be either linear or nonlinear. The nonlinear relationship is largely driven by the data, and estimated nonparametrically with a univariate B-spline or local regression smoother. In other words, instead of having a single coefficient for each predictor, GAM uses an unspecified nonparametric function to describe the relationship between each predictor and the response, for the purpose of maximizing the predictive performance. Such nonparametric function is analogous to the coefficient in a logistic regression, and can be used to visualize the relationship between the response and the predictor. This ability to visualize the nonparametric function between each predictor and the response is an important feature in GAM, and provides an intuitive way for modelers to explore the complex data structure and interpret the model s results. In order to illustrate the application of GAM in Table 1: Conversion Summary direct marketing, we will apply it to realworld data and compare the results with a logistic regression baseline model. Data analyzed in our paper has been used to develop a response-toconversion model. The dataset consists of 6,180 responders with a 6-percent conversion rate and 24 variables. The response variable Y reflects the status of the prospect s conversion and therefore has binary outcome. The 23 predictors include eight numeric variables and 15 categorical variables with levels ranging from 2 to 11. Before the model development, we randomly divided the entire dataset into two parts: one for model development, and the other for model validation, as shown in Table 1. A version of logistic regression is estimated for the conversion model using the development data with the inclusion of all predictors and a focus on seven numeric variables. Partial output of the model is shown in Table 2 below, in which predictors significant at 10% level are flagged. Out of seven numeric variables, only four are shown to be statistically Table 2: Output of Logistic Regression significant based on the linear assumption. Besides the illustrated output, all statistics indicate that the logistic regression achieves an adequate fit for the development data. A Receiver Operating Characteristic curve, also known as ROC curve, is a graphical representation of the tradeoff between Type-I error (Sensitivity) and Type-II error (Specificity) for different possible cutoffs, and is often used to compare predictive performance between different classification models. In a ROC curve, Sensitivity is placed on the Y-axis and Specificity on the X-axis (expressed as 1-Specificity). The area under ROC curve, also abbreviated as AUC, is a statistical measure often used to summarize the information of ROC curve derived from a predictive model. In brief, AUC can be interpreted as the probability that the predictive model is able to score a randomly selected positive response higher than a randomly selected negative response. A predictive model with the perfect performance has an area under ROC curve equal to 1, whereas this area is 0.5 for a predictive model as good as the random guess. In practice, the area under the ROC curve is between 0.5 and 1. In Figure 1 (pg. 24), ROC curves of logistic regression for both development and validation data are plotted. The areas under ROC curves are equal to 0.78 and 0.76, respectively, for development and validation data, suggesting a reasonable predictiveness for the logistic regression. All statistical evidence thus far indicates a good fit for the conversion model using logistic regression. However, whether the assumed linearity is the correct functional form for the analyzed data remains in question, and still needs to be addressed. Since logistic regression indicates that X4, X5, X6, and X7 are 23

Figure 1: ROC Curves for Development and Validation Data Table 4: Partial Output of Generalized Additive Model Figure 2: Partial Prediction Plots of X5 and X7 statistically insignificant, we will pay extra attention to these four variables. After establishing the Logistic Regression Model as the benchmark, we fit the GAM with the same development data, and apply a flexible nonparametric estimation to those four insignificant predictors in the previous logistic regression. With the flexibility provided by the GAM, there is a strong temptation to over-fit the model to the development data using excess degrees of freedom in the model. Based upon our experience, the benefit of using conservative degrees of freedom in a GAM is twofold: first, over-fitting can be prevented with low degrees of freedom; and second, computation time can be reduced dramatically, given the large volume of data commonly modeled in direct marketing. Table 4 (left) shows the partial output of the best GAM developed, after trial and error, with all predictors included. Note that X5 and X7 become significant after being estimated nonparametrically under the nonlinear assumption. The nonlinear effect of each predictor is shown in Figure 2 (left). For instance, it is clear that the relationship between X5 and Y is neither linear nor monotonic, a violation of the linear assumption in logistic regression. Instead, the conversion rate goes up as X5 increases, starts decreasing once X5 reaches 0, and then picks up again after X5 exceeds 1.5. For comparison purposes, the ROC curves of GAM for both development and validation data are also plotted in Figure 3 (right) to evaluate the predictive performance based upon the values of Area Under Curve (AUC). While AUC for development data is marginally higher than the one from logistic regression, both models perform comparably for validation data. 24

However, it is important to note that the GAM provides better insight into the relationship between the response and the predictor without imposing strict linear assumption and pre-specified functional form to the model. This feature is particularly helpful in database marketing, where the data structure is always complex and little domain knowledge about thousands of variables is known before model development. Although conceptually attractive, GAM is not beyond criticism. Lack of parametric functional form makes it difficult to score new data directly from the database in a direct marketing production environment. Also, the nonlinear effect without an estimated parameter might not be easily adopted by non-technical audiences such as business directors and campaign managers. A possible workaround might be to estimate a parametric approximation for each nonlinear term derived from the GAM, such as a piece-wise constant approximation. While there are multiple ways to come up with such an approximation based upon either percentile or experience, we propose a model-based method using a Classification and Regression Tree to develop such approximation. In this case, classification and regression tree is developed with only one independent variable and one dependent variable, which are the predictor and the corresponding nonlinear term, respectively. Such a tree-based model is constructed through a process of recursive binary partitioning. At each partition, an if-then splitting rule is generated to divide the nonlinear term into several homogeneous groups based upon the value of the predictor. Figure 4 shows diagrams of classification and regression trees for X5 and X7 (see Figure 4, top). Figure 3: ROC Curves for Development and Validation Data Figure 4: Classification and Regression Trees of X5 and X7 After the development of the classification and regression tree, we then use the piecewise constant approximation as a categorical variable to replace the nonlinear term from the GAM. As a result, the GAM collapses to become a logistic regression model that is very familiar. Figure 4 (top) shows such piecewise constant approximation for X5 and X7. Table 5 (pg. 26) displays the statistical output from revised logistic regression with inclusion of piecewise constant approximation. While X5 shows statistical significance, X7 is at the border of 10% significant level. Parameter estimates and statistical significances of the other numeric variables are very close to the ones in the GAM. In Figure 5 (pg. 26), ROC curves of logistic regression with piecewise constant approximation for both development and validation data are plotted to evaluate the efficiency of predictiveness. For development data, this new hybrid model performs similarly to the GAM discussed earlier. However, it generalizes better than both logistic regression and the GAM 25

Figure 5: Piecewise Constant Approximation for X5 and X7 for validation data with AUC = 0.77. Since one of the advantages in the classification and regression tree is the resistance to outliers, our understanding is that this improvement might come from the reduction of over-fitting. Table 5: Partial Output of Revised Logistic Regression with Piecewise Constant Approximation Figure 6: ROC Curves for Development and Validation Data While the classification and regression tree provides a satisfactory solution of piecewise constant approximation for nonlinear effect, multivariate adaptive regression splines are able to address the same problem with piecewise linear approximation. Similar to the classification and regression tree, multivariate adaptive regression splines are designed to partition the entire range of a predictor into multiple sub-regions known as basis. Within each sub-region, a regression with a different coefficient is used to define the relationship between the response and the predictor. As a result, such a divide and conquer strategy provides flexibility to approximate any nonlinear pattern with sufficient numbers of basis functions. It is interesting to note that, when the coefficient in each sub-region is equal to zero and only the intercept remains, then the multivariate adaptive regression spline will simply become a classification and regression tree (see Figure 6, below). As the building block of a multivariate adaptive regression spline, the basis function can be considered a function with a of hockey stick shape, and takes the form of BF = max(x k, 0), where k is the breaking point of the hockey stick. During the model development, a forward selection method is employed to include statistically significant basis functions, followed by a backward pruning process. A similar technique is used in logistic regression for variable selection. Figure 7 (right) shows the piecewise linear approximation for nonlinear terms of X5 and X7, indicating 26

that each nonlinear pattern can be adequately approximated by three basis functions. Figure 7: Piecewise Linear Approximation for X5 and X7 After the development of multivariate adaptive regression splines, we can use derived basis functions to replace nonlinear terms from the GAM, and re-estimate a logistic regression with inclusion of these basis functions. Table 6 (right) illustrates the output of such revised logistic regression, and shows that almost all basis functions are statistically significant in this new model. Predictiveness efficiency is summarized in Figure 8, (right) using ROC curves and AUC again. While there is no noticeable difference with logistic regression and the GAM for development data, this new model does not generalize well for the validation data with AUC = 0.75. In our view, this underperformance is likely to have arisen from over-fitting, which is one of the known drawbacks for multivariate adaptive regression splines. Table 6: Partial Output of Revised Logistic Regression with Piecewise Linear Approximation Conclusion We have demonstrated a new modeling technique and its application in direct marketing. In our experience, the GAM outperforms logistic regression in two aspects: First, the GAM relaxes the linear assumption between the response and predictors, and therefore avoids the problem of model mis-specification, which often occurs in the linearbased logistic regression. Second, by incorporating nonlinear effects, the GAM helps discover the hidden pattern between the response and predictors, and consequently improves the predictive performance while over-fitting is carefully guarded. Figure 8: Roc Curves for Development and Validation Data Besides its original concept and implementation, we have also discussed two 27

hybrid GAMs based upon our learning experience from daily modeling work, which are piecewise constant and piece linear approximation models. Our findings show that a hybrid model that combines the flexibility from GAM and the resistance to over-fitting from classification and regression tree is able to yield the best result. n Reference 1. McCullagh, P. and Nelder, J., Generalized Linear Models, Chapman and Hall, (1989). 2. Hastie, T. and Tibshirani, R., Generalized Additive Models, Chapman and Hall, (1990). 3. Hastie, T., Tibshirani, R., and Friedman, J., Elements of Statistical Learning, Springer, (2001). 4. Franke, J., Hardle, W., and Stahl, G., Measuring Risk in Complex Statistical Systems, Springer Verlag, (2000). 5. Breiman, L., Friedman, J., Olshen, R., and Stone, C., Classification and Regression Trees, Chapman & Hall, (1984) 6. Friedman, J., Multivariate Adaptive Regression Splines, The Annals of Statistics, Vol. 19, No. 1, 1-67, (1991). 28