Generalized Additive Model and Applications in Direct Marketing Sandeep Kharidhi and WenSui Liu ChoicePoint Precision Marketing Abstract Logistic regression 1 has been widely used in direct marketing applications to develop response and conversion models. Although attractively simple, logistic regression has been criticized for failing to capture the nonlinearity, and therefore possibly not leading to satisfactory results. Introduced by Hastie and Tibshirani, Generalized Additive Model 2 (GAM) provides the ability to detect the nonlinear relationship between the response behavior and predictors. In this paper, we present an application of GAM in the direct marketing environment, and also demonstrate how to improve the interpretability and scoring scheme of GAM using other statistical techniques. Model performance is evaluated and compared among discussed models using the area under Receiver Operating Characteristic (ROC) curve. It is found that our proposed method is able to yield superior results in practice. Introduction How to assess a prospect s likelihood to respond to a marketing campaign has been a key interest in direct marketing. In most marketing companies, modelers are using logistic regression to develop response and conversion models to boost marketing campaign response rates and reduce expenses. While logistic regression is easy to understand and implement, it assumes a linear relationship between response and predictors, and may fail to capture a more complex nonlinear relationship that often exists with the real-life data. While neural networks have reportedly been successful in nonlinear modeling, 3 this success comes with the price of interpretability. As another alternative, GAM has recently shown promising success in modeling the default risk 4 by combining the strength of flexible modeling with the ease of interpretation. In this paper, we demonstrate the superiority of GAM over logistic regression through the development of a conversion model. Due to the nonparametric nature of GAM, we also discuss how to use other techniques such as Classification and Regression Tree 5 (CART) and Multivariate Adaptive Regression Splines 6 (MARS) as further improvements in the model development. Modeling Methodology In direct marketing, statistical models are often used to evaluate the probability that an individual will respond to a marketing campaign, for the aim of improving campaign efficiency and reducing marketing costs. Traditionally, discriminant analysis and logistic regression have been commonly used to model binary outcomes such as that kind of response action, under the assumption that predictors for these outcomes are linearly related to the response. However, a potential risk of such an assumption is model misspecification. While the effects of predictors are often neither linear nor monotonic in the real world, it is always challenging to find an appropriate functional form between the response and the predictor. Consequently, logistic regression may not 22
always be able to provide an adequate fit to the complex data structure. As an alternative to logistic regression, GAM relaxes the linear assumption, and assumes that the response is dependent on predictors in a flexible manner, which could be either linear or nonlinear. The nonlinear relationship is largely driven by the data, and estimated nonparametrically with a univariate B-spline or local regression smoother. In other words, instead of having a single coefficient for each predictor, GAM uses an unspecified nonparametric function to describe the relationship between each predictor and the response, for the purpose of maximizing the predictive performance. Such nonparametric function is analogous to the coefficient in a logistic regression, and can be used to visualize the relationship between the response and the predictor. This ability to visualize the nonparametric function between each predictor and the response is an important feature in GAM, and provides an intuitive way for modelers to explore the complex data structure and interpret the model s results. In order to illustrate the application of GAM in Table 1: Conversion Summary direct marketing, we will apply it to realworld data and compare the results with a logistic regression baseline model. Data analyzed in our paper has been used to develop a response-toconversion model. The dataset consists of 6,180 responders with a 6-percent conversion rate and 24 variables. The response variable Y reflects the status of the prospect s conversion and therefore has binary outcome. The 23 predictors include eight numeric variables and 15 categorical variables with levels ranging from 2 to 11. Before the model development, we randomly divided the entire dataset into two parts: one for model development, and the other for model validation, as shown in Table 1. A version of logistic regression is estimated for the conversion model using the development data with the inclusion of all predictors and a focus on seven numeric variables. Partial output of the model is shown in Table 2 below, in which predictors significant at 10% level are flagged. Out of seven numeric variables, only four are shown to be statistically Table 2: Output of Logistic Regression significant based on the linear assumption. Besides the illustrated output, all statistics indicate that the logistic regression achieves an adequate fit for the development data. A Receiver Operating Characteristic curve, also known as ROC curve, is a graphical representation of the tradeoff between Type-I error (Sensitivity) and Type-II error (Specificity) for different possible cutoffs, and is often used to compare predictive performance between different classification models. In a ROC curve, Sensitivity is placed on the Y-axis and Specificity on the X-axis (expressed as 1-Specificity). The area under ROC curve, also abbreviated as AUC, is a statistical measure often used to summarize the information of ROC curve derived from a predictive model. In brief, AUC can be interpreted as the probability that the predictive model is able to score a randomly selected positive response higher than a randomly selected negative response. A predictive model with the perfect performance has an area under ROC curve equal to 1, whereas this area is 0.5 for a predictive model as good as the random guess. In practice, the area under the ROC curve is between 0.5 and 1. In Figure 1 (pg. 24), ROC curves of logistic regression for both development and validation data are plotted. The areas under ROC curves are equal to 0.78 and 0.76, respectively, for development and validation data, suggesting a reasonable predictiveness for the logistic regression. All statistical evidence thus far indicates a good fit for the conversion model using logistic regression. However, whether the assumed linearity is the correct functional form for the analyzed data remains in question, and still needs to be addressed. Since logistic regression indicates that X4, X5, X6, and X7 are 23
Figure 1: ROC Curves for Development and Validation Data Table 4: Partial Output of Generalized Additive Model Figure 2: Partial Prediction Plots of X5 and X7 statistically insignificant, we will pay extra attention to these four variables. After establishing the Logistic Regression Model as the benchmark, we fit the GAM with the same development data, and apply a flexible nonparametric estimation to those four insignificant predictors in the previous logistic regression. With the flexibility provided by the GAM, there is a strong temptation to over-fit the model to the development data using excess degrees of freedom in the model. Based upon our experience, the benefit of using conservative degrees of freedom in a GAM is twofold: first, over-fitting can be prevented with low degrees of freedom; and second, computation time can be reduced dramatically, given the large volume of data commonly modeled in direct marketing. Table 4 (left) shows the partial output of the best GAM developed, after trial and error, with all predictors included. Note that X5 and X7 become significant after being estimated nonparametrically under the nonlinear assumption. The nonlinear effect of each predictor is shown in Figure 2 (left). For instance, it is clear that the relationship between X5 and Y is neither linear nor monotonic, a violation of the linear assumption in logistic regression. Instead, the conversion rate goes up as X5 increases, starts decreasing once X5 reaches 0, and then picks up again after X5 exceeds 1.5. For comparison purposes, the ROC curves of GAM for both development and validation data are also plotted in Figure 3 (right) to evaluate the predictive performance based upon the values of Area Under Curve (AUC). While AUC for development data is marginally higher than the one from logistic regression, both models perform comparably for validation data. 24
However, it is important to note that the GAM provides better insight into the relationship between the response and the predictor without imposing strict linear assumption and pre-specified functional form to the model. This feature is particularly helpful in database marketing, where the data structure is always complex and little domain knowledge about thousands of variables is known before model development. Although conceptually attractive, GAM is not beyond criticism. Lack of parametric functional form makes it difficult to score new data directly from the database in a direct marketing production environment. Also, the nonlinear effect without an estimated parameter might not be easily adopted by non-technical audiences such as business directors and campaign managers. A possible workaround might be to estimate a parametric approximation for each nonlinear term derived from the GAM, such as a piece-wise constant approximation. While there are multiple ways to come up with such an approximation based upon either percentile or experience, we propose a model-based method using a Classification and Regression Tree to develop such approximation. In this case, classification and regression tree is developed with only one independent variable and one dependent variable, which are the predictor and the corresponding nonlinear term, respectively. Such a tree-based model is constructed through a process of recursive binary partitioning. At each partition, an if-then splitting rule is generated to divide the nonlinear term into several homogeneous groups based upon the value of the predictor. Figure 4 shows diagrams of classification and regression trees for X5 and X7 (see Figure 4, top). Figure 3: ROC Curves for Development and Validation Data Figure 4: Classification and Regression Trees of X5 and X7 After the development of the classification and regression tree, we then use the piecewise constant approximation as a categorical variable to replace the nonlinear term from the GAM. As a result, the GAM collapses to become a logistic regression model that is very familiar. Figure 4 (top) shows such piecewise constant approximation for X5 and X7. Table 5 (pg. 26) displays the statistical output from revised logistic regression with inclusion of piecewise constant approximation. While X5 shows statistical significance, X7 is at the border of 10% significant level. Parameter estimates and statistical significances of the other numeric variables are very close to the ones in the GAM. In Figure 5 (pg. 26), ROC curves of logistic regression with piecewise constant approximation for both development and validation data are plotted to evaluate the efficiency of predictiveness. For development data, this new hybrid model performs similarly to the GAM discussed earlier. However, it generalizes better than both logistic regression and the GAM 25
Figure 5: Piecewise Constant Approximation for X5 and X7 for validation data with AUC = 0.77. Since one of the advantages in the classification and regression tree is the resistance to outliers, our understanding is that this improvement might come from the reduction of over-fitting. Table 5: Partial Output of Revised Logistic Regression with Piecewise Constant Approximation Figure 6: ROC Curves for Development and Validation Data While the classification and regression tree provides a satisfactory solution of piecewise constant approximation for nonlinear effect, multivariate adaptive regression splines are able to address the same problem with piecewise linear approximation. Similar to the classification and regression tree, multivariate adaptive regression splines are designed to partition the entire range of a predictor into multiple sub-regions known as basis. Within each sub-region, a regression with a different coefficient is used to define the relationship between the response and the predictor. As a result, such a divide and conquer strategy provides flexibility to approximate any nonlinear pattern with sufficient numbers of basis functions. It is interesting to note that, when the coefficient in each sub-region is equal to zero and only the intercept remains, then the multivariate adaptive regression spline will simply become a classification and regression tree (see Figure 6, below). As the building block of a multivariate adaptive regression spline, the basis function can be considered a function with a of hockey stick shape, and takes the form of BF = max(x k, 0), where k is the breaking point of the hockey stick. During the model development, a forward selection method is employed to include statistically significant basis functions, followed by a backward pruning process. A similar technique is used in logistic regression for variable selection. Figure 7 (right) shows the piecewise linear approximation for nonlinear terms of X5 and X7, indicating 26
that each nonlinear pattern can be adequately approximated by three basis functions. Figure 7: Piecewise Linear Approximation for X5 and X7 After the development of multivariate adaptive regression splines, we can use derived basis functions to replace nonlinear terms from the GAM, and re-estimate a logistic regression with inclusion of these basis functions. Table 6 (right) illustrates the output of such revised logistic regression, and shows that almost all basis functions are statistically significant in this new model. Predictiveness efficiency is summarized in Figure 8, (right) using ROC curves and AUC again. While there is no noticeable difference with logistic regression and the GAM for development data, this new model does not generalize well for the validation data with AUC = 0.75. In our view, this underperformance is likely to have arisen from over-fitting, which is one of the known drawbacks for multivariate adaptive regression splines. Table 6: Partial Output of Revised Logistic Regression with Piecewise Linear Approximation Conclusion We have demonstrated a new modeling technique and its application in direct marketing. In our experience, the GAM outperforms logistic regression in two aspects: First, the GAM relaxes the linear assumption between the response and predictors, and therefore avoids the problem of model mis-specification, which often occurs in the linearbased logistic regression. Second, by incorporating nonlinear effects, the GAM helps discover the hidden pattern between the response and predictors, and consequently improves the predictive performance while over-fitting is carefully guarded. Figure 8: Roc Curves for Development and Validation Data Besides its original concept and implementation, we have also discussed two 27
hybrid GAMs based upon our learning experience from daily modeling work, which are piecewise constant and piece linear approximation models. Our findings show that a hybrid model that combines the flexibility from GAM and the resistance to over-fitting from classification and regression tree is able to yield the best result. n Reference 1. McCullagh, P. and Nelder, J., Generalized Linear Models, Chapman and Hall, (1989). 2. Hastie, T. and Tibshirani, R., Generalized Additive Models, Chapman and Hall, (1990). 3. Hastie, T., Tibshirani, R., and Friedman, J., Elements of Statistical Learning, Springer, (2001). 4. Franke, J., Hardle, W., and Stahl, G., Measuring Risk in Complex Statistical Systems, Springer Verlag, (2000). 5. Breiman, L., Friedman, J., Olshen, R., and Stone, C., Classification and Regression Trees, Chapman & Hall, (1984) 6. Friedman, J., Multivariate Adaptive Regression Splines, The Annals of Statistics, Vol. 19, No. 1, 1-67, (1991). 28