Statistics & Analysis. A Comparison of PDLREG and GAM Procedures in Measuring Dynamic Effects

Size: px

Start display at page:

Download "Statistics & Analysis. A Comparison of PDLREG and GAM Procedures in Measuring Dynamic Effects"

Winfred Jordan
5 years ago
Views:

1 A Comparison of PDLREG and GAM Procedures in Measuring Dynamic Effects Patralekha Bhattacharya Thinkalytics The PDLREG procedure in SAS is used to fit a finite distributed lagged model to time series data where the coefficients of the lagged terms are assumed to lie on a polynomial curve. This method was suggested by Almon (1965) as a relatively flexible and yet parsimonious method of estimating distributed lags. This increases the flexibility in the shape of the distributed lag function. The degree of the polynomial is usually chosen to be less than the number of lags, thereby reducing the number of parameters to be estimated as well. The GAM procedure in SAS is used to fit generalized additive models as outlined by Hastie and Tibshirani (1990). In this paper, we outline the advantages and disadvantages of the two procedures and compare their performance in estimating the duration of lags, by conducting some simulations. In any field involving research and analytics, one often encounters data that is spaced over time. This is known as time series data. One of the inherent characteristics of time series data is that effects are spread out over time so that the outcome in this period may be affected not only by the events that take place in the current period, but also by those that occurred in the past. For instance, this month s Sales may be affected by the advertising that takes place in the current month as well as by all the marketing and advertising the firm did in previous months. Another example is that an individual s consumption in this period may depend on his disposable income in this period as well as on his disposable income in previous periods. Both the above instances are examples of lagged effects where the value of the dependent variable depends on lagged effects of the independent variable. There might be lead effects as well so that the outcome variable is affected by perceptions of what may happen in the future. For example, if housing prices are expected to go down in the future, one might defer the purchase of a new house by a few months. Also, consumers can react in anticipation of a marketing stimulus. Practitioners have always struggled to understand and correctly estimate these dynamic effects that are an inherent part of time series data. One way of modeling delayed effects is to introduce lagged terms as independent variables in the model. Therefore, taking the example of Sales and advertising, Sales in period t depends on advertising in period t as well as in previous periods as shown in equation (1) where represents Sales and represents advertising in period t. through are the lag coefficients. Equation (1) indicates that advertising has an effect up to s-1 periods in the future. However estimation of equation (1) can become difficult for several reasons. First of all it is difficult to decide how many lagged terms to include in equation (1). Without prior knowledge about how long the effects of advertising last, we cannot choose a value of s. Secondly with s-1 lagged terms the number of parameters turns out to be s + 1. For large values of s, this can require estimation of a large number of parameters, which may cause problems because of loss of degrees of freedom. Besides, these lagged independent variables may be correlated with the original variable thus adding to the collinearity in the model. This has led researchers to look into other methods of modeling the carryover effects of advertising. One of the methods has been to postulate relationships between the different lag parameters in order to reduce the number of parameters in the model. For instance the geometric distributed lag model assumes that the impact of the lagged terms declines geometrically over time. Therefore if is the impact of advertising in period 1, then in period 2, the impact of that advertising would be where λ is a fraction i.e. 0 < λ < 1. Therefore in equation (1), and and so on. This assumption greatly reduces the number of parameters that have to be estimated in this model. The geometric lag model however, assumes a monotonically declining lag structure which may not always be 1 (1)

2 realistic. For instance the effect of advertising may be small in the first few periods, then increase and eventually decline. In order to accommodate this type of a lag structure, some authors have used the negative binomial distribution (or Pascal distribution) to model advertising lags. A more flexible method of estimating the effect of advertising lags in equation (1) was postulated by Almon. She suggested a method in which the coefficients of the model are expressed in terms of some function f(k) which can be approximated by a polynomial in k. The PDLREG procedure in SAS is based on this method and the distribution of the lagged effects is modeled by Almon lag polynomials. This means that the coefficients of the lagged values of the independent variables are assumed to lie on a polynomial curve. Apart from dynamic effects, another factor that complicates matters is that, often, the relationship between the dependent and independent variable may be nonlinear. Going back to our Sales and advertising example, it is a well-known fact that the relationship between Sales and advertising is concave for high levels of advertising because of diminishing returns to scale with respect to advertising. In order to model this relationship, marketing practitioners often transform the dependent and independent variables using different functional forms such as log, square root etc in order to model that non-linear relationship. For more details about the types of functional forms that are usually used, please refer to the NESUG 2010 paper by the same author. However, rarely in the real world does the relationship between Sales and advertising follow a specific mathematical functional form. Moreover, predictor variables usually do not show much variation in the sample. Therefore we may only observe values of advertising within a small range. Sometimes with only small variation in the sample, several models can be a good fit for the data. In the next section of the paper we will outline the PDLREG procedure in SAS. We will illustrate the syntax for the procedure and explain some of the outputs from the procedure. The following section will be devoted to the GAM procedure. We will next move on to a comparison between the two procedures where we will outline the advantages and disadvantages of both. Finally in the last section of the paper we compare the performance of the two procedures through some simulation exercises. THE PDLREG PROCEDURE: The PDLREG procedure can be invoked by using the following syntax : proc pdlreg data=test; model y = x( n, l ); run; where y is the dependent variable and x is in the independent variable. The parameter n specifies the length of the lag distribution, i.e. the number of lags of the regressor to use in the model. The parameter l denotes the degree of the distribution polynomial. The variable x is the independent variable in the equation. The above model statement assumes that the relationship between y and x is linear in parameters. However, if there is reason to assume a nonlinear relationship between those variables, either of them can be transformed in any manner and the transformed variable can be used in the equation instead. In other words, if the relationship between the dependent variable and independent variable is non-linear, one can specify the nature of that non-linear relationship by appropriate transformations of the dependent and independent variables. Suppose we believe that there is a linear relationship between y and log(x). Then the model statement in the above syntax can be changed to : model y = z( n, l ); where z = log(x). The PDLREG procedure also allows other covariates to be entered in the above model and distributed lags can be specified for more than one regressor. The procedure prints a table containing the parameter estimates for the polynomial distribution as shown 2

3 in Fig 1.1. This table can be used to determine the correct degree of the distribution polynomial. For instance if we start off with the assumption of a polynomial of degree 5 and the parameter estimates of the coefficients are significant for the 1 st 4 terms but insignificant for the 5 th term, that may indicate that the true degree of the polynomial should be 4. The PDLREG Procedure Parameter Estimates Standard Approx Variable DF Estimate Error t Value Pr > t Intercept logx** <.0001 logx** <.0001 logx** <.0001 logx** <.0001 logx** <.0001 logx** Figure 1.1 As shown in figure 1.2, the PDLREG procedure also prints the parameter estimates of the lag distribution coefficients which are the coefficients of the lagged values of z. The significance of these coefficients can be used to determine the duration of the lags. The PDLREG procedure can support any number of lags. Estimate of Lag Distribution Standard Approx Variable Estimate Error t Value Pr > t logx(0) logx(1) <.0001 logx(2) <.0001 logx(3) <.0001 logx(4) <.0001 logx(5) <.0001 logx(6) <.0001 logx(7) <.0001 logx(8) logx(9) logx(10) Figure 1.2 3

4 THE GAM PROCEDURE: The following statements invoke the GAM procedure. proc gam data=diabetes; model y = spline(x) spline(lag1x) spline(lag2x).. spline(lagix); run; The GAM procedure fits generalized additive models as those models are defined by Hastie and Tibshirani (1990). The procedure is based on nonparametric regression and smoothing techniques which relaxes the assumption of linearity and enables us to uncover structure in the relationship between the independent variables and the dependent variable that might otherwise be missed. Multiple lag terms and /or other covariates can be entered in the model by using additional spline functions in the syntax shown above. If multiple lag terms are entered into the model, the number of lag terms that remain significant can be used to understand the duration of the lags. The procedure prints a table containing parameter estimates for the parametric part of the model as shown in figure 2.1. This looks at the linear relationship between y and each of the independent variables in the model. If the t score is high for an independent variable in this table, that indicates that the linear trend for the specific independent variable is significant. The GAM Procedure Dependent Variable: Y Smoothing Model Component(s): spline(x) spline(lag1x) spline(lag2x) spline(lag3x) spline(lag4x) spline(lag5x) spline(lag6x) spline(lag7x) spline(lag8x) spline(lag9x) spline(lag10x) Regression Model Analysis Parameter Estimates Parameter Standard Parameter Estimate Error t Value Pr > t Intercept <.0001 Linear(X) <.0001 Linear(lag1X) <.0001 Linear(lag2X) <.0001 Linear(lag3X) <.0001 Linear(lag4X) Linear(lag5X) Linear(lag6X) Linear(lag7X) Linear(lag8X) Linear(lag9X) Linear(lag10X) Figure 2.1 Another table shows the Analysis of Deviance table for the nonparametric component of the model. This later table looks at the significance of the non-linear relationships between y and each of the independent variables. A high F value for one of the independent variables in this table implies that there is a significant non-linear trend for that specific variable. This table therefore can be used to determine the significance of the non-linear trends for the independent variables in the model. 4

5 Smoothing Model Analysis Analysis of Deviance Sum of Source DF Squares Chi-Square Pr > ChiSq Spline(X) Spline(lag1X) Spline(lag2X) <.0001 Spline(lag3X) <.0001 Spline(lag4X) Spline(lag5X) Spline(lag6X) Spline(lag7X) Spline(lag8X) Spline(lag9X) Spline(lag10X) Figure 2.2 Since we have always assumed nonlinear relationships between y and each lag of the independent variable we focus on this second table to estimate the duration of lags. We use a method of iteration using a SAS DO loop to determine the number of lags that remain in the model. We start by using the independent variable along with 20 of its lags as predictor variables in the model. If any of the predictors is insignificant in this table, that term is deleted in the next round of model iteration. This method is continued until all the predictors that remain in the model are significant. The lagged terms that remain in the final model are then used to determine the duration of lagged effects. COMPARING THE TWO PROCEDURES: The art of modeling almost always calls for assumptions about the relationships between the dependent and independent variables to simplify the estimation process. For instance, when we know the relationship between the dependent and independent variable is non-linear we may use a specific functional form (such as logarithmic or square root or reciprocal) to model that relationship. In fact when it comes to estimations of dynamic lags, assumptions are also made about the relationships between the parameter estimates of the lagged independent variables. For instance, in the geometric lag model, the parameters of the lagged variables are assumed to be geometrically declining over time. Similarly, in the Pascal model, the coefficients of the lagged terms are assumed to form a negative binomial distribution. To summarize, there may be two types of restrictions that can be imposed on a dynamic model : 5

6 1. One is restrictions in coefficients of the lagged terms. 2. Restriction in the shape of the functional form: e.g., fitting a model that is linear in parameters when the correct functional form should be a non-linear model. Compared to other models such as the geometric lag model and Pascal model, the Almon lag structure somewhat relaxes the first restriction and allows some degree of flexibility in the determining the coefficients of the lagged terms. In spite of that though, certain restrictions are still imposed on the lag parameters and if these conditions are not correct, then the model will be somewhat mis-specified. In that case, incorrect lags may show up as significant in the model. The generalized additive model, however, imposes no restrictions on the coefficients of the lagged terms and allows those coefficients to be determined from the data. When working with the PDLREG procedure, we also have to make assumptions about the relationship between the dependent and independent variables and use specific transformations of variables such as log, square root etc. to represent those relationships. The GAM procedure on the other hand does not require us to make any presumptions about these relationships or compute any transformations of variables. Nor does this procedure impose specific functional forms to model the relationships between these variables. In fact, the advantage of the GAM procedure over the PDLREG procedure is that the nature of the non-linear relationship has to be specified in the latter whereas in the former, the relationship is uncovered from the data. Therefore, in the PDLREG procedure, the non-linear relationship between the dependent and independent variable is restricted to a specific functional form such as log or square root or reciprocal, etc. In the GAM procedure on the other hand, the relationship can follow any pattern that is found in the data. GAM allows complete flexibility of the functional form (non-linear) of the model and imposes no restriction on parameters. Therefore, when the true model is non-linear, GAM does a better job of fitting the model and estimating the true duration of the lags. This complete flexibility of choosing the functional form and parameters however, also comes with its own disadvantages. With a large number of predictor variables including lagged terms for each of them may lead to a large number of lagged independent variables in the model which might cause difficulties in estimation. The PDLREG procedure also has some other advantages as compared to the GAM procedure. Firstly, the PDLREG procedure is less intensive computationally, uses less resources and is much faster to run than the GAM procedure. Secondly the PDLREG procedure allows for tests of autocorrelation such as the DW test whereas the GAM procedure does not. The PDLREG procedure allows lagged dependent variables to be included in the model by using the nlag= option. However, there is no such option in the GAM procedure. In the next section we describe the method of simulation that we used in the paper. Our method here is very similar to the method used in the NESUG 2010 paper by the same author. At the risk of repetition we also describe the methodology here for the convenience of the reader. SIMULATION METHOD We use a dataset that contains media impressions for magazines. The data are simulated to be as close to real world data as possible. Actual magazine accumulation curves were obtained from the former MRI website (now GFKMRI ( and those were used to create a variable with magazine impressions. For the sake of simplification, throughout this paper we will assume that Sales are affected by only one media variable. We assume that carryover effects exist so that the media variable influences sales not only in the period in which it is aired but also in future periods. Therefore Sales in any period is determined by the value of the media variable in that period as well as lagged values of the media variable. In this paper we assume that there are 3 significant lags so that in the true model, Sales in the current period is affected by the media variable in the current and 3 preceding periods. The way we conduct our experiment is as follows. We postulate the true relationship between Sales and the media variable by specifying the model and the values of the parameters. Next we use Monte Carlo simulation methods to try to fit several different models (including the true model) to estimate the relationship between Sales and the current and lagged media variables and see which lags come up as significant in 6

7 the model. For example, suppose we assume that the true relationship between Sales and the media variable (M) can be represented by a semi-log model as follows: Using the current and lagged values of the media variable that we have in our dataset, and a randomly chosen set of parameters ( β 1, β 2, β 3, β 4 ), we calculate the value of Sales using equation (2). This is assumed to be the true relationship between Sales and the media variable, M. Taking this as the true model, we will next try to simulate a bunch of data sets each with different random scatter. In order to do this we first create a new variable stream (called δ, say) the values of which are chosen from the standard normal distribution with replacement. We next add this new variable, δ to our dependent variable, Sales to create a new Sales variable (New_S t ). where δ ~ N(0,1). This new Sales variable, New_S t, is used to run a regression model using the true (semi-logarithmic) specification and that exercise helps us obtain an estimate of the standard deviation of the residuals, S(yx). This gives us an estimate of the variance in Sales that we can observe when the true relationship is given by equation (2). Next we use Monte Carlo simulations to determine other plausible values that the dependent variable can take assuming that the true relationship is (2). These are the values of Sales that may be observed in practice when the true sales stream is S t in equation (2). To obtain these possible values for the Sales stream, we proceed as follows. To each ideal point we add random scatter drawn from a Gaussian distribution with a mean of 0 and SD equal to the value of S(yx) reported from the linear regression of our experimental data. This gives us the probable values that the Sales stream can take when the true values is given by equation (2). We repeat this step 50 times to obtain 50 different data sets each containing different Sales streams. With each dataset and each new Sales stream we try to fit the simulated data using the PDLREG procedure but using different model specifications including the true model specification. For instance since the true model specification is semi-log the simulated data set is fitted using a semi-log model, a reciprocal model, a square root model as well as the generalized additive model. In the above example, we used the semi-logarithmic model to obtain the ideal data set and then tried to fit other types of models to the simulated data that we derived from this ideal data. We repeat this exercise outlined in the previous paragraph for other types of models as well. More specifically, apart from the semi-log model, the above simulations are also performed using the reciprocal model and the square root model as the ideal models. Therefore, in the second phase of the experiment we use the reciprocal model as the true relationship between Sales and advertising and then try to derive a bunch of simulated data sets from this ideal data set. Next these simulated data sets are fitted with the PDLREG procedure assuming the reciprocal model, the semi-log model, a square root model as well as the generalized additive model. In the third phase of the experiment, we assume that the true relationship between Sales and advertising is represented by the square root model. All of the above steps are then repeated assuming that the square root model is the true model. Notice that to obtain the ideal relationship between Sales and media variables (and their lagged values) as shown in equation (1) we need to come up with values for the parameters β 1, β 2, β 3, β 4. This parameter combination is chosen randomly (with certain restrictions) in order to make sure that the choice of parameters does not influence any of the results. In fact, for each model type, 100 different parameter combinations are used to obtain the dependent variable and create 100 ideal data sets. Therefore for each model type, the method of simulating 50 datasets outlined in the previous paragraph was repeated for each of the 100 different parameter combinations. Therefore, a total of model simulations were run with 50 simulations for each of 3 model types and 100 parameter combinations. The GAM procedure was invoked using the following code. proc gam data = test; 7 (2) (3)

8 model Y = spline(m) spline(lag1m) spline(lag2m) spline(lag3m) spline(lag4m) spline(lag5m) spline(lag6m) spline(lag7m) spline(lag8m) spline(lag9m) spline(lag10m) spline(lag11m) spline(lag12m) spline(lag13m) spline(lag14m) spline(lag15m) spline(lag16m) spline(lag17m) spline(lag18m) spline(lag19m) spline(lag20m) / dist = normal; ods output ANODEV = Anodev_out; run; where lagim represents the variable that is obtained by taking the the ith lag of M. RESULTS The tables in this section illustrate the results obtained from the simulation exercises. Tables 1a 1d shows the results when the true model is semi-logarithmic. Recall that we have assumed that the current media stream as well as the 3 lagged terms are significant in the ideal model. Table 1a shows a typical result for one of the parameter combinations when the true model is semilogarithmic. If a generalized additive model is fitted to this data, then almost 100% of the simulated models show the correct lags as significant. Lag 4 shows also shows up as significant sometimes but only in 34% of the models. However if a PDLREG model is used to fit the data, some irrelevant lag terms show up as insignificant in the model. Obviously, when a Reciprocal or Square root transformation is used for the media variable, this result can be expected to occur because of the incorrect model specification (since the true model is semi-logarithmic). Surprisingly though, incorrect lags show up as significant even in the situation where the correct transformation of the independent variable is used in the PDLREG procedure. In other words even if we compute the logarithmic transformation of the media variable and then use the transformed variable in the RHS of the PDLREG model equation, we still do not get the correct lags in most of the model simulations. T RUE MODEL = SEMI-LOGARIT HMIC MODEL la g0 la g1 la g2 la g3 la g4 la g5 la g6 la g7 la g8 la g9 la g10 GAM 100% 100% 100% 100% 34% 2% 6% 0% 0% 0% 0% PDLREG (SEMI-LOG) 100% 100% 100% 100% 100% 100% 100% 100% 0% 100% 100% PDLREG (SQR_ROOT ) 100% 100% 100% 100% 100% 100% 100% 100% 0% 100% 100% PDLREG(RECIPROCAL) 100% 100% 100% 100% 100% 100% 100% 0% 100% 100% 100% Table 1a Table 1b summarizes the results for all the parameter combinations when the true model is semi-log and the fitted model is the PDLREG model with the square root transformation for the media variable. Recall that the simulation exercise is repeated for 100 different parameter combinations. The left most column in Table 1b shows the percentage of models for which the corresponding lag shows up as significant. Therefore for each of the 100 parameter combinations, 100% of the simulated models pick up lag0 through lag 7 as being significant. For 99 of the 100 parameter combinations at least one of the far out lags (lag8-11) is also found to be significant in 100% of the models. 8

9 TRUE MODEL =SEMI-LOGARITHMIC, FITTED MODEL= PDLREG WITH SQUARE ROOT TRANSFORMATION % Simula tions la g0 la g1 la g2 la g3 la g4 la g5 la g6 la g7 la g8-11 0%-20% %-40% %-60% %-80% %-100% %-100% % Table 1b Table 1c summarizes the results for all parameter combinations when the true model is semi-log and the fitted model is the PDLREG model with the semi-log transformation for the media variable. Therefore the transformation of the media variable in this case is the true model transformation. In spite of that we still find that for all 100 of the parameter combinations, 100% of the model simulations pick out lags 4, 5 and 6 as significant along with the relevant lags 0 through 3. Besides, at least one of the lags 8 through 11 always shows up as significant for 99 of the parameter Table 1c combinations. TRUE MODEL =SEMI-LOGARITHMIC, FITTED MODEL= PDLREG WITH SEMI-LOG TRANSFORMATION % Simula tions la g0 la g1 la g2 la g3 la g4 la g5 la g6 la g7 la g8-11 0% %-20% %-40% %-60% %-80% %-100% %-100% % The results look very similar if the true model is semi-logarithmic and the fitted model is a PDLREG with a reciprocal transformation for the independent variable and will not be repeated here. If the fitted model is a generalized additive model then the results are strikingly different as shown in Table 1d. In this case for of the parameter combinations lags 0 through 3 show up as significant in 100% of the model simulations. For about 39 parameter combinations, lag 4 shows up as significant for 80% to 100% of the model simulations. For almost none of the parameter combinations do far out lags show up as significant in any of the model simulations. TRUE MODEL =SEMI-LOGARITHMIC, FITTED MODEL= GAM % Simula tions la g0 la g1 la g2 la g3 la g4 la g5 la g6 la g7 la g8 dum9 dum10 dum11 0%-20% %-40% %-60% %-80% %-100% %-100% % Table 1d 9

10 Therefore, when the true model is semi-logarithmic, the generalized additive model does a better job of picking out the true duration of lags than any of the PDLREG models. Based on our simulation exercises, we reach the same conclusions when the true model is reciprocal or square root. CONCLUSION In this paper, we compare the PDLREG and GAM procedures and look at their effectiveness in estimating dynamic effects. We show that model specification may play an important role in determining which lags show up as significant in the model. There are two ways in which an incorrect model may be specified. Assumptions about the specific functional form as well as restrictions on the parameter estimates can both result in a mis-specification of the model. We propose that using a generalized additive model (PROC GAM in SAS) instead of the PDLREG procedure may help to more accurately identify the significant lags in the model. Since the true relationship between Sales and advertising, rarely follows a precise functional form, using an explicit function to model the relationship may lead to incorrect estimation of the lagged effects of advertising. A Generalized Additive Model allows greater flexibility of the functional form and helps to get more accurate results. Also, the PDLREG model restricts the parameters to lie on a polynomial curve. While this might be a reasonable assumption for some parameter values, it may not be true for other values of the parameters. GAM does not impose any restriction on the parameters of the model and may therefore be more accurate. Having said that, we would like to emphasize that the PDLREG procedure takes less time to run and may be able to handle a larger number of independent variables than the GAM procedure. Besides, the PDLREG procedure has options available for autoregressive terms to be included in the model and allows for tests for autocorrelation of residuals whereas none of these options are available in the GAM procedure. In conclusion, we would like to point out that in this paper we have used a very simplistic model to show the accuracy of GAM vis a vis other functional forms. We also restricted our analysis and simulation exercises to one dataset. More research is needed to investigate how well GAM performs with different datasets as well as when we use more complicated models with multiple media variables. REFERENCES S Almon, The Distributed Lag Between Capital Appropriations and Expenditure, Econometrica, Vol. 33, No. 1 (Jan., 1965), pp P Bhattacharya, Using Generalized Additive Models in Marketing Mix Modeling, NESUG Hastie and Tibshirani (1990), Generalized Additive Models, New York: Chapman and Hall ACKNOWLEDGMENTS SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc. in the USA and other countries. indicates USA registration. Other brand and product names are registered trademarks or trademarks of their respective companies. CONTACT INFORMATION Your comments and questions are valued and encouraged. Contact the author at: Patralekha Bhattacharya 10

11 Thinkalytics Web: * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * 11

Generalized Additive Model

Generalized Additive Model by Huimin Liu Department of Mathematics and Statistics University of Minnesota Duluth, Duluth, MN 55812 December 2008 Table of Contents Abstract... 2 Chapter 1 Introduction 1.1