Goals of the Lecture. SOC6078 Advanced Statistics: 9. Generalized Additive Models. Limitations of the Multiple Nonparametric Models (2)

SOC6078 Advanced Statistics: 9. Generalized Additive Models Robert Andersen Department of Sociology University of Toronto Goals of the Lecture Introduce Additive Models Explain how they extend from simple nonparametric regression (i.e., smoothing splines and local polynomial regression) Discuss estimation using backfitting Explain how to interpret their results Conclude with some examples of Additive Models applied to real social science data 1 2 Limitations of the Multiple Nonparametric Models The general nonparametric model (both the lowess smooth and the smoothing spline) takes the following form: As we see here, the multiple nonparametric model allows all possible interactions between the independent variables in their effects on Y we specify a jointly conditional functional form This model is ideal under the following circumstances: 1. There are no more than two predictors 2. The pattern of nonlinearity is complicated and thus cannot be easily modelled with a simple transformation, polynomial regression or cubic spline 3. The sample size is sufficiently large 3 Limitations of the Multiple Nonparametric Models (2) The general nonparametric model becomes impossible to interpret and unstable as we add more explanatory variables, however 1. For example, in the lowess case, as the number of variables increases, the window span must become wider in order to ensure that each local regression has enough cases (The general idea is the same for smoothing splines) This process can create significant bias (the curve becomes too smooth) 2. It is impossible to interpret general nonparametric regression when there are more than two variables there are no coefficients, and we cannot graph effects more than three dimensions These limitations lead us to Additive Models 4

Additive Regression Models Additive regression models essentially apply local regression to low dimensional projections of the data That is, they estimate the regression surface by a combination of a collection of one-dimensional functions The nonparametric additive regression model is Additive Regression Models (2) The assumption that the contribution of each covariate is additive is analogous to the assumption in linear regression that each component is estimated separately Recall that the linear regression model is where the B j represent linear effects where the f i are arbitrary functions estimated from the data; the errors ε are assumed to have constant variance and a mean of 0 The estimated functions f i are the analogues of the coefficients in linear regression 5 For the additive model we model Y as an additive combination of arbitrary functions of the Xs The f j represent arbitrary trends that can be estimated by lowess or smoothing splines 6 Additive Regression Models (3) Now comes the question: How do we find these arbitrary trends? If the X s are completely independent which will not be the case we could simply estimate each functional form using a nonparametric regression of Y on each of the X s separately Similarly in linear regression when the X s are completely uncorrelated the partial regression slopes are identical to the marginal regression slopes Since the X s are related, however, we need to proceed in another way, in effect removing the effects of other predictors which are unknown before we begin We use a procedure called backfitting to find each curve, controlling for the effects of the others 7 Estimation and Backfitting Suppose that we have a two predictor additive model: If we unrealistically know the partial regression function f 2 but not f 1 we could rearrange the equation in order to solve for f 1 In other words, smoothing Y i -f 2 (x i2 ) against x i1 produces an estimate of α+f 1 (x i1 ). Simply put, knowing one function allows us to find the other in the real world, however we don t know either so we must proceed initially with preliminary estimates 8

Estimation and Backfitting (2) 1. Start by expressing the variables in mean deviation form so that the partial regressions sum to zero, thus eliminating the individual intercepts 2. Take preliminary estimates of each function from a leastsquares regression of Y on the X s 3. The preliminary estimates are used as step (0) in an iterative estimation process Estimation and Backfitting (3) The partial residuals for X 1 are then 5. The same procedure in step 4 is done for X 2 6. Next we smooth these partial residuals against their respective X s, providing a new estimate of f 4. Find the partial residuals for X 1. Recall that partial residuals remove Y from its linear relationship to X 2 while retaining the relationship between Y and X 1 9 where S is the (n n) smoother transformation matrix for X j that depends only on the configuration of X ij for the jth predictor 10 Estimation and Backfitting (4) Either loess or smoothing splines can be used to find the regression curves If local polynomial regression is used, a decision must be made about the span that is used If a smoothing spline is used, the degrees of freedom can be specified a head of time or using cross-validation with the goal of minimizing penalized residual sum of squares Recall that the first term measures the closeness to the data; the second term penalizes curvature in the function 11 Estimation and Backfitting (5) This process of finding new estimates of the functions by smoothing the partial residuals is reiterated until the partial functions converge That is, when the estimates of the smooth functions stabilize from one iteration to the next we stop When this process is done, we obtain estimates of s j (X ij ) for every value of X j More importantly, we will have reduced a multiple regression to a series of two-dimensional partial regression problems, making interpretation easy: Since each partial regression is only two-dimensional, the functional forms can be plotted on two-dimensional plots showing the partial effects of each X j on Y In other words, perspective plots are no longer necessary unless we include an interaction between two smoother terms 12

GAMs in R There are two excellent packages for fitting generalized additive models in R The gam (for generalized additive model) function in the mgcv (multilple smoothing parameter estimation with generalized cross validation) fits generalized additive models using smoothing splines The smoothing parameter can be chosen automatically using cross-validation or manually by specifying the degrees of freedom The gam function in the gam package allows either lowess (lo(x)) or smoothing splines (s(x)) to be specified The anova function can be used for both functions, allowing different models to be easily compared Additive Regression Models in R: Example: Canadian prestige data Here we use the Canadian Prestige data to fit an additive model for prestige regressed on income and education For this example I use the gam function in mgcv package The formula takes the same form as the glm function except now we have the option of having parametric terms and smoothed estimates The R-script specifying a smooth trend for both income and education is as follows: 13 14 Additive Regression Models in R: Example: Canadian prestige data (2) The summary function returns tests for each smooth, the degrees of freedom for each smooth, and an adjusted R- square for the model. The deviance can be obtained from the deviance(model) command Additive Regression Models in R: Example: Canadian prestige data (3) Again, as with other nonparametric models, we have no slope parameters to investigate (we do have an intercept, however) A plot of the regression surface is necessary 15 16

Additive Regression Models in R: Example: Canadian prestige data (4) Additive Model: We can see the nonlinear relationship for both education and Income with Prestige but there is no interaction between them i.e., the slope for income is the same at every value of education We can compare this model to the general nonparametric regression model Prestige 80 60 40 20 5000 10000 Income 15000 20000 25000 8 10 12 14 Education Additive Regression Models in R: Example: Canadian prestige data (5) General Nonparametric Model: This model is quite similar to the additive model, but there are some nuances particularly in the midrange of income that are not picked up by the additive model because the X s do not interact Prestige 80 60 40 20 5000 10000 15000 Income 20000 25000 8 10 12 14 Education 17 18 Additive Regression Models in R: Example: Canadian prestige data (6) Perspective plots can also be made automatically using the persp.gam function. These graphs include a 95% confidence region 80 60 40 20 5000 10000 15000 income 20000 25000 8 14 12 10 education Additive Regression Models in R: Example: Canadian prestige data (7) Since the slices of the additive regression in the direction of one predictor (holding the other constant) are parallel, we can graph each partialregression function separately This is the benefit of the additive model We can graph as many plots as there are variables, and allowing us to easily visualize the relationships In other words, a multidimensional regression has been reduced to a series of two-dimensional partial-regression plots To get these in R: red/green are +/-2 se 19 20

Additive Regression Models in R: Example: Canadian prestige data (8) s(income,3.12) s(education,3.18) -20 0 10-20 0 10 0 5000 10000 15000 20000 25000 income 6 8 10 12 14 16 education 21 Interpreting the Effects A plot of of X j versus s j (X j ) shows the relationship between X j and Y holding constant the other variables in the model Since Y is expressed in mean deviation form, the smooth term s j (X j ) is also centered and thus each plot represents how Y changes relative to its mean with changes in X Interpreting the scale of the graphs then becomes easy: The value of 0 on the Y-axis is the mean of Y As the line moves away from 0 in a negative direction we subtract the distance from the mean when determining the fitted value. For example, if the mean is 45, and for a particular X-value (say x=15) the curve is at s j (X j )=4, this means the fitted value of Y controlling for all other explanatory variables is 45+4=49. If there are several nonparametric relationships, we can add together the effects on the two graphs for any particular observation to find its fitted value of Y 22 Interpreting the Effects (2) R-script for previous slide Income=10000 Education=10 s(income,3.08) -20-10 0 10 20 (10 000,6) s(education,3) -20-10 0 10 20 (10,-5) 5000 10000 20000 6 8 10 12 14 16 income education The mean of prestige=47.3. Therefore the fitted value for an occupation with average income of $10000/year and 10 years of education on average is 47.3+6-5=48.3 23 24

Residual Sum of Squares As was the case for smoothing splines and lowess smooths, statistical inference and hypothesis testing is based on the residual sum of squares (or deviance in the case of generalized additive models) and the degrees of freedom The RSS for an additive model is easily defined in the usual manner: The approximate degrees of freedom, however, need to be adjusted from the regular nonparametric case, however, because we are no longer specifying a jointlyconditional functional form Degrees of Freedom Recall that for nonparametric regression, the approximate degrees of freedom are equal to the trace of the smoother matrix (the matrix that projects Y onto Y-hat) We extend this to the additive model: 1 is subtracted from each df reflecting the constraint that each partial regression function sums to zero (the individual intercept have been removed) Parametric terms entered in the model each occupy a single degree of freedom as in the linear regression case The individual degrees of freedom are then combined for a single measure: 25 1 is added to the final degrees of freedom to account for the overall constant in the model 26 Specifying Degrees of Freedom As was the case the degrees of freedom or alternatively the smoothing parameter λ can be specified by the researcher Also like smoothing splines, however, generalized crossvalidation can be used to specify the degrees of freedom Recall that this finds the smoothing parameter that gives the lowest average mean squared error from the cross-validation samples Cross-validation is implemented using the mgcv package in R 27 Cautions about Statistical Tests when the λ are chosen using GCV If the smoothing parameters λ s (or equivalently, the degrees of freedom) are chosen using generalized crossvalidation (GCV), caution must be used when using an analysis of deviance If a variable is added or removed from the model, the smoothing parameter λ that yields the smallest mean squared error will also change By implication, the degrees of freedom also changes implying that the equivalent number of parameters used for the model is different In other words, the test will only be approximate because the otherwise nested models have different degrees of freedom associated with λ As a result, it is advisable to fix the degrees of freedom when comparing models 28

Testing for Linearity We can compare the linear model of prestige regressed on income and education with the additive model by carrying out an incremental F-test Diagnostic Plots The gam.check function returns four diagnostic plots: 1. A quantile-comparison plot of the residuals allows us to look for outliers and heavy tails 2. Residuals versus linear predictors (simply observed y for continuous variables) helps detect nonconstant error variance 3. Histogram of the residuals are good for detecting nonormality 4. Response versus fitted values The difference between the models is highly statistically significant the additive model describes the relationship between prestige and education and income much better 29 30 Diagnostic Plots (2) Interactions between Smoothed Terms Sample Quantiles -15-5 5 15 Normal Q-Q Plot residuals -15-5 5 15 Resids vs. linear pred. The gam function in the mgcv package allows you to specify an intercation term between two or more terms In the case of an interaction between two terms, and when no other variables are included in the model, we essentially have a multiple nonparametric regression -2-1 0 1 2 30 40 50 60 70 80 Theoretical Quantiles linear predictor Histogram of residuals Response vs. Fitted Values Once again we need to graph the relationship in a perspective plot Frequency 0 10 20 30 Response 20 40 60 80 While it is possible to fit a higher order interaction, once we get past the two-way interaction the graph no longer can be interpreted -20-10 0 10 20 30 40 50 60 70 80 Residuals Fitted Values 31 32

Semi-Parametric Models The generality of the additive model makes it very attractive when there is complicated nonlinearity in the multivariate case Nonetheless, the flexibility of the smooth fit comes at the expense of precision and statistical power As a result, if a linear trend can be fit, it should be preferred This leads us to the semi-parametric model, which allow a mixture of linear and nonparametric components Semi-Parametric Models (2) Semi-parametric models also makes it possible to add categorical variables. They enter the model in exactly the same way as for linear regression as a set of dummy regressors As said earlier, the gam function in the mgcv package also allows you to specify interaction terms Any interactions that can be done in a linear model can also be included in an additive model The same backfitting procedure that is used for the general additive model is used in fitting this semiparametric model 33 The last of these specifies an interaction between a categorical variable X 2 (noted by two dummy regressors- D 1 and D 2 ) and a quantitative variable X 1 for which a smooth trend is specified This fits a separate curve for each category 34 Interaction for GAMs in R Interaction for GAMs in R (2) Blue Collar Professional White Collar s(income,1) -20 0 20 40 60 80 100 s(income,1) -20 0 20 40 60 80 100 s(income,1) -20 0 20 40 60 80 100 5000 15000 25000 income 5000 15000 25000 income 5000 15000 25000 income 35 36

Concurvity The generalized additive model analogue to collinearity in linear models Two possible problems can arise: 1. A point or group of points that are common outliers in two or more X s could cause wild tail behavior 2. If two X s are too highly correlated, backfitting may be unable to find a unique curve. In these cases, the initial linear coefficient will be all that is returned The graph on previous page is a good example Here type and income are too closely related i.e., professional jobs are high paying, blue collar jobs pay less, and thus we find only linear fits where the lines cross) As is the case with collinearity, there is no solution to concurvity other than reformulating the research question Example 2: Inequality data revisited Recall earlier in the course we saw that there was an apparent interaction between gini and democracy in their effects on attitudes towards pay inequality Thus far we haven t given too much effort towards trying to determine the functional form of these effects We now do so using a semi-parametric model that fits a smooth term for gini that interacts with a dummy regressor for democracy In other words we fit two separate curves: One for democracies and another for none democracies 37 38 Example 2: Inequality data revisited (2) Example 2: Inequality data revisited (3) Democracies Non-democracies s(gini,5.05) -1.5-1.0-0.5 0.0 0.5 1.0 s(gini,8.77) -1.5-1.0-0.5 0.0 0.5 1.0 39 20 30 40 50 60 gini 20 30 40 50 60 40 gini

Example 2: Inequality data revisited (4) We now proceed to test whether the additive model does significantly better than the linear model We conclude that the additive model is no more informative 41 Cautions about Interpretation When we fit a linear model we don t believe the model to be correct, but rather that it is a good approximation The same goes for additive models, but we hope that they are better approximations Having said that, all the same pitfalls possible for linear models are magnified with additive models Most importantly, we must be careful not to overinterpret fitted curves An examination of standard error bands, an analysis of deviance, and residual plots, can help determine whether fitted curves are important We can also select and delete variables in a stepwise manner (taking out and then re-adding) insignificant terms to ensure that only important terms remain in the final model We do not want unimportant terms influencing otherwise important effects 42 Generalized Additive Mixed Models The gamm function in the mgcv package calls on the lme function in the nlme package to fit generalized additive mixed models Recall earlier we used the British Election Study data to explore how income affected left-right attitudes Since observations were clustered within constituencies, we used a mixed model to take this clustering into account, specifying a random intercept Assume now that we have reason to believe that income had a nonlinear effect on attitudes. We could test this hypothesis by specifying a smooth for the income effect using the gamm function, and comparing this model with another that specifies a simpler linear trend Generalized Additive Mixed Models (2) 43 44

Generalized Additive Mixed Models (3) s(income2,1) -1.0-0.5 0.0 0.5 1.0 1.5 Generalized Additive Mixed Models (4) We can also test for linearity using the anova function to compare the fit of a model specifying a smooth term to a model specifying a linear trend -5 0 5 INCOME2 The plot of the income effect below indicates that a linear specification is probably best, so we proceed to a formal test using an analysis of deviance 45 We see hear that the difference between the models is not statistically significant, suggesting that the linear specification is best 46 Missing Data Missing data can be a problem for any type of model It is only seriously problematic, however, if the missing cases have a systematic relationship to the response or the X s If the data are missing at random (i.e., the pattern of missingness is not a function of the response), we are less worried about them. However, if they are not, the problem is even more serious for generalized additive models than for linear regression The backfitting algorithm omits all missing observations and thus their fitted values are set to 0 when the partial residuals are smoothed against the predictor Since the fitted curves have a mean of 0, this amounts to assigning the average fitted value to the missing observations In other words, it the same as using mean imputation in linear models, thus resulting in bias estimates 47 Problem of Mean Imputation (1) The following example randomly generates 100 observations from the normal distribution, N(20,2), so that x and y are perfectly correlated The example shows what happens if 20% of the data are randomly removed (and thus are missing completely at random) and mean imputation is used in a regression model 48

Problem of Mean Imputation (2) Missing on x Problem of Mean Imputation (3) Missing on x Density of x I now remove values of x for 20 observations, replacing them with the mean of x Density 0.0 0.1 0.2 0.3 0.4 All cases Mean imputation 16 18 20 22 24 N = 100 Bandwidth = 0.3697 49 50 Problem of Mean Imputation (4) Missing on x The mean imputation does not affect the slope, but it has pulled the intercept downwards More importantly, because there is less variation in x, the standard errors will be larger yall 18 20 22 24 20% Mean imputation (x) 18 20 22 24 xmean Problem of Mean Imputation (4) Missing on y We now randomly replace 20 y-values with the mean of y but retain all values of x The mean imputation affects the slope and the intercept, resulting in biased estimates ymean 18 20 22 24 20% Mean imputation (y) 18 20 22 24 xall 51 52

Summary and Conclusions Additive models give a compromise between ease of interpretation of the linear model and the flexibility of the general nonparametric model Complicated nonlinearity problems can be easily accommodated, even for models with many independent variables The effects in the model represent partial effects in the same way as coefficients in linear models These models should be seen as important models on their own, but they also can play an important role in diagnosing nonlinearity even if the final model chosen is a regular linear model 53 Summary and Conclusions (2) These models are extremely flexible in that both nonparametric and parametric trends can be specified Moreover, even interactions between explanatory variables are possible Caution:Since GAMs effectively use mean imputation for missing data (rather than list-wise deletion as in linear models), we must be especially careful to deal appropriately with missing data before fitting the model Mean imputation can result in biased estimates Finally, as we shall see tomorrow, these models can be extended to accommodate limited dependent variables in the same way that generalized linear models extend the general linear model 54