Soft Threshold Estimation for Varying{coecient Models 2 ations of certain basis functions (e.g. wavelets). These functions are assumed to be smooth an

Size: px

Start display at page:

Download "Soft Threshold Estimation for Varying{coecient Models 2 ations of certain basis functions (e.g. wavelets). These functions are assumed to be smooth an"

Sherilyn Gibbs
6 years ago
Views:

1 Soft Threshold Estimation for Varying{coecient Models Artur Klinger, Universitat Munchen ABSTRACT: An alternative penalized likelihood estimator for varying{coecient regression in generalized linear models is proposed. The estimator leads to a sparse representation of the results as sums of a few basis functions. In the special case of function estimation, it reduces to the soft threshold estimator widely used by the wavelet community. By using appropriate sets of basis functions, varying coecients characterizing bumps or periodic functions can also be modelled within the same framework. KEYWORDS: Generalized additive models; Shrinkage estimation; Splines; Varying coecients; Wavelets 1 Introduction In many applications the parametric form of common generalized linear models is too restrictive. One way to obtain more exibility is to assume the coecients as functions j (x j ) varying over other (metrical) covariates x j. Varyingcoecient models of this general form were introduced by Hastie and Tibshirani (1993). Extending the predictor of generalized linear models to = (x ) + 1 (x 1 )z 1 + : : : + p (x p )z p ; (1.1) they are a valuable tool for exploring interactions between coded categorical covariates z j and their eect-modiers x j. Semiparametric models, where x 1 : : : x p 1, generalized linear models for time series data (x = : : : = x p = t) and generalized additive models (z 1 : : : z p 1) are important special cases of (1.1). Usually estimation of the varying coecients is carried out by penalized likelihood estimation, leading to smoothing splines, or by maximizing local likelihoods. These approaches are oriented on linear or polynomial regression, and hence, they are not always appropriate for modelling baseline functions or intercepts, collecting unobserved variables, and seasonal components. We propose an alternative penalized likelihood estimator motivated by soft thresholding of wavelet coecients. To review the basic idea, consider the problem of function estimation in a model y i = f(x i ) + " i, " i i:i:d N(; 2 ), and let P be an orthogonal matrix with columns created by vectors of point evalu-

2 Soft Threshold Estimation for Varying{coecient Models 2 ations of certain basis functions (e.g. wavelets). These functions are assumed to be smooth and describe characteristics of the underlying systematic of f. With ~c = P y, ~c = (~c 1 ; : : : ; ~c n ), y = (y 1 ; : : : ; y n ), we have ~c i i:i:d N(; 2 ). The strategy of soft thresholding is to set all coecients ~c i with absolute value smaller than a specied noise level to zero. Since small ~c i correspond to more `smooth' or desirable functions f, big coecients are shrunken towards zero by taking the noise level o. Formally the soft threshold strategy is described by ^c i = sgn(~c i ) max(; j~c i j? ) = sgn(~c i )(j~c i j? ) + ; (1.2) and the estimator is ^f = P ^c, ^f = ( ^f(x1 ); : : : ; ^f(x n )). Asymptotic optimality results in the context of wavelet basis functions in general function spaces for soft thresholding (1.2) were derived by Donoho and Johnstone (1994) and Donoho, Johnstone, Kerkyacharian and Picard (1995). For orthogonal design the estimator (1.2) corresponds to a minimum of the absolute penalized least squares estimator ^c = arg min(y? P c) (y? P c) + X i jc i j: (1.3) For non orthogonal design matrices X, Tibshirani (1996) proposes the restricted least squares estimator minimize subject to (y P?? X) (y?? X) j j jj ; > : in the context of variable selection and shrinkage by LASSO (Least absolute shrinkage and regression operator). By introducing Lagrange multipliers this leads to the proposed generalized soft threshold estimator dened by ^ = arg max l(; )? X j j j j; > : These two investigations make one of the main features of soft thresholding or absolute penalties transparent: Only few basis functions, describing characteristics of the unknown functions, are included in the estimator ^ j (x j ). Results become easier to interpret and analyze. By the following reasons the sparse representation of the estimator is of particular value in applications: Results are characterized as features of the varying coecients, such as maxima, minima or frequency. Soft thresholding directly describes the results as sums of functions with specic characteristics. Varying coecients may be highly correlated or `concurvous' (Buja, Hastie and Tibshirani, 1989). Detection and analysis of this correlation by using a parametric approach based on soft thresholding prevents from possible misinterpretation.

3 Soft Threshold Estimation for Varying{coecient Models 3 Model checking and diagnosis can be performed using only few coecients detected by soft thresholding. By using locally supported basis functions, soft thresholding adapts well to functions non homogeneous in smoothness. 2 Generalized Soft Thresholding for Varying{coecient Models Introducing vectors of point evaluations j = ( j (x j1 ); : : : ; j (x jsj )), x j1 < : : : < x jsj for the functions j (x j ), j = 1; : : : ; p, the predictor at the observed values of the eect-modiers = (x 1 ; : : : ; x p ) can be written as = Z, = (1 : : : ; p). Here Z is a large, usually very sparse matrix consisting of the z 1 ; : : : ; z p. In the case of few eect-modiers sparsity usually holds also for Z Z, otherwise the number of non-zeros depends on the actual design. Generalized soft thresholding of the functions 1 (x 1 ); : : : ; p (x p ) is carried out by representing them as a sum of (orthogonal) basis functions (x j ), k = 1 : : : S j. This framework reduces the initial function estimation problem by X j (x ju ) = c (x ju ); k to estimation of the basis coecients c. (Smoothness) restrictions on j (x j ) lead to restrictions on the c. These coecients are linked to the dependent variable by a predictor = Zc, c = (c 11 ; : : : ; c psp ), where is an (orthogonal) matrix consisting of the (x ju ). The generalized soft threshold estimator is then dened as absolute penalized likelihood estimator ^c = arg max l()? X j X k jc j; > ; : (1.4) Let s () be partial derivatives of the log{likelihood. One can show that the following rst order conditions are necessary for a maximum of (1.4): js (^)j if ^c = ; s (^) = if ^c > ; (1.5) s (^) =? if ^c < : These equations may be interpreted as follows: If a coecient ^c is set to zero, its score function s (^) is smaller than. Hence the maximum likelihood estimator is also close to zero, or the likelihood is at in this direction. Maximum likelihood estimation of this coecient would not increase the likelihood more than inclusion of a covariate vector consisting of pure noise, and thus this coecient is omitted. By adding the noise level to the score function, the nonzero coecients c are shrunken towards more favorable values leading to `smooth' j (x j ).

4 Soft Threshold Estimation for Varying{coecient Models Algorithms To obtain a fast algorithm for estimation, we follow the proposal of Tishler and Zang (1982) and approximate the absolute penalty by the dierentiable function 8 ><?c ; if c? c jc j h(c ; ) = >: ; if? < c < : (1.6) c ; if c Computation is then done by a modied Gauss{Newton or Fisher scoring procedure: Algorithm 1: Do while any jc (m+1) (Z of full rank)? c (m) j > : 1. Compute the vector d (m) with elements d (m) = sgn(c (m) ) and the diagonal matrix D (m) = diag(ifjc (m) j < g =) 2. Compute the score vector s( (m) ) (m) )=@c (m) and the (expected) negative second derivative matrix F ( (m) ) =?@ 2 l( (m) )=@c (m). 3. Solve the system [F ( (m) ) + D (m) ]c (m+1) = F ( (m) )c (m) + s( (m) )? d (m) to obtain updated values c (m+1). 4. Trim steps crossing the zero: If fc (m) 6= and sgn(c (m+1) ) 6= sgn(c (m) )g; set c(m+1) =. Trimming of coecients in step 4 ensures that for small termination criterion, the coecients c do not alternate around (?; +). At convergence of Algorithm 1 we have s (^) = d = sgn(^c ) s (^) = ^c = < if j^c j if j^c j < and the necessary conditions for a maximum of the absolute penalized log{ likelihood (1.5) are fullled up to the termination criterion. The result is checked and improved in a further step by starting Algorithm 1 again with a basis matrix S consisting only of basis functions which coecients exceeded jc j in the rst step. In varying{coecient models the number of possible basis functions is often very large and Z may not be of full rank. In the following algorithm we make use of the fact that only a small fraction of coecients c are estimated unequal to zero. To select the global threshold, it is convenient to compute the

5 Soft Threshold Estimation for Varying{coecient Models 5 estimator for a sequence of threshold parameters () > : : : > (l) > : : : > (L). We start with the embedded model () = 1 characterized by the coecients having =. The embedded model contains at least an intercept term and coecients c which are not shrunken. For varying{coecient models this usually corresponds to a common generalized linear model in the covariates z 1 ; : : : ; z p. Algorithm 2: 1. Let S be the set of all indexes with = and let l = Estimate a generalized linear model using only the columns Z S. 3. Select the threshold values based on this estimate. 4. Do while l L: (a) If 9 =2 S : js ()j > (l) then add the index with = arg max js ()j= to S. (b) Estimate the coecients c by applying Algorithm 1 only to Z S. (c) If 8 =2 S : js ()j (l) : Keep the result c (l) = c as estimate for (l) and set l = l + 1. Algorithm 2 adds successively basis coecients to the set of non zero coef- cients. When the score function s () for all zero coecients is smaller than the threshold value we have an estimation for (l) and the algorithm proceeds with the next smaller (l+1) < (l). 3 Selecting the Thresholds In contrast to common thresholding of wavelet coecients, the variation of the score functions s () depends on the entries in the matrix Z and on the actual predictor,. By choosing dierent threshold values for each coecient in step 3 of algorithm 2, we take this fact into account. The thresholds,, are selected according to the variation of the score function assuming the embedded generalized linear model dened by =. Let ^c S ; ^ S be an estimate of the coecients in the embedded model and let ^(s (^ S )) be an estimator of the variation of s (^ S ) based on this model. Thresholds are then chosen in step 3 according to = ^(s (^ S )). 3.1 Function estimation Further consideration for the threshold values have to be done for smooth varying coecients. We outline threshold selection for smoothing in the following setting: Smoothing splines As in the common penalized likelihood estimation, smooth spline functions can also be estimated by absolute penalties as described above. Let (x j ) be the

6 Soft Threshold Estimation for Varying{coecient Models 6 orthogonal spline R basis functions as R described by Demmler and Reinsch (1975) and let = f (m) (u)g2 du with R (m) (u)(m) jl (u)du =, l 6= P k. The penalty for ordinary spline smoothing, f (m) j (u)g 2 du, corresponds to c 2. For soft thresholding we use = p ^(s (^ S )) which might be regarded as an estimate of the standard deviation of a score function targeting (m) j (x j ). Coefficients for true function 5 2 True function Absolute Bias.2 Variance Smoothing spline: solid.5 1 Soft thresholding: dashed FIGURE 1. Absolute bias and variance of spline smoothing and soft thresholding for the true function (x) = 2? (5x? 2:5) 2 computed from 1 simulations. Data are based on a logit model for a binomial B(5; (xt)) distribution. The x 1; : : : ; x 1 were simulated according to a uniform U(; 1) distribution. Figure 1 compares spline smoothing with results obtained by soft thresholding of spline functions. The upper left picture shows that the true function can be well approximated by using only the rst few Demmler{Reinsch basis functions. Soft thresholding has about the same bias and variance than spline smoothing but uses only about 3 basis functions more than linear regression. Many other popular linear smoothers may be incorporated in the same manner by adopting the concept of pseudo splines due to Hastie (1996). Wavelets If, for example, j (x j ) is a baseline eect or intercept term collecting unobserved variables, often no prior assumptions about the structure of the coecient can be made. Here wavelet basis functions provide a powerful tool.

7 Soft Threshold Estimation for Varying{coecient Models 7 These orthogonal functions have compact support and decompose the j (x j ) in a hierarchical scheme. They are well suited to describe eects heterogeneous in smoothness. The thresholds may be chosen global, i.e. = ^(s (^ S )) or according to the resolution level l, e.g. = 2 l^(s (^ S )). Trigonometric series When time varying eects are included in the model, often seasonality has to be considered. In the case of periodicity, trigonometric basis functions lead to a sparse representation of the varying coecients. In principle, combinations of dierent types of basis functions, such as polynomial trigonometric series (Eubank and Speckman, 199) or polynomial regression together with wavelets can be used to estimate the j (x j ). 3.2 Selecting the global threshold 3 Estimation (True function dashed) 2 Error Coefficients for true function non zero coefficients log likelihood non zero coefficients FIGURE 2. One simulation drawn from the true function (x) = sin(1x 2 ). The data follow a logit model for a binomial B(5; (xt)) distribution, and 1 xt were simulated according to a uniform U(; 1) distribution. The upper right picture shows the estimation error 1=1 P t ( ^(xt)? (xt)) 2, computed from a sequence of 's, plotted versus the number of ^c k 6=. A good value for the global threshold or smoothing parameter may be chosen by comparing the log{likelihood with the number of non zero coecients. If

8 Soft Threshold Estimation for Varying{coecient Models 8 the true systematic can be well approximated by only a few basis functions, a sharp bend of the log{likelihood function plotted versus the number of non zero coecients is visible. This bend can be used to select the global threshold, since coecients on the right hand side do not contribute to the likelihood signicantly. Figure 2 is typical for this situation. The log{likelihood increases rapidly with inclusion of the rst ve basis functions. Here the estimation error decreases. By including more basis functions the log{likelihood increases only slightly and the error increases. If no distinct bend is visible, another set of basis functions may yield a sparser representation of the underlying systematic, and hence, more precise estimates. Acknowledgments: This work was supported by the Deutsche Forschungsgemeinschaft, Sonderforschungsbereich 386 \Statistische Analyse diskreter Strukturen, Modellierung und Anwendung in Biometrie und Okonometrie." References Buja, A., Hastie, T. and Tibshirani, R. (1989). Linear smoothers and additive models, Annals of Statistics 17, 453{555. Demmler, A. and Reinsch, C. (1975). Oscillation matrices with spline smoothing, Numerische Mathematik 24, 375{382. Donoho, D. L. and Johnstone, I. M. (1994). Ideal spatial adaptation by wavelet shrinkage, Biometrika 81, 425{455. Donoho, D. L., Johnstone, I. M., Kerkyacharian, G. and Picard, D. (1995). Wavelet shrinkage: asymptopia (with discussion)?, Journal of the Royal Statistical Society B 57, 31{369 Eubank, R. L. and Speckman, P. (199). Curve tting by polynomial{trigonometric regression. Biometrika 77, 1{9 Hastie, T. (1996). Pseudo splines, Journal of the Royal Statistical Society B 58, 379{396. Hastie, T. and Tibshirani, R. (1993). Varying-coecient models, Journal of the Royal Statistical Society B 55, 757{796. Tibshirani, R. (1996). Regression shrinkage and selection via the lasso, Journal of the Royal Statistical Society B 58, 267{288. Tishler, A. and Zang, I. (1982). An absolute deviations curve{tting algorithm for nonlinear models. In S.H. Zanakis and J.S. Rustagi (eds.) Optimization in Statistics, TIMS Studies in Management Science, Vol. 19, North Holland.

Linear Methods for Regression and Shrinkage Methods

Linear Methods for Regression and Shrinkage Methods Reference: The Elements of Statistical Learning, by T. Hastie, R. Tibshirani, J. Friedman, Springer 1 Linear Regression Models Least Squares Input vectors