Approximate Smoothing Spline Methods for Large Data Sets in the Binary Case Dong Xiang, SAS Institute Inc. Grace Wahba, University of Wisconsin at Mad

Size: px

Start display at page:

Download "Approximate Smoothing Spline Methods for Large Data Sets in the Binary Case Dong Xiang, SAS Institute Inc. Grace Wahba, University of Wisconsin at Mad"

Lorena Moody
6 years ago
Views:

1 DEPARTMENT OF STATISTICS University of Wisconsin 1210 West Dayton St. Madison, WI TECHNICAL REPORT NO. 982 September 30, 1997 Approximate Smoothing Spline Methods for Large Data Sets in the Binary Case 1 by Dong Xiang and Grace Wahba 1 Prepared for the 1997 Proceedings of the American Statistical Association Joint Statistical Meetings, Biometrics Section. This short paper was written under the length constraints of those Proceedings. Further details may be found in Xiang (1996) and in preparation. Research sponsored in part by NEI under Grant R01 EY09946 and in part by NSF under Grant DMS

2 Approximate Smoothing Spline Methods for Large Data Sets in the Binary Case Dong Xiang, SAS Institute Inc. Grace Wahba, University of Wisconsin at Madison. y September 30, 1997 Key Words: Generalized Approximate Cross Validation, smoothing spline, penalized likelihood regression, generalized cross validation, Kullback-Leibler distance. Abstract We consider the use of smoothing splines in generalized additive models with binary responses in the large data set situation. Xiang and Wahba (1996) proposed using the Generalized Approximate Cross Validation (GACV ) function as a method to choose (multiple) smoothing parameters in the binary data case and demonstrated through simulation that the GACV method compares well to existing iterative methods, as judged by the Kullback-Leibler distance of the estimate to the true function being tted. However, the calculation of the GACV function involves solving an n by n linear system, where n is the sample size. As the sample size increases, the calculation becomes numerically unstable and infeasible. To reduce these computational problems we propose a randomized version of the GACV function, which is numerically stable. Furthermore, we use a clustering algorithm to choose a set of basis functions with which to approximate the exact additive smoothing spline estimate, which has a basis function for every data point. Combining these two approaches, we are able to extend smoothing spline methods in the binary response case to much larger data sets without sacricing much accuracy. 1 Introduction Suppose that y i ; i = 1; : : : ; n; are independent observations from an exponential family with density of the form f(y i ; (x i ); ) = exp(y i (x i )? b((x i )))=a() + c(y i ; ); y Dong Xiang is Senior Research Statistician in SAS Institute Inc, Grace Wahba is Bascom Professor in the Statistics Department, University of Wisconsin at Madison. This paper is supported by the National Science Foundation under Grant DMS and the National Eye Institute under Grant R01 EY Typographical error in the formulas for RanGACV xed on March 16,

3 where a, b and c are given, with b a strictly convex function of on any bounded set, the x i are vectors of covariates, is a nuisance parameter, and (x i ) is the so called canonical parameter. The goal is to estimate (). For the purpose of exposition, we will assume that x i is on the real line, but our arguments extend to more general domains for x. In the particular case of binary data, a() = 1; b() = log(1 + e ); c(y; ) = 0, and y i is 1 or 0 with probability p (xi ) = e (x i) =(1 + e (x i) ): The binary case is of particular interest because of its applicability in risk factor estimation. In the usual parametric GLIM models, () is assumed to be of a parametric form, and then maximum likelihood methods may be used to estimate and asses the tted models. A variety of approaches have been proposed to allow more exibility than is inherent in simple parametric models. We will not review the general literature, other than to note that regression splines have been used for this purpose by, for example, Friedman (1991), Stone (1994) and others. O'Sullivan (1983), O'Sullivan, Yandell and Raynor (1986), Gu (1990), Wahba (1990), others (Green, Jennison and Seheult 1985, Green and Yandell 1985, Yandell and Green 1986, Green and Silverman 1994 and Hastie and Tibshirani 1990) allow () to take on a more exible form by assuming that () is an element of some reproducing kernel Hilbert space H of smooth functions, and estimating () by minimizing a penalized log likelihood. Assuming that a() = 1 (or, is absorbed into below), dene l(y i ; (x i )) = y i (x i )?b((x i )): The smoothing spline (or penalized log likelihood) estimate () of () is the minimizer in H of? l(y i ; (x i )) + n J(); (1) 2 where the smoothing parameter 0 balances the tradeo between minimizing the negative log likelihood function and the `smoothness' J(). Here J is a quadratic penalty functional dened on H. If J 1=2 () is a norm in H or a semi-norm in H with low dimensional null space (the `parametric part') satisfying some conditions, then it is well known that, the minimizer of (1), is in a known n-dimensional subspace H n of H, with basis functions that are known functions of the reproducing kernel for H and a basis for the null space of J. See Wahba(1990), O'Sullivan (1983), Kimeldorf and Wahba (1971), and what follows. In general, we will assume that (1) will be minimized numerically in some N n dimensional space H B ; that is, () = NX j=1 j B j (); where the B j are either a set of suitable basis functions which may span H n or may constitute a convenient, suciently rich (linearly independent) approximation to a spanning set. See Wahba (1990), Chapter 7, and references cited there. Given, the computational problem is then to nd = ( 1 ; :::; N ) T to minimize I =? l(y i ; i ()) + n 2 T ; (2) 2

4 where i () = P N j=1 j B j (x i ) and is a non-negative denite matrix dened by T = J( NX j=1 j B j ): Letting l i () = l(y i ; ) and using the fact that all l i () are strictly concave with respect to, we may compute via Newton iterations. Dene w i =?d 2 l i =d 2 i ; u i =?dl i =d i. Each iteration for is equivalent to nding to minimize 1 X min ~wi (~y i? i ()) 2 + T ; (3) n where ~y i = ~ i? ~u i = ~w i and ~ i ; ~u i ; ~w i are the working values of i ; u i and w i based on the last iteration. The ~y i will be called the working data here. This problem will have a unique minimizer provided = 0 and i () = 0; i = 1; :::; n ) = 0. See O'Sullivan et al. (1986) and Gu (1990). Note that by the properties of the exponential family, if (x i ) is the true canonical parameter evaluated at x i, then Ey i = u i, and var y i = w i. If H B = H n or H B is suciently large, then a suciently small allows the i to eectively interpolate the data, while a suciently large forces the estimate to the null space of J() in H B. As for choosing the smoothing parameters for the generalized linear model, a number of methods have been proposed. O'Sullivan et al (1986) adapted the GCV function to the non-gaussian case by considering the quadratic approximation to the negative log likelihood available at the nal stage of their Newton iteration of. Yandell (1986) suggested evaluation and minimization of the GCV scores as the iteration proceeded. Gu (1992) compared the two dierent algorithms and found that the results from the method which chooses after each iteration of (3), and updates the working data based on the new before the new iteration (Algorithm 2), out performs the method which xes during the iteration and chooses based on the nal working data (Algorithm 1). However, a theoretical justication for Algorithm 2 is still yet to emerge. And Algorithm 2 does not guarantee convergence. In the same paper, Gu suggested a criteria (the \U" estimate) for binary data which is similar to the unbiased risk (UBR) estimate in Craven and Wahba (1976). He believed that the criteria he suggested is a proxy for the symmetrized Kullback-Leibler distance between () and the true (). He also demonstrated via some simulations that this criteria, computed via Algorithm 2, gave more favorable results than other methods. Based on this result, Wang (1994) extended the method to the problem of estimation of multiple smoothing parameters. In Xiang and Wahba (1996), we proposed the Generalized Approximated Cross Validation (GACV ) functions. With some abuse of notation, we write J() = T, where in this context we are letting = ((x 1 ); ; (x n )) T. For binary data, the GACV is given by P GACV () = 1 (?y i (x i ) + log(1 + e (x i) )) + T r(a) n y i (y i? p (x i )) n n n? T r(w 1=2 AW 1=2 ) ; (4) where W = diag(p (x 1 )(1? p (x 1 ); ; p (x n )(1? p (x n )) and A = [W ( ) + n]?1 : The GACV is an approximation to the comparative Kullback-Leibler distance obtained by starting with a leaving-out-one proxy, approximating it by repeated use of a Taylor 3

5 series expansion, and, nally, replacing individual diagonal entries in certain matrices by the average diagonal entry. In the simulations we conducted, the estimates of appeared superior to the popular and successful U estimates computed via Algorithm 2. However, to evaluate the GACV function, we need to invert an n n matrix. When the sample size increases, the calculation of GACV becomes unstable if it is still possible. In order for this method to be competitive with Gu's U estimates, we need to nd a stable numerical method. In this paper, we address this problem by combining a clustering method for choosing a subset of bases and a randomized version of GACV for selecting the smoothing parameters, therefore, instead of solving a n n linear system, we only need to nd the solution for a k k system. And for large data sets, k can be much smaller than the actual sample size. 2 Randomized GACV As we mentioned in the last section, the diculty of extending the GACV method to large data sets is that it involves solving an n n linear system. In this section, we address this problem by proposing an approximate solution to the penalized log-likelihood based on the basis functions selected by a clustering algorithm and a randomized GACV function for choosing the smoothing parameters. 2.1 An Approximate Solution Let H = H 0 H 1, where H 0 is the null space of J in H, let P 1 be the orthogonal projector onto H 1 in H, and let J() = jjp 1 jj 2. This situation covers the cases we are interested in. See Wahba (1990) for details. Let t i be the representer of evaluation at x i ; i = 1; ; n, let i = P 1 t i and let v ; v = 1; ; M span H 0. Then the minimizer of (1) in H is known to have a representation of the form Now let's dene (x) = without loss of generality we assume MX v=1 d v v (x) + c i i (x): H k = spanf ij ; j = 1; : : : ; k n ; v ; v = 1; : : : ; Mg; ( ij ; j = 1; ; k n ) = ( 1 ; ; k ): Instead of nding the minimizer of (1) in H, let's consider the problem of: min 2H k? 1 n l(y i ; (x i )) + J(): (5) The solution of (5) will have the form (x) = P M v=1 d v v (x) + P k j=1 c j j (x): We call it the solution to the penalized approximate smoothing spline problem. 4

6 When k = n, the solution is equal to the minimizer of the penalized log likelihood function (1). When k < n, the dierence between two solutions depends on the projection of the underlying true function onto the space of spanf 1 ; : : : ; n g? spanf 1 ; : : : ; k g: In Xiang (1996), we investigated the dierence of the two solutions and found that when the design points are very close (quite often when the sample size is large), the representers corresponding to those design points will be highly correlated. The correlation of two representers t i ; t j are dened as the correlation of (t i (x 1 ); ; t i (x n )) and (t j (x 1 ); ; t j (x n )): Including all those representers in the solution not only makes the problem dicult to solve due to the large number of parameters but also may make the calculation unstable. Under a simplied situation, Xiang (1996) proved that in order to achieve the same convergence rate as the penalized smoothing spline, the k for the penalized approximate smoothing spline (PASS) only needs to increase at the rate of 2m k = O(n (2m?1)(2m+1) = 1=(2m?1) ): The above argument shows that we do not need a large k in order for the approximate solution (in H k) to be very close to the 'exact' solution (in H). However, once k is chosen, it is important to choose the i well. We would like the i to have maximum separation. A crude method would be to use a grid to partition the design space into a number of small cells and then, choosing one design point inside each cell use only i corresponding to them to generate the solution space H k. However, this method will fail in the situations that the design points are not regularly spaced over the design space; most of the cells will be empty, especially when the number of covariates are large. As for real data, this is nearly always the case. Considering this problem from another point of view, what we want is to group design points into k groups with the requirement that the groups be spaced as far as possible from each other in an appropriate sense. Thus we can borrow the classical cluster analysis approach to solve our problem. Cluster analysis is normally used to separate a set of data into its constituent groups or clusters. Even though there is no natural separation among design points in our case, we still can force the algorithm to nd k groups that have maximum separation. There are lots of algorithms for clustering the data. From our limited experience, just about any algorithm will provide a reasonable answer, although some are faster than the others. We will choose the algorithm used in the procedure FASTCLUS in the SAS software package (SAS/STAT User's Guide, SAS Institute Inc, 1988) because it is designed for the disjoint clustering of very large data sets in minimum time. It operates in the following three steps: 1. Select the rst k design points as the cluster seeds 2. Clusters are formed by assigning each design point to the nearest seed (Euclidean distance). After all design points are assigned, the cluster seeds are replaced by the cluster means. This step can be repeated until the changes in the cluster seeds are less than or equal to the minimum distance between previous seeds times a default value taken as

7 3. Final clusters are formed by assigning each observation to the nearest seed. Within each cluster, we randomly choose one design point as the representative to be included in our subset for the purpose of generating the H k subspace. In the simulation as well as the real data examples, we found that the approximate solution is very close to the smoothing spline solution (Xiang 1996). 2.2 The Randomized GACV As we noticed in (4), the GACV contains the quantities T r(a) and T r(w 1=2 AW 1=2 ). In order to evaluate T r(a) and T r(w 1=2 AW 1=2 ), we have to calculate the n n matrix A explicitly. This will slow down the calculation of the penalized approximate smoothing spline. In this section, we will derive a randomized version of the GACV function, which is similar to the randomized version of GCV (Girard 1987). This version of GACV does not require the calculation of any inverse matrix. Now consider I (; Y ) =? l j (y j ; (x j )) + n 2 T ; (6) where Y = (y 1 ; ; y n ) T, l j (y j ; (x j )) is the log likelihood and = ((x 1 ); : : : (x n )) T. Thus, I (; Y ) is the penalized log likelihood for with respect to data Y. If we put a small disturbance on Y to get a new pseudo data set, Y +, we expect ^ Y + and ^ Y, the minimizers of (6) with respect to Y + and Y, to be very close. Since ^ Y + and ^ Y are minimizers of (6), we + (^Y ; Y + ) = @ (^Y ; Y ) = 0: Similar to the approach we used to derive the GACV function, we use the Taylor expansion to + ; Y + ) at (^ Y ; Y ) and have the + (^Y ; Y + (^Y ; Y ) (^ ; Y )(^ Y +? ^ Y ) where (^ ; Y ) is a point between (^ Y ; Y ) and (^ Y + ; Y + ). Based on (6), we 2 I = W () + and (^ ; Y )(Y +? Y ); 2 =?I nn: 6

8 Therefore, equation (7) can be simplied to ^ Y +? ^ Y = (W (^ ) + n)?1 : (8) When is very small (^ Y + ; Y + ) is very close to (^ Y ; Y ), W (^ ) can be approximated by W (^ Y ) and (8) will be approximated by ^ Y +? ^ Y (W (^ Y ) + n)?1 = A: (9) Using the above result, we will derive a Monte Carlo calculation of T r(a). Let's generate a small random error N(0; 2 I n ). Applying it to the square matrix A, we have and E( T A) = 2 T r(a) V ar( T A) = 4 (T r(a T A) + T r(a 2 )): Hence if is small, 1 n T A= 2 provides a good estimate of 1 T r(a). n Applying it to the approximation in (9), we can use ^ Y +? ^ Y to approximate A; therefore, we can calculate T A in the following way: T A = T (A) = T (^ Y +? ^ Y ): Thus, and T r(a) T (^ Y +? ^ Y )= 2 T r(w 1=2 AW 1=2 ) = T r(w A) T W (^ Y +? ^ Y )= 2 : The accuracy of this approximation depends on the size of the disturbance ( 2 ) which we add to the data. From the experiment we did in Xiang (1996), we found out that the shape and the minima of RanGACV are not very sensitive to the 2 as long as is within a certain large range. Now, replacing T r(a) and T r(w 1=2 AW 1=2 ) by their Monte Carlo calculations, we get a randomized version of the the GACV function, Heuristically, as n! 1, RanGACV () = 1 (?y i (x i ) + b( (x i )) n P + T (^ Y +? ^ Y ) n y i (y i? (x i )) n T? T W (^ Y +? ^ Y ) : ( T A=n? T r(a)=n)! p 0 7

9 and also, T =n! p 1 ( T (W 1=2 AW 1=2 )=n? T r(w 1=2 AW 1=2 )=n)! p 0; we expect (GACV? RanGACV )! p 0 when the sample size is large. To calculate RanGACV, we rst generate a set of disturbed pseudo data Y +. For a xed, we minimize (6) with respect to both Y and Y + to get ts ^ Y and ^ Y +. Based on the ^ Y and ^ Y +, we can evaluate RanGACV for that. In order to reduce the variation caused by, we will use the same for all the. It is very easy to generate and calculate ^ Y + using the penalized approximate smoothing spline approach; for each sequence of, there is a RanGACV curve. Given a set of Y, we can use the average of several RanGACV curves as our nal RanGACV. Several averaging methods are discussed in Xiang (1996), we will use RanGACV () = 1 n 1 nl lx r=1 [?y i (x i ) + b( (x i ))] + T + (r)(^y? (r) ^Y ) T (r) (r)? T + W (^Y (r)? (r) ^Y ) y i (y i? (x i )): As we investigated in Xiang (1996), when l, the number of replicates of, is larger than three, the RanGACV will perform very well for most cases. See Xiang (1996) for further discussion. Based on the extensive simulation we did in Xiang (1996), we found that RanGACV performs better than the U method in both univariate and multivariate smoothing parameter cases in the sense that the nal estimates based on the smoothing parameter chosen by RanGACV has a smaller Kullback-Leibler distance from the true function. References Craven, P. & Wahba, G. (1979), `Smoothing noisy data with spline functions: estimating the correct degree of smoothing by the method of generalized cross-validation', Numer. Math. 31, 377{403. Friedman, J. (1991), `Multivariate adaptive regression splines', Ann. Statist 19, 1{141. Girard, D. (1987), A fast `Monte Carlo cross validation' procedure for large least squares problems with noisy data, Technical Report RR 687-M, IMAG, Grenoble, France. Green, P. & Silverman, B. (1994), Nonparametric Regression and Generalized Linear Models, Chapman and Hall. 8

10 Green, P. & Yandell, B. (1985), Semi-parametric generalized linear models, in R. Gilchrist, ed., `Lecture Notes in Statistics, Vol. 32', Springer, pp. 44{55. Green, P., Jennison, C. & Seheult, A. (1985), `Analysis of eld experiments by least squares smoothing', J. Roy. Stat. Soc. 47, 299{315. Gu, C. (1990), `Adaptive spline smoothing in non-gaussian regression models', J. Amer. Statist. Assoc. 85, 801{807. Gu, C. (1992), `Cross-validating non-gaussian data', J. Comput. Graph. Stats. 1, 169{179. Hastie, T. & Tibshirani, R. (1990), Generalized Additive Models, Chapman and Hall. Kimeldorf, G. & Wahba, G. (1971), `Some results on Tchebychean spline functions', J. Math. Anal. Applic. 33, 82{95. SAS Institute Inc. (1988), SAS/STAT User's Guide, Release 6.03 edition, Cary, NC. O'Sullivan, F. (1983), The analysis of some penalized likelihood estimation schemes, PhD thesis, Department of Statistics, University of Wisconsin, Madison, WI. Technical Report 726. O'Sullivan, F., Yandell, B. & Raynor, W. (1986), `Automatic smoothing of regression functions in generalized linear models', J. of the Amer. Statist. Assoc. 81, 96{103. Stone, C. (1994), `The use of polynomial splines anad their tensor products in multivariate function estimation, with discussion', Annals of Statistics 22, 118{184. Wahba, G. (1990), Spline Models for Observational Data, SIAM. CBMS-NSF Regional Conference Series in Applied Mathematics, v. 59. Wahba, G., Wang, Y., Gu, C., Klein, R. & Klein, B. (1995), `Smoothing spline ANOVA for exponential families, with application to the Wisconsin Epidemiological Study of Diabetic Retinopathy', Annals of Statistics 23, 1865{1895. Wang, Y. (1994), Smoothing spline analysis of variance of data from Exponential Families, Ph.D thesis, Department of Statistics, University of Wisconsin-Madison, Madison, WI. Technical Report 928. Xiang, D. (1996) Model tting and testing for non-gaussian data with large data sets, Ph.D thesis, Department of Statistics, University of Wisconsin-Madison, Madison, WI. Technical Report 957. Xiang, D. & Wahba, G. (1996), A generalized approximate cross validation for smoothing splines with non-gaussian data, Statistica Sinica 6, 675{692. Yandell, B. (1986), Algorithms for nonlinear generalized cross-validation, in T. Boardman, ed., `Computer Science and Statistics: 18th Symposium on the Interface', American Statistical Association, Washington, DC. 9

11 Yandell, B. & Green, P. (1986), Semi-parametric generalized linear model diagnostics, in `ASA Proceedings of Statistical Computing Section', American Statistical Association, pp. 48{53. 10

TECHNICAL REPORT NO December 11, 2001

TECHNICAL REPORT NO December 11, 2001 DEPARTMENT OF STATISTICS University of Wisconsin 2 West Dayton St. Madison, WI 5376 TECHNICAL REPORT NO. 48 December, 2 Penalized Log Likelihood Density Estimation, via Smoothing-Spline ANOVA and rangacv