Constrained Penalized Splines

Size: px

Start display at page:

Download "Constrained Penalized Splines"

Ernest Skinner
6 years ago
Views:

1 Constrained Penalized Splines Mary C Meyer Colorado State University October 15, 2010 Summary The penalized spline is a popular method for function estimation when the assumption of smoothness is valid. In this paper, methods for estimation and inference are proposed using penalized splines under the additional constraints of shape, such as monotonicity or convexity. The constrained penalized spline estimator is shown to have the same convergence rates as the corresponding unconstrained penalized spline, although in practice the squared error loss is typically smaller for the constrained versions. The penalty parameter may be chosen with generalized cross-validation, which also provides a method for determining if the shape restrictions hold. The method is not a formal hypothesis test, but is shown to have nice large-sample properties, and simulations show that is compares well with existing tests for monotonicity. Extensions to the partial linear model, the generalized regression model, and the varying coefficient model are given, and several examples demonstrate the utility of the methods. Keywords: shape-restricted regression, quadratic programming, cone projection, test for monotonicity, isotonic, convex regression 1

2 1 Introduction and Background Consider the problem of estimating a function f from a scatterplot of data {(x i, y i )}, i = 1,..., n, using the model y i = f(x i ) + σɛ i (1) where x i [0, 1] and the ɛ i are iid random variables with mean zero and unity variance. Smoothing methods include kernel methods and local polynomial regression, smoothing splines, and regression splines. The isotonic regression estimator (Brunk 1955) minimizes the sum of squared residuals over monotonicity assumption; the estimator is a step function. Several methods are available if f is assumed to monotone increasing as well as smooth. Friedman and Tibshirani (1984) smoothed the isotonic regression estimator using a running average. Mammen (1991) proposed a two-step method in which a kernel smoother is applied to the isotonic regression estimator, or the kernel smoother is isotonized. Utreras (1986) provided the characterization and convergence rates for the monotone smoothing spline estimator that minimizes n 1 [y i f(x i )] 2 + λ [f (k) (x)] 2 dx i=1 0 over the monotone non-decreasing functions f with continuous kth derivatives. Tantiyaswasdikul and Woodroofe (1994) obtained the solution for k = 1. The estimators are difficult to obtain for k 2, however, because imposing the constraints at the observations is not sufficient to guarantee the constraints hold everywhere. Additional knots must be placed between observations and a general algorithm for the exact placement is not known. (Wang and Li 2008). Mammen and Thomas-Agnan (1999) and showed that the shape-constrained smoothing spline estimator is equivalent to a two-step procedure and provided convergence rates. Mammen et al. (2001) generalized the idea of projecting a smoothed estimator onto a constrained subset of the estimation space. He and Shi (1998) proposed monotone B-spline smoothing with L 1 optimization, using linear programming to obtain the estimator. 2

3 In this paper the constrained penalized regression spline estimator is considered. For the method of polynomial regression splines, we assume that f C p+1 [0, 1] for some integer p 0, and specify a set of knots 0 = t 1 < < t k = 1. A set of m = k + p 1 basis functions δ 1 (x),..., δ m (x) are defined with corresponding basis vectors δ 1,..., δ m where δ ji = δ j (x i ) so that the functions span the space of smooth piecewise degree-p polynomial regression functions with the given knots, and the basis vectors span an m-dimensional subspace of IR n. The coefficients of the linear combination that minimizes the residual sum of squares is found using ordinary least-squares regression techniques, so that if B is the n m matrix whose columns are the δ j vectors, the coefficients of the least-squares estimator are b = (B B) 1 B y. Then f(x) = m j=1 bj δ j (x) is the regression spline estimate of f. There are many possibilities for the set of basis functions; the B-splines are a standard choice, defined so that each basis vector is orthogonal to all but a few of the others. These are shown for k = 5 in Figure 1(a) for p = 2 and (b) for p = 3. The curves represent the basis functions and the dots mark the values of the basis vectors for n = 50 equally spaced x i. See De Boor (2001) for a thorough treatment, including formulas for the spline basis functions. Other choices such as the truncated polynomial basis (see Ruppert, Wand, and Carroll, section 3.7) span the same space but the resulting design matrix might be highly collinear. The regression splines are simple, flexible, and parsimonious function estimators. Some large sample theory was derived by Zhou, Shen, and Wolfe (1998), for mild assumptions. The knots are required to be of bounded mesh ratio, that is, the ratio of the maximum gap between the knots to the minimum is bounded, and the maximum difference in gaps is o(k 1 ). The design points must be distributed according to some continuous density that is positive on (0, 1). They derived the asymptotic bias and variance of the regression splines, the asymptotic normality of the estimates, and showed that if the number of knots grows as n 1/(2p+3), the optimal point-wise convergence rate f(x) f(x) = O p (n (p+1)/(2p+3) ) is attained. In practice, the splines are sensitive to knot number and placement. Penalized splines are 3

4 (a) (b) x x x x x x x x x x x x Figure 1: B-spline basis functions with k = 5 knots (marked with ) equally spaced in (0, 1) (a) Piecewise quadratic; (b) piecewise cubic. more robust and the user choices are simplified to a single penalty parameter, which may be determined through cross-validation methods. The P -splines of Eilers and Marx (1996) use many knots equally spaced in (0, 1), and penalize the qth order differences in the sizes of the coefficients of adjacent B-splines. The penalized sum of squares is 2 n m m y i b j δ j (x i ) + λ ( q b j ) 2 (2) i=1 j=1 j=q+1 where 1 b j = b j b j 1 and q b j = q 1 b j for q > 1. The expression to minimize may be written in vector form: ψ(b; y) = b (B B + λd D)b 2y Bb, (3) where D is the qth order difference matrix, and the coefficient estimate is found via a weighted projection onto a linear subspace: b = (B B + λd D) 1 B y. The estimate of µ, where µ i = f(x i ), is µ = B b, and the effective degrees of freedom (edf) of the model (see Hastie and Tibshirani 1990, chapter 5) is the trace of B(B B + λd D) 1 B. For λ = 0 this is m, the number of columns of B, and as λ gets larger, the edf approaches min(p, q 1). 4

5 Claskens, Krivobokova, and Opsomer (2009) investigated the asymptotic properties of the penalized splines. They defined the estimator in more generality as the minimizer of 2 n m (q) 2 1 y i b j δ j (x i ) m + λ b i=1 j=1 0 j δ j (x) dx, j=1 and noted that the simplifications in (2) do not influence the asymptotic properties of the estimator. For degree of spline p and degree of penalty q, they demonstrate that the asymptotic behavior of the penalized spline can be similar to that of the regression spline if the number of knots grows slowly and similar to that of the smoothing spline for faster-growing number of knots. In particular, the regression spline behavior is seen when the number of knots grows at the rate n 1/(2p+3) and λ = O(n (p+2 q)/(2p+3) ). For smoothing spline behavior, the number of knots grows at the rate n ν, where 1/(2q + 1) ν < 2q/(2q + 1), and λ = O(n 1/(2q+1) ); then the rate ˆf(x) f(x) = O p (n q/(2q+1) ) is attained. The penalty parameter may be chosen through generalized cross-validation (GCV) methods (see Rupert, Wand, and Carroll, 2003, chapter 5), where the GCV for the fit µ with penalty parameter λ is defined as GCV (λ) = ni=1 (y i µ i ) 2 (1 edf/n) 2. (4) Suppose we can assume, in addition to smoothness, that f is monotone or convex. The (unpenalized) monotone spline was introduced by Ramsay (1988) and extended to convex and other shape restrictions by Meyer (2008). The quadratic I-spline basis functions, together with the constant function, span the space of piecewise quadratic splines with the given knots, and have the additional property that the linear combination is monotone if and only if the coefficients of the basis functions are non-negative. The cubic C-splines can be constructed so that the basis functions, together with the constant function and the identity function, span the space of piecewise cubic functions with the given knots, and the linear combination is convex if and only if the coefficients on the basis functions are nonnegative. 5

6 In the next section, the method for estimating the constrained penalized splines is presented, and it is shown that they have same large-sample properties as the unconstrained penalized splines. In section 3 the GCV is used to test for monotonicity or convexity, and in section 4 some confidence interval methods are presented. Extensions to the partial linear model, the generalized regression model, and the varying coefficient model are considered in section 5. Several examples of applications are given in section 6, demonstrating the broad utility of the methods. 2 Penalized Shape-Constrained Splines The B-spline bases are used for penalized monotone or convex regression. For the monotone case, we use the quadratic splines and define the k m matrix of slopes at the knots S by S ij = δ j (t i). Then the linear combination m j=1 b j δ j (x) is non-decreasing if and only if the coefficient vector is in the set C = {b : Sb 0} IR m. (5) (Note that this result can not hold for cubic or higher-degree splines, where the spline function may be increasing at the knots but decreasing between, and there is no set of linear constraints on b that guarantees monotonicity.) For the convex case, we use cubic splines and S ij = δ j (t i); then the linear combination is convex if and only if Sb 0. (For degree four or higher, there is no set of linear constraints on b that guarantees convexity.) Note that S is k (k + 1) for the monotone case and k (k + 2) for the convex case, and both are full row-rank. The extension to monotone and convex restrictions requires a (k + 1) (k + 2) matrix S, containing the k rows from the convex constraints and an additional row to guarantee a positive slope at the left hand side of the interval. Clearly, variations such as decreasing and concave, or sigmoidal with known point of inflection require simple changes to the matrix S. The set C is a convex polyhedral cone, and the problem of minimizing (3) for b C is a quadratic programming 6

7 problem. A routine such as the R function qpsolve might be used to find the coefficients of the basis functions. The problem may be transformed into a cone projection. Let LL be the Cholesky decomposition of B B + λd D, then define φ = L b and z = L 1 B y, so that (3) can be written as φ z 2 and the constraints are Aφ 0, where A = S(L ) 1. The minimizer ˆφ is the projection of z onto the cone C = {φ : Aφ 0} IR m, and subsequently ˆb = (L ) 1 ˆφ. Some necessary background for cone projection is given next; more details and proofs of the following claims may be found in Meyer (2003) or Silvapulle and Sen (2005, chapter 3). To characterize the constrained solution and estimate the effective degrees of freedom of the model, the dimension of the face of the cone containing the solution is determined as follows, for the case where A is k m and full row-rank. (All cases discussed in this paper involve full row-rank constraint matrices. However, if A is not full row-rank but irreducible, the following characterizations may still be given using the methods of Meyer (1999) for constrained estimation with more constraints than dimensions.) First, the edges or generators of C must be found. Let e 1,..., e m k span the null space of A, and let Ã be the square, nonsingular matrix where the first k rows are the rows of A and the last rows are the e vectors. Then the first k columns of Ã 1 are the edges γ 1,..., γ k of the cone, and the cone can be written as m k C = φ : φ = a j e j + j=1 k c j γ j, for c j 0, j = 1,..., k. (6) i=1 The projection of z onto C lands on a face of the cone, where the 2 k faces are indexed by the collection of sets J {1,..., k}, and are defined by m k F J = φ : φ = a j e j + c j γ j, for c j > 0, j J. j=1 j J Note that the interior of the cone C is a face with J = {1,..., k}, and the linear space spanned 7

8 by the e j vectors is a face with J =. Further, the faces partition C. The cone projection algorithm determines the face F J on which the projection falls; then the solution coincides with the ordinary least-squares projection onto the linear space spanned by the e j vectors and the edges γ j for j J. To compute the edf associated with the constrained penalized fit, let J be the matrix whose columns are the e vectors and the γ j vectors for j J. Then the constrained estimate of µ is ˆµ = B(L ) 1 J ( J J ) 1 JL 1 B y =: P J y, and hence the edf is the trace of P J. Note that if J = {1,..., k}, i.e., all edges are used, then J ( j J ) 1 J is the identity matrix, and the constrained edf is identical to the unconstrained edf. This happens when the unconstrained spline satisfies the constraints; otherwise, the edf for the constrained version is smaller. The edf for the constrained version is a random quantity with k + 1 possible values, the largest of which is that of the unconstrained version. A cross-validation scheme for the constrained version considers n CV (λ) = [y i ˆf ( i),λ (x i )] 2, i=1 where ˆf ( i),λ is the estimated regression function using the edge vectors with the ith observation removed (the knots stay the same). If the index set J of edges used by this projection is the same as that using all the data, then y i ˆf ( i),λ (x i ) = y i ˆf λ (x i ) 1 h i, where h i is the ith diagonal element of the projection matrix for the face indexed by J and ˆf λ is the estimated regression function using all the data. Replacing h i with the average over i gives the GCV in the form (4). Although the set J may vary when an observation is ignored, particularly for points of high leverage, the approximation is useful for the selection of λ and the check of the shape assumptions in the next section. 8

9 The penalized regression spline fits to (1) where cov(ɛ) = A for known positive-definite A is a simple extension of either the constrained or unconstrained model. The expression to be minimized is ψ(b; y) = b (B A 1 B + λd D)b 2y A 1 Bb, which can be accomplished as above with the Cholesky decomposition LL = B A 1 B + λd D and z = L 1 BA 1 y. This extension is important for correlated-errors models, for heteroskedastic data, and for the iteratively re-weighted least-squares methods such as for generalized regression. The estimators are more robust to knot choices when shape restrictions are employed, especially if convex or concave constraints are appropriate. The restriction to the cone allows for all the flexibility within the constraints, without allowing the wiggling associated with over-fitting in unconstrained non-parametric function estimation. To illustrate, a data set of size n = 50 is used to compare quadratic regression spline fits in Figure 2. In plot (a) the unconstrained, unpenalized spline fits with 3, 4, 5, and 6 equally spaced knots (having edf=4, 5, 6, and 7) are the most variable. The penalized spline fits, each with 10 knots placed evenly in the span of x-values, are shown in plot (b), with the choices of penalty parameter that coincide with 4, 5, 6, and 7 edf. Next, suppose that the relationship must be monotone increasing. In plot (c) we see the unpenalized, constrained splines, again with 3, 4, 5, and 6 knots, and finally, in plot (d) the penalized constrained splines (with the same penalty parameters as in (b)) are seen to vary the least with edf, i.e., are most robust to user choices. Next we show that the constrained penalized splines have the same convergence rates as the unconstrained penalized splines, if f (x) 0 over [0, 1], under the following assumptions: A1. The regression function f is in C p+1 [0, 1]. A2. The errors ɛ i are uncorrelated with zero mean and common variance σ 2. A3. There is a constant M such that for δ min = min i=1,...,k 1 (t i+1 t i ) and δ max = max i=1,...,k 1 (t i+1 9

10 (a) (b) (c) (d) Figure 2: Spline fits to simulated data. Each plot shows curves using 4 (black), 5 (red), 6 (green), and 7 (blue) equivalent degrees of freedom. (a) unconstrained, unpenalized (b) unconstrained, penalized, (c) constrained, unpenalized, and (d) constrained, penalized t i ), then δ max /δ min M for all n, and δ max = o(k 1 ). A4. There is a distribution function design Q with positive density so that sup x [0,1] Q n (x) Q(x) = o(1/k) where Q n is the empirical distribution function for x 1,..., x n. A5. The number of knots K = o(n). The Kuhn-Tucker conditions (see Silvapulle and Sen (2005), Appendix 1) for the unweighted projection ˆφ minimizing z φ 2 over φ C are z ˆφ, ˆφ = 0 and z ˆφ, φ 0, for all φ C. These are easily seen to be equivalent to y ˆµ, ˆµ = λˆbd Dˆb and y ˆµ, µ λˆbd Db, for all µ = Bb, b C. For the unconstrained estimator b, we have y µ, µ = λ bd Db, for all µ = Bb, b C. Let b 0 be the minimizer of ψ(b; µ) for ψ defined in (3), so that µ 0 = Bb 0 is the penalized spline 10

11 fit to the true function values, and suppose Sb 0 0, so that the shape restrictions hold. Then µ µ 0 2 = µ ˆµ 2 + ˆµ µ µ ˆµ, ˆµ µ 0 = µ ˆµ 2 + ˆµ µ y µ, ˆµ µ y ˆµ, ˆµ µ 0 µ ˆµ 2 + ˆµ µ λ( b ˆb) D D(ˆb b 0 ). From the properties of B-splines, we have µ ˆµ 2 = b ˆb 2 O p (n/k), so that we have µ µ 0 2 ˆµ µ b ˆb 2 [O p (n/k) + 2λD D(ˆb b 0 )] Using the banded nature of D and ˆb j = O p (1) for all j, and letting λ and k grow at the optimal rates given in section 1 (for either regression spline rates or smoothing spline rates), we have that the convergence rate for the constrained penalized splines is at least that for the unconstrained versions. To compare the constrained and unconstrained fits for small samples, simulations were conducted for two choices of regression function, edf, and sample size. The x-values are equally spaced in (0, 1) and unit model variance is used. The function f(x) = 1.5(2x 1) 3 is flat in the middle and steep on either end, while f(x) = log(2x +.1) is increasing rapidly at the left end of the interval, and flattens out at the right end. For the latter, monotone and concave constraints were also considered. The penalty parameters for the n = 50 case are chosen so that the edf values for the unconstrained estimator are 4 and 7, while the values for n = 100 are so that the edf is 5 and 8. In addition, we compute the constrained and unconstrained fits using the minimum-gcv value of the penalty parameter, where the range for n = 50 is so that the unconstrained edf varies from 3.5 to 8 by half-steps, and for n = 100, the range is The minimum-gcv value of the penalty parameter is typically smaller for the constrained case, because the constrained estimator allows for a high degree of flexibility without introducing the wiggling that is typically associated with over-fitting in non-parametric regression. In 11

12 f(x) = 1.5(2x 1) 3 f(x) = log(2x.1) n k edf monotone unconstr mon-conc monotone unconstr GCV GCV Table 1: SASEL for constrained and unconstrained penalized splines. The model variance is unity, the x values are equally spaced in (0, 1), and N = 10, 000 datasets per cell. particular, with the constrained estimator, one can decrease the penalty parameter without (necessarily) increasing the effective degrees of freedom of the fit, because the face of the cone on which the projection lands may be of the same small dimension as for the fit with larger penalty. The computations here and in the rest of the paper use q = 3, so that the fit for larger penalty parameter tends toward a cubic polynomial. The square root of the average squared error loss (SASEL) for the rth data set was computed as SASEL r = [( 1 n )] n 1/2 (ˆµ ri µ i ) 2, i=1 and the average over 10,000 data sets is reported in Table 1. In each case, the more stringent constraints provide smaller SASEL, with bigger improvements shown for larger edf. Comparisons of the penalized spline and the monotone B-spline of He and Shi (1998) give the expected results that the former has smaller MSE for normal errors, and the latter has smaller MSE for double-exponential errors. Simulations results for SASEL are given in Table 2 for two regression functions and two sample sizes. The method of He and Shi was coded using the R function lpcdd for the linear programming; there are m + 2n variables with m 1 + 3n linear equality and inequality constraints; for larger sample sizes the routine did not always converge. 12

13 f(x) = 1.5(2x 1) 3 f(x) = log(2x.1) n edf CPS HS CPS HS Table 2: SASEL for constrained penalized splines (CPS) compared with the monotone B- splines of He and Shi (HS). The model variance is unity, the x values are equally spaced in (0, 1), and N = 10, 000 datasets per cell. 3 Check for Validity of Shape Assumptions The shape assumptions often fall into the category of a priori knowledge, but occasionally the research question might concern the shape. The classic formal hypothesis test of monotone versus non-monotone is presented in Robertson, Wright, and Dykstra (1988), which tests H 0 :f is monotone versus H a :f is not monotone, using unsmoothed monotone regression. Their test statistic has the distribution of a mixture of beta random variables, if the true function is constant. For f not constant, the test can be badly biased. Bowman, Jones, and Gijbels (1998) formulate a test using kernel regression, where they define the critical bandwidth to be the smallest for which the fit is monotone. Using the errors from this fit, a bootstrap method is used for the p-value. The test proposed by Ghosal, Sen, and van der Vaart (2000) uses a locally-weighted Kendall s tau statistic to determine if there any decreasing portions of the regression function. Hall and Heckman (2000) compute a running derivative by fitting lines over portions of the data, and their test statistic is the minimum slope. Juditsky and Nemirovski (2002) formulate a general problem of testing whether the signal in a Gaussian random process is contained in a convex cone. A simple method to check for the validity of the shape assumptions is proposed here, based on the generalized cross-validation (GCV) choices of the penalty parameters. We choose a set of 13

14 Figure 3: (a) The GCV values for the data set of Figure 2, where the black circles are for the constrained version and the red triangles for the unconstrained splines. The method chooses the unconstrained version with edf=7.5, shown in plot (b) along with the best constrained fit. The dashed curve is the true regression function. possible penalty parameters, and compute the constrained and unconstrained penalized spline fits and the GCV for each parameter value. If the minimum GCV value for the constrained splines is smaller than that for the unconstrained splines, that is evidence that the shape restrictions hold. Conversely, if the smallest GCV value is for an unconstrained spline, then we conclude that there is evidence against the shape restriction. In the case of a tie, we decide in favor of the shape restrictions, because a tie often indicates that the unconstrained fit satisfies the constraints. The procedure is illustrated in Figure 3, using a data set with n = 50 generated from the regression function shown as the dash curve in plot (b). Ten penalty parameters were used, chosen so that the edf values for the unconstrained fit are 3.5, 4.0,..., 8.0, and 10 knots equally spaced over the range of the MRI values. The GCV values are shown in plot (a), where the values for the constrained fit are shown as circles and those for the unconstrained fit are triangles. The winner is the unconstrained spline with penalty parameter chosen to coincide with edf=7.5. The minimum-gcv constrained and unconstrained fits are shown in plot (b). Although this method is not a formal hypothesis test, we can show that it has nice large- 14

15 sample properties. First, suppose that the true regression function is strictly increasing, so that f (x) ξ > 0 for x [0, 1], and suppose λ and k grow with n at the rates necessary to obtain the optimal convergence rates. Then the probability that f (t j ) > 0 for all j = 1,..., k is P (S b 0) P (S( b b 0 ) ξ1), which goes to one as n increases, by the consistency of the unconstrained estimator b. Therefore, the constraints are asymptotically unbinding, and the probability that the GCVs are identical (and hence the constrained version is chosen) goes to one. Next, suppose that the true regression function is decreasing over an interval, so that f (x) ξ < 0 for x [a, b] where 0 a < b 1. Then SSE/n for the unconstrained version goes to σ 2, but SSE/n for the constrained version goes to σ 2 + c, where c (b a) 2 ξ/4. The limit of the numerator of the GCV, divided by n, is larger by c for the constrained version, while the denominators both tend to one, so that the probability that the GCV for the unconstrained fit is smaller (and the unconstrained version is chosen) goes to one. The small-sample performance of the procedure is illustrated through simulations. The percents of times the correct choice is made are provided in Table 2, for four choices of regression function and three sample sizes, where 10,000 data sets were generated for each combination, and model variance is σ 2 = 1. For the smallest sample size n = 50, the 10 penalty parameters are chosen so that the edf for the unconstrained fit ranges from 3.5 to 8. For n = 100, the range is 4 to 8.5, and for n = 200, the range is 4.5 to 9. The regression functions are chosen from the Bowman, et al. (1998) paper and the Ghosal, et al. (2000) paper. For the former, f a (x) = 1 + x a exp{ 50(x.5) 2 } for a =.15,.25, and.45. Note that f.15 is increasing over the range [0, 1], f.25 has a mild dip, and f.45 has a more pronounced dip. That paper uses σ =.1 for their simulations; we use σ = 1 and 10f a. Their rates of correct choice for a =.15 are 97.3, 98.2, and 99.2, for n = 50, n = 100, and n = 200, respectively. For a =.25, their rates are 8.8, 10.0, and 17.4 percent correct choices, and for a =.45, their rates are 37.4, 54.4, and 87.4 percent. Ghosal, et al. 15

16 percent correct choice f n = 50 n = 100 n = f f f µ Table 3: Simulations for the GCV method for determination of monotonicity. The model variance is unity, N = 10, 000 datasets per cell. Entries are percents of correct choices. (2000) defined 10(x.5) 3 exp{ 1000 x.25) 2 } for x [0,.5) µ 4 = 0.1(x 0.5) exp{ 100(x 0.25) 2 }, for x [.5, 1], which has a very sharp dip followed by a flat section. In that paper they use σ =.1; here we use σ = 1 and 3µ 4. The signal-to-noise ratio for our simulations is over three times smaller, but their rates of correct choices are lower: 2.6, 8.0, and 75.7, for n = 50, 100, and 200. The method proposed here consistently has higher power than the two competitors, for examples chosen for those methods. The advantage is that the new method does not require the choice of a single penalty parameter or bandwidth, but allows the flexibility of the fit to be chosen by the data. Therefore, the sharp dip of µ 4 and the more regular variation of f a are fit equally well. 4 Extensions The partial linear model is a simple extension. An arbitrary number of covariates to be modeled parametrically can be included in the model, so that y i = z ia + f(x i ) + σɛ i (7) 16

17 for z i IR r, where again the ɛ i are iid random variables with mean zero and unit variance. Let the n r matrix Z be the design matrix for the covariates, and suppose that its columns and the x vector form a linearly independent set, and further suppose that the 1 vector is not in the column space of Z. Estimates of E(y) will be of the form Za + Bb = W c, where the first r columns of W are the covariate vectors and the last m are the B-spline basis vectors. The constraint Sc 0 is again imposed, where now the first r columns of the k (r + m) matrix S contain zeros. The cone projection proceeds as described in section 2, except now the null space of A is of dimension m + r k, so that there are more unconstrained dimensions in the constraint set C. The ability to add covariates to a model is important if there is a possible confounding variable or to account for another source of variation. For the generalized regression problem, the responses y i are independent observations from a distribution written in the form of an exponential family: f(y i ) = exp{[y i θ i b(θ i )]/φ 2 c(y i, τ)}, where the specifications of b and c determine the sub-family of models. Common examples are b(θ) = log(1 + e θ ) for the Bernoulli and b(θ) = exp(θ) for the Poisson model. The vector µ is defined as E(y), and µ i = b (θ i ). The variance of y i is b (θ i )τ, and the variance function is written in terms of the mean as var(y i ) = V (µ i ). The mean vector is related to a design matrix of predictor variables through a link function g(µ i ) = η i ; for the generalized linear model, η i = x i β, where x i is the ith row of the design matrix and β is the parameter vector to be estimated. Here, we let η i = f(x i ) + z iα where f has known shape and degree of smoothness, and z i is an r-vector of co-variates. It is often natural to impose monotonicity restrictions on the mean function; for example, the probability of cancer in a lab rat might be increasing with amount of carcinogen administrered, or the expected count of water bird nests at a lake might be known to be decreasing with amount of boat traffic. If the link function is one-to-one, this is equivalent to constraining the function component f to be monotone. 17

18 The algorithm involves iteratively re-weighted quadratic programming problems, following the same ideas for the generalized linear model as found in McCullagh & Nelder (1989). Starting with η 0 C, the estimate η k+1 is obtained from η k by constructing a data vector u k where where µ k i u k i = η k i + (y i µ k i ) ( ) dη, dµ ki = g 1 (η k i ) and the derivative of the link function is evaluated at µk i. Weights w i are defined by 1/wi k = (dη/dµ) 2 k V k. The estimate η k+1 is obtained as the weighted penalized constrained regression spline with data u k ; i.e, the weighted projection of u k onto the cone defined by the study design. The algorithm is guaranteed to converge to the penalized maximum likelihood estimator; the proof is similar to the convergence proof in Meyer and Woodroofe (2004). The GCV may be used as check for the validity of shape assumptions, where minus two times the log likelihood is used in place of the sum of squared residuals, and the edf for the constrained version is computed according to the penalty parameter and the dimension of the cone face containing the solution. The varying coefficient linear regression model was introduced by Hastie and Tibshirani (1993). A simple version is y i = β 0 (u i ) + β 1 (u i )x i + ɛ i, where the functions β 0 and β 1 are assumed to be smooth. Here we impose a shape such as monotonicity or convexity on either or both coefficient functions. We define basis functions δ j (u), j = 1,..., m, over the range of u values, and create the matrix B with columns δ 1,..., δ m, δ 1 x,..., δ m x. The 2m 2m penalty matrix D is block diagonal with identical m m qth-order difference matrices. The constraint matrix S is constructed according to the desired shapes for β 0 and β 1. Then the estimation and inference methods of the previous sections can be implemented. 18

19 (a) constrained unconstrained (b) constrained unconstrained Figure 4: Analysis of British coal mine accident data. (a) Counts of accidents per year, along with estimated mean functions. The dashed curve is the penalized quadratic spline with thirdorder penalties, with penalty parameter chosen so that the effective degrees of freedom of the model is 6.5. The solid curve is the constrained version with the same penalty parameter. (b) Generalized cross-validation values for nine values of penalty parameter, chosen so that the unconstrained spline fits have edf ranging from 3.5 to Examples and Discussion The British coal mine accident data are used to illustrate the method of validation of the monotonicity assumption for count data. These data were used in the Eilers and Marx (1996) paper on penalized splines. Counts of serious mine accidents per year from 1850 to 1961 are shown in Figure 4(a). Suppose a historian believes that the risk of accident is decreasing over the time span, and the spike in the 1930s can be explained by random chance. Is there evidence in the data to contradict this belief? Constrained and unconstrained penalized quadratic spline fits were computed for penalty parameters chosen so that the effective degrees of freedom for the unconstrained spline increases from 3.5 to 7.5 by half-steps. The GCV values are shown in plot (b); the minimum value is attained by the constrained version, so we find no evidence that the historian is incorrect. The ability to include co-variates when modeling the relationship between the response and the main predictor of interest is important when there may be a confounding effect, or simply 19

20 (a) (b) GCV constrained unconstrained log(survival time) present absent edf, unconstrained fit log(white blood cell count) Figure 5: (a) GCV values for constrained and unconstrained fits to the leukemia data. (b) The solid curves are the constrained fit with penalty parameter chosen so that the unconstrained fit has edf=6. The dashed curves are the fit with edf=4. to account for another source of variation. For example we use the leuk data set from the MASS library in R, containing survival times and white blood cell counts of 33 leukemia patients. A two-level covariate records the presence or absence of a morphologic characteristic of the blood cells. These data are depicted in Figure 5(b), where the log of the white blood cell count is plotted against the log of the survival time, and the presence or absence of the characteristic is indicated by the plot character. Plot (a) contains the GCV values for fits constrained to be decreasing in the continuous predictor, and for the unconstrained fits. For the smallest effective degrees of freedom, the unconstrained and constrained fits coincide, but the minimum GCV occurs for the constrained fit at the fourth largest penalty parameter. The best fit is shown as the solid curves in plot (b). The dashed curves are the smallest-gcv unconstrained fit. These are decreasing, but oversmoothed compared to the best constrained fit. For an example of the varying coefficient model, consider data from a lawsuit from the early 1990s. The United Auto Workers sought pay equity adjustments for civil service jobs in the state of Michigan (Killingsworth (2002)), claiming that women were systematically underpaid by the state based on comparable worth arguments. Both parties agreed to the appointment of a committee to assign points to civil service job categories, meant to represent worth. 20

21 (a) (b) (c) log(wages) log(wages) log(wages) percent women points points Figure 6: (a) Wages versus percent women for UAW data; (b) wages versus points; (c) fit to varying coefficient model for 0, 30, 50, 70, and 100% women. The data set then consists of three variables for each of 176 categories in Michigan civil service jobs: the points assigned by the committee, the percent of women in the job category, and the maximum wages (in dollars per hour) for the category. The data are shown in Figure 6(a) and (b), with wages plotted against each of the two predictors. The wages appear to be a decreasing function of percent of women employees, but perhaps the job categories with more women have, on average, lower points. As seen in plot (b), wages rise sharply with increasing points, with a curvature that could be accounted for if the slope and intercept of the linear relationship depends on the percent women. The model is y i = β 0 (u i ) + β 1 (u i )x i + ɛ i, where y i is wages for the ith job category, x i is the points, and u i is percent women. Interest is in whether the slope function is decreasing with percent of women in the job category. The function β 0 is unconstrained. Penalty parameters that result in between 6 and 12.5 effective degrees of freedom for the unconstrained fit are chosen, and the GCV scores for the constrained and unconstrained fits are shown in Figure 7(a). The dotted line near the top indicates the GCV for the constant-slope model with varying intercept, indicating that the varying slope function provides a better fit. The constrained and unconstrained fits coincide for the seven largest penalty parameters, and 21

22 GCV score (a) intercept estimate (b) slope estimate (c) edf, unconstrained percent women percent women Figure 7: (a) The GCV values for the constrained and unconstrained slope functions are identical for the first seven values, and the smallest GCV occurs for an edf of 6.5. (b) The estimated intercept function for edf=6.5, and (c) the estimated slope function. the GCV for the unconstrained fit is increasing past where they diverge. The minimum GCV fits for the intercept and slope functions are shown in plots (b) and (c). Finally, example fits to the scatterplot are shown in Figure 6(c), where the lines correspond to 0, 30, 50, 70, and 100% women, from the top to the bottom on the right. In conclusion, we have found that regression spline fits are more robust to user-defined choices when shape constraints are imposed. The constrained, penalized splines are computed via a weighted projection onto a polyhedral convex cone, which is a subset of the model space for the unconstrained penalized splines. All the flexibility of the penalized splines are retained, within the bounds of the shape restrictions. The convergence rates for the constrained versions are the same as for the unconstrained versions, but in practice, the constrained fits tend to have smaller squared error loss than the unconstrained fits, as well as more robustness to penalty parameter choice. The GCV method of choosing the penalty parameter provides a simple way to determine if the shape assumptions are valid; this is not a formal hypothesis test but has been shown to have good large-sample properties and compares well to two competing methods. A straightforward extension to the additive models has been given, where the fit is obtained through a 22

23 single cone projection; no back-fitting or two-step procedures are necessary. The ability to account for co-variates in data analysis is important, to check for possible confounding effects and to model another source of variation. The generalized regression model and the varying-coefficient model are also simple extensions, and co-variates are easily included in the model. User-friendly R-code for all methods presented in this paper can be found at meyer/penspl.htm. References [1] Bowman, Jones, and Gijbels (1998) Testing monotonicity of regression. Journal of Computational and Graphical Statistics 7(4) [2] de Boor, C. (2001) A Practical Guide to Splines, revised edition. Springer, New York. [3] Claeskens G., Krivobokova T., and Opsomer J.D. (2009) Asymptotic properties of penalized spline estimators. Biometrika 96(3) [4] Eilers, P.H.C. and Marx, B.D. (1996) Flexible smoothing with B-splines and penalites. Statistical Science 11(2) [5] Friedman J. and Tibshirani R. (1984) The Monotone Smoothing of Scatterplots, Technometrics 26(3), [6] Ghosal, Sen, and van der Vaart (2000) Testing monotonicity of regression. Annals of Statistics 28(4) [7] Hall P. and Heckman N.E. (2000) Testing for monotonicity of a regression mean by calibrating for linear functions. Annals of Statistics 28(1) [8] Hastie, TJ and Tibshirani, RJ (1990). Generalized Additive Models Chapman & Hall, London. 23

24 [9] Juditsky A. and Nemirovski A. (2002) On nonparametric tests of positivity/ monotonicity/ convexity. Annals of Statistics 30(2) [10] Killingsworth, M.R. (2002) Comparable Worth and Pay Equity: Recent Developments in the United States. Canadian Public Policy Vol 28. [11] Mammen, E. (1991) Estimating a Smooth Monotone Regressin Function, Annals of Statistics 19(2) [12] Mammen, E. and Thomas-Agnan (1999) Smoothing Splines and Shape Restrictions, Scandinavian Journal of Statistics [13] McCullagh, P. and Nelder, J.A. (1989) Generalized Linear Models, Second Edition Chapman & Hall, New York. [14] Meyer, M.C. (1999) An Extension of the Mixed Primal-Dual Bases Algorithm to the Case of More Constraints than Dimensions, Journal of Statistical Planning and Inference 81, pp [15] Meyer M.C. (2003) A test for linear versus convex regression function using shaperestricted regression, Biometrika, 90(1) [16] Meyer M.C. and Woodroofe, M (2004) Estimation of a Unimodal Density using Shape Restrictions, Canadian Journal of Statistics 32(1), [17] Meyer, M.C. (2008) Inference using shape-restricted regression splines. Annals of Applied Statistics 2(3) [18] Ramsay, J. O. (1988) Monotone regression splines in action, Statistical Science, 3(4) [19] Robertson, T., Wright, F. T., and Dykstra, R. L. (1988) Order Restricted Statistical Inference John Wiley & Sons, New York 24

25 [20] Ruppert, D., Wand, M. P., and Carroll, R. J. (2003) Semiparametric Regression, Cambridge Series in Statistical and Probabilistic Mathematics. [21] Silvapulle, M.J. and Sen, P.K. (2005) Constrained Statistical Inference John Wiley & Sons, New York [22] Tantiyaswasdikul, C. and Woodroofe, M. (1994) Isotonic Smoothing Splines under Sequential Designs, Journal of Statistical Planning and Inference [23] Utreras, F.I. (1985) Smoothing noisy data under monotonicity constraints: existence, characterization, and convergence rates. Numerishe Mathematik [24] Wang X. and Li, F. (2008) Isotonic smoothing splines regression, Journal of Computational and Graphical Statistics 17(1) [25] Zhou, S., Shen, X., and Wolfe, D.A. (1998) Local Asymptotics for Regression Splines and Confidence Regions. Annals of Statistics 26(5)

An Algorithm for Quadratic Programming. with Applications in Statistics

An Algorithm for Quadratic Programming with Applications in Statistics Mary C. Meyer December 2, 2010 Problems involving estimation and inference under linear inequality constraints arise often in statistical