J. Weston, A. Gammerman, M. Stitson, V. Vapnik, V. Vovk, C. Watkins. Technical Report. February 5, PDF Free Download

Density Estimation using Support Vector Machines J. Weston, A. Gammerman, M. Stitson, V. Vapnik, V. Vovk, C. Watkins. Technical Report CSD-TR-97-3 February 5, 998!()+, -./ 3456 Department of Computer Science Egham, Surrey TW EX, England

Introduction Introduction In this report we describe how the Support Vector (SV) technique of solving linear operator equations can be applied to the problem of density estimation [4]. We present a new optimization procedure and set of kernels closely related to current SV techniques that guarantee the monotonicity of the approximation. This technique estimates densities with a mixture of bumps (Gaussian-like shapes), with the usual SV property that only some coecients are non-zero. Both the width and the height of each bump is chosen adaptively, by considering a dictionary of several kernel functions. There is empirical evidence that the regularization parameter gives good control of the approximation, and that the choice of this parameter is universal with respect to the number of sample points, and can even be xed with good results. The density estimation problem We wish to approximate the density function p(x) where F (x) = P (X x) = Z x p(t)dt: ()? We consider densities in the interval [,]. Finding the required density means solving the linear operator equation Z (x? t)p(t)dt = F (x) () where instead of knowing the distribution function F (x) we are given the iid (independently and identically distributed) data x ; : : : ; x`: (3) We consider the multi-dimensional case where the data x ; : : : ; x` are vectors of dimension d. The problem of density estimation is known to be ill-posed. That is, when nding f F that satises the equality Af = F; f can have large deviations in solution with small deviations in the right hand side. In our terms, a small change in the cumulative distribution function of the continuous random variable x can cause large changes in the derivative, the density function. One can choose regularization techniques to obtain a sequence of solutions that converge to the desired one. Using the data (3) we construct the empirical distribution function F`(x) = ` (x? x i ) : : : (x d? x d i ) where x = (x ; : : : ; x d ) instead of the right hand side of () which is unknown. We use the SV method to solve the regression problem of approximating the right hand side, using the data (x ; F`(x )); : : : ; (x l ; F l (x`)): Applying the SV method of solving linear operator equations [3], the parameters of the regression function can then be used to express the corresponding density. (x) = n ; x > ; otherwise

SV Method for Solving Linear Operator Equations The advantage of this approach is that we can control regularization through the free parameter in the SVM approach. For any point x i the random value F l (x i ) is unbiased and has the standard deviation r i = ` F (x i)(? F (x i )) so we characterize our approximation with r i = i = ` F`(x i )(? F`(x i )) where is usually chosen to be +, where is some small value. Therefore one constructs triples (x ; F`(x ); ); : : : ; (x l ; F l (x`); `): (4) 3 SV Method for Solving Linear Operator Equations To solve the density estimation problem we use the SV method for solving linear operator equations Af(t) = F (x) (5) where operator A is a one to one mapping between two Hilbert spaces. We solve a regression problem in image space (the right hand side of the equation) and this solution which is an expansion on the support vectors can be used to describe the solution in pre-image space (the left hand side of the equation before the operator A has been applied). The method is as follows: choose a set of functions to solve the problem in pre-image space f(t; w) that are linear in some attening space: f(t; w) = X r= w r r (t) = (W (t)): (6) That is, the set of functions is a linear combination of the functions (t) = ( (t); : : : ; N (t); : : :) (7) which can be thought of as a hyperplane in some attening space, where the linear combination W that denes the parameters of the set of functions can also be viewed as the coecients to the hyperplane W = (w ; : : : ; w N ; : : :): (8) The mapping from pre-image to image space by the operator A can then be expressed as a linear combination of functions in another Hilbert space dened thus: F (x; w) = Af(x; w) = X r= w r r (x) = (W (x)) (9) where r is the r th function from our set of functions after the linear operator A has been applied, i.e r (x) = A r (x). The problem of nding the required density (nding the vector W in pre-image space) is equivalent to nding the vector of coecients W in image space, where W is an expansion on the support vectors w = i (x i );

Spline Approximation of a Density 3 giving the approximation to the desired density f(t; w) = i (x i )(t): To nd the required density we solve a linear regression problem in image space by minimizing the same functional we used to solve standard regression problems ([, 3]). Instead of directly nding the innite dimensional vector W which is equivalent to nding the parameters which describe the density function, we use kernel functions to describe the mapping from input space to the image and preimage Hilbert spaces. In image space we use the kernel K(x i ; x j ) = X r= r(x i ) r (x j ) to dene an inner product dened by the set of functions. We solve the corresponding regression problem in image space, using the coecients to dene the density function in pre-image space: f(t; ; ) = i K(x i ; t) where K(x i ; t) = X r= r(x i ) r (x j ): () 4 Spline Approximation of a Density We can look for the solution to equation () in any set of functions where one can construct a corresponding kernel and cross kernel. For example, consider the set of constant splines with innite number of nodes. That is we approximate the unknown density by the function: p(x) = = Z Z g()[ Z x (t? )dt]d + a x g()[(x? ) + ]d + a x () where function g() and parameter a are to be estimated. So the corresponding kernel is K(x i ; x j ) = Z x (x i? ) + (x j? ) + d + x i x j = (x i ^ x j ) (x i _ x j )? (x i ^ x j ) 3? (x i ^ x j ) (x i _ x j ) + 3 (x i ^ x j ) 3 + x i x j () and the corresponding cross kernel is K(x; t) = Z = x(x ^ t)? (t? )(x? ) + d + x (x ^ t) + x

Considering a Monotonic Set of Functions 4 Using kernel () and triples (4) we obtain the support vector coecients i = i? i, only some of which are non-zero, from the standard SV regression approximation with generalized -insensitive loss function, by maximizing the quadratic form W ( ; ) =? i ( i + i) + subject to the constraints y i ( i? i)? i = i i;j= i C; i = ; : : : ; ` ( i? i)( j? j)k(x i ; x j ) i C; i = ; : : :; `: (3) These coecients dene the approximation to the density f(t) = i K(x i ; t) where x i are the support vectors with corresponding non zero coecients i. 5 Considering a Monotonic Set of Functions Unfortunately, the described technique does not guarantee the chosen density will always be positive (recall that a probability is always nonnegative, and the distribution function monotonically increases). This is because the set of functions F (x; w) from which we choose our regression in image space can contain nonmonotonic functions. We can choose a set of monotonic regression functions and require that the coecients i, i = ; : : : ; ` are all positive. However, many sets of monotonic functions expressed with Mercer Kernels are too weak in their expressive power to nd the desired regression - for example if we choose from the set of polynomials with only positive coecients. A set of linear constraints can be introduced into the optimization problem to guarantee monotonicity. This becomes computationally unacceptable in the multidimensional case, and in our experiments did not give good results even in the one dimensional case. We require a set of functions that has high VC dimension but is guaranteed to approximate the distribution function with a monotonically increasing function. 6 Another Approach to SV Regression Estimation In the SV approach, regression estimation problems are solved as a quadratic optimization problem (3), giving the approximation F (x) = i K(x i ; x): If we choose to solve this problem directly, (to choose a function from this set of functions) without the regularizing term (w w) (minimizing the norm of coecients), we are only required to solve a Linear Programming (LP) problem

Linear Programming Approach to SV Density Estimation 5 [4]. In this alternative approach we can choose our regularizing term as the sum of the support vector weights. This is justied by bounds obtained in the problem of Pattern Recognition that probability of test error is less than the minimum of three terms:. A function of the number of free parameters this can be billions, and although is often used as a regularizer in classical theory, is in fact ignored in the SV method.. A function of the size of the margin this is the justication for the regularizer used in the usual SV approach (maximizing the margin). 3. A function of the number of support Vectors the justication for the new approach. So, to solve regression problems we can minimize the following: under constraints y i?? i ( j= i + i + C i + C i (4) ( i? j)k(x i ; x j )) + b y i + + i ; i = ; : : :; ` (5) i ; i ; i ; i ; i = ; : : : ; `: (6) This regularizing term can also be seen as a measure in some sense of smoothness in input space; a small number of support vectors will mean a less complex decision function. So minimizing the sum of coecients can be seen as an approximation of minimizing the number of support vectors. 7 Linear Programming Approach to SV Density Estimation There exist powerful basis functions that will guarantee the monotonicity of the regression but are non-symmetrical, thus violating Mercer's condition of describing an inner product in some Hilbert space. If we do not describe an inner product, we cannot use the quadratic programming approach to the support vector machine, which minimizes the norm of coecients of a hyperplane in feature space. Using the linear programming approach to the Support Vector Machine we do not have this restriction. In fact in this approach K(x; y) can be any function from L (P ). To estimate a density we choose a monotonic basis function K(x; y) which has a corresponding cross kernel K(x; t), and then minimize under constraints y i? i? i j= i + C i + C i (7) j K(x i ; x j ) y i + i + i ; i = ; : : : ; ` (8) i K(x i ; ) = (9)

Gaussian-like Approximation of a Density 6 i ; i ; i i K(x i ; ) = () ; i = ; : : :; `: () This diers from the usual SV LP Regression (Section 6) in the following ways:. We require only positive coecients, so we only have one set of Lagrange multipliers.. We require the constraints (9) and () to guarantee F () = and F () =. 3. We no longer need a threshold b for our hyperplane because we require that F () =, and F () =, and choose kernels that can satisfy these conditions. 4. We allow each vector x i, i = ; : : : ; ` to have a unique -insensitivity so we can nd the corresponding support vectors for our triples (4). To show that there exist powerful linear combinations of monotonic non-symmetrical kernels, consider the following basis function, the so-called step kernel K(x; y) = dy (x i? y i ): For each support vector (basis function) this gives the eect of a step around the point of the vector in input space (we can also think of this as constant splines with a node at each vector). The coecient or weight of the vector can make the step arbitrarily large. This kernel can approximate any distribution and is known to converge to the desired one as the number of training examples increases. However, the non-smoothness of the function renders it useless for estimating densities. 8 Gaussian-like Approximation of a Density We would like to approximate the unknown density from a mixture of bumps (Gaussian-like shapes). This means approximating the regression in image space with a mixture of sigmoidal functions. Consider sigmoids of the form: K(x; y) = + e (x?y) In particular, if = then is a good approximation to the distance from the support vector at which the sigmoid approaches atness. The approximation of the density then is: f(x) = i K(x i ; x) where K(x; y) = : () + e (x?y) + e?(x?y) The distribution function is approximated with a linear combination of sigmoidal shapes which can estimate well the desired regression. The derivative of the regression function (the approximation of the density) is a mixture of Gaussian-like shapes. The chosen centres for the bumps are dened by the support vectors, and their heights by the size of their corresponding weights.

Constant Spline Approximation of a Density 7 9 Constant Spline Approximation of a Density We can also estimate our density using a mixture of uniform densities. This means approximating the regression in image space with a mixture of linear pieces, using a kernel of the form: K(x; y) = 8 < : where is the width parameter. The approximation of the density is then: where K(x; y) = ; (x? y) <? (x?y)+ ;? (x? y) ; (x? y) > f(x) = Adaptive kernel width 8 < : i K(x i ; x) ; (x? y) <? ;? (x? y) ; (x? y) > This technique does not allow us to choose the width of each piece of our density function, but only its height and centre. The width of the pieces is decided by the free parameter, but all of the widths have to be the same. We would like to remove this free parameter, and allow our SV technique to choose these widths adaptively. This can be achieved if for each centre x i we have a dictionary of k kernels, giving the approximation to the density: f(x) = ( i K (x i ; x) + i K (x i ; x) + : : : + k i K k(x i ; x)) where each vector x i,i = ; : : :; ` has coecients j i, j = ; : : : ; k. We then have a corresponding dictionary of k cross kernels, where K i has the width i, a chosen dictionary of widths, for example = ; = 3 : : : : As usual, many of these coecients will be zero. This technique allows us to remove the free parameter, and also tends to reduce the number of Support Vectors required to approximate a function because of the power of the dictionary. We can thus generalize the Linear Programming SV Regression technique (section 7) to the following optimization problem: minimize with constraints n= y i? i? i n i + C j= n= i + C n j K n(x i ; x j ) y i + i + i ; i (3)

Choice of regularizer 8 n= n= i ; i ; i Choice of regularizer n i K n(x i ; ) = n i K n(x i ; ) = ; i = ; : : : ; `: When considering density estimates as mixtures of kernel functions, where all the coecients are positive, the sum of coecients must sum to (if the kernel chosen has been scaled appropriately) to ensure that the integral of the estimated density is also. This can be easily translated into a linear constraint which can replace the boundary conditions F () = and F () = to give a density estimate without nite support (this is more suitable for estimates using a mixture of Gaussian shapes because of the tails of the distribution.) However, using the linear regularizer described in the optimization problem (3) is not a good approximation to minimizing the number of support vectors because of the coecients summing to. This means we must choose another regularizer. This gives us the more general optimization problem: with constraints y i? i? i () + C j= n= n= i + C n j K n(x i ; x j ) y i + i + i ; i ; i ; i n i = ; i = ; : : : ; `: i (4) There are many choices of (), for example we could try a more accurate minimization of the number of support vectors by approximating the function (because it has no useful gradients) with a sigmoid function: () = n= + exp(? n i + 6) : However, this leads to a nonlinear, non-convex optimization problem. In the case where one only has a single kernel function, if the cross kernel (the kernel that describes the density estimate rather than the estimate of the distribution function) satises Mercer's condition and describes a hyperplane in some feature space, then one can choose the regularizer () = i;j= i j K(x i ; x j ) which is equivalent to minimizing W W, the norm of the vector of coecients describing the hyperplane in feature space. This is interesting because we can use the same regularizer as the usual support vector pattern recognition and regression

Approximating a mixture of normal distributions 9 estimation cases, because we are regularizing the derivative of our function, even though we are using a kernel to solve our regression problem that does not satisfy Mercer's condition. So we can regularize our density estimate if we use the radial basis (Gaussian-like) kernel (). However, when considering a dictionary of kernels one must consider how the different kernels are combined in the regularizer, i.e they describe a set of hyperplanes in dierent Hilbert spaces. In our experiments we chose a much simpler regularizer, a weighted sum of the support vector coecients: () = n= w n n i where w n is chosen to penalize kernels of small width (w n = n if the k kernels are in order, smallest width rst). This gives a linear optimization problem, however the quality of the solution can probably be improved by choosing other regularizers. Approximating a mixture of normal distributions We considered a density generated from a mixture of two normal distributions (x? ) + p(x; ; ) = p () exp? p () exp? x where =?4, =. We drew 5, and examples generated by this distribution, and estimated the density using our technique. Here, =, and a dictionary of four kernel widths was used - = :, = :, 3 = :3 and 4 = :5. The results are shown in gure. (5) 4.5 Density example - Royal Holloway Center for Computer Learning Density example - Royal Holloway Center for Computer Learning Density example - Royal Holloway Center for Computer Learning 4 3.5 training set estimated density estimated distbn func 4.5 4 training set estimated density estimated distbn func 5 4.5 training set estimated density estimated distbn func 3 3.5 3.5 4 3.5 3 3.5.5.5 3.5.5.5.5.5.5.5.5.5..4.6.8 -...4.6.8. -...4.6.8. -...4.6.8. Figure The real density (far left) scaled to the interval [,] is estimated using from left to right 5, and data points. The number of support vectors required were 9, and 3 respectively. 3 Conclusions and Further Research We have described a new SV technique for estimating multi-dimensional densities. The estimation can be a mixture of bumps where the height, width and centres are chosen by the training algorithm. We can also use other sets of functions desscribed by other kernels, for example constant splines. As is usual for SV techniques, the description of the approximation can be short as it depends on the number of support vectors.

Conclusions and Further Research The control of the regularization in this method appears to be independent of the number of examples. Fixing to be greater than the estimated variance (say = ) and C to a high value (close to innity), then the technique becomes a parameterless estimator of densities. The key idea of our method is the following: nd the combination of weights that describe the smoothest function (according to some measure of smoothness which we choose) that lies inside the epsilon tube of the training data. Hypothetically we could choose to have innitely many widths, giving us an innite set of kernel functions at each point in our training set. Even with these innite number of variables our optimization problem still has a nite (probably very small) number of coecients because it will not require many bumps to describe the smoothest function inside the epsilon tube. This has a great advantage over other kernel methods, for example, the Parzen windows method. The Parzen method is a xed width estimator (the width is the free parameter you must choose appropriately) and the number of coecients is equal to the number of training points. Thus our method has two advantages over the Parzen method: we remove the free parameter of width and we introduce sparsity into our decision function. The idea of dictionaries of kernel functions may also be useful in other problems (pattern recognition, regression estimation). Dictionaries of dierent sets of functions could also be considered, and not just dierent parameters of the same kernel type. Although the method looks promising, when the sample size is small (say less than 5 points) the estimates could be improved. To estimate well one must choose a good measure of smoothness. In the current implementation this method is rather poor - a weighted sum of coecients is minimized (this is not a good idea because the sum of coecients must sum to.) This is a rather ad hoc method, and there are many other regularizers we could choose. For example, as the density estimate describes a hyperplane in feature space (it is a radial basis kernel) we could minimize the norm of the vector of coecients describing this hyperplane, as in the usual support vector case.

REFERENCES References [] Cortes, C.; and Vapnik, V. 995. Support Vector Networks. Machine Learning :73-97. [] Vapnik, V. N. The Nature of Statistical Learning Theory. Springer-Verlag, New York, 995. [3] Vapnik, V.; Golowich, S.; Smola, A. 997. Support Vector Method for Function Approximation, Regression Estimation, and Signal Processing. In: M. Mozer, M. Jordan, and T. Petsche (eds.): Neural Information Processing Systems, Vol. 9. MIT Press, Cambridge, MA, 997 (in press). [4] Vapnik, V. N. Statistical Learning Theory. J. Wiley, 998 (forthcoming).

J. Weston, A. Gammerman, M. Stitson, V. Vapnik, V. Vovk, C. Watkins. Technical Report. February 5, 1998