Multi Layer Perceptron trained by Quasi Newton learning rule

Multi Layer Perceptron trained by Quasi Newton learning rule Feed-forward neural networks provide a general framework for representing nonlinear functional mappings between a set of input variables and a set of output variables (Bishop 2006). One can achieve this goal by representing the nonlinear function of many variables by a composition of non-linear activation functions of one variable: = () () (1) A Multi-Layer Perceptron may be represented by a graph: the input layer (x i ) is made of a number of perceptrons equal to the number of input variables (d); the output layer, on the other hand, will have as many neurons as the output variables (K). The network may have an arbitrary number of hidden layers (in most cases one) which in turn may have an arbitrary number of perceptrons (M). In a fully connected feed-forward network each node of a layer is connected to all the nodes in the adjacent layers. Each connection is represented by an adaptive weight which represents the strength of the synaptic connection between neurons (w kj (l) ). The response of each perceptron to the inputs is represented by a non-linear function g, referred to as the activation function. Notice that the above equation assumes a linear activation function for neurons in the output layer. We shall refer to the topology of an MLP and to the weights matrix of its connections as to the model. In order to find the model that best fits the data, one has to provide the network with a set of examples: the training phase thus requires the KB, i.e. the training set. The learning rule of our MLP is the Quasi Newton Algorithm (QNA). In general Quasi Newton Algorithms (QNA) are variable metric methods used to find local maxima and minima of functions (Davidon 1968) and, in the case of MLP s they can be used to find the stationary (i.e. the zero gradient) point of the learning function. The Newton method is the general basis for a whole family of so called Quasi Newton methods. One of those methods, implemented here is the L-BFGS algorithm (Byrd et al. 1994, Broyden C. G. 1970, Fletcher R. 1970, Goldfarb D. 1970, Shanno D. F. 1970). More rigorously, the QNA is an optimization of learning rule, also because, as described below, the implementation is based on a statistical approximation of the Hessian by cyclic gradient calculation, that, as said in the previous section, is at the base of Back Propagation (BP; Bishop 2006) method. As known, the classical Newton method uses the Hessian of a function. The step of the method is defined as a product of an inverse Hessian matrix and a function gradient. If the function is a positive definite quadratic form, we can reach the function minimum in one step. In case of an indefinite quadratic form (which has no minimum), we will reach the maximum or saddle point. In short, the method finds the stationary point of a quadratic form. In practice, we usually have functions which are not quadratic forms. If such a function is smooth, it is sufficiently good described by a quadratic form in the minimum neighborhood. However, the Newton method can converge both to a minimum and a maximum (taking a step into the direction of a function increasing). Quasi Newton methods solve this problem as follows: they use a positive definite approximation instead of a Hessian. If Hessian is positive definite, we make the step using the Newton method. If Hessian is indefinite, we modify it to make it positive definite, and then perform a step using the Newton method. The step is always performed in the direction of the function decrement. In case of a positive definite Hessian, we use it to generate a quadratic surface approximation. This should make the convergence better. If Hessian is indefinite, we just move to where function decreases. Some modifications of Quasi Newton methods perform a precise linear minimum search along the indicated line, but it is proved that it is enough to sufficiently decrease the function value, and not necessary to find a precise minimum value. The L-BFGS algorithm tries to perform a step using the Newton method. If it does not lead to a function value decreasing, it lessens the step length to find a lesser function value. Up to here it seems quite simple but it is not! The Hessian of a function isn't always available and in many cases is too much complicated. More often we can only calculate the function gradient. Therefore, the following operation is used: the Hessian of a function is generated on the basis of the N consequent gradient calculations, and the Quasi Newton step is performed.

There is a special formulas which allows to iteratively get a Hessian approximation. On each approximation step, the matrix remains positive definite. The algorithm L-BFGS does not generate the Hessian, but directly its inverse matrix, so we don't have to waste time to invert the Hessian. In order to better understand the process at the base of Quasi Newton method, let s start from the classical Gradient Descent Algorithm (Back Propagation, Bishop 2006). By using the standard GDA, the direction of each updating step is calculated through the error descent gradient, while the length is determined by the learning rate. A more sophisticated approach could be to move towards the negative direction of the gradient (line search direction) not by a fixed length, but up to reach the minimum of the function along that direction. This is possible by calculating the descent gradient and analyzing it with the variation of the learning rate (Brescia 2012). Let suppose that at step t the current weight vector is w(t ) and consider a search direction d(t) = E(t). If we select the parameter λ in order to minimize ()= () + (). The new weight vector can be then expressed as: () = () + () (2) The problem of line search is in practice a single dimension minimization problem. A simple solution could be to move E(λ) by varying λ in small intervals, to evaluate the error function at each new position and to stop when the error starts to decrease. There exist many other methods to solve this problem. For example the parabolic search of a minimum calculates the parabolic curve crossing pre-defined learning rate points. The minimum d of the parabolic curve is a good approximation of the minimum of E(λ) and it can be reached by considering the parabolic curve crossing the fixed points with the lowest error values. There are also the trust region based strategies to find a minimum of an error function, which main concept is to iteratively growing or contracting the region of the function by adjusting a quadratic model function which better approximates the error function. In this sense this technique is considered dual to line search, because it tries to find the best size of the region by preliminarily fixing the moving step (the opposite of the line search strategy that always chooses the step direction before to select the step size), (Celis et al. 1985). Up to now we have supposed that the optimal search direction for the method based on the line search is given at each step by the negative gradient. That s not always true! If the minimization is done along the negative gradient, next search direction (the new gradient) will be orthogonal to the previous one. In fact, note that when the line search founds the minimum, then we have: (() + () )=0 (3) and hence, () () =0 (4) where g(t + 1) E(t + 1). By selecting further directions equal to the negative gradient, there should be obtained some oscillations on the error function that slow down the convergence process. The solution could be to select further more directions such that the gradient component, parallel to the previous search direction (that is zero), remains unchanged at each step. Let supposed to have already minimized in respect of the direction d(t) starting from the point w(t ) and reaching the point w(t + 1). In the point w(t + 1) the (4) is ( () ) () =0 and by choosing () to preserve the gradient component parallel to () equal to zero, it is possible to build a sequence of directions d in such a way that each direction is conjugated to the previous on the dimension w of the search space (conjugate gradients method), (Golub et al. 1999).

In presence of a square error function, an algorithm with such technique has a weight update of the form: () = () +! () () (5) with! () = (#)$ % (#) (#)$ &(#) (6) Furthermore, d can be obtained for the first time by the negative gradient and then as linear combination of the current gradient and of the previous search directions: () = () +' () () (7) with ' () = %(#())$ & (#) (#)$ &(#) (8) This algorithm founds the minimum of a square error function in almost w steps. On the contrary, the computational cost of each step is high, because in order to determine the values of α and β, we have to refer to the hessian matrix Η, highly expensive in terms of calculations. But fortunately, the coefficients α and β can be obtained from analytical expressions that do not use the Hessian matrix explicitly. For example the term β can be calculated in one of the following ways: 1) expression of Polak-Ribiere: ' () = %(#())$ (% (#()) *% (#) ) % (#)$ % (#) 2) expression of Hestenes-Sitefel: ' () = %(#())$ (% (#()) *% (#) ) (#)$ (% (#()) *% (#) ) 3) expression of Fletcher-Reeves: ' () = %(#())$ % (#()) % (#)$ % (#) These expressions are equivalent if the error function is square-typed, otherwise they assume different values. Typically the Polak-Ribiere equation obtains better results, because, if the algorithm is slow and the consequent gradients are quite similar between them, this equation produces values of β such that the search direction tends to assume the negative gradient direction (Vetterling et al. 1992). And this corresponds to a restart of the procedure. Concerning the parameter α, its value can be obtained by using the line search method directly. The method of conjugate gradients reduces the number of steps to minimize the error up to a maximum of w because there could be almost w conjugate directions in a w -dimensional space. In practice however, the algorithm is slower because, during the learning process, the property conjugate of the search directions tend to deteriorate. It is useful, to avoid the deterioration, to restart the algorithm after steps, by resetting the search direction with the negative gradient direction. By using a local square approximation of the error function, we can obtain an expression for the minimum position. The gradient in every point w is in fact given by: E = H (w w*) (9) where w corresponds to the minimum of the error function, which satisfies the condition: w = w H 1 E (10)

The vector H 1 E is known as Newton direction and it is the base for a variety of optimization strategies, such as for instance the QNA which instead of calculating the H matrix and then its inverse, uses a series of intermediate steps of lower computational cost to generate a sequence of matrices which are more and more accurate approximations of H 1. From the Newton formula (10) we note that the weight vectors on steps t and t+1 are correlated to the correspondent gradients by the formula: () () =. * () () (11) known as Quasi Newton Condition. The approximation G is therefore built to satisfy this condition. The formula for G is: / () =/ () + 00$ 2 (#) 11 $ 2 (#) +3 / () 344 0 $ 1 1 $ 2 (#) (12) 1 where the vectors are 5= () (), 3= () () and 4= 5 6 5 3 /() 36 3 / () (13) 3 By using the identity matrix to initialize the procedure is equivalent to consider, step by step, the direction of the negative gradient while, at each next step, the direction Gg is for sure a descent direction. The above expression could carry the search out of the interval of validity for the squared approximation. The solution is hence to use the line search to found the minimum of function along the search direction. By using such system, the weight updating expression (5) would be formulated as follows: () = () +! () / () () (14) where α is obtained by the line search. The following algorithm shows the MLP trained by QNA method. Let us consider a generic MLP with () the weight vector at time (t). 1) Initialize all weights () with small random values (typically normalized in [-1, 1]), set constant 7, set t = 0 and / () =8; 2) Present to the network all training set and calculate ( () ) as the error function for the current weight configuration; 3) If t=0 a) then () = () b) else () = / (*) (*) 4) Calculate () = ()! () where! is obtained by line search expression (6); 5) Calculate G () with equation (12); 6) If () >7 then t=t+1 and goto 2, else STOP One of main advantage of QNA, compared with conjugate gradients, is that the line search does not require the calculation of α with an high precision, because it is not a critical parameter. On the contrary, the downside is that it requires a big amount of memory to calculate the matrix G w w for large w. One way to reduce the required memory is to replace at each step the matrix G with a unitary matrix. With such replacement and multiplying by g (the current gradient), we obtain:

() = () +;5+<3 (15) Note that if the line search returns exact values, then the above equation produces mutually conjugate directions. A and B are scalar values defined as ;= 1+ 1$ 1 % (#()) 0 $ 1 0$ + 1$ % (#()) and <= 0$ % (#()) 0 $ 1 0 $ 1 0 $ 1 (16) As discussed we use a slightly modified version of the QNA, known as L-QNA or Limited memory QNA (L-BFGS, Nocedal 1980). The algorithm of MLP with L-QNA is the following: Let us consider a generic MLP with () the weight vector at time (t). 1) Initialize all weights () with small random values (typically normalized in [-1, 1]), set constant 7, set t = 0; 2) Present to the network all training set and calculate ( () ) as the error function for the current weight configuration; 3) If t=0 a) then () = () b) else () = (*) +;5+<3, where 5= () (), 3= () () 4) Calculate () = ()! () where! is obtained by line search equation (6); 5) Calculate A and B for the next iteration, as reported in (16); 6) If () >7 then t=t+1 and goto 2, else STOP Note that for approximate values of alpha the algorithm works well anyway. During the exploration of the parameter space, in order to find the minimum error direction, QNA starts in the wrong direction. This direction is chosen because at the first step the method has to follow the error gradient and so it takes the direction of steepest descent. However, in subsequent steps, it incorporates information from the gradient at the steps taken to build up an approximate model of the Hessian. As known, all line search methods, being based on techniques searching the minimum error by exploring the error function surface, are likely to get stuck in a local minimum. Along the research in the field many solutions have been proposed (Floudas and Jongen 2005). By incorporating a random component into the weight updating is one general way to escape the local minimum. Also Genetic Algorithms have been employed to deal with this problem, by proceedings through multiple initial weight settings and recombining trained weights during the process (Fu 1994). But the cost of both approaches is the prolonged duration time of training. In order to accelerate the convergence of GDA, Newton s method uses the information on the second-order derivatives. QNA is able to better optimize the convergence time by approximating secondorder information with first-order terms (Shanno 1990). By having the information of the second derivatives, QNA is able to avoid local minima of the error function and to be more precise in the error function trend follow-up, revealing a natural capability to find the absolute minimum error of the optimization problem. However this last feature could be a downside of the model, especially when the signal-to-noise ratio of data is very poor. But with clean data, such as in presence of high quality spectroscopic redshifts, used for model training, the QNA performances result extremely precise. In the L-BFGS version of the algorithm, in case of big dimensions, the amount of memory required to store a Hessian is too big, along with the machine time required to process it. Therefore, instead of using a complete number of gradient values to generate a Hessian, we can use a smaller number of values. On the one hand, the convergence slows down. On the other hand, the performance could even grow up. At first sight, this

statement seems to be paradoxical. But it contains no contradictions: the convergence is measured by a number of iterations, whereas the performance depends on the number of processor's time units spent to calculate the result. Related to the computational cost there is also the strategy adopted in terms of stopping criteria of the method. As known, the process of adjusting the weights based on the gradients is repeated until a minimum is reached. In practice, one has to decide the stopping condition of the algorithm. More in general, there are several criteria. Among them the most used are: (i) the algorithm could be terminated after the gradient is sufficiently small (by definition the gradient will be zero at a minimum); (ii) based on the error to be minimized, in terms of a fixed threshold; (iii) based on the cross validation. The cross validation can be used to monitor generalization performance during training and to terminate the algorithm when there is no more improvement. The basic mechanism consists into dividing data in a train and test set. The network is trained on the training set and its performances are evaluated on the test set. Statistically significant results come out by trying multiple independent data partitions and averaging the performance. The first two criteria mentioned above are mainly sensitive to the choice of specific parameters and may lead to poor results if the parameters are improperly set. The cross validation do not suffer of such drawback. It can avoid overfitting the data and is able to improve the generalization performance of the model. However it is much more computationally expensive. REFERENCES Bishop, C.M., Pattern Recognition and Machine Learning, 2006, Springer ISBN 0-387-31073-8. Brescia, M., New Trends in E-Science: Machine Learning and Knowledge Discovery in Databases. 2012, Contribution to the Volume Horizons in Computer Science Research, Thomas S. Clary (eds.), Series Horizons in Computer Science Vol. 7, Nova Science Publishers, ISBN: 978-1-61942-774-7. Broyden, C. G., The convergence of a class of double-rank minimization algorithms. 1970, Journal of the Institute of Mathematics and Its Applications, Vol. 6, pp. 76 90. Byrd, R.H et al., 1994, Mathematical Programming, 63, 4, pp. 129-156 Celis, M.; Dennis, J. E.; Tapia, R. A., A trust region strategy for nonlinear equality constrained optimization. 1985, in Numerical Optimization, P. Boggs, R. Byrd and R. Schnabel eds, SIAM, Philadelphia USA, pp. 71 82. Davidon, W.C., Comput. J. 10, 406 (1968) Fletcher, R., A New Approach to Variable Metric Algorithms. 1970, Computer Journal, Vol. 13, pp. 317 322. Floudas, C. A.; Jongen, H. Th., Global Optimization: Local Minima and Transition Points. 2005, Journal of Global Optimization, Vol. 32, Number 3, 409-415 Fu, Limin, Neural Networks in Computer Intelligence. 1994, E.M. Munson and L. Goldberg Editors, McGraw-Hill NY Goldfarb, D., A Family of Variable Metric Updates Derived by Variational Means. 1970, Mathematics of Computation, Vol. 24, pp. 23 26. Golub, G.H.; Ye, Q., Inexact Preconditioned Conjugate Gradient Method with Inner-Outer Iteration. 1999, SIAM Journal of Scientific Computation, Vol. 21, pp. 1305-1320.

Nocedal, J., Updating Quasi-Newton Matrices with Limited Storage. 1980, Mathematics of Computation, Vol. 35, pp. 773 782. Shanno, D. F., Conditioning of quasi-newton methods for function minimization. 1970, Mathematics of Computation, Vol. 24, pp. 647 656. Shanno, D.F., Recent Advances in Numerical Techniques for large-scale optimization. 1990, in Neural Networks for Control, MIT Press, Cambridge MA Vetterling, T.; Flannery, B.P., Conjugate Gradients Methods in Multidimensions. 1992, Numerical Recipes in C - The Art of Scientific Computing, W. H. Press and S. A. Teukolsky Eds, Cambridge University Press; 2nd edition.