6.1 Introduction CHAPTER VI BACK PROPAGATION ALGORITHM In the previous chapter, we analysed that multiple layer perceptrons are effectively applied to handle tricky problems if trained with a vastly accepted algorithm identified as the back-propagation algorithm (error) in a supervised manner. It functions on learning law with error-correction. It is also a simplified version for the least mean square (LMS) filtering algorithm which is equally popular to error back-propagation algorithm. In Error back-propagation training there are two computational passes via several network layers: A forward pass. A backward pass. In forward pass, vector input is applied to the nodes of the system propagating each layer s outcome to the next layer via network. To get the accurate response of the network, these outputs pass on from several layers and arrive at a set of outputs. In forward pass network weights are permanent. On other hand in the backward pass, weights are adjusted according to rule for error correction. Error signal is the actual response of the network minus the desired response ( Junichi HINO 2006:742-749) The propagation of this error signal through the network is towards backward in direction opposite to the connections of synaptic. The move the real response of network closer to the favoured response, tuning of weights is to be done. There are three unique features of a multilayer perception: 1) For each neuron in any system, its illustration has an activation function that is nonlinear. The logistical function is used to define a function which is sigmoid y= 1 1+e x One other commonly used function worth mentioning is the hyperbolic tangent. Jain University 2015-16 Page 50
y= 1 e x 1+e x Existence of nonlinearities is imperative since the I/O relation in the network is turned to only one layer perceptron otherwise. 2) There are layer(s) of hidden neurons not contained in the input or the output present in the neural network. The study over complex tasks is facilitated by these hidden neurons. 3) Connectivity degree is high in network. Weight's population should be changed if there is a requirement to alter the connectivity of the network. 6.2 FLOW CHART Figure 6.1: Basic Flowchart showing working of BPA (Source: Kumar 2009) Jain University 2015-16 Page 51
6.3 Types of Transfer Function Activation or transfer function, denoted by Φ (.), according to the level of input activity it defines the neuron's output. There is a transfer function that is linked with each neuron of ANN and gives its output. In MATLAB software, the transfer function that is used is presented in Table 6.1 (appendix) (Demuth and Beale, 2004). Log-Sigmoid transfer function (Logsig) accepts between positive signed infinity and negative signed infinity as input and the output is compressed to a range [0, 1] as following: F(x) = 1 1+e x Other important transfer function is Hyperbolic Tangent Sigmoid function (Tansig). For input, the difference is from positive signed infinity to negative signed infinity. The variation is from -1 to +1in output function as following: f(x)= ex e x e x +e x In linear function (Purelin), output equals input and is employed at the output stage of the neural network as given (E.M. Bezerra 2007:177-185) f (x) = x 6.4 Usefulness of back propagation technique For multiple layers training in ANN, Back propagation is used. This technique is systemic when used to multiple layers training of ANN giving a firm foundation to mathematics. In Back propagation the range of problem is stretched to the application of ANNs. Number of inputs is used from previous layer or outside. Each input is then multiplied with weight to give the sum denoted as NET which calculates activation function f there by producing signal OUT. Here OUT = 1 e 1 NET (6.1) Jain University 2015-16 Page 52
NET = X 1 W 1 +X 2 W 2 + +X N W N (6.2) OUT=f(NET) (6.3) OUT NET = OUT(1 OUT) (6.4) This function is known as sigmoid. OUT is in between zero and one as it is the function of NET. If non linear nature is reduced in Multiple layers of network then representational power is increased as compared to single layer. In algorithm for Back propagation function must be differentiable everywhere to tally with sigmoid. 6.4.1 Multi layer network A multi layer network may be considered for training with Back propagation algorithm. The initial neuron set connects inputs that are used as point of segregation which implies that it performs addition of no input. The signal that is at the input is passed onto their outputs as is through the weights. NET and OUT signals are produced by each layer. 6.4.2 Overview of training Network's aim is to utilise the weight in such a way that with the help of input set, required output is generated. These I/O sets are also known as vectors. A key assumption by the training is that every vector at input is linked with vector at target to represent the required output hence forming a training pairs. Training of network is done over many training pairs known as training sets. To ensure that there is no saturation in the network due to huge weights value, the initial weight value must be kept as small random number. This shall avoid other pathologies related to training in the complete set of training. Following are the steps for the training of algorithm related to Back propagation: 1) Another training pair is selected from the training sets and input vector is applied to the input of the network. 2) Output from the network is calculated. Jain University 2015-16 Page 53
3) Difference between the output from the network and the required or targeted output is calculated and is known as error 4) Minimization of error is done by adjusting the weights. 5) For every vector, the above step 1 to 4 is repeated to reduce the error in the training set to desired level. The calculation in step 1 and 2 matches to the methods used in the network that is trained i.e. an output vector is applied and done on subsequent layer format. To begin with, neuron's output in j layer is calculated which is the input to k layer. then the output from layer k is calculated and constitutes the vector of network's output. Step 3, every network output labelled - subtraction of OUT is done from its component of vector that is target and produces an error in step 4. Network's weight is adjusted in step 4 by making use of the training algorithm to determine changes in weight's polarity along with its numerical value. Repetition of the above four steps are done to reduce the error between target and actual output to an acceptable level. Upon arriving at the acceptable level the network is said to be trained and can be used for recognition keeping the weight constant. Vector expression for Step 1 and 2: X input vector produces -Y output vector to form vector pair of input-target (X,T) relating the training set. As we can see calculations in a multilayer network are executed subsequent layers beginning with nearest layer at input. In starting layer the NET value linked with each neuron is brought down as neuron's input weighted. The NET is then squashed by the activation function f to generate the value of OUT for every neuron in that particular layer. Upon finding the output's set for a layer, it acts like an input which is given to the neuron layer. Repetition of this process takes place to achieve the final network's output set. Jain University 2015-16 Page 54
6.4.3 Adjusting the weight of output layers The availability of the target value of output layer's every neuron is to be set according to the weights and can be accomplished as per delta 8 rule's modifications. Hidden layers are the layers at interior without any comparable target value. K layer's neuron output is subtracted from target value to get an error signal and multiplication with final (OUT *(1-OUT)) squashing function's derivative is done to achieve the δ value for this layer. δ = OUT 1 OUT (Traget OUT) Then multiplication between neuron j's OUT and δ is done as in question, neuron source for that particular weight is mentioned. Then multiplication of this product and training rate coefficient η(0.01 to 1.0) is done to and then the addition of result and weight is done. 6.4.4 Adjusting the weight to hidden layer The training technique cannot be used as there is no target vector in case of hidden layer. For the training of the layer that is hidden, back propagation is used as the error from output is propagated back into the network and subsequent layer weights are adjusted. Generation of δ in case of layer that is hidden is done without any application of vector at target. Firstly calculation of each neuron's δ value is done for layer at output. According to this value the weights are adjusted and given as an input to layer at output. As per this δ value, hidden layer's weights are adjusted similar back propagation is done for other subsequent layers ( T.Gowri Manohar 2008:19-25) Let us take one neuron layer in the layer that is hidden and is preceding the layer that is at output. This neuron s output value is propagated through interconnecting weights to output layer's neuron in forward pass. Under training, these weights work in pattern in reversed to pass the δ value back to the layer that is hidden from outer layer of neurons. Multiplication of each weights and δ value of the neuron in the output layer is Jain University 2015-16 Page 55
done. Magnitude of δ is used in layer of neuron that is hidden and addition of all products and also the product of the same with squashing function's derivative. 6.5 Training Algorithms n δ pj = OUT pj 1 OUT q=1 δ pk. w pq There are many different training algorithms with back-propagation. There exist a range of computational and storage requirements as no single algorithm suits all locations. Table 6.2 (appendix) Training algorithms summarizes the training algorithms included in MATLAB software. The few important have been briefed below: 6.5.1 Resilient Back-propagation (trainrp) Sigmoid transfer functions are typically employed in the hidden layers of Multilayer networks. These functions are known as squashing functions as it compress infinite range input to finite range output. A significant characteristic of Sigmoid functions is that as their inputs get larger their slope also must approach zero. This leads to an issue when the steepest descent is applied for training of a multiple layer network with sigmoid functions, the reason being that the gradient of the function can be of too small numerical value perpetuating low variations in the weights and biases, though their favourable values are at quite distant from. The resilient back-propagation training algorithm is used to avoid the harmful effects of the magnitudes in partial derivatives. 6.5.2 Scaled Conjugate Gradient (trainscg) For each iteration, different line search is required in the conjugate gradient algorithms. This makes it computationally expensive as for every search the response of network in all training inputs is computed many times. Whereas scaled conjugate gradient algorithm (SCG) was considered to consume less time for the line search but is complicated to explain in a few lines. The concept is combining the two approaches: the conjugate gradient approach and the model-trust region approach. 6.5.3 Levenberg-Marquardt (trainlm) To reach second order training speed without the utilization of the Hessian matrix, the Levenberg-Marquardt was designed. The Hessian matrix is approximated in case the performance function is assumed the form a sum of squares in training of feed forward Jain University 2015-16 Page 56
networks. Computation of the gradient is done with the Jacobian matrix through a standard technique of back-propagation (S.K. Lahiri 2010, 1497-1509). Jain University 2015-16 Page 57