CHAPTER 4 DETERMINATION OF OPTIMAL PATTERN RECOGNITION TECHNIQUE BY SOFTCOMPUTING ANALYSIS

Size: px

Start display at page:

Download "CHAPTER 4 DETERMINATION OF OPTIMAL PATTERN RECOGNITION TECHNIQUE BY SOFTCOMPUTING ANALYSIS"

Mitchell Jack Barker
5 years ago
Views:

1 70 CHAPTER 4 DETERMINATION OF OPTIMAL PATTERN RECOGNITION TECHNIQUE BY SOFTCOMPUTING ANALYSIS 4.1 INTRODUCTION Pattern recognition is the act of taking in raw data and taking action based on the category of pattern. Over the years, Artificial Neural Networks (ANNs) have become recognized as powerful tool for pattern-recognition techniques and are considered to have remarkable approach in solving some difficult pattern recognition problems. These networks are capable of recognising spatial or temporal relationships and performing tasks like classification. The artificial neural networks have the precise feature of storing the knowledge in the synaptic weights of the processing elements (Mohamed 2009). There are a number of ANN types and algorithms which allow the design of neural networks and the computation of weight values. But, the challenging issue in the design of ANN for a particular application is to find out a specialized architecture in terms of required number of neurons, number of hidden layers and the learning procedure (Nelson 2002). This is more vital, in order to meet the necessary requirements, the sensitivity, specificity and repeatability. This chapter focuses on finding the optimal architecture and its learning algorithm required for this particular application of identifying E.coli from other pathogens together with its life stage.

2 71 The simulation tool used in this study was Neuro-Solutions software (v. 4.2 Neuro Dimension Inc., Gainesville, Florida, USA). From literature survey, it was found that for this type of research analysis the pattern recognition method used were back propagation neural network (Pavlou 2002 a, Pavlou 2004, Hao 2003, Xing 2005, Fend 2006, Hong 2007), genetic algorithm (Pavlou 2002 a, Pavlou 2004), self organising map (Ritaban 2002) and fuzzy clustering means (Ping 1997, Ritaban 2002). In all these works the percentage of recognition varied from 85% to 100%. This chapter focuses on finding a model which produces 100% recognition. After analysis, it was decided that in this work various pattern recognition techniques such as Multi Layer Perceptron (MLP), Principal Component Analysis Neural Network (PCANN) and Support Vector Machine (SVM) were employed to find the best architecture suiting this application. In these three models, genetic algorithm was invoked for network optimization for selecting training parameters, input selection and for evolving network architecture. 4.2 APPLYING GA FOR NETWORK ARCHITECTURE OPTIMISATION The topology of a network, that is, the number of nodes and the location and the number of connections among them, has a significant impact in the performance of the network and its generalization skills. By obtaining optimised network, measures like fast response (requires minimum network size) and VLSI hardware implementation compatibility (requires minimum connectivity) can be achieved. Genetic Algorithms (GAs) are the other methods to arrive optimal neural network design (Fiezelew 2007). They employ a parallel multipoint probabilistic search strategy that is biased toward reinforcing search points of high fitness. The most distinguishing feature of GAs is their flexibility and applicability to a wide range of optimisation problems. In the domain of neural networks, GAs are useful as global search

3 72 methods for synthesizing the weights of generally interconnected networks, optimal network architectures, learning parameters, optimal learning rules, etc (Kermani 1999, Montana 1989). As the performance of a neural network is critically dependant on the choice of the processing elements and network architecture, this investigation is focussed on implementing genetic algorithm for optimising the architecture of the pattern recognition network. In GA, a genotype is an array of genes, where every gene takes a value from a properly defined domain (Goldberg 1991). Each genotype codes a phenotype or candidate solution for the domain of interest (neural architecture class). Such codings could use genes that take numeric values to represent a few parameters or complex structures of symbols become into phenotypes (neural networks) by means of a proper decoding process (Illeana 2004, Fiezelew 2007). The resulting neural network (the phenotypes) is also equipped with learning algorithms that train them using stimulus data set. This evaluation of a phenotype determines the fitness of its corresponding genotype (Rich 1991). The evolutionary procedure works in a population of such genotypes, preferably selecting genotypes that code phenotypes with a high fitness and reproducing them. Genetic operators such as mutation, crossover and selection are used to introduce variety into the population and to test variants of candidate solutions represented in the current population. In this way, over several generations, the population gradually will move towards genotypes that correspond to phenotypes with high fitness (Srivastava 1998). In this work, the genotype only codes the architecture of a neural network with forward connections. The training of the weights for those connections is carried out by various learning algorithms (Figure 4.1).

4 73 Offspring Decoding Trained ANN Genotype Phenotype Selection Crossover Mutation Fitness Evaluation ANN training Figure 4.1 Optimisation of ANN by genetic algorithm The hybrid algorithm for generation of neural networks uses direct coding scheme and develops the following steps: Step1: Step2: Step3: Step4: Step5: Step6: Step7: Create an initial population of individuals (neural network) with random topologies and learning parameters. Train each neural network with a learning algorithm. Select parents from the population by selection operator. Calculate the fitness f(x) of each string x in the population. Recombine both parents by cross over operator with probability p c to produce two offspring. Mutate each offspring randomly by mutation operator with probability p m and place it in the new population. Train each child network using learning algorithm. Replace the offspring into the population. Repeat steps from 2 to 6 for a given number of generations. For computation of genetic algorithms, it maintains a collection of samples called population of strings. The computation model was given by

5 74 Vose and Liepins (1991). From the initial population subsequent populations are computed by employing three genetic operators. These operators are selection, crossover and mutation. Treatments of various lengths are found from Neuro-Solution manual Selection Operator Selection is a genetic operator that chooses a chromosome from the current generation s population for inclusion in the next generation s population. Before making it into the next generation s population, the selected chromosomes depending upon the probability of crossover and mutation would undergo crossover and / or mutation. The purpose of selection is to emphasize the fitter individuals in the population with the hope that their offspring will in turn have even higher fitness (Melanie 1999). Selection has to be balanced with variation from crossover and mutation. If selection is too strong then suboptimal highly fit individuals would take over the population, reducing the diversity needed for further change and progress. If selection is too weak it will result in too slow evolution. Different selection methods like Roulette, Tournament etc., are available Crossover Operator Crossover is a genetic operator that mates two chromosomes from parental set to produce a new offspring chromosome. The idea behind crossover is that the new chromosome may be better than both of the parents if it takes the best characteristics from each of the parents. Crossover occurs during evolution according to a user-definable crossover probability, P c. The crossover can exhibit high simultaneous levels of preservation, survival and construction since it shares information between fit individuals. Of the three genetic operators crossover is the most crucial in obtaining global results. It is responsible for mixing the partial information contained in the strings of population. Based on empirical evidence, it has been found that reasonable

6 75 values for the probability of crossover are in the range of 0.6 P c 0.99 (Jong 1975, Grefenstette 1986, Schaffer 1989). Different crossover operators like one point crossover, uniform crossover etc., are available Mutation Operator Mutation is a genetic operator that alters one ore more gene values in a chromosome from its initial state. It is an important part of the genetic search as it helps to prevent the population from stagnating at any local optima. As mutation is considered to be disruptor of new schemas, it is kept in low value and is often implemented with a parameter that is constant during genetic algorithm search. This can result in entirely new gene values being added to the gene pool. With these new gene values, the genetic algorithm may be able to arrive at better solution than was previously possible. There are various mutation operators like flip bit, boundary, Gaussian and uniform. The mutation forces diversity in the population and explores the search space allowing the search to overcome local minima. Applying mutation too frequently would result in destroying the highly fit strings in the population which may be slow and impede convergence to the solution. Empirically, it has been found that reasonable values for the probability of mutation are in the range of 0.01 P m (Jong 1975, Grefenstette 1986, Schaffer 1989) Parameters Setting for Genetic Algorithm The next criterion to make in implementing a genetic algorithm is to decide the values for the various parameters, such as population size, crossover rate and mutation rate. These parameters typically interact with one another nonlinearly, so they cannot be optimized one at a time. Jong's experiments indicated that the best population size was individuals, the best single point crossover rate was ~0.6 per pair of parents and the best mutation rate was per bit (Jong 1975). Schaffer et al found that the best

7 76 settings for population size, crossover rate and mutation rate were independent of the problem in their test suite (Schaffer 1992). These settings were similar to those found by Grefenstette : population size 20 30, crossover rate and mutation rate (Grefenstette 1986). For this research work, the design parameters such as number of inputs, number of hidden layers, learning rate, learning parameter and number of processing element in the hidden layer are genetically optimized. The population size is fixed as 50 with maximum generation as 100. The operators of genetic algorithm including the selection, crossover and mutation were varied to obtain required sensitivity and specificity. There are numerous selection schemes in which those selection schemes available in Neuro- Solution software were employed in this work. They are roulette, tournament, top selection and best selection. Different crossover methods like one-point, two-point and uniform were employed. The mutation was implemented with a parameter that was constant during genetic algorithm search and kept as 0.01 uniform type mutation. For one set of data, these parameters were applied and the performances were measured as shown in the Figure 4.2, 4.3, 4.4 and Table 4.1. Figure 4.2 Performance of LM algorithm

8 77 Figure 4.3 Performance of GA with LM algorithm: Selection-Roulette, Crossover- Uniform, Mutation - Uniform Figure 4.4 Performance of GA with LM algorithm: Selection-Best, Crossover- Uniform, Mutation - Uniform

9 78 Table 4.1 Performance of GA with different Selection and Crossover operators Selection Operator Crossover Operator One Point Two Point Uniform Generation (Average Fitness) MSE Generation (Average Fitness) MSE Generation (Average Fitness) MSE Roulette Tournament Top Best From the above figures and table, the performance measures can be seen and the average fitness of all individuals that was evaluated over t evaluation step was analysed between various genetic optimization models. Of these models, Best selection with uniform crossover with cross over probability 0.9 and mixing ratio 0.5 gave better results. Hence, in this work the selection method was kept as best selection method, the cross over operator as uniform cross over with probability 0.9 and mixing ratio as 0.5 with uniform mutation as CLASSIFICATION METHODOLOGY The performance characteristic of adaptive systems is utilized directly to change the parameters through systematic procedures called learning or training rules, so that the system output improves with respect to the desired goal. Before learning, the data set was divided into training phase and testing phase Training Phase During the learning process, the network adjusts its parameters and the synaptic weights, in response to a stimulus input so that its actual output

10 79 converges to the desired output. At this level the synaptic weights of each processing unit are dynamically modified to reach a defined error level according to optimization criteria called learning algorithm. This is in order to identify the best architecture with a given number of neurons for a specific problem. When the error minimizes below a threshold level and when the actual output response is the same as the desired one, the network has completed the learning phase Testing Phase Once the learning phase is completed, the network is subjected to the test inputs. Even if the error of learning samples classification reaches zero, there is a probability that samples with non identical but similar characteristics can be misclassified. The testing set embrace samples with similar but not identical characteristics with the learning ones Cross validation More features, more hidden units and longer training times enable the neural network to learn the data in its training set with greater accuracy. Sometimes due to over fitting, instead of learning how to approximate the function presented in the data, the network can simply memorize every training example. The noise in the training data could then be memorized as part of the function, often destroying the skills of the network to generalize. Having good generalization as a goal, it becomes very difficult to realize the stopping criterion looking only at the training learning curve. In particular, it is possible that the network ends up over fitting the training data if the training session is not stopped at the right time. Two popular techniques for generalizing artificial neural network models are Bayesian Regularization and the cross validated early stopping method. As this research work utilise Neuro-Solutions tool, the cross validated early stopping method is used for improved generalisation. The beginning of over fitting can be found by using

11 80 crossed validation in which the training exemplars are split into a training subset and a validation subset. The training subset is used to train the network in the usual way, except for a little modification: the training session is periodically stopped (every a certain number of epochs) and the network is evaluated with the validation set after each training period. The early stopping heuristic suggests that the minimum point on the validation learning curve should be used as an approach to stop the training session (Fiezelew 2007) Parameter Setting for Various Learning Algorithms After each pass through the training set, training was suspended and each vector in the cross validation set was fed to the neural network's input units. The value produced at the neural network's output unit in response to each vector was compared with the desired output value. From this the MSE between the desired and actual output values was calculated over the entire cross validation set. The criterion for when to stop training was the number of passes through the training set that minimised the MSE on the cross validation set. To account for the possibility of local minima in error space each architecture was trained several times using a different random initialisation of the weights set on each occasion. This was then repeated using the cross validation set as the training set and vice versa. For each learning technique of the ANN, the performance was evaluated by Mean Squared Error (MSE), Normalized Mean Squared Error (NMSE), Correlation Coefficient (r) and percentage of recognisation. Generally while designing ANN, for all problems, one hidden layer is sufficient. Two hidden layers are required for modeling data with discontinuities such as a saw tooth wave pattern. Using two hidden layers rarely improves the model and it may introduce a greater risk of converging to a local minima. There is no theoretical reason for using more than two hidden layers. Hence in this work for all learning algorithms, the architecture was

12 81 built with maximum of two hidden layers. A simple experiment was conducted with different activation functions for neurons at hidden layer (Figure 4.5). The activation function at the output layer was kept as softmax axon as suggested in the Neuro-Solution tool manual. Figure 4.5 Percentage classifications for different transfer functions at hidden layer The percentage classification from the Figure 4.5 suggests that tanh activation is better and hence it was used as the activation function for all neurons in the hidden layer for various learning algorithms. However for the output neurons softmax activation function was used. 4.4 DATA SET The obtained exemplars from the sensory array were divided into three sets for training, cross validation and testing. Each exemplar consists of 12 data which signifies the response from 12 sensors. The data set obtained for E.coli identification was 75 and this was referred throughout this chapter as data set 1 (DS1). Similarly, the data set for discriminating the life stages of E.coli was 100 and this was referred throughout this chapter as data set 2 (DS2).

13 82 The typical range of training exemplars lies between 30% to 90% of the complete data set and the typical range of cross validation exemplars lies between 10% and 90%. In this work to find optimum number of hidden layer neurons for deciding the network configuration, the numbers of exemplars for each class were kept almost equal and the data set was divided into training, cross-validation and testing sets with a ratio of 70-80%, 10-15% and 10-15% respectively Data Set 1 (DS1) Out of 75 sets of readings, Seventy percent of the randomised sample data set were used as the training set (53 samples), 10 samples for cross validation and 12 samples for testing. The four different bacteria groups with one control medium were now further grouped into 3 groups as E.coli, control and other pathogens. This grouping is done in order to identify E.coli and discriminate from others including the control medium. The number of input neurons utilized is 12 and the number of output neurons is Data Set 2 (DS2) The data set consists of 100 samples, in which seventy eight randomised samples were utilized for training, 7 for cross validation and 15 for testing. The output is supposed to be grouped into four groups as Lag, Log, Stationary and Death phase. The number of input neurons utilized is 12 and the number of output neurons is MULTI LAYER PERCEPTRON (MLP) The most common neural network model is the multi layer perceptron (MLP). It is a feed forward network (Figure 4.6). This network learns a correct mapping between input and output patterns via a learning algorithm. The process of learning used here is supervised learning.

83 Figure 4.6 Architecture of Multi Layer Perceptron Network 4.5.

14 83 Figure 4.6 Architecture of Multi Layer Perceptron Network Design Approach The most common form of learning is back propagation (BP) as it is easy to use, with few parameters to adjust and is applicable to a wide range of problems. Here the weights are changed based on their previous value and a correction term using generalized delta learning rule (Haykin 2003). The error function (E) on input pattern is given as Equation (4.1) 1 E t y 2 P 2 ( k k) (4.1) k 1 where y k is the actual output for the k th pattern and t k is desired output. P is the total number of training patterns. However there are several shortcomings for this algorithm such as inability to know generation of arbitrary mapping procedure and slow learning. Several alternative algorithms were developed to improve the training speed and to obtain global minimum such as back propagation with momentum, quick propagation etc (Fausett 1994). The MLP model used in this study was learned using different algorithms such as back propagation with an adaptive learning rate and

15 84 momentum, Conjugate Gradient, Quick Propagation, Delta-bar-delta and Levenberg Marquardt algorithm. All weights were initially set to small random values. Then a set of training inputs were presented sequentially to the network. The training epoch was fixed as All samples were subjected to batch learning. Recommended values are used for all learning parameters Momentum Empirical evidence shows that the use of a term called momentum in the back propagation algorithm can be helpful in speeding the convergence and avoiding local minima. This momentum term is known as heavy ball method in numerical analysis. The role of momentum is to filter out rapid changes in error surface (Phansalkar 1994). The momentum keeps the weight changes going in the same direction even when a local minimum is encountered (Drago 1995). The momentum term is added to the weight update equation and the value of momentum should be in the range of 0 to 0.9 with learning rate as 0.5 to 0.9 (Kevin 2005). The idea about using a momentum is to stabilize the weight change by making nonradical revisions using a combination of the gradient decreasing term with a fraction of the previous weight change (Yu 1993). The change in weight ( w) is given by Equation (4.2) wi () E/ wi () wi ( 1) (4.2) where is weight change step, is the learning rate and i is the index of the current weight change. In this work, the initial weights were set up to 0.5 randomly and the non linearity offset was set to The momentum and the learning rate

Figures 4.7, 4.8, 4.9 and 4.10. Figure 4.

16 85 were varied from 0.1 to 0.9 and from 0.5 to 0.9 respectively. The performance analyses of this learning algorithm for various architectures were given in the Figures 4.7, 4.8, 4.9 and Figure 4.7 Performance measure of MLP with momentum for various neuron elements in one hidden layer (DS1) Figure 4.8 Performance measure of MLP with momentum for various neuron elements in two hidden layer (DS1)

17 86 Figure 4.9 Performance measure of MLP with momentum for various neuron elements in 1 hidden layer (DS2) Figure 4.10 Performance measure of MLP with momentum for various neuron elements in two hidden layers (DS2) From the figures it can be noted that the performance is good with momentum values 0.9 for hidden layer and 0.5 for outputlayer. However, this

18 87 learning algorithm does not perform well with classification as it is verified from the percentage error, it obtained in training and cross validation set Conjugate Gradient The Conjugate gradient (CG) technique was developed by Hestenes and Stiefe in 1952 and it was upgraded by Moller in 1993 (Fletcher 1964). As an optimization technique, the conjugate gradient works with large number of weights. It performs a series of line-searches across the error surface. It determines the direction of steepest descent and then projects a line in that direction to locate the minimum then makes an update in weights once per epoch. Another search is then performed along a conjugate direction from this point. This direction is chosen to ensure that all other directions that have been minimised stay at global minimum. It does this in the assumption that the error surface is quadratic. If the quadratic assumption is wrong and the chosen direction does not slow downward, it would then calculate a line of steepest descent and search in that direction (Nawi 2006). Each epoch involves searching a specific direction. This results in a search that does not generally follow the steepest descent, but it often produces a faster convergence that a search along the steepest direction as it searches in one direction at a time. As the algorithm moves closer to minimum point, the quadratic assumption is more likely to be true and the minimum is then located quickly (Bayati 2009). The update equation is given by Equation (4.3) k 1 k 2 k k w w x (4.3) where is the learning rate, k is the error at iteration step k, x k is the input value to the weight i at iteration k and w k is the value of weight i at iteration k. In this work, the initial weights were set up to 0.5 randomly and the non linearity offset was set to The learning rate is varied between 0.5

figures 4.11, 4.12, 4.13 and 4.14. Figure 4.

19 88 and 0.9. Similar to MLP with momentum, the performance measures can be analyzed with the figures 4.11, 4.12, 4.13 and Figure 4.11 Performance measure of MLP with CG for various neuron elements in one hidden layer (DS1) Figure 4.12 Performance measure of MLP with CG for various neuron elements in two hidden layers (DS1)

20 89 Figure 4.13 Performance measure of MLP with CG for various neuron elements in one hidden layer (DS2) Figure 4.14 Performance measure of MLP with CG for various neuron elements in two hidden layers (DS2)

21 90 From the figures, it can be noted that the performance is satisfactory with five neurons in one hidden layer for DS1 and DS2. For two hidden layers the architecture for DS1and for DS2 give comparatively minimum MSE However, this learning algorithm does not perform well with classification as it was verified from percentage error obtained in training and cross validation set Levenberg Marquardt The Levenberg Marquardt (LM) algorithm is basically a Hessianbased algorithm for nonlinear least square optimization. Hessian-based algorithms allow the network to learn more subtle features of a complicated mapping (Deepak 2005). The training process converges quickly as the solution is approached, because the Hessian does not vanish at the solution. By LM algorithm, all inputs are presented to the network and their corresponding outputs are computed. For those outputs, mean squared errors are calculated (Fletcher 1987). Then the Jacobian matrix, J(z) is computed where z represents the weights and biases of the network. After that the Levenberg-Marquardt weight update equation to obtain z is solved. The error is recomputed using z + z. If this new error is smaller than the previously computed error, the training parameter is reduced by the factor. If the error is not reduced, then the factor is increased by +. This is repeated until the training process reaches a minimum error. The and + are defined by user. The algorithm is assumed to have converged when the norm of the gradient is less than some predetermined value, or when the error has been reduced to some error goal. The weight update vector z is calculated by Equation (4.4) and (4.5) [ ( ) ( ) ] T 1 T z J z J z I J E (4.4)

91 where E is a vector of size P calculated as E [ t yt y... t y ] T 1 1 2 2 p p (4.5) Here J T (z)j(z) is referred to as the Hessian matrix where I is the identity matrix, is the learning parameter.

22 91 where E is a vector of size P calculated as E [ t yt y... t y ] T p p (4.5) Here J T (z)j(z) is referred to as the Hessian matrix where I is the identity matrix, is the learning parameter. For = 0 the algorithm becomes Gauss- Newton method. For very large the LM algorithm becomes steepest decent or the error back propagation algorithm. The parameter is automatically adjusted for each iteration in order to secure convergence. The LM algorithm requires computation of the Jacobian J(z) matrix and the inversion of J T (z)j(z) square matrix at each iteration step. The initial value of µ is set to be The value is incremented by a factor of + µ by 10 and decremented by a factor of - µ by 0.1. The performance measures are analysed in Figures 4.15, 4.16, 4.17 and Figure 4.15 Performance measure of MLP with LM for various neuron elements in one hidden layer (DS1)

23 92 Figure 4.16 Performance measure of MLP with LM for various neuron elements in two hidden layers (DS1) Figure 4.17 Performance measure of MLP with LM for various neuron elements in one hidden layer (DS2)

24 93 Figure 4.18 Performance measure of MLP with LM for various neuron elements in two hidden layers (DS2) From the figures, it can be noted that the performance is satisfactory with architecture for DS1 and for DS2. However, this learning algorithm does perform well with classification as it is verified from percentage error, it gives in training and cross validation set Quick Propagation Quick Propagation (QP) was formulated by Patterson in It is a batch update algorithm. It works out the average gradient of the error surface across all cases before updating the weights once at the end of the epoch. QP works by making the assumption that the error surface is locally quadratic, with the axes of the hyper-ellipsoid error surface aligned with the weights. This algorithm converges on the minimum very rapidly. Weight changes ( w) are calculated using the quick propagation formula as Equation (4.6) wt ( ) a wt ( 1) (4.6) where a is the acceleration coefficient. In this work, the initial weights were set up to 0.5 randomly and the non linearity offset was set to The

19 Performance measure of MLP with QP for various neuron elements in one hidden

25 94 performance analyses of this learning algorithm for various architectures are given in the Figures 4.19 and Figure 4.19 Performance measure of MLP with QP for various neuron elements in one hidden layer (DS1) Figure 4.20 Performance measure of MLP with QP for various neuron elements in one hidden layer (DS2)

26 95 From the figures, it can be noted that the performance is good with momentum value as 0.8 for hidden layer and 0.5 for output layer. For DS1, the architecture of and for DS2, the architecture were found to have better performance than others Delta Bar Delta Delta bar delta was developed by Robert Jacobs in 1988 to improve the learning rate of standard back-propagation networks. The delta bar delta network utilizes the same architecture as a back-propagation network. The difference of delta bar delta lies in its unique algorithmic method of learning. The delta bar delta paradigm uses a learning method where each weight has its own self-adapting coefficient. It also does not use the momentum factor of the back-propagation architecture. The remaining operations of the network such as feed forward recall are identical to the normal back-propagation architecture. Delta bar delta is a heuristic approach to training artificial networks. It means that the past error values can be used to infer error values in future. Knowing the probable errors enables the system to take intelligence steps in adjusting the weights. Every connection weight of a network should have its own learning rate. The claim is that the step size appropriate for one connection weight may not be appropriate for all weights in that layer. Further these learning rates should be allowed to vary over time. By assigning a learning rate to each connection and permitting this learning rate to change continuously over time, more degrees of freedom are introduced to reduce the time to convergence. By permitting different learning rates for each connection weight in a network, the connection weights are updated on the basis of the partial derivatives of the error with respect to the weight itself. The weight updating formula is given by Equation (4.7).

27 96 if sij ( n 1) Dij ( n) 0 nij ( n 1) n ij ( n 1) if sij ( n 1) Dij ( n) 0 0 Otherwise (4.7) where is the learning rate for each weight, D ij (n) is the gradient and S ( n) (1 ) D ( n 1) S ( n 1) where is a small constant. ij ij ij In this work, the initial weights were set up to 0.5 randomly and the non linearity offset was set to The learning rate is varied between 0.5 to 0.9. The performance analyses of this learning algorithm for various architectures are given in the Figures 4.21, 4.22, 4.23 and Figure 4.21 Performance measure of MLP with for various neuron elements in one hidden layer (DS1)

28 97 Figure 4.22 Performance measure of MLP with for various neuron elements in two hidden layers (DS1) Figure 4.23 Performance measure of MLP with for various neuron elements in one hidden layer (DS2)

29 98 Figure 4.24 Performance measure of MLP with for various neuron elements in two hidden layers (DS2) From the figures, it can be noted that performance is satisfactory with architecture for DS1and with architecture for DS2. However, this learning algorithm does not perform well with classification as it is verified from percentage error, it gives in training and cross validation set Performance Measures for MLP for Various Learning Algorithms A breadboard created for MLP using Neuro solution software is shown in Figure 4.25 and it contains genetic control inspector, learning inspector (here Levenberg Marquartdt), Axon inspector and Tanh axon inspector. To find then best learning out of different algorithms, the data set DS1 and DS2 were subjected to training, cross validation and testing.

99 Figure 4.25 Active Breadboard for MLP For learning using momentum, bread boards were saved with one and two hidden layers with various initial weights.

30 99 Figure 4.25 Active Breadboard for MLP For learning using momentum, bread boards were saved with one and two hidden layers with various initial weights. Out of various training, better result was obtained from architecture with two hidden layers, nine processing elements in first hidden layer and eight processing elements in second hidden layer for DS1. The momentum value was 0.9, 0.9 and 0.5 for first hidden, second hidden and output layer respectively. This architecture gave correlation coefficient of 47.9% and MSE as for cross validation. For log phase, it has recognized for 25%, for lag, stationary and death phase as 80%, 66.6% and 100% respectively. For quick propagation, the momentum values were kept at 0.8 and 0.5 for hidden and output layer respectively. For DS2, the architecture with one hidden layer with 14 processing elements gives better result. Similarly for all learning algorithms, the satisfactory rated results were tabulated for their suited architectures depending upon the correlation percentage and MSE on cross validation set (Figure 4.26 and 4.27).

31 100 Figure 4.26 Performance measure of MLP for various training algorithms (DS1) Figure 4.27 Performance measure of MLP for various training algorithms (DS2) From the Figures 4.26 and 4.27, it can be noted that for E.coli discrimination (DS1) and for life phase identification, LM learning algorithm gives less percentage errors than other algorithms. It gives the correlation

32 101 coefficient of 99.71% and MSE of for DS1. For DS2, LM gives correlation coefficient of 93.93% and MSE of PRINCIPAL COMPONENT ANALYSIS NEURAL NETWORK (PCANN) Principal Component Analysis Neural Network is a feed forward neural network. It is usually trained by an external teacher i.e., by supervised learning. In supervised learning, the network separates the input parameter space on the basis of features associated with it during learning. A feed forward network may also be unsupervised according to some learning rules which impose a certain condition on its output. The unsupervised feed forward neural networks measure the correlation of the input data, identify certain features or perform principal component analysis (Hertz 1991). Principal Component Analysis (PCA) network is simply a procedure for plain Hebbian learning with constrained weight vector growth and measures the correlation of input data by identifying certain features. The Hebbian learning rule is a biologically inspired scheme, which is strongly influenced by unsupervised learning (Kung 1994). For unsupervised training of feed forward networks, two robust learning rules are available for implementing Hebbian principle and they were Oja s and Sanger s learning rule. Out of these two, Sanger s rule was preferred for PCA as it naturally orders the PCA components by magnitude Sanger s PCA Sanger proposed a learning rule for unsupervised training of single layered neural network formed by linear units implementing PCA network (Figure 4.28). This network consists of a Sanger s synapse that compresses the input data down to its N largest "components", where N denotes the number of processing elements in the axon (Sanger 1989, Kung 1994,

33 102 Haykin 2003). Thus, the actual information contained in the input can be efficiently compressed. Figure 4.28 General architecture for Sanger s PCA The Sanger s rule finds exactly the first N principal components and the update equation is given by Equation (4.8) j wij yjxi yj wikyk (4.8) k 1 where w ij defines the synaptic weight or connection strength between the ith input and jth output neurons, x and y are the input and output vectors, respectively and is the learning rate parameter Performance Measures for PCANN for Various Learning Algorithms As in the competitive network, the PCA network could be used as a pre-processor for a supervised network. The PCA network significantly reduces the dimensionality of the input to the supervised network, creating a

The supervised network used in this investigation is the Multi Layer Perceptron and it is trained with back propagation by

34 103 smaller, easier to train network. The outputs are related to eigen-values and can be used as input to another supervised network for classification. The supervised network used in this investigation is the Multi Layer Perceptron and it is trained with back propagation by Levenberg-Marquadt (LM) algorithm. The active bread board is shown in the Figure The performance analyses are as shown in the Figure 4.30 and Figure 4.31 for various principal components. Figure 4.29 Active Breadboard for PCANN Figure 4.30 Performance measure of Various PCANN (DS1)

35 104 Figure 4.31 Performance measure of various PCANN (DS2) From the analysis, it can be stated that 6 principal components with 8 processing elements in one hidden layer give better performance for DS1. Eight principal components, with 8 processing elements in one hidden layer give better performance for DS SUPPORT VECTOR MACHINE (SVM) The SVM is a supervised machine learning technique. Instead of minimizing error in the training data like ordinary neural networks, the Support Vector Machine minimizes bound on the expected generalization error (Haykin 2003, Gunn 1998). The benefits of the SVM are that it gives global optimum solution and the computational time is much lesser than other computing techniques. It gives a sparser model solution and minimizes the expected generalization error, rather than the empirical error, on the training data (Cortes 1995).

36 105 The SVM technique has margin on either side of a hyper plane that separates data classes. The margin is maximized by creating the largest possible distance between the separating hyper planes to reduce upper bound on the expected generalization error (Burges 1998, Xiao-Dong 2005, Pardo 2005). An optimum hyper plane can be found by minimizing the squared norm of the separating hyper planes, where the data points found to lie on its margin. These data points are called Support Vector (SV) points. The optimal solution is obtained by expanding these points only. Once the hyper plane has been created, the kernel function is used to map new points into the feature space for classification. A great benefit arises from the kernel formulation. Though it is possible to design specific kernel mappings incorporating domain knowledge that separate the training data precisely, it is much easier to use one from a family of kernel functions that represent familiar machine learning techniques. The different kernels like polynomial, Gaussian radial basis function, exponential radial basis function, multilayer perceptron, fourier series and splines are available Radial Basis Function In this work the kernel is Radial Basis Function (RBF) that places a Gaussian at each data sample. Here, RBF uses back propagation to train a linear combination of the Gaussian to produce the final result. The SVM uses the idea of large margin classifiers for training. This decouples the capacity of the classifier from the input space and at the same time provides good generalization Kernel Adatron Algorithm In this work, Kernel Adatron (KA) algorithm extended to RBF network is utilized for implementing SVM. The advantages of the Adatron algorithm are that it has simple structure and the implementation is simple and

37 106 straight forward. It allows the implementation of both hard and soft margin SVM with very few changes. This algorithm recasts the Adatron in the feature space of SVM and hence called Kernel Adatron or KA algorithm. It precomputes the inner products (or the kernel computation) and is an iterative algorithm. It maps inputs to a high dimensional feature space and then optimally separates data into their respective classes by isolating these inputs which fall close to the data boundaries. Therefore, this algorithm is especially effective in separating sets of data which share complex boundaries. The step size should be experimentally chosen. The algorithm is as given below: Step 0: Define m fad( xi) yi yj jk( xi, xj) b j 1 M min f ( x), i {1,..., m} AD AD i Step 1: Initialization setup: Lagrange multiplier i, i {1,..., m}, learning rate, bias b and a small threshold t Step 2: While MAD t Step 3: Choose pattern x, j {1,..., m} Step 4 : Calculate a update (1 f ( x)) i i AD i Step 5: If ( ) 0,, b b y Step 6: End while i i i i i i i This algorithm is called kernel Adatron and can adapt an RBF to have an optimal margin.

107 4.7.3 Performance Measures for SVM The typical SVM built is shown by the active bread board (Figure 4.32).

The second three components implement the large margin classifier that trains the parameters. The step size was varied from 0.01 to 0.

38 Performance Measures for SVM The typical SVM built is shown by the active bread board (Figure 4.32). The first three components implement the expansion of the dimensionality by having a gaussian for each input. The second three components implement the large margin classifier that trains the parameters. The step size was varied from 0.01 to 0.1 and the performance can be analysed by the Figures 4.33 and From these figures, it can be noted that the step size 0.02 gives satisfactory results for DS1 and step size 0.06 gives satisfactory results for DS2. Figure 4.32 Active Breadboard for SVM. Figure 4.33 Performance measure of SVM for various step size (DS1)

39 108 Figure 4.34 Performance measure of SVM for various step size (DS2) 4.8 INFERENCE Out of different algorithms used, MLP with LM proved to be efficient than others which has been explained in Chapter Six.

Neural Network Weight Selection Using Genetic Algorithms

Neural Network Weight Selection Using Genetic Algorithms David Montana presented by: Carl Fink, Hongyi Chen, Jack Cheng, Xinglong Li, Bruce Lin, Chongjie Zhang April 12, 2005 1 Neural Networks Neural networks