Assignment 2. Classification and Regression using Linear Networks, Multilayer Perceptron Networks, and Radial Basis Functions

ENEE 739Q: STATISTICAL AND NEURAL PATTERN RECOGNITION Spring 2002 Assignment 2 Classification and Regression using Linear Networks, Multilayer Perceptron Networks, and Radial Basis Functions Aravind Sundaresan aravinds@glue.umd.edu ENEE 739Q Assignment 2 1 of 14

ENEE 739Q Assignment 2 2 of 14

1. Pattern Classification using Linear Networks A set of N=300 training samples were used to train a 3 3 linear network, where the input is a 3 dimensional vector X x 100 y 100 0.5 T (the bias is chosen to be 0.5). The LMS algorithm was used to train the weights iteratively. The output is a 3 dimensional vector, Z, whose i th element is set to 1 if the input is from the i th class else it is set to zero. The output of the linear network is calculated as follows. Z W X, where X is the input and W is the weight vector. O arg max Z i i Strategy: The learning rate needs to be chosen carefully as large values for the learning rate cause the error to diverge leading to instability in the algorithm. The learning rate is a function of the iteration index and is given by t t It is a 0 good idea to normalize the input so that input values lie in [0,1] or [ 1,1]. In the implementation the inputs have been scaled so that they lie in 0,1 d. Figure 1.1: The Performance of the Network for different learning rates ENEE 739Q Assignment 2 3 of 14

Results: The rate of convergence of the error for three different values of are illustrated in Figure 1.1. The convergence is faster and the error (or the energy 1 function which is set to be equal to 2 Z T 2 ) is lesser for higher learning rates, but as observed earlier the algorithm becomes unstable for higher learning rates leading to divergence of the error function. The original configuration and the classification achieved by the linear network after training with learning rate 0.008 0 are illustrated in Figure 1.2. Conclusions: Obviously the performance of the network is limited by its linearity. As can be observed from Figure 1.2 only linear discrimination can be performed. In this case where the input is from a 2 dimensional space, the output space is split into regions (classes) separated by lines (hyperplanes in the general case). Figure 1.2: The Output of the Network 2. Pattern Classification using Multi Layer Perceptrons A set of N 2000 training samples were used to train 3 h 1 network (Multi layer perceptron network) using the back propagation algorithm. The input is a is a 3 dimensional vector X x y 50 T. The desired output is a scalar which takes the value 1 if the input is in the foreground and the value 1 if it is from the background. Strategy: Initial Weights are uniformly (and independently ) distributed in [ 0.5C, 0.5C] where C is a scaling constant that is inversely proportional to the average magnitude of the input. The training rate,, is calculated as follows. t 0 1 t 400, where 0 0.03 ENEE 739Q Assignment 2 4 of 14

The tan sigmoid function is chosen as the activation function. The activation function, and the derivative of the activation function are calculated as f x 1.7 tanh 0.7x f x 1.19 1 tanh 2 0.7x The error function is calculated as 1 J N n 1 2 Z n T n 2, where T n is the desired O/P, Z n is the actual O/P. The weights are updated for every sample input (online training) according to the back propagation algorithm. The input is not scaled and therefore a scaling factor (inversely proportional to the average magnitude of the input vector) is multiplied with the actual weight increment to obtain the modified weight increment. The training strategy is to continue training the network until the training set error is below a predetermined threshold. Since the error function of both the training set and the validation set may have multiple minimas, the decision to stop the training becomes complicated if it is based on the minima of the error function of either the training or the validation set. It can be in general quite complicated. Here, since there is a very clear demarcation between the foreground and the background, the error of the validation set does not attain a minima even after several iterations. Therefore a good stopping criterion would be based on the value of the training set error. In the MLP network implemented, training is stopped after 2000 iterations or when J t E threshold, whichever occurs first. Results: Table 2.1 illustrates how the validation error varies with the number of hidden units, the stopping criterion being J t E threshold 0.045. Figure 2.1 shows the output of the network (without thresholding) for several values of h. Hidden units Stopping iteration Error of Training Set Error of Validation Set 10 2,000 0.070052 0.070181 15 2,000 0.066416 0.070210 20 2,000 0.054751 0.058752 25 973 0.044987 0.049210 Table 2.1: Number of Hidden units for Optimal Performance As both Table 2.1 and Figure 2.1 indicate, the optimal choice for the number of hidden units seems to be 25. Figure 2.2 and Figure 2.3 illustrate the performance of the network wit h 25 hidden units. ENEE 739Q Assignment 2 5 of 14

Figure 2.1: The performance of the MLP network for different values of h Figure 2.2: Performance of MLP network for h = 25 ENEE 739Q Assignment 2 6 of 14

Figure 2.3: The error of the MLP network with h = 25 Optimal Brain Damage: Because of the random nature of the initialization process, and possibly other factors, the optimal performance of the MLP network is obtained with a higher number of hidden units than may be actually necessary. Thus, some of the weights in the network with the optimal number of hidden units may be superficial or redundant. These redundant weights or units maybe removed by a process called Optimal Brain Damage, which sets to zero the weights that do not affect the output, or the performance of the network. This has been implemented in the following manner. 1. Train the network using h h opt hidden units than required in the optimal case determined earlier (In this case the number of hidden units is chosen as 25). 2. Determine the saliency of each of the weights in the Input Hidden Layer and set to zero three of the weights that have the smallest saliency. 3. Train the network (keeping the value of the discarded the weights equal to zero) until the training set error is less than the threshold or until 2000 iterations are completed. If the final error is less than the threshold, there is scope for further pruning: Repeat 2. If the final error is greater than the threshold it can be concluded that the number of non zero weights required may be less than the number necessary: Go to 4. 4. Use the most recent weight vector that gave an error less than the threshold with the training set. Using an initial value of h 25, and pruning the weights with E threshold 0.045, we ended up with a network that had 45 nonzero weights and 20 hidden units. The ENEE 739Q Assignment 2 7 of 14

performance of the pruned network is illustrated in Figure 2.4. The results of the pruning are summarized in Table 2.2. The number of weights has been reduced by 40% and 5 (20%) of the hidden units have been removed. Hidden units Weights Error of Training Set Error of Validation Set Before Pruning 25 75 0.04499 0.04672 After Pruning 20 45 0.04500 0.04843 Table 2.2: Summary of the pruning Figure 2.4: Performance of MLP network after pruning 3. Function approximating using Radial Basis Functions The objective is to train a RBF network using N 1000 sample points. Though the input is a 3 h 1dimensional vector like before, the bias does not make any difference, because the bias of all the "function centres" is the same as the bias of the input. Strategy: The strategy is to use randomly select the function centres from the training set. The function used in the network is is the inverse multi quadratic basis function defined as i x 1 1 x x i 2 2, where x is the input and x i is the function centre. ENEE 739Q Assignment 2 8 of 14

The "variance" or the spread,, is set according to the number of function centers chosen (the hidden units). The experiment is repeated for different values of h, the number of hidden units. The value of for a given value of h is calculated as follows. ( h is proportional to the ratio of the area of the domain of the mapping to 2 ). 0.7 100 2 h The weights W are determined iteratively using the LMS algorithm. The weights are trained until the validation set error increases continuously for 3 epochs or the number of iterations exceeds 200. The network is trained and the results are compared for different values of h. Hidden units Figure 3.1: Performance of RBF network for different values of h Error of Training Set Error of Validation Set 40 11.07 0.036770 0.044179 60 9.04 0.028014 0.028211 80 7.83 0.024689 0.025554 100 7.00 0.027671 0.02591 Table 3.1: Performance of RBF network for different values of h ENEE 739Q Assignment 2 9 of 14

Results: The results for different values of h are listed in Table 3.1 and the respective outputs of the network are illustrated in Figure 3.1. The performance of the network for h 80 is illustrated in Figure 3.2 and Figure 3.3. The RBF network performs rather poorly because we do not train the function centres or the "variance" of the radial basis functions. Training these parameters using the EM algorithm or the gradient descent algorithm should result in a much better performance. Besides, the performace of the RBF network is very much dependent on the choice of the radial basis function and is more suited to (smooth) function approximation rather tha n the current scenario. The RBF network is not able to sharply define the boundary regions because of the inherent smoothness of the basis fucntion. Figure 3.2: Performance of RBF network for h = 80 ENEE 739Q Assignment 2 10 of 14

4. Optical Character Reader Figure 3.3 The error of the RBF network with h = 80 To implement an OCR we require a Multi Output Multi layer network. The input is a 16x16 grayscale image anda bias. The simplest network architecture would have 257 input nodes, h hidden units, and 10 output nodes, a 257 h 10 MLP network. Strategy: The training set can be obtained by using using manufactured data that provides for translational, rotational, and scale invariance in the network. The target output is set as follows. T i 1;input i 1;input i The network is trained using the manufactured data. The manufactured data has a translation (in pixels) which is uniformly distributed in [ 1.5, 1.5], rotation (in degrees) which is uniformly distributed in [ 9,9] and a scale factor that is uniformly distributed in [0.9,1.1]. A subset of the training set is presented in Figure 4.1. The output of the neural network is chosen as follows. O arg max Z i i ENEE 739Q Assignment 2 11 of 14

The training is continued for 1000 iterations or till the number of misclassified samples for the validation set remains consistently higher than the sum of the minimum value achieved and a threshold. Figure: 4.1: Manufactured data for rotational, translational, and scale invariance Dimensionality reduction using PCA: In the previous case the input dimensions are rather large and this leads to increased computations because the number of weights to be trained depends on the number of input nodes. If it is possible to represent the image using a smaller vector the training would be much less computationally intensive. To this end, the input vector can be transformed using Principal Component Analysis. An estimate of the auto correlation matrix can be obtained from the training set data and using this estimate, k principal eigenvectors (eigenvectors corresponding to the largest eigenvalues) are obtained. The projections of the input vector on these k components are packed into a k dimensional vector, which retains as much information as is necessary to correctly identify the digit. This has an additional advantage that some noise (unnecessary information) is also filtered out which results in a better performance. In the implementation k is set to 30. Thus, including the bias, the dimension of the input vector is 31. Results: The results of the training for both the normal case and the PCA case are summarized in table 4.1. The performances of the normal and PCA cases are also illustrated in Figure 4.2 and Figure 4.3 respectively. Type Input Dimension Hidden units Iterations Misclassified Samples Error of Training Set Error of Validation Set Normal 257 30 1000 0.60% 0.1003 0.4595 PCA 31 30 530 0.20% 0.1305 0.2292 Table 4.1: Summary of performances for Normal and PCA cases ENEE 739Q Assignment 2 12 of 14

Figure 4.2: Performance of the network :Direct Input Figure 4.3: Performance of the network :PCA ENEE 739Q Assignment 2 13 of 14

As can be seen, using PCA to reduce the dimensions of the input leads to a far better performance (both in terms of speed of convergence and validation set error) with the number of misclassified samples in the validation set falling as low as 0.20% (2 in 1000 samples ). In a more general setting it may be a good idea to use a general transformation such as DCT and select the low frequency components to represent the image. 5. References 1. Yann Le Cun, John S. Denker and Sara A. Solla, Optimal Brain Damage. AT&T Bell Laboratories, NJ. 2. Richard Duda, Peter Hart, and David Stork, Pattern Classification. Wiley Interscience, New York, 2001. ENEE 739Q Assignment 2 14 of 14