Assignment 2. Classification and Regression using Linear Networks, Multilayer Perceptron Networks, and Radial Basis Functions

Similar documents
Classification and Regression using Linear Networks, Multilayer Perceptrons and Radial Basis Functions

Neural Networks. CE-725: Statistical Pattern Recognition Sharif University of Technology Spring Soleymani

Neural Networks (Overview) Prof. Richard Zanibbi

Optimal Brain Damage. Yann Le Cun, John S. Denker and Sara A. Solla. presented by Chaitanya Polumetla

ICA as a preprocessing technique for classification

Image Compression: An Artificial Neural Network Approach

Supervised Learning in Neural Networks (Part 2)

Classification: Linear Discriminant Functions

Artificial Neural Networks (Feedforward Nets)

Pattern Recognition. Kjell Elenius. Speech, Music and Hearing KTH. March 29, 2007 Speech recognition

11/14/2010 Intelligent Systems and Soft Computing 1

Neural Networks and Deep Learning

Introduction. Introduction. Related Research. SIFT method. SIFT method. Distinctive Image Features from Scale-Invariant. Scale.

Akarsh Pokkunuru EECS Department Contractive Auto-Encoders: Explicit Invariance During Feature Extraction

For Monday. Read chapter 18, sections Homework:

CANCER PREDICTION USING PATTERN CLASSIFICATION OF MICROARRAY DATA. By: Sudhir Madhav Rao &Vinod Jayakumar Instructor: Dr.

Machine Learning. Topic 5: Linear Discriminants. Bryan Pardo, EECS 349 Machine Learning, 2013

CS 195-5: Machine Learning Problem Set 5

Spectral Classification

6. Linear Discriminant Functions

Character Recognition Using Convolutional Neural Networks

CLASSIFICATION WITH RADIAL BASIS AND PROBABILISTIC NEURAL NETWORKS

Assignment # 5. Farrukh Jabeen Due Date: November 2, Neural Networks: Backpropation

Neural Network Learning. Today s Lecture. Continuation of Neural Networks. Artificial Neural Networks. Lecture 24: Learning 3. Victor R.

Network Traffic Measurements and Analysis

Learning. Learning agents Inductive learning. Neural Networks. Different Learning Scenarios Evaluation

Image Processing. Image Features

4.12 Generalization. In back-propagation learning, as many training examples as possible are typically used.

FACE RECOGNITION USING SUPPORT VECTOR MACHINES

Statistical Methods in AI

Chap.12 Kernel methods [Book, Chap.7]

In this assignment, we investigated the use of neural networks for supervised classification

More on Learning. Neural Nets Support Vectors Machines Unsupervised Learning (Clustering) K-Means Expectation-Maximization

Visual object classification by sparse convolutional neural networks

PATTERN CLASSIFICATION AND SCENE ANALYSIS

What is machine learning?

Linear Models. Lecture Outline: Numeric Prediction: Linear Regression. Linear Classification. The Perceptron. Support Vector Machines

Content-based image and video analysis. Machine learning

Natural Language Processing CS 6320 Lecture 6 Neural Language Models. Instructor: Sanda Harabagiu

The exam is closed book, closed notes except your one-page cheat sheet.

Features Points. Andrea Torsello DAIS Università Ca Foscari via Torino 155, Mestre (VE)

Obtaining Feature Correspondences

Model Answers to The Next Pixel Prediction Task

Why MultiLayer Perceptron/Neural Network? Objective: Attributes:

Deep Learning With Noise

An Algorithm For Training Multilayer Perceptron (MLP) For Image Reconstruction Using Neural Network Without Overfitting.

Static Gesture Recognition with Restricted Boltzmann Machines

Artificial neural networks are the paradigm of connectionist systems (connectionism vs. symbolism)

Network Traffic Measurements and Analysis

Texture Classification by Combining Local Binary Pattern Features and a Self-Organizing Map

Radial Basis Function Neural Network Classifier

IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 10, NO. 6, NOVEMBER Inverting Feedforward Neural Networks Using Linear and Nonlinear Programming

Design of an optimal multi-layer neural network for eigenfaces based face recognition

Basis Functions. Volker Tresp Summer 2017

5 Learning hypothesis classes (16 points)

291 Programming Assignment #3

Dimension Reduction CS534

Recitation Supplement: Creating a Neural Network for Classification SAS EM December 2, 2002

A Systematic Overview of Data Mining Algorithms

CS6220: DATA MINING TECHNIQUES

Face Recognition for Mobile Devices

Classification Lecture Notes cse352. Neural Networks. Professor Anita Wasilewska

A Novel Technique for Optimizing the Hidden Layer Architecture in Artificial Neural Networks N. M. Wagarachchi 1, A. S.

Radial Basis Function Networks: Algorithms

Lecture 8 Object Descriptors

The Curse of Dimensionality

Robustness of Selective Desensitization Perceptron Against Irrelevant and Partially Relevant Features in Pattern Classification

Motivation. Problem: With our linear methods, we can train the weights but not the basis functions: Activator Trainable weight. Fixed basis function

Hand Written Digit Recognition Using Tensorflow and Python

Experimental Data and Training

GENDER CLASSIFICATION USING SUPPORT VECTOR MACHINES

Univariate and Multivariate Decision Trees

Opening the Black Box Data Driven Visualizaion of Neural N

MSA220 - Statistical Learning for Big Data

Modern Methods of Data Analysis - WS 07/08

A Systematic Overview of Data Mining Algorithms. Sargur Srihari University at Buffalo The State University of New York

Support Vector Machines.

Data Analysis 3. Support Vector Machines. Jan Platoš October 30, 2017

HW Assignment 3 (Due by 9:00am on Mar 6)

Handwritten Hindi Numerals Recognition System

Clustering. CS294 Practical Machine Learning Junming Yin 10/09/06

CS 6501: Deep Learning for Computer Graphics. Training Neural Networks II. Connelly Barnes

Support Vector Machines

Neural Networks for Classification

Grundlagen der Künstlichen Intelligenz

Neural Network Neurons

Tangent Prop - A formalism for specifying selected invariances in an adaptive network

Week 3: Perceptron and Multi-layer Perceptron

2. Basic Task of Pattern Classification

LECTURE NOTES Professor Anita Wasilewska NEURAL NETWORKS

Data Mining: Concepts and Techniques. Chapter 9 Classification: Support Vector Machines. Support Vector Machines (SVMs)

Convolution Neural Networks for Chinese Handwriting Recognition

Applied Neuroscience. Columbia Science Honors Program Fall Machine Learning and Neural Networks

( ) =cov X Y = W PRINCIPAL COMPONENT ANALYSIS. Eigenvectors of the covariance matrix are the principal components

CSE 481C Imitation Learning in Humanoid Robots Motion capture, inverse kinematics, and dimensionality reduction

Deep Learning. Volker Tresp Summer 2014

A Novel Pruning Algorithm for Optimizing Feedforward Neural Network of Classification Problems

Machine Learning : Clustering, Self-Organizing Maps

This leads to our algorithm which is outlined in Section III, along with a tabular summary of it's performance on several benchmarks. The last section

Neural Networks (pp )

Transcription:

ENEE 739Q: STATISTICAL AND NEURAL PATTERN RECOGNITION Spring 2002 Assignment 2 Classification and Regression using Linear Networks, Multilayer Perceptron Networks, and Radial Basis Functions Aravind Sundaresan aravinds@glue.umd.edu ENEE 739Q Assignment 2 1 of 14

ENEE 739Q Assignment 2 2 of 14

1. Pattern Classification using Linear Networks A set of N=300 training samples were used to train a 3 3 linear network, where the input is a 3 dimensional vector X x 100 y 100 0.5 T (the bias is chosen to be 0.5). The LMS algorithm was used to train the weights iteratively. The output is a 3 dimensional vector, Z, whose i th element is set to 1 if the input is from the i th class else it is set to zero. The output of the linear network is calculated as follows. Z W X, where X is the input and W is the weight vector. O arg max Z i i Strategy: The learning rate needs to be chosen carefully as large values for the learning rate cause the error to diverge leading to instability in the algorithm. The learning rate is a function of the iteration index and is given by t t It is a 0 good idea to normalize the input so that input values lie in [0,1] or [ 1,1]. In the implementation the inputs have been scaled so that they lie in 0,1 d. Figure 1.1: The Performance of the Network for different learning rates ENEE 739Q Assignment 2 3 of 14

Results: The rate of convergence of the error for three different values of are illustrated in Figure 1.1. The convergence is faster and the error (or the energy 1 function which is set to be equal to 2 Z T 2 ) is lesser for higher learning rates, but as observed earlier the algorithm becomes unstable for higher learning rates leading to divergence of the error function. The original configuration and the classification achieved by the linear network after training with learning rate 0.008 0 are illustrated in Figure 1.2. Conclusions: Obviously the performance of the network is limited by its linearity. As can be observed from Figure 1.2 only linear discrimination can be performed. In this case where the input is from a 2 dimensional space, the output space is split into regions (classes) separated by lines (hyperplanes in the general case). Figure 1.2: The Output of the Network 2. Pattern Classification using Multi Layer Perceptrons A set of N 2000 training samples were used to train 3 h 1 network (Multi layer perceptron network) using the back propagation algorithm. The input is a is a 3 dimensional vector X x y 50 T. The desired output is a scalar which takes the value 1 if the input is in the foreground and the value 1 if it is from the background. Strategy: Initial Weights are uniformly (and independently ) distributed in [ 0.5C, 0.5C] where C is a scaling constant that is inversely proportional to the average magnitude of the input. The training rate,, is calculated as follows. t 0 1 t 400, where 0 0.03 ENEE 739Q Assignment 2 4 of 14

The tan sigmoid function is chosen as the activation function. The activation function, and the derivative of the activation function are calculated as f x 1.7 tanh 0.7x f x 1.19 1 tanh 2 0.7x The error function is calculated as 1 J N n 1 2 Z n T n 2, where T n is the desired O/P, Z n is the actual O/P. The weights are updated for every sample input (online training) according to the back propagation algorithm. The input is not scaled and therefore a scaling factor (inversely proportional to the average magnitude of the input vector) is multiplied with the actual weight increment to obtain the modified weight increment. The training strategy is to continue training the network until the training set error is below a predetermined threshold. Since the error function of both the training set and the validation set may have multiple minimas, the decision to stop the training becomes complicated if it is based on the minima of the error function of either the training or the validation set. It can be in general quite complicated. Here, since there is a very clear demarcation between the foreground and the background, the error of the validation set does not attain a minima even after several iterations. Therefore a good stopping criterion would be based on the value of the training set error. In the MLP network implemented, training is stopped after 2000 iterations or when J t E threshold, whichever occurs first. Results: Table 2.1 illustrates how the validation error varies with the number of hidden units, the stopping criterion being J t E threshold 0.045. Figure 2.1 shows the output of the network (without thresholding) for several values of h. Hidden units Stopping iteration Error of Training Set Error of Validation Set 10 2,000 0.070052 0.070181 15 2,000 0.066416 0.070210 20 2,000 0.054751 0.058752 25 973 0.044987 0.049210 Table 2.1: Number of Hidden units for Optimal Performance As both Table 2.1 and Figure 2.1 indicate, the optimal choice for the number of hidden units seems to be 25. Figure 2.2 and Figure 2.3 illustrate the performance of the network wit h 25 hidden units. ENEE 739Q Assignment 2 5 of 14

Figure 2.1: The performance of the MLP network for different values of h Figure 2.2: Performance of MLP network for h = 25 ENEE 739Q Assignment 2 6 of 14

Figure 2.3: The error of the MLP network with h = 25 Optimal Brain Damage: Because of the random nature of the initialization process, and possibly other factors, the optimal performance of the MLP network is obtained with a higher number of hidden units than may be actually necessary. Thus, some of the weights in the network with the optimal number of hidden units may be superficial or redundant. These redundant weights or units maybe removed by a process called Optimal Brain Damage, which sets to zero the weights that do not affect the output, or the performance of the network. This has been implemented in the following manner. 1. Train the network using h h opt hidden units than required in the optimal case determined earlier (In this case the number of hidden units is chosen as 25). 2. Determine the saliency of each of the weights in the Input Hidden Layer and set to zero three of the weights that have the smallest saliency. 3. Train the network (keeping the value of the discarded the weights equal to zero) until the training set error is less than the threshold or until 2000 iterations are completed. If the final error is less than the threshold, there is scope for further pruning: Repeat 2. If the final error is greater than the threshold it can be concluded that the number of non zero weights required may be less than the number necessary: Go to 4. 4. Use the most recent weight vector that gave an error less than the threshold with the training set. Using an initial value of h 25, and pruning the weights with E threshold 0.045, we ended up with a network that had 45 nonzero weights and 20 hidden units. The ENEE 739Q Assignment 2 7 of 14

performance of the pruned network is illustrated in Figure 2.4. The results of the pruning are summarized in Table 2.2. The number of weights has been reduced by 40% and 5 (20%) of the hidden units have been removed. Hidden units Weights Error of Training Set Error of Validation Set Before Pruning 25 75 0.04499 0.04672 After Pruning 20 45 0.04500 0.04843 Table 2.2: Summary of the pruning Figure 2.4: Performance of MLP network after pruning 3. Function approximating using Radial Basis Functions The objective is to train a RBF network using N 1000 sample points. Though the input is a 3 h 1dimensional vector like before, the bias does not make any difference, because the bias of all the "function centres" is the same as the bias of the input. Strategy: The strategy is to use randomly select the function centres from the training set. The function used in the network is is the inverse multi quadratic basis function defined as i x 1 1 x x i 2 2, where x is the input and x i is the function centre. ENEE 739Q Assignment 2 8 of 14

The "variance" or the spread,, is set according to the number of function centers chosen (the hidden units). The experiment is repeated for different values of h, the number of hidden units. The value of for a given value of h is calculated as follows. ( h is proportional to the ratio of the area of the domain of the mapping to 2 ). 0.7 100 2 h The weights W are determined iteratively using the LMS algorithm. The weights are trained until the validation set error increases continuously for 3 epochs or the number of iterations exceeds 200. The network is trained and the results are compared for different values of h. Hidden units Figure 3.1: Performance of RBF network for different values of h Error of Training Set Error of Validation Set 40 11.07 0.036770 0.044179 60 9.04 0.028014 0.028211 80 7.83 0.024689 0.025554 100 7.00 0.027671 0.02591 Table 3.1: Performance of RBF network for different values of h ENEE 739Q Assignment 2 9 of 14

Results: The results for different values of h are listed in Table 3.1 and the respective outputs of the network are illustrated in Figure 3.1. The performance of the network for h 80 is illustrated in Figure 3.2 and Figure 3.3. The RBF network performs rather poorly because we do not train the function centres or the "variance" of the radial basis functions. Training these parameters using the EM algorithm or the gradient descent algorithm should result in a much better performance. Besides, the performace of the RBF network is very much dependent on the choice of the radial basis function and is more suited to (smooth) function approximation rather tha n the current scenario. The RBF network is not able to sharply define the boundary regions because of the inherent smoothness of the basis fucntion. Figure 3.2: Performance of RBF network for h = 80 ENEE 739Q Assignment 2 10 of 14

4. Optical Character Reader Figure 3.3 The error of the RBF network with h = 80 To implement an OCR we require a Multi Output Multi layer network. The input is a 16x16 grayscale image anda bias. The simplest network architecture would have 257 input nodes, h hidden units, and 10 output nodes, a 257 h 10 MLP network. Strategy: The training set can be obtained by using using manufactured data that provides for translational, rotational, and scale invariance in the network. The target output is set as follows. T i 1;input i 1;input i The network is trained using the manufactured data. The manufactured data has a translation (in pixels) which is uniformly distributed in [ 1.5, 1.5], rotation (in degrees) which is uniformly distributed in [ 9,9] and a scale factor that is uniformly distributed in [0.9,1.1]. A subset of the training set is presented in Figure 4.1. The output of the neural network is chosen as follows. O arg max Z i i ENEE 739Q Assignment 2 11 of 14

The training is continued for 1000 iterations or till the number of misclassified samples for the validation set remains consistently higher than the sum of the minimum value achieved and a threshold. Figure: 4.1: Manufactured data for rotational, translational, and scale invariance Dimensionality reduction using PCA: In the previous case the input dimensions are rather large and this leads to increased computations because the number of weights to be trained depends on the number of input nodes. If it is possible to represent the image using a smaller vector the training would be much less computationally intensive. To this end, the input vector can be transformed using Principal Component Analysis. An estimate of the auto correlation matrix can be obtained from the training set data and using this estimate, k principal eigenvectors (eigenvectors corresponding to the largest eigenvalues) are obtained. The projections of the input vector on these k components are packed into a k dimensional vector, which retains as much information as is necessary to correctly identify the digit. This has an additional advantage that some noise (unnecessary information) is also filtered out which results in a better performance. In the implementation k is set to 30. Thus, including the bias, the dimension of the input vector is 31. Results: The results of the training for both the normal case and the PCA case are summarized in table 4.1. The performances of the normal and PCA cases are also illustrated in Figure 4.2 and Figure 4.3 respectively. Type Input Dimension Hidden units Iterations Misclassified Samples Error of Training Set Error of Validation Set Normal 257 30 1000 0.60% 0.1003 0.4595 PCA 31 30 530 0.20% 0.1305 0.2292 Table 4.1: Summary of performances for Normal and PCA cases ENEE 739Q Assignment 2 12 of 14

Figure 4.2: Performance of the network :Direct Input Figure 4.3: Performance of the network :PCA ENEE 739Q Assignment 2 13 of 14

As can be seen, using PCA to reduce the dimensions of the input leads to a far better performance (both in terms of speed of convergence and validation set error) with the number of misclassified samples in the validation set falling as low as 0.20% (2 in 1000 samples ). In a more general setting it may be a good idea to use a general transformation such as DCT and select the low frequency components to represent the image. 5. References 1. Yann Le Cun, John S. Denker and Sara A. Solla, Optimal Brain Damage. AT&T Bell Laboratories, NJ. 2. Richard Duda, Peter Hart, and David Stork, Pattern Classification. Wiley Interscience, New York, 2001. ENEE 739Q Assignment 2 14 of 14