This leads to our algorithm which is outlined in Section III, along with a tabular summary of it's performance on several benchmarks. The last section

Similar documents
Connectivity and Performance Tradeoffs in the Cascade Correlation Learning Architecture

CS6220: DATA MINING TECHNIQUES

Constructively Learning a Near-Minimal Neural Network Architecture

CMPT 882 Week 3 Summary

Artificial neural networks are the paradigm of connectionist systems (connectionism vs. symbolism)

LECTURE NOTES Professor Anita Wasilewska NEURAL NETWORKS

Neural Networks. CE-725: Statistical Pattern Recognition Sharif University of Technology Spring Soleymani

A *69>H>N6 #DJGC6A DG C<>C::G>C<,8>:C8:H /DA 'D 2:6G, ()-"&"3 -"(' ( +-" " " % '.+ % ' -0(+$,

Neural Network Learning. Today s Lecture. Continuation of Neural Networks. Artificial Neural Networks. Lecture 24: Learning 3. Victor R.

Natural Language Processing CS 6320 Lecture 6 Neural Language Models. Instructor: Sanda Harabagiu

Classification Lecture Notes cse352. Neural Networks. Professor Anita Wasilewska

CMU Lecture 18: Deep learning and Vision: Convolutional neural networks. Teacher: Gianni A. Di Caro

Character Recognition Using Convolutional Neural Networks

Ensemble methods in machine learning. Example. Neural networks. Neural networks

Notes on Multilayer, Feedforward Neural Networks

For Monday. Read chapter 18, sections Homework:

Performance analysis of a MLP weight initialization algorithm

MATLAB representation of neural network Outline Neural network with single-layer of neurons. Neural network with multiple-layer of neurons.

Artificial Neural Networks Lecture Notes Part 5. Stephen Lucci, PhD. Part 5

Learning with Limited Numerical Precision Using the Cascade-Correlation Algorithm

IMPROVEMENTS TO THE BACKPROPAGATION ALGORITHM

Data Mining. Neural Networks

Neural Network Neurons

Query Learning Based on Boundary Search and Gradient Computation of Trained Multilayer Perceptrons*

COMP 551 Applied Machine Learning Lecture 14: Neural Networks

4.12 Generalization. In back-propagation learning, as many training examples as possible are typically used.

Visual object classification by sparse convolutional neural networks

Perceptron-Based Oblique Tree (P-BOT)

6. Linear Discriminant Functions

Lecture 13. Deep Belief Networks. Michael Picheny, Bhuvana Ramabhadran, Stanley F. Chen

Extreme Learning Machines. Tony Oakden ANU AI Masters Project (early Presentation) 4/8/2014

In the Name of God. Lecture 17: ANFIS Adaptive Network-Based Fuzzy Inference System

Lecture 2 Notes. Outline. Neural Networks. The Big Idea. Architecture. Instructors: Parth Shah, Riju Pahwa

An Empirical Study of Software Metrics in Artificial Neural Networks

Implementation Feasibility of Convex Recursive Deletion Regions Using Multi-Layer Perceptrons

Assignment 2. Classification and Regression using Linear Networks, Multilayer Perceptron Networks, and Radial Basis Functions

Fast Training of Multilayer Perceptrons

Deep Learning. Architecture Design for. Sargur N. Srihari

IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 10, NO. 6, NOVEMBER Inverting Feedforward Neural Networks Using Linear and Nonlinear Programming

Assignment # 5. Farrukh Jabeen Due Date: November 2, Neural Networks: Backpropation

Lecture 20: Neural Networks for NLP. Zubin Pahuja

IMPLEMENTATION OF RBF TYPE NETWORKS BY SIGMOIDAL FEEDFORWARD NEURAL NETWORKS

Neural Network Weight Selection Using Genetic Algorithms

Perceptron as a graph

Supervised Learning with Neural Networks. We now look at how an agent might learn to solve a general problem by seeing examples.

OMBP: Optic Modified BackPropagation training algorithm for fast convergence of Feedforward Neural Network

COMBINING NEURAL NETWORKS FOR SKIN DETECTION

Radial Basis Function Networks: Algorithms

The exam is closed book, closed notes except your one-page cheat sheet.

Week 3: Perceptron and Multi-layer Perceptron

Machine Learning 13. week

CHAPTER VI BACK PROPAGATION ALGORITHM

arxiv: v1 [cs.lg] 25 Jan 2018

Simulation of objective function for training of new hidden units in constructive Neural Networks

Linear Models. Lecture Outline: Numeric Prediction: Linear Regression. Linear Classification. The Perceptron. Support Vector Machines

CPSC 340: Machine Learning and Data Mining. Principal Component Analysis Fall 2016

Image Compression: An Artificial Neural Network Approach

Multi-Layered Perceptrons (MLPs)

Center for Automation and Autonomous Complex Systems. Computer Science Department, Tulane University. New Orleans, LA June 5, 1991.

Random projection for non-gaussian mixture models

Preface to the Second Edition. Preface to the First Edition. 1 Introduction 1

A Novel Technique for Optimizing the Hidden Layer Architecture in Artificial Neural Networks N. M. Wagarachchi 1, A. S.

MULTILAYER PERCEPTRON WITH ADAPTIVE ACTIVATION FUNCTIONS CHINMAY RANE. Presented to the Faculty of Graduate School of

Using CODEQ to Train Feed-forward Neural Networks

A Learning Algorithm for Piecewise Linear Regression

Dr. Qadri Hamarsheh Supervised Learning in Neural Networks (Part 1) learning algorithm Δwkj wkj Theoretically practically

An Algorithm For Training Multilayer Perceptron (MLP) For Image Reconstruction Using Neural Network Without Overfitting.

Learning via Optimization

11/14/2010 Intelligent Systems and Soft Computing 1

Neural Networks Laboratory EE 329 A

PATTERN CLASSIFICATION AND SCENE ANALYSIS

Pattern Recognition. Kjell Elenius. Speech, Music and Hearing KTH. March 29, 2007 Speech recognition

CLASSIFICATION WITH RADIAL BASIS AND PROBABILISTIC NEURAL NETWORKS

Classroom Tips and Techniques: Least-Squares Fits. Robert J. Lopez Emeritus Professor of Mathematics and Maple Fellow Maplesoft

Data Mining: Concepts and Techniques. Chapter 9 Classification: Support Vector Machines. Support Vector Machines (SVMs)

Traffic Signs Recognition using HP and HOG Descriptors Combined to MLP and SVM Classifiers

Machine Learning. Topic 5: Linear Discriminants. Bryan Pardo, EECS 349 Machine Learning, 2013

Bumptrees for Efficient Function, Constraint, and Classification Learning

Concept of Curve Fitting Difference with Interpolation

EE613 Machine Learning for Engineers LINEAR REGRESSION. Sylvain Calinon Robot Learning & Interaction Group Idiap Research Institute Nov.

Neural Networks (Overview) Prof. Richard Zanibbi

Deep Generative Models Variational Autoencoders

Neural Networks: What can a network represent. Deep Learning, Fall 2018

EE613 Machine Learning for Engineers LINEAR REGRESSION. Sylvain Calinon Robot Learning & Interaction Group Idiap Research Institute Nov.

Neural Networks: What can a network represent. Deep Learning, Spring 2018

V.Petridis, S. Kazarlis and A. Papaikonomou

Neural Network and Deep Learning. Donglin Zeng, Department of Biostatistics, University of North Carolina

Artificial Neural Networks (Feedforward Nets)

REGRESSION ANALYSIS : LINEAR BY MAUAJAMA FIRDAUS & TULIKA SAHA

Proximal operator and methods

Cursive Handwriting Recognition System Using Feature Extraction and Artificial Neural Network

Image Classification using Fast Learning Convolutional Neural Networks

A Population-Based Learning Algorithm Which Learns Both. Architectures and Weights of Neural Networks y. Yong Liu and Xin Yao

Unit V. Neural Fuzzy System

A Connection between Network Coding and. Convolutional Codes

Deep Learning. Vladimir Golkov Technical University of Munich Computer Vision Group

Classification: Linear Discriminant Functions

Multidimensional scaling

A Compensatory Wavelet Neuron Model

Neural Networks (pp )

Transcription:

An Algorithm for Incremental Construction of Feedforward Networks of Threshold Units with Real Valued Inputs Dhananjay S. Phatak Electrical Engineering Department State University of New York, Binghamton, NY 13902-6000 email: phatak@ee.binghamton.edu (Proceedings of WCNN'96, San Diego, CA, PP 999 1002) ABSTRACT Anovel algorithm is presented to construct feedforward networks of threshold units. The inputs of the threshold units (including external network inputs) are not restricted to be binary: they can assume any REAL values. The algorithm is derived from the Cascade- Correlation and Adaline algorithms. The cascade architecture is examined from a linear systems perspective which reveals a connection between the training of hidden units and raising the rank of a set of vectors consisting of the network inputs and the outputs of the hidden units. This leads to our algorithm, which is shown to work successfully on several standard benchmarks. The algorithm can generate a UNIQUE solution (both the topology, as well as the weight and bias values), irrespective of the random initializations. The algorithm is flexible and the search-based steps therein can be completely replaced by well known methods from linear algebra, which could lead to a considerable reduction in learning time. Merits and drawbacks of this method are discussed in the context of other relevant approaches. Several further extensions are suggested. I Introduction The sigmoid function is commonly used as the nonlinear squashing function of the processing elements (or units) in Feedforward ANNs (Artificial Neural Networks). Reasons for the widespread use of the sigmoid function are (i) it is a one-to-one continuous function with well defined derivatives of all orders, and (ii) it approximates a step (threshold) function: it can be considered to be a soft step. Most training algorithms involve optimization of an objective function such as total error in backpropagation, correlation in the Cascade-Correlation [1], etc., with respect to the adjustable parameters (i.e., the weights and biases). The optimization is typically achieved via gradient based methods. Hence, the squashing function of the units must be differentiable. A step function is a highly many-to-one function and is not differentiable everywhere. Therefore a network of threshold units cannot be trained by gradient based methods. From a hardware implementation perspective, however, a step function is highly desirable: it is a lot simpler to realize in hardware than a sigmoid. Hence, several researchers have addressed the construction of feedforward networks of threshold units [3, 5] [6] [9]. The inputs, however, are assumed to be binary in most cases. Most tasks of practical utility including regression tasks (these have real valued inputs and real valued continuously variable outputs) as well as classification tasks (real valued inputs and discrete outputs) specify real valued inputs, as illustrated by several benchmarks [2]. Hence the restriction to binary valued inputs necessitates quantization and binary encoding of real valued inputs. This might lead to a very large number of inputs to the network if the number of quantization levels is high. This paper presents a method to incrementally build networks of threshold units with real valued inputs. The method is based on the Cascade-Correlation (abbreviated Cascor) [1] and Adaline algorithms. The next section analyses the cascade architecture from a linear systems perspective. 999

This leads to our algorithm which is outlined in Section III, along with a tabular summary of it's performance on several benchmarks. The last section presents conclusion and extensions. II Cascade Architecture from a Linear Systems Perspective As mentioned above, the algorithm is similar to the Cascade-Correlation (please refer to [1, 10] for details). We therefore examine the Cascor from a linear systems perspective. In the following, vectors are indicated with an overline, while matrices are indicated by an underline. Assume that there are P patterns (samples) in the training set. Then, the output of unit k (it can be hidden or output unit) for pattern i =1;:::P is denoted by y ki and is given by y ki = f(u ki ); where u ki =[weighted sum of inputs + bias] for unit k for pattern i (1) and f denotes the squashing function of the unit. In vector notation, equation (1) can be rewritten as y k = f(u k ) where squashing function f is applied independently to each of the P (2) components of vector u k to obtain vector y k. For the purpose of illustration, assume that the problem specified has n inputs and 1 output, without loss of generality (it is easy to extend the ensuing analysis to more than one outputs and the final algorithm does handle multiple outputs). Typically, number of samples P >> n, the number of inputs. Assume the output unit to be linear. Cascor starts off by connecting the n inputs to the output unit and minimizes the total error E total = i=1 (d i y i ) 2 where d i and y i are desired (target) and actual outputs for pattern i: (3) Since the output unit is linear, E total is a quadratic function of the weights, which implies that there is a unique minimum. Any gradient based descent method can lead the search to this unique minimum. Linearity of the output unit implies that, minimizing E total in equation (3) is equivalent to solving the linear system M (0) w (0) = d in the least squares" sense, where (4) matrix M (0) =[I 1 ; ; I n+1 ]; vector d =[d 1 d 2 :::d P ] T ; and vector w (0) =[w 1 w 2 :::w n+1 ] T (5) In equation (4), vector d represents the target outputs for each the P samples, w (0) is a vector with n +1 components corresponding to the bias and the n weights of the output unit, and M (0) is a P (n+1) dimensional matrix whose columns are vectors I 1 ; ; I n+1. The jth componentofvector I m is the mth external input to the network for training sample j (First column I 1 corresponds to the bias unit). The minimum norm solution w (0) opt (which minimizes E total defined in equation (3)) to the system of equations (4)is unique and is given by w (0) opt = M (0)y d where M (0)y is the Moore-Penrose pseudoinverse of matrix M (0) (6) After this training, if the desired error bound is not met, then more units are installed one by one in a cascade topology [1, 10]. Every new unit that gets installed is connected to the output unit, so that after installing k hidden units, minimization of E total in equation (3) is equivalent to solving the linear system M (k) w (k) = μ d in the least-squares sense, where (7) matrix M (k) =[I 1 ; ; I n+1 ; H 1 ; ; H k ]; and vector w (k) =[w 1 w 2 :::w n+1 :::w n+1+k ] T (8) In (7) the superscript (k)" indicates that k hidden units have been installed so far. P dimensional vectors H 1 ; ; H k correspond to the outputs of the k hidden units: H j = f(u j ); j =1;:::;k(9) For instance, jth component of vector H i is the output of hidden unit i for pattern j. Note that every hidden unit in effect adds a new column to matrix M. A close examination of Cascade Correlation in its original [1] and modified form [10] indicate that every new hidden unit that is installed raises the rank of the matrix M, i.e., rank(m (k) ) = rank(»m (k 1). H k ) > rank(m (k 1) ) (10) As long as the hidden units' squashing function f (it must be non linear) and it's weights are such that it's output vector H is linearly independent of the output vectors of previously installed hidden units and the inputs to the network, the unit will further reduce error when it gets 1000

installed. This is the key to the success of any algorithm that generates a cascade architecture: it must utilize the nonlinearity f of the hidden units in such away that each hidden unit k upon installation generates an output vector H k which raises the rank of matrix M. This observation is the foundation of our algorithm which is presented next. III The Algorithm and its Performance on Benchmarks. If the task at hand specifies continuously variable real outputs, then the output units are assumed to be linear, i.e., y = f(u) = u. Otherwise, for discrete outputs (required for classification tasks) the squashing function of the output units is a threshold or Step function: where u =(weighted sum of inputs + bias) (11) 1 otherwise All hidden units are threshold units. The algorithm is summarized by the following steps: y = Step(u) =( +1 if u 0 Step 1 Connect all inputs to each output unit and train all weights feeding all output units to X minimize the objective function C= (d oi u oi ) 2 where sum over o covers all outputs (12) o i=1 Note that this is similar to the adaline algorithm, where the inputs to the units are used to calculate weight adjustments. If the specified error tolerance is not met, proceed to the next step. Step 2 Install hidden units one by one. Each hidden unit k =1; 2; is installed in three steps. 2.1 In the first step, the input of the new hidden unit is connected to all (external) network inputs as well as the outputs of all previously installed hidden units. It's output is not connected anywhere in the network. The input side connections of the unit are trained to minimize a discrepancy X D = (e op u kp ) 2 where e op = residual error at output unit o for pattern p = y op d op (13) o p=1 and u kp is the resultant input to hidden unit k (which is being installed) for pattern p. Once again, note that the input u of the unit being installed is used to calculate the weight adjustments. Hence, D is a quadratic function of the weights and minimization of D leads to a unique minimum solution for the weight values (which can also be obtained via the pseudoinverse). 2.2 Now examine if H k = f(u k ) = Step(u k ) is linearly independent of all the network inputs and the outputs of previous hidden units, i.e., if the vectors fi 1 ;:::;I n+1 ; H 1 ;:::;H k g, form a linearly independent set If they do, then proceed to part 3 of step 2. Otherwise, try other weights sets (via trial and error or random initializations), till the linear independence condition is met. We would like to point out that in all the benchmark simulations tried so far, including the highly non linear and complex two spirals classification task [1, 2], minimization of D in step (2.1) above has always led to a vector H k that is linearly independent of the previous ones. 2.3 Once the linear independence condition is met, the input side connections of the hidden unit are frozen forever. It's output is now connected to the network output units and all connections feeding output units are trained to minimize the objective function C defined in equation (12). Step 3 Iterate step 2, installing more units one at a time, till the desired error bound is met or some pre-determined maximum number of units is exceeded (which is deemed to be a failure). We have run this algorithm on several benchmarks. Illustrative results for some benchmarks from the CMU collection [2] are shown in Table 1 on the next page. IV Discussion and Conclusion. The algorithm successfully learns the benchmarks listed in the table (and several others which were omitted for the sake of brevity). It handles real valued inputs, thereby obviating their quantization and binary encoding. Besides the simplicity of threshold units, one of the main advantages 1001

of our method is that unique weight values can be obtained at each step, by seeking out the exact minimum of the corresponding objective function (C or D, both of which are quadratics). This implies that given a problem, it is possible to generate a fixed topology and weights irrespective Number of units No. of non output Rank of matrix M Benchmark (input, output); units = [inputs + at Test set hidden) 1 (bias) + hidden] successful termination errors Two Spirals (2, 1); 102 105 105 88.66% classification [1, 2] Sonar data (60, 1); 42 103 103 55.77% classification [4] Speaker independent (10, 4); 24 35 35 57.79% vowel recognition [11] Table 1: Performance of the algorithm on benchmarks from the CMU collection [2] of the random initializations. In fact the quadratic minimization can be achieved via the leastsquares method. Consequently, gradient calculations can be completely avoided. There are several well known linear algebra packages for least squares solution, singular value decomposition, etc, which can be easily incorporated in this algorithm. The main drawbacks (of the current version) are (i) the number of hidden units required is considerably larger than a corresponding net with sigmoidal units; and (ii) the generalization performance (in terms of percentage errors on the test set) is also not as good as that of a net with sigmoidal units. These outcomes can be expected, since the threshold function is a highly many-to-one and discontinuous function. A smooth continuous function like sigmoid will naturally give better interpolation than a discontinuous function such as a step. Several further extensions are possible. Selection of an intermediate objective function (instead of the discrepancy D) which will guarantee that the rank of M will get raised and expedite the convergence is being investigated. Other possibilities include a mixture of sigmoidal and threshold units along with other nonlinearities f (such as clamped linear, multiple steps, etc.) and layered architectures with restricted depth and fan-in [10]. References [1] Fahlman, S. E., and Lebiere, C. The Cascade Correlation Learning Architecture". In Neural Information Processing Systems 2, 1990, D. S. Touretzsky, Ed., Morgan Kaufman, pp. 524 532. [2] Fahlman, S. E. et. al. Neural Nets Learning Algorithms and Benchmarks Database. maintained by S. E. Fahlman et. al. at the Computer Science Dept., Carnegie Mellon University. [3] Frattale Mascioli, F. M., and Martinelli, G. A constructive algorithm for binary neural networks: the oil-spot algorithm". IEEE Transactions on Neural Networks, vol. 6, no. 3, May 1995, pp. 794 797. [4] Gorman, R. P., and Sejnowski, T. J. Analysis of Hidden Units in a Layered Network Trained to Classify Sonar Targets". Neural Networks, vol. 1, 1988, pp. 75 89. [5] Gray, D. L., and Michel, A. N. A training algorithm for binary feedforward neural networks". IEEE Transactions on Neural Networks, vol. 3, 1992, pp. 176 194. [6] Marchand, M., Golea, M., and Rujan, P. A convergence theorem for sequential learning in two layer perceptrons". Europhysics Letters, vol. 11, 1990, pp. 487 492. [7] Martinelli, G., and Mascioli, F. M. Cascade Perceptron". IEE Electronics Letters, vol. 28, 1992, pp. 947 949. [8] Martinelli, G., Mascioli, F. M., and Bei, G. Cascade neural network for binary mapping". IEEE Transactions on Neural Networks, vol. 4, 1993, pp. 148 150. [9] Muselli, M. On sequential construction of binary neural networks". IEEE Transactions on Neural Networks, vol. 6, no. 3, May 1995, pp. 678 690. 1002

[10] Phatak, D. S., and Koren, I. Connectivity and Performance Tradeoffs in the Cascade Correlation Learning Architecture". IEEE Transactions on Neural Networks, vol. 5, Nov. 1994, pp. 930 935. [11] Robinson, A. J., and Fallside, F. A Dynamic Connectionist Model for Phoneme Recognition". In Proc. of neuro, June 1988, Paris, Jun. 1988. 1003