Extensive research has been conducted, aimed at developing

Size: px

Start display at page:

Download "Extensive research has been conducted, aimed at developing"

Dorthy Sparks
5 years ago
Views:

1 Chapter 4 Supervised Learning: Multilayer Networks II Extensive research has been conducted, aimed at developing improved supervised learning algorithms for feedforward networks. 4.1 Madalines A \Madaline" is a combination of many Adalines, trained an algorithm that follows the \Minimum Disturbance" principle. Learning algorithms for Madalines have gone through three stages of development: MRI (Madaline Rule I), MRII and MRIII. 1

2 2 CHAPTER 4. SUPERVISED LEARNING: MULTILAYER NETWORKS II Adaline x 1 OR x 2 Adaline Figure 4.1: Madaline I network with two Adalines \Madaline I" has an adaptive hidden layer and a xed logic device (AND/OR/majority gate) at the outer layer. The goal of its training procedure is to decrease network error (at each input presentation) by making the smallest possible perturbation to the weights of some Adaline. Note that the Adaline output must change in sign in order to have any eect on the network output. If the net input into an Adaline is large, its output reversal requires a large amount ofchange in its incoming weights. Therefore, the algorithm attempts to nd an existing Adaline (hidden node) whose net input is smallest in magnitude, and whose output reversal would reduce

3 4.1. MADALINES 3 network error. This output reversal is accomplished by applying the LMS algorithm to weights on connections leading into that Adaline. The number of Adalines for which output reversal is necessary depends on the choice of the outermost node function. If the outermost node computes the OR of its inputs, then network output can be changed from 0 to 1by modifying a single Adaline, but changing network output from 1 to 0 requires modifying all the Adalines for which current outputis1. * ::::::::::::::::::::::::::::::::::::::::::::::::::* Madaline II architecture uses Adalines with modiable weights at the output layer of the network, instead of xed logic devices. The weights are initialized to small random values, and training patterns are repeatedly presented. The algorithm successively modies weights in earlier layers before later ones, using the LMS or -LMS rule. Adalines whose weights have already been adjusted

4 4 CHAPTER 4. SUPERVISED LEARNING: MULTILAYER NETWORKS II Adaline Adaline Adaline Figure 4.2: Madaline II with one hidden layer may be revisited, when new input patterns are presented. It is possible to adjust the weights of several Adalines simultaneously. Since the node functions are not dierentiable, there can be no explicit gradient descent. The algorithm may repeatedly cycle through the same sequence of states. * ::::::::::::::::::::::::::::::::::::::::::::::::::* Madaline III networks are feedforward, with sigmoid node functions. Weights of all nodes are adapted in each iteration, using the MRIII training algorithm. Although the approach is equivalent to backpropagation, each weight change involves considerably more computation.

5 4.1. MADALINES 5 Algorithm MRIII repeat Present a training pattern i to the network Compute node outputs for h = 1 to the number of hidden layers, do for each nodek in layer h, do Let E be the current network error Let network error be E k when >0 (small) is added to the kth node's net input Adjust weights on connections to kth node using or w = ;i(e 2 k ; E2 )= w = ;ie(e k ; E)= end-for end-for until network error is satisfactory or upper bound on no. of iterations is reached. Figure 4.3: MRIII Training Algorithm for Madaline III

6 6 CHAPTER 4. SUPERVISED LEARNING: MULTILAYER NETWORKS II * ::::::::::::::::::::::::::::::::::::::::::::::::::* 4.2 Adaptive Multilayer Networks Small networks are more desirable than large networks that perform the same task because: 1. training is faster, 2. the number of parameters in the system is smaller, 3. fewer training samples are needed, 4. the system is more likely to generalize well for new test samples. There are three approaches to build a network of \optimal" size: 1. A large network may be built and then \pruned" by eliminating nodes and connections that can be considered unimportant. 2. Starting with a very small network, the size of the network is repeatedly increased by small increments

7 4.2. ADAPTIVE MULTILAYER NETWORKS 7 until performance is satisfactory. 3. A suciently large network is trained, and unimportant connections and nodes are then pruned, following which new nodes with random weights are re-introduced and the network is retrained. This process is continued until a network of acceptable size and performance levels is obtained, or further pruning attempts become unsuccessful. These methods are based on sound heuristics, but are not guaranteed to result in an optimal size network. * ::::::::::::::::::::::::::::::::::::::::::::::::::* Network pruning algorithms train a large network and then repeatedly `prune' it until size and performance of the network are both satisfactory. Pruning algorithms use penalties to help determine if some of the nodes and links in a trained network can be removed, leading to a network with better generalization properties. A generic algorithm for such methods is described.

8 8 CHAPTER 4. SUPERVISED LEARNING: MULTILAYER NETWORKS II Algorithm Network-Pruning: Train a network large enough for the current problem repeat Find a node or connection whose removal does not penalize performance beyond a predetermined level Delete this node or connection (Optional:) Retrain the resulting network until further pruning degrades performance excessively. Figure 4.4: Generic network pruning algorithm.

9 4.2. ADAPTIVE MULTILAYER NETWORKS 9 The following are some pruning procedures: 1. Connections associated with weights of small magnitude may be eliminated from the trained network. Nodes whose associated connections have small magnitude weights may also be pruned. 2. Connections whose existence does not signicantly aect network outputs (or error) may be pruned. These may be detected by examining the change in network output when a connection weightischanged to 0 or by testing is negligible, i.e., if the network outputs change little when a weight is modied. 3. Input nodes can be pruned if the resulting change in network output is negligible. This results in a reduction in relevant input dimensionality, by detecting whichnetwork inputs are unnecessary for output computation.

10 10 CHAPTER 4. SUPERVISED LEARNING: MULTILAYER NETWORKS II 4. Lecun, Denker and Solla (1990) identify the weights that can be pruned from the network, by examining the second derivatives of the error function. Note that w+1 2 (E00 )(w) 2 where E If a network has already been trained until a local minimum of E has been reached, using an algorithm such as backpropagation, 0, and the above equation simplies to E 1 2 (E00 )(w) 2 : Pruning a connection corresponds to changing the connection weight fromw to 0, i.e., w = ;w. The connection is pruned if E = 1 2 (E00 )(;w) 2 =(E 00 )w 2 =2 is small enough. The pruning criterion hence examines whether (E 00 )w 2 =2 is below some threshold. :

11 4.2. ADAPTIVE MULTILAYER NETWORKS 11 Marchand's algorithm obtains an \optimal" size network for classication problems, repeatedly adding a perceptron node to the hidden layer. Let T + k and T ; k represent the training samples of two classes that remain to be correctly classied at the kth iteration. At each iteration, a node's weights are trained such that either jt + k+1 j < jt + k j or jt ; k+1 j < jt ; k (T ; k+1 [ T + k+1 ) (T ; k [ T + k j, and ), ensuring that the algorithm terminates eventually at the mth step, when either T ; m or T + m is empty. Eventually, wehave two sets of hidden units (H ; and H + ), that together give the correct answers on all samples of each class. The connection from the kth hidden node to the output node carries the weight w k = 8 < : 1=2 k if the kth node belongs to H + ;1=2 k if the kth node belongs to H ; The above weight assignment scheme ensures that nodes added later do not modify the correct results obtain in earlier iterations of the algorithm.

12 12 CHAPTER 4. SUPERVISED LEARNING: MULTILAYER NETWORKS II * ::::::::::::::::::::::::::::::::::::::::::::::::::* Upstart algorithm develops additional \subnets" for samples misclassied in earlier iterations. Each new node is trained using the pocket algorithm. If any samples are misclassied by a node, two subnets are introduced between the input nodes and other nodes in the network. Each subnet attempts to correctly separate samples of one class misclassied by the \parent" node from all samples of the other class for which the parent node was invoked. This is a simpler task than the one confronted by the parent node, ensuring that the recursive algorithm terminates. Large magnitude weights are attached to the connections from new nodes to the parent node, in such away that neither of the new nodes can aect performance of the parent nodeon samples correctly classied by the parent alone. The upstart algorithm can be extended to introduce a small network module at a time, instead of a single node, as in the \block-start" algorithm.

13 4.2. ADAPTIVE MULTILAYER NETWORKS 13 * ::::::::::::::::::::::::::::::::::::::::::::::::::* Neural Tree algorithm develops a decision tree, in which the output of a node (2 f0 1g) determines whether the nal answer is to come from its left subtree or its right subtree. Each node is trained using the LMS or pocket algorithm. A node is a leaf node (new subtrees are not required) if all or most of the corresponding training samples belong to the same class. * ::::::::::::::::::::::::::::::::::::::::::::::::::* Cascade correlation: This algorithm has two important features: (1) the cascade architecture development, and (2) correlation learning. Moreover, the architecture is not strictly feedforward. In the cascade architecture development process, new single-node hidden layers are successively added to a steadily growing layered neural network until performance is judged adequate. Each node may employ a nonlinear node function such asthehyperbolic tangent, whose output lies in the closed interval [;1:0 1:0].

14 14 CHAPTER 4. SUPERVISED LEARNING: MULTILAYER NETWORKS II Fahlman and Lebiere suggest using the Quickprop learning algorithm. When a node is added, its input weights are trained rst. Then all the weights on the connections to the output layer are trained while leaving other weights unchanged. Weights to each new hidden node are trained to maximize covariance with current network error, i.e., to maximize S(w new )= KX k=1 PX p=1 (x new p ; x new )(E k p ; Ek ) where w new is the vector of weights to the new node from all the pre-existing input and hidden units, x new p is the output of the new node for the pth input, E k p is the error of the kth output node for the pth sample before the new node is added, and x new and Ek are averages (over the training set).

15 4.2. ADAPTIVE MULTILAYER NETWORKS 15 Output node Output node Output node First hidden node First hidden node x y 1 x y 1 x y 1 Input nodes Input nodes Input nodes Output node Output node Second hidden node Second hidden node First hidden node First hidden node x y 1 x y 1 Input nodes Input nodes Figure 4.5: (modied Fig.4.19 from text) Cascade correlation example solid lines indicate connections currently being modied.

16 16 CHAPTER 4. SUPERVISED LEARNING: MULTILAYER NETWORKS II Tiling algorithm develops a multilayer feedforward network, successively adding new hidden layers into which ow the outputs of the previous layer. Within each layer, a \master" node is rst trained using the pocket algorithm, and then \ancillary" nodes are successively added to separate samples of dierent classes that are not yet separated by nodes currently in that layer. Each layer is constructed such that for every pair of training samples that belong to dierent classes, some node in that layer produces dierent outputs. * ::::::::::::::::::::::::::::::::::::::::::::::::::* 4.3 Prediction Networks Prediction problems constitute a special subclass of function approximation problems, in which the values of variables need to be determined from values at previous instants. Two classes of neural networks have been used for prediction tasks: recurrent networks and feedforward networks.

17 4.3. PREDICTION NETWORKS 17 * ::::::::::::::::::::::::::::::::::::::::::::::::::* Recurrent Networks Recurrent neural networks contain connections from output nodes to hidden layer and/or input layer nodes, and they allow interconnections between nodes of the same layer, particularly between the nodes of hidden layers. Rumelhart, Hinton, and Williams's (1986) training procedure is essentially the same as the backpropagation algorithm. They view recurrent networks as feedforward networks with a large number of layers. Each layer is thought of as representing a time delay in the network. Using this approach, the fully connected neural network with three nodes is considered equivalent to a feedforward neural network with k hidden layers, for some value of k. Weights in dierent layers are constrained to be identical, to capture the structure of a recurrent network: w (` `;1) i j = w (`;1 `;2) i j.

18 18 CHAPTER 4. SUPERVISED LEARNING: MULTILAYER NETWORKS II Output Output Output Time w 3,1 k+1 w 1,1 k w 1, w 1,1. w 3,1 1 w 1,3 Input Input Input 0 (a) (b) Figure 4.6: (a) Fully connected recurrent neural network with 3 nodes (b) Equivalent feedforward version, for Rumelhart's training procedure.

19 4.3. PREDICTION NETWORKS 19 Output Output Input Figure 4.7: Recurrent network with hidden nodes, to which Williams and Zipser's algorithm can be applied. Williams and Zipser proposed another training procedure for a recurrent network with hidden nodes. The net input to the kth node consists of inputs from other nodes as well as external inputs. The output of each node depends on outputs of other nodes at the previous instant: o k (t +1)=f(net k (t)). * ::::::::::::::::::::::::::::::::::::::::::::::::::* Feedforward networks for forecasting The generic network model consists of a preliminary preprocessing component that transforms an external

20 20 CHAPTER 4. SUPERVISED LEARNING: MULTILAYER NETWORKS II Algorithm Recurrent Network Training: Assume randomly chosen weights, t = 0, k i j =0 for each i j k: while MSE is unsatisfactory and k (t i j bounds are not exceeded, do Modify the weights: w i j (t) = X k2u (d k (t) ; o k i j where U is the set of nodes with a specied target value d k (t) = f 0 (net k (t)) " X `2U w k i j # + i k z j (t) Increment t end-while. Figure 4.8: Williams and Zipser's Recurrent network training algorithm

21 4.3. PREDICTION NETWORKS 21 input vector x(t) into a preprocessed vector x(t). The feedforward network is trained to compute the desired output value for a specic input x(t). x(t) Preprocessor x(t) Predicted Neural x(t+1) Network Predictor Figure 4.9: Generic neural network model for prediction. Tapped delay-line neural network (TDNN): Consider a prediction task where x(t) is to be predicted from x(t ; 1) x(t ; 2): In this simple case, x at time t consists of a single input x(t), and x at time t consists of the vector (x(t ; 1) x(t ; 2)) supplied as input to the feedforward network. For this example, preprocessing consists merely of storing past values of the variable and supplying them to the network along with the latest value. Many preprocessing transformations for prediction problems can be described as: x i (t) = tx =0 c i (t ; )x()

22 22 CHAPTER 4. SUPERVISED LEARNING: MULTILAYER NETWORKS II where c is a known \kernel" function. For example, in \exponential trace memories," c i (j) =(1; i ) j i, where i 2 [;1 1]. The kernel function for \discrete-time gamma memories" is c i (j) = 8 < : ; j l i (1 ; i ) l i +1 j;l i i 0 otherwise if j l i where delay l i is a non-negative integer and i 2 [0 1]. * ::::::::::::::::::::::::::::::::::::::::::::::::::* 4.4 Radial Basis Functions (RBF) A function is radially symmetric (or is an RBF) if its output depends on the distance of the input vector from a stored vector specic to that function. Neural networks whose node functions are radially symmetric functions are referred to as RBF-nets. RBF nodes compute a non-increasing function of distance, with (u 1 ) (u 2 ) whenever u 1 <u 2. The Gaussian, g (u) / e ;(u=c)2

23 4.4. RADIAL BASIS FUNCTIONS (RBF) 23 is the most widely used RBF. RBF-nets are generally called upon for use in function approximation problems, particularly for interpolation. In many function approximation problems, we needto determine the behavior of the function at a new input, given the behavior of the function at training samples. Such problems are often solved by linear interpolation. (x 1 :11) (x 2 :8) D 2 D 3 (x 3 :9) (x 6 :17) (x D 5 x 8 :10) 0 (x 7 :15) D 4 (x 5 :8) (x 4 :3) Figure 4.10: D j is the Euclidean distance between x j and x 0. (x 5 : 8) indicates that f(x 5 )=8. The four nearest observations can be used for interpolation at x 0,giving (8D ;1 2 +9D ;1 3 +3D ;1 4 +8D ;1 5 )=(D ;1 2 + D ;1 3 + D ;1 4 + D ;1 5 ). * ::::::::::::::::::::::::::::::::::::::::::::::::::* If nodes in an RBF-net with linear radial basis node functions 0 (D) = D ;1 corresponding to each of the

24 24 CHAPTER 4. SUPERVISED LEARNING: MULTILAYER NETWORKS II training samples x p for p =1 ::: P 0 surround the new input x, the output of the network at x is proportional to 1 X P 0 p d p 0 (kx ; x p k) where d p = f(x p ) is the output corresponding to x p, the p-th pattern. To avoid the problem of deciding which of the training samples do surround the new input, we may use the output 1 where d 1 ::: d P P PX p=1 d p (kx ; x p k) are the outputs corresponding to the entire set of training samples, and is any RBF. For network size to be reasonably small, we cannot have one node to representeach x p. Hence similar training samples are clustered together, and output NX o = 1 N i=1 where N is the number of clusters, i is the center of ' i (k i ; xk) the ith cluster, and ' i is the desired mean output of all samples of the ith cluster.

25 4.4. RADIAL BASIS FUNCTIONS (RBF) 25 Training involves learning the values of w 1 = ' 1 N ::: w N = ' N N minimizing E = PX p=1 E p = PX p=1 1 ::: N (d p ; o p ) 2 : Gradient descent suggests the following update rules: w i = i (d p ; o p )(kx p ; i k)and i j = ; i j w i (d p ; o p )R 0 (kx p ; i k 2 )(x p j ; i j ) where R is dened for convenience such thatr(z 2 ) = (z). Note that i and i j may dier for each w i and i j. This learning algorithm requires considerable computation. A faster alternative, \partially o-line training," consists of two steps: 1. Some clustering procedure is used to estimate the centroid ( i ) and spread for each cluster. 2. Using one node per cluster (with xed i ), gradient descent one is invoked to nd weights (w i ). * ::::::::::::::::::::::::::::::::::::::::::::::::::*

26 26 CHAPTER 4. SUPERVISED LEARNING: MULTILAYER NETWORKS II 4.5 Polynomial Networks Many practical problems require computing or approximating functions that are polynomials of the input variables, a task that can require many nodes and extensive training if familiar node functions (sigmoids, Gaussians, etc.) are employed. Networks whose node functions allow them to directly compute polynomials and functions of polynomials are referred to as \polynomial networks". The following are some examples. 1. Higher-order networks contain no hidden layers, but have nodes referred to as \higher-order processing units" that perform complex computations. A higherorder node applies a nonlinear function f to a polynomial in the input variables, and the resulting node output is f(w 0 + X j 1 w j1 i j1 + :::+ X j 1 j 2 ::: j k w j1 j 2 ::: j k i j1 i j2 :::i jk )

27 4.5. POLYNOMIAL NETWORKS 27 The LMS training algorithm can be applied to train such a network. However, the number of weights required increases rapidly with the input dimensionality n and polynomial degree k. 2. Sigma-pi networks contain \sigma-pi units" that apply nonlinear functions to sums of weighted products of input variables: f(w 0 + :::+ X j 1 6=j 2 6=j 3 6=j 1 w j1 j 2 j 3 i j1 i j2 i j3 + :::) This model does not allow terms with higher powers of input variables, therefore sigma-pi networks are not universal approximators. 3. \Product units" are nodes that compute products n j=1 ip j i j, where each p j i is an integer whose value is to be determined. Networks may contain these in addition to some \ordinary" nodes that apply sigmoids or step functions to their net input. The learning algorithm must determine both the weights and exponents p ji for product units, in addition to deter-

28 28 CHAPTER 4. SUPERVISED LEARNING: MULTILAYER NETWORKS II mining weights for ordinary nodes. Backpropagation training may beused,andlearningisslow. 4. Networks with functional link architecture conduct linear regression, estimating the weights in P j w j j (i), where each \basis function" j is chosen from a predened set of components (such as linear functions, products and sinusoidal functions). 5. Pi-sigma networks contain a hidden layer, in which each hidden node computes w k 0 + P j w k ji j, and the output node computes f Q k(w k 0 + P j w k ji j ) : Weights between the input layer and the hidden layer are adapted during the training process, and can be trained quickly using the LMS algorithm. 6. Ridge-polynomial networks consist of components that generalize Pi-sigma networks, and are trained using an adaptive network construction algorithm. These networks have universal approximation capabilities. * ::::::::::::::::::::::::::::::::::::::::::::::::::*

29 4.6. REGULARIZATION Regularization Regularization involves optimizing a function E +jp j 2, where E is the original cost (or error) function, P is a \stabilizer" that incorporates a priori problem-specic requirements or constraints, and is a constant that controls the relative importance of E and P. Regularization can be implemented explicitly or implicitly. Examples of explicit regularization: Introduction of P = P j w2 j into the cost function penalizes large magnitude weights. A weight decay term may be introduced into the generalized delta rule, so that the weight change w = ;(@E=@w) ; w depends on the negative gradient of the mean square error as well as the current weight. This favors the development ofnetworks with smaller weight magnitudes.

30 30 CHAPTER 4. SUPERVISED LEARNING: MULTILAYER NETWORKS II \Smoothing" penalties may beintroduced into the error function, e.g., terms proportional to j 2 E=wi 2 j and other higher-order derivatives of the error function. Such penalties prevent a network's output function from having very high curvature, and may prevent anetwork from over-specializing on the training data to account for outliers in the data. Examples of implicit regularization: Training data may be perturbed articially, using Gaussian noise. This is equivalent to using a stabilizer that imposes a smoothness constraint on the derivative of the squared error function with respect to the input variables. Adding random noise to the weights imposes a smoothness constraint on the derivatives of the squared error function with respect to the weights. RBF-nets constitute a special case of networks that accomplish regularization.

APPLICATION OF THE FUZZY MIN-MAX NEURAL NETWORK CLASSIFIER TO PROBLEMS WITH CONTINUOUS AND DISCRETE ATTRIBUTES

APPLICATION OF THE FUZZY MIN-MAX NEURAL NETWORK CLASSIFIER TO PROBLEMS WITH CONTINUOUS AND DISCRETE ATTRIBUTES A. Likas, K. Blekas and A. Stafylopatis National Technical University of Athens Department