Neural Networks. CE-725: Statistical Pattern Recognition Sharif University of Technology Spring Soleymani

Size: px

Start display at page:

Download "Neural Networks. CE-725: Statistical Pattern Recognition Sharif University of Technology Spring Soleymani"

Arlene Mosley
5 years ago
Views:

1 Neural Networks CE-725: Statistical Pattern Recognition Sharif University of Technology Spring 2013 Soleymani

2 Outline Biological and artificial neural networks Feed-forward neural networks Single layer networks Multi-Layer Perceptron (MLP) Back-propagation RBF networks 2

Learning: altering the strengths of connections between neurons

3 Biological Neural Network How human recognizes patterns? Human brain Structure: Neurons and connections between them Learning: altering the strengths of connections between neurons (adding or deleting connections). Single Neuron Connection between neurons 3

4 Artificial Neural Networks Artificial Neural Networks (ANNs): mathematical models inspired by biological neural networks : Activation function Σ = 4

5 Neuron Neuron, unit, or processing element: Equivalent to = binary McCulloch- Pitts neuron bias or activation threshold 1, = 0, < 1 = bias: 5

6 Activation function The most common activation functions: Unit step Sigmoid = 1 1+exp ( ) Unit step =2 =1 =0.5 Bipolar activation functions can be defined as These are usually more attractive tanh /2 = 1 exp ( ) 1+exp ( ) 6

7 Feed-Forward Neural Networks (FFNN) Neurons are arranged in layers Each unit receives input only from units in the preceding layer Weights on links can be adapted using training data and a learning algorithm Input Output Non-processing units Hidden Layers 7

8 Feed-Forward Neural Networks (FFNN) The most commonly used networks for pattern classification tasks Expressive while efficient Propagates data from input to output, From the input nodes data goes through the hidden nodes (if any) and then to the output nodes 8

9 Single Layer FFNNs Single layer network can be used as a linear decision boundary: ( ) showstheclassof Types of single layer networks: Hebb (Hebb, 1949) Perceptron (Rosenblatt, 1962) ADALINE (Widrow and Hoff, 1960) 1 = 9 bias:

10 Training of Single Layer FFNNs First initialize and then iteratively update it: = + The learning process goes through all training examples (an epoch) a number of times, until reaching a stopping criterion. 10

11 Training of Single Layer FFNNs Weight update for a training pair () : Hebb: = () () Perceptron: If ( () ) () then = () () else = ADALINE: = ( () () ) () Widrow-Hoff, LMS, or delta rule = = () () 11

12 Perceptron vs. Delta Rule Perceptron learning rule: guaranteed to succeed if training examples are linearly separable Delta rule: guaranteed to converge to the hypothesis with the minimum squared error succeed if sufficiently small learning rate Even when training data contain noise or are not separable by a hyperplane can also be used for regression problems 12

13 Training Mode in Gradient Descent Batch mode (gradient descent) In each iteration, weight update depends on the entire training data = Incremental or sequential mode (stochastic gradient descent) Iteratively update weight vector based on one data point at a time (cycle through data points in sequence or by selecting them at random with replacement) = () = 13

14 Multi-Layer Perceptron (MLP) Two-layer MLP (Number of layers of adaptive weights is counted) (), = (), = h () =1 () =0,, =1 =1 h () =1 =1,, =h () Input h h Output Usually, h are sigmoid activation functions: 1 = 1+exp( ) 14

15 Multi-Layer Perceptron (MLP) Multi-Layer Perceptron (MLP): MLP is a generalized linear model: () For the classification problem,(. ) is a nonlinear activation function The form of the nonlinearity (basis functions )isadaptedfrom the training data (not fixed in advance) is defined based on parameters which can be also adapted during training (e.g. = ) MLP is of greatest practical use Hidden units enable us to express complicated nonlinear functions 15

16 XOR Problem = 1 2 = ( 1 2 ) ( 1 2 ) = = =0 =1 = 1 = 1 [Duda, Hart & Strork] 16

17 MLP Outputs as Discriminant Functions Classification problem: MLP with output units where the number of classes is we can view the network as computing discriminants functions = () ( =1,,) and classify according to the largest discriminant function () 17

18 MLP Universal Approximator A feed-forward network with a single hidden layer and linear outputs can approximate any continuous function on a compact domain to an arbitrary accuracy under mild assumptions on the activation function e.g., sigmoid activation functions (Cybenko,1989) when sufficiently large (but finite) number of hidden units is used = It is of greater theoretical interest than practical the construction of such a network requires the nonlinear activation functions and the weight values which are unknown 18

19 MLP with Different Number of Layers: Separability Properties MLP with unit step activation function Decision region found by an output unit. Structure Type of Decision Regions Interpretation Example of region Single Layer (no hidden layer) Half space Region found by a hyper-plane Two Layer (one hidden layer) Polyhedral (open or closed) region Intersection of half spaces Three Layer (two hidden layers) Arbitrary regions Union of polyhedrals 19

20 MLP Training Backpropagation Training algorithm that is used to adjust weights in MLP networks (based on the training data) The backpropagation algorithm is based on gradient descent MLP with sigmoidal activation function in hidden layers Differentiable w.r.t. parameters 20

21 Backpropagation Sum of square error cost function: = _ = Backpropagation algorithm uses stochastic gradient descent to find weights minimizing the above cost function. Following we remove this superscript a computationally efficient algorithm to learn multiple layers of weights 21

22 Backpropagation First step: Forward propagation Feed the input vector to the network and calculate the activation of all hidden and output nodes (z s and o s) =1 Input () =0,, =1 =1 h h h () =1 =1,, Output =h () 22

23 Backpropagation (hidden-to-output) Weight Adaptation = = = = = is the sensitivity of output unit : = = = ( ) Weight update (or learning rule) for the hidden-to-output weights: = = 23

24 Backpropagation (input-to-hidden) Weight Adaptation = = = 1/2 = = Sensitivity for a hidden unit: = = h ( ) = h ( ) Weight update (or learning rule) for the input-to-hidden weights: = = 24

25 Backpropagation of Errors = ( ) =h ( ) 25

26 Stochastic Back-propagation initialize M, w,, t 0 do t t + 1 x randomly chosen pattern among training data Apply x to the network (forward propagate) and find the activations of all the units Evaluate k for all the output units Backpropagate s to obtain j for all hidden units w ji w ji + j x i w kj w kj + k z j until J(w) < 26

27 Cost function in BP algorithm Non-linear activation functions yields a non-convex cost function in general Error surface (cost function) depends on the training data May have many local minima Networks with multiple hidden layers are more prone to getting stuck in a local minima. Other training criteria can also be used (e.g., cross entropy) instead of least squares cost function 27

28 Backpropagation: Training Mode Advantages of stochastic gradient descent: useful for training of neural networks on large training sets. possibility of escaping from local minima Higher degree of randomness during training can approximate gradient descent arbitrarily closely if is small enough Advantages of batch gradient descent: a better estimate of the gradient thus to more well-behaved convergence 28

29 Backpropagation & Local minima To find a good minimum: usually we run training algorithm multiple times with different random initialization of weights and select one of them according to the performance on a validation set In order to decide when to stop training: During training, the training set is used more than once until the algorithm converges. we do not want to overtrain the network (decreasing the generalization of the model), so we stop training at a minimum of the error on the validation set. 29

30 Back-propagation: MLP Activation Function Sigmoid is the most widely used activation function Sigmoid properties: smooth, differentiable, nonlinear, monotonic, saturating Hidden layer of sigmoids affords global representation of the input Sigmoid derivative: Binary: =(1 ) Bipolar: = (1 ) 30

31 Stopping Criteria Elementary stopping criteria: Other stopping criteria: ( ) < Reaching a minimum (i.e. the training error fails to improve) or an acceptable level of error When the rate of improvement drops below a certain level When a certain number of epochs have passed Number of epochs: number of presentations of the full training set Indicate the relative amount of training When the error on a separate validation set reaches a minimum Stopping the training before completing the gradient descent may help avoid ovefitting 31

32 Number of Hidden Units Shows the expressive power the network Can specify the total numbers of weights that are the number of freedom degree Select among networks with different no. of hidden units by training these networks and then evaluating them on a validation set For large networks and large training set, it is inefficient. Constructive techniques Pruning techniques error validation error training error Number of hidden units

33 Training of MLP using BP: Many Parameters Many parameters to tune: Learning rate: Stopping parameter:, number of epochs, or Number of hidden layers, number of hidden units 33

34 Radial-Basis Function (RBF) Network Radial-Basis Function (RBF) networks can also be considered as a two-layer feed-forward NN Hidden layer: input is mapped onto each RBF in the 'hidden' layer. RBF is a function which has built into a distance criterion with respect to a center It is commonly taken to be Gaussian Output layer: In regression problems: a linear combination of hidden layer values In classification problems: typically a sigmoid function of a linear combination of hidden layer values. 34

35 RBF Network Radial Basis Function (RBF) = + (, ) Transform data into an -dim space: Representing the instances by a number of prototypes,,. Then, using a linear (i.e. single layer) model to find the output. (, ) 1 = + (, ) 35 (, ) =(, ) Can be easily generalized to more than one output unit

36 RBF Network RBF kernel:, = XOR problem example: = 0,0, = 1,1, = =1 36

37 RBF Network RBF kernel:, = XOR problem example: = 0,0, = 1,1, = =1 37

38 RBF Network Training RBF networks are trained by deciding on the number of hidden units deciding on their centers and the sharpnesses (standard deviation) of their Gaussians training up the output layer 38

39 Selecting the centers The first idea is to set a center on each training data In practice, usually we set 39 higher generalization capabilities of the model (avoid overfitting) reduction in computational complexity Selecting center locations Fixing centers before weight adaptation E.g., randomly selected data from training points, or centers found by a clustering algorithm Training centers and standard deviations along with weight adaptation: = (,, ) = 0,, = (,, ) = 1,, = (,, ) = 1,,

40 Kernel SVM vs. RBF Kernel SVM automatically computes all the unknown parameters including the number of centers. In the SVM approach, the number of nodes and the centers are found according to the optimization problem. 1 =( + (, () )) (, () ) (, ( ) ) Only support vectors : Number of SVs 40

41 RBF vs. MLP The activation responses are of a local nature in the RBF networks and of a global nature in MLP networks MLPs exhibit improved generalization properties, especially for regions that are not represented sufficiently in the training set In RBFs, a large number of centers is required to fill in the space (thus exponential dependence on the input dimension-curse of dimensionality) RBF networks do not suffer from local minima (when fixing RBFs ahead) The only parameters that are adjusted in the learning process are the linear mapping from hidden layer to output layer. Linearity ensures that the error surface is quadratic and has a single minimum. Weight adaptation: delta rule, perceptron rule, MLPs learn slower However, RBF networks (when fixing RBFs ahead) show less flexibility 41

42 Neural Network Models: Properties Properties: ability to learn complex nonlinear input-output relationships sequential training procedures adapting the network parameters to the data Neural network models and statistical models Most of the neural network models are similar or equivalent to the classical pattern recognition models [Jain et al. 2000] Statistics for amateurs, conceal the statistics from the user [Anderson] 42

43 Other Neural Networks for Pattern Recognition Self-Organizing Map (SOM) or Kohonen-Network clustering and dimensionality reduction Auto-associative Neural Networks can be used for dimensionality reduction (feature extraction) Recurrent Neural Networks (RNNs) propagate data also from later processing stages to earlier stages. they are general sequence processors. State-of-the-art networks: Deep learning (Hinton, 2006): uses restricted Boltzmann machine (RBM) to find an efficient learning procedure for deep models effective feature extractors 43

44 Neural Networks (MLP): Summery Advantages: can be used for huge data sets learn a feature extractor and a classifier simultaneously do not make any assumption regarding the underlying probability density resistant to outliers should not be used when traditional methods are appropriate Disadvantages: many parameters to be set usually slow training process (many epochs are required) Very slow in networks with multiple hidden layers will find a local, not necessarily global minimum of the error function may be very hard to interpret these models 44

For Monday. Read chapter 18, sections Homework:

For Monday. Read chapter 18, sections Homework: For Monday Read chapter 18, sections 10-12 The material in section 8 and 9 is interesting, but we won t take time to cover it this semester Homework: Chapter 18, exercise 25 a-b Program 4 Model Neuron