Neural Networks. Theory And Practice. Marco Del Vecchio 19/07/2017. Warwick Manufacturing Group University of Warwick

Size: px

Start display at page:

Download "Neural Networks. Theory And Practice. Marco Del Vecchio 19/07/2017. Warwick Manufacturing Group University of Warwick"

Ezra Ellis
5 years ago
Views:

1 Neural Networks Theory And Practice Marco Del Vecchio Warwick Manufacturing Group University of Warwick 19/07/2017

2 Outline I 1 Introduction 2 Linear Regression Models 3 Linear Classification Models 4 Feed-forward Neural Networks 5 Training 6 Regularisation

3 Outline II 7 Tensorflow

4 Introduction Outline I 1 Introduction Training Data Loss Functions Generalisation And Overfitting 2 Linear Regression Models Linear Regression Models: An Overview Linear Regression Models: Training Linear Basis Function Models 3 Linear Classification Models Linear Classification Models: An Overview Linear Discriminant Functions Generalised Linear Discriminant Functions Generalised Linear Basis Function Discriminant Functions 4 Feed-forward Neural Networks

5 Introduction Outline II Feed-forward Neural Networks: An Overview Feed-forward Neural Network Functions Characterisation of a Feed-forward Neural Network Choice of Activation Functions Choice of Objective Functions Choice of Architecture Choice of Weight Initialization 5 Training Gradient Descent Choice of Learning Rate Backpropagation: An Overview Backpropagation: The Algorithm Backpropagation: Computational complexities 6 Regularisation

6 Introduction Outline III Regularisation: An Overview Weight Decay Dataset Augmentation Early Stopping Dropout Weight Sharing 7 Tensorflow Playground

7 Introduction Training Data Training Data Definition (Training Data) Let the training data be denoted as {(x (n), y (n) ) R D R n = 1,..., N}, in the case where the response variable is a scalar, and {(x (n), y (n) ) R D R K n = 1,..., N} when it is a multidimensional vector, where N denotes the total number of training examples.

8 Introduction Loss Functions Loss Functions Definition (Loss Function) A loss function L(X, y, w) is a single, overall measure of loss incurred in taking any of the available decisions or actions. In particular, in this context, we define a loss function to be a mapping that quantifies how unhappy we would be if we used w to make a prediction on X when the correct output is y.

9 Introduction Generalisation And Overfitting Generalisation And Overfitting I Definition (Overfitting) A model, is said to overfit to the data if it does not generalise to out-of-sample cases although it fits the training data very well. More specifically, a model which explains the random error or noise in the data instead of underlying relationship is said to be overfitting.

10 Introduction Generalisation And Overfitting Generalisation And Overfitting II Ideally, we would like to choose a model which performs best (i.e. minimises the loss) on new unseen data. That is, we would like a model that can generalises well beyond the data used during training. However, by the very nature of the problem, unseen data is not available.

11 Introduction Generalisation And Overfitting Generalisation And Overfitting III One solution to this is to train and validate the model on two different datasets. Definition (K-Fold Cross Validation) In k-fold cv, the original sam- ple is randomly partitioned into k equal sized subsamples. Of the k subsamples, a single subsample is retained as the validation data for testing the model, and the remaining k1 subsamples are used as training data. The crossvalidation process is then repeated k times (the folds), with each of the k subsamples used exactly once as the validation data. The k results from the folds are then averaged to produce a single measure of performance.

12 Linear Regression Models Outline I 1 Introduction Training Data Loss Functions Generalisation And Overfitting 2 Linear Regression Models Linear Regression Models: An Overview Linear Regression Models: Training Linear Basis Function Models 3 Linear Classification Models Linear Classification Models: An Overview Linear Discriminant Functions Generalised Linear Discriminant Functions Generalised Linear Basis Function Discriminant Functions 4 Feed-forward Neural Networks

13 Linear Regression Models Outline II Feed-forward Neural Networks: An Overview Feed-forward Neural Network Functions Characterisation of a Feed-forward Neural Network Choice of Activation Functions Choice of Objective Functions Choice of Architecture Choice of Weight Initialization 5 Training Gradient Descent Choice of Learning Rate Backpropagation: An Overview Backpropagation: The Algorithm Backpropagation: Computational complexities 6 Regularisation

14 Linear Regression Models Outline III Regularisation: An Overview Weight Decay Dataset Augmentation Early Stopping Dropout Weight Sharing 7 Tensorflow Playground

15 Linear Regression Models Linear Regression Models: An Overview Linear Regression Models: An Overview I Given a training set {(x (n), y (n) ) R D R n = 1,..., N}, the goal of regression is to predict the value of y (n) given x (n). In the simplest approach, this can be done by directly constructing a function h(x; w) y where w is the vector of parameters specifying h. The simplest form of regression model are those which constrain the relationship between the model parameters and y to be linear.

16 Linear Regression Models Linear Regression Models: An Overview Linear Regression Models: An Overview II That is, we let h(x; w) = w 0 + w 1 x w D w D = w 0 + D w i x i. We can get rid of the special treatment reserved for w 0 by augmenting the input vector: i=1 {(x (n), y (n) ) R D+1 R n = 1,..., N, x 0 = 1}, so that we can write D h(x; w) = w i x i = w T x i=0 This is commonly known as a linear regression model.

17 Linear Regression Models Linear Regression Models: Training Linear Regression Models: Training I Question: Given some training data and a linear regression model parametrised by w how can se set w so that we achieve the lowest error?

18 Linear Regression Models Linear Regression Models: Training Linear Regression Models: Training II Ordinary Least Square (OLS) solution: Let the error on the training set be captured by the sum of squared errors loss function, i.e, let the loss function be given by L(X, y, w) = 1 2 N (y (n) h(x (n) ; w)) 2 = 1 2 n=1 Then, our goal is to find ŵ s.t. N (y (n) w T x (n) ) 2 i=1 ŵ = arg min w R D+1 L(X, y, w) = (X T X) 1 X T y Where the last equality is due to the fact that this specific loss function is convex and smooth, and therefore there exist an analytic solution to the above optimisation problem.

19 Linear Regression Models Linear Basis Function Models Linear Basis Function Models I When we defined a linear regression model we stated that the relationship between the adjustable parameters, w and y must be linear, however, this does not apply to the relationship between x and y. We can extend this class of models by considering linear combinations of fixed non-linear functions of the input variables

20 Linear Regression Models Linear Basis Function Models Linear Basis Function Models II Definition (Linear Basis Function Model) h(x; w) = D w i φ i (x) = w T φ(x), i=0 where φ : R D+1 R M+1 is a vector function such that φ i = φ i with φ 0 1. φ i (x) are known as basis functions

21 Linear Regression Models Linear Basis Function Models Linear Basis Function Models III Let us consider an example: Suppose that we want to create a model which allows us to approximate the relationship between x and y given in the plot below This can be done easily by specifying a linear basis function model as follows: h(x; w) = w 0 + w 1 x + w 2 sin(x) that is we let φ 1 (x) = x, and φ 2 (x) = sin(x)

22 Linear Regression Models Linear Basis Function Models Linear Basis Function Models IV Figure: Prediction of the linear basis function model specified as h(x; w) = w 0 + w 1 x + w 2 sin(x) as given by its Ordinary Least Squared (OLS) solution.

23 Linear Regression Models Linear Basis Function Models Linear Basis Function Models V Figure: Side view in {(x, y, z) R 2 y = sin(x)} space Figure: Data and hyperplane view in {(x, y, z) R 2 y = sin(x)} space

24 Linear Classification Models Outline I 1 Introduction Training Data Loss Functions Generalisation And Overfitting 2 Linear Regression Models Linear Regression Models: An Overview Linear Regression Models: Training Linear Basis Function Models 3 Linear Classification Models Linear Classification Models: An Overview Linear Discriminant Functions Generalised Linear Discriminant Functions Generalised Linear Basis Function Discriminant Functions 4 Feed-forward Neural Networks

25 Linear Classification Models Outline II Feed-forward Neural Networks: An Overview Feed-forward Neural Network Functions Characterisation of a Feed-forward Neural Network Choice of Activation Functions Choice of Objective Functions Choice of Architecture Choice of Weight Initialization 5 Training Gradient Descent Choice of Learning Rate Backpropagation: An Overview Backpropagation: The Algorithm Backpropagation: Computational complexities 6 Regularisation

26 Linear Classification Models Outline III Regularisation: An Overview Weight Decay Dataset Augmentation Early Stopping Dropout Weight Sharing 7 Tensorflow Playground

27 Linear Classification Models Linear Classification Models: An Overview Linear Classification Models: An Overview I Given a D-dimensional input vector x, assign it to one, and only one, of K discrete classes C k, k = 1,..., K. Equivalently, partition the input space into K decision regions whose boundaries are called decision boundaries. Definition (Decision Boundary) The boundary between two classes where it is equiprobable to belong to either class is called the decision boundary.

28 Linear Classification Models Linear Classification Models: An Overview Linear Classification Models: An Overview II Here we focus on linear models for classification, by which we mean that the decision surfaces are linear functions of the input vector x and hence are defined by (D 1)-dimensional hyperplanes within the D-dimensional input space. Definition (Linear Separability) A set of labelled data {(x (n), y (n) ) R D {1,..., K} n = 1,..., N}, is said to be linearly separable if each x can be correctly classified by a set of linear decision boundaries (i.e. hyperplanes).

29 Linear Classification Models Linear Classification Models: An Overview Linear Classification Models: An Overview III For regression problems, the target variable y was simply the vector of real numbers whose values we wish to predict. In the case of classification there are various ways of using the target values to represent class labels depending on the number of classes: K = 2: y (n) {0, 1} K > 2: y (n) R K such that if x (n) belongs to class C k then { y (n) 1, for i = k i = 0, for i k e.g. for K = 3 if x (n) belongs to C 2 then y (n) = (0, 1, 0) T

30 Linear Classification Models Linear Classification Models: An Overview Linear Classification Models: An Overview IV In the linear regression models considered before, the model h(x; w) was linear w.r.t. the parameters w. In the simplest case, the model is also linear w.r.t. the input variables and therefore takes the form w T x + w 0. For classification problems, however, we wish to predict discrete class labels or more generally posterior probabilities that lie between [0, 1]. To achieve this, we consider a generalisation of this model in which we transform the linear function of w using a strictly increasing, non-linear function g( ) so that h(x; w) = g(w T x + w 0 )

31 Linear Classification Models Linear Classification Models: An Overview Linear Classification Models: An Overview V In the machine learning literature g( ) is known as an activation function, whereas its inverse is called a link function in the statistics literature. The decision surfaces correspond to h(x; w) = constant = w T x + w 0 = constant = decision surfaces are still linear functions of x even thought g( ) is non-linear.

32 Linear Classification Models Linear Classification Models: An Overview Linear Classification Models: An Overview VI Definition (Generalised Linear Models) The class of models described by h(x; w) = g(w T x + w 0 ) where g : R R is a strictly increasing, possibly non-linear, function s.t. a i < a j = g(a i ) < g(a j ) are called generalised linear models.

33 Linear Classification Models Linear Classification Models: An Overview Linear Classification Models: An Overview VII Definition (Generalised Linear Basis Functions Models) The class of models described by h(x; w) = g(w T φ(x) + w 0 ) where g : R R is a strictly increasing, possibly non-linear, function s.t. a i < a j = g(a i ) < g(a j ), and φ : R D R M is a vector function such that φ i = φ i are called generalised linear models.

34 Linear Classification Models Linear Discriminant Functions Linear Discriminant Functions I Definition (Discriminant Function) A discriminant function is a function that takes an input vector x and assigns it to one of K classes, denoted C k. A discriminant function specifying decision surfaces as hyperplanes are called linear discriminant functions.

35 Linear Classification Models Linear Discriminant Functions Linear Discriminant Functions II Definition (Linear Discriminant Function) A linear discriminant function for K classes is a discriminant function of the form: where h(x; W ) = W T x, f(x) = k, if h k (x) > h j (x) j k, W T is a K (D + 1) matrix of parameters whose k th row vector w T k = (w k0, w k1,..., w kd ) x is the augmented input vector (1, x 1, x 2,..., x D ) T.

36 Linear Classification Models Generalised Linear Discriminant Functions Generalised Linear Discriminant Functions Definition (Generalised Linear Discriminant Function) A generalised linear discriminant function for K classes is a discriminant function of the form: where h(x; W ) = g(w T x), f(x) = k, if h k (x) > h j (x) j k, W T is a K (D + 1) matrix of parameters g : R K R K is the activation function s.t. a i < a j = g(a) i < g(a) j.

37 Linear Classification Models Generalised Linear Basis Function Discriminant Functions Generalised Linear Basis Function Discriminant Functions I Definition (Generalised Linear Basis Function Discriminant Function) A generalised linear basis function discriminant function for K classes is a discriminant function of the form: where h(x; W ) = g(w T φ(x)), f(x) = k, if h k (x) > h j (x) j k, W T is a K (D + 1) matrix of parameters g : R K R K is an activation function φ : R D+1 R M+1 is a possibly non-linear function.

38 Linear Classification Models Generalised Linear Basis Function Discriminant Functions Generalised Linear Basis Function Discriminant Functions II Key observation: Data that is not linearly separable in input space may be linearly separable in feature space specified by φ.

39 Linear Classification Models Generalised Linear Basis Function Discriminant Functions Generalised Linear Basis Function Discriminant Functions III Let us consider an example: Suppose that we want to create a model which allows us to classify the observations x = (x 1, x 2 ) plotted below into class C 1 or C 2. This can be done by specifying A generalised linear basis function discriminant function model as follows: h(x; w) = σ(w 0 + w 1 φ 1 (x) + w 2 φ 2 (x)) where φ 1 (x) = e x ( 1, 1) 2 /2 φ 2 (x) = e x (1,1) 2 /2 That is, we perform logistic regression in the feature space.

40 Linear Classification Models Generalised Linear Basis Function Discriminant Functions Generalised Linear Basis Function Discriminant Functions IV To find the best set of parameters, we need to access the likelihood of the model under the i.i.d assumption: P(y w, X) = N P(y (n) = 1 w, x (n) ) y(n) (1 P(y (n) = 1 w, x (n) )) (1 y(n) ) n=1

41 Linear Classification Models Generalised Linear Basis Function Discriminant Functions Generalised Linear Basis Function Discriminant Functions V We can do this because logistic regression is also an example of a probabilistic generative model. Suppose that P(x C k ) = exp(a(θ k ) + B(x, φ) + θ T k x) with P(C 1 ) = p = 1 P(C 2 ). That is, we assume that the classconditional densities are members of the exponential family of distributions, where the parameters θ k and φ control the shape of the distribution. An example would be two Gaussian distributions with different means, but with common covariance matrices.

42 Linear Classification Models Generalised Linear Basis Function Discriminant Functions Generalised Linear Basis Function Discriminant Functions VI Then P(C 1 x) = σ(w T x) exactly, for a suitable choice of w.

43 Linear Classification Models Generalised Linear Basis Function Discriminant Functions Generalised Linear Basis Function Discriminant Functions VII Figure: View in the feature space (φ 1 (x), φ 2 (x)) Figure: Data and hyperplane view in the augmented feature space (φ 1 (x), φ 2 (x), z)

44 Feed-forward Neural Networks Outline I 1 Introduction Training Data Loss Functions Generalisation And Overfitting 2 Linear Regression Models Linear Regression Models: An Overview Linear Regression Models: Training Linear Basis Function Models 3 Linear Classification Models Linear Classification Models: An Overview Linear Discriminant Functions Generalised Linear Discriminant Functions Generalised Linear Basis Function Discriminant Functions 4 Feed-forward Neural Networks

45 Feed-forward Neural Networks Outline II Feed-forward Neural Networks: An Overview Feed-forward Neural Network Functions Characterisation of a Feed-forward Neural Network Choice of Activation Functions Choice of Objective Functions Choice of Architecture Choice of Weight Initialization 5 Training Gradient Descent Choice of Learning Rate Backpropagation: An Overview Backpropagation: The Algorithm Backpropagation: Computational complexities 6 Regularisation

46 Feed-forward Neural Networks Outline III Regularisation: An Overview Weight Decay Dataset Augmentation Early Stopping Dropout Weight Sharing 7 Tensorflow Playground

47 Feed-forward Neural Networks Feed-forward Neural Networks: An Overview Feed-forward Neural Networks: An Overview I In the previous two sections we considered models for regression and classification that comprised linear combinations of fixed basis functions. We saw how the use of basis functions allows us to Get a non-linear response from a linear regression model. Classify non-linearly separable data by performing classification in the feature space. However, the models we have considered so far have one huge limitation: the parameters characterising basis functions need to be set a priori and cannot be fitted to the data.

48 Feed-forward Neural Networks Feed-forward Neural Networks: An Overview Feed-forward Neural Networks: An Overview II An alternative approach is to fix the number of basis functions in advance but allow them to be adaptive, that is we use parametric forms for the basis functions in which the parameter values are adopted during training. One of the most successful model of this type in the context of pattern recognition is the feed-forward neural network, also known as the multilayer perceptron (to pay tribute to the first feed-forward neural network called the Perceptron introduced by F. Rosenblatt in 1957)

49 Feed-forward Neural Networks Feed-forward Neural Network Functions Feed-Forward Neural Network Functions I The linear models for regression and classification discussed previously are based on linear combinations of fixed non-linear basis functions φ j (x) and take the form ( M ) h(x; w) = g w j φ j (x) j=0 (1) where g( ) is a non-linear strictly increasing activation function in the case of classification and the identity in the case of regression. A feed-forward neural network extends this class of models by making the basis functions φ j (x) depend on the parameters and then allowing these parameters to be adjusted, along with the coefficients {w j }, during training.

50 Feed-forward Neural Networks Feed-forward Neural Network Functions Feed-Forward Neural Network Functions II This leads to the basis neural network model which can be described as a series of functional transformations: Notation: The connection from g (i 1)j to g (ik) is associated with w (i) kj where the superscript (i) indicates that the corresponding parameters are in the i th layer of the network. 1 Construct D 1 linear combinations of the augmented input vector x = (x 0, x 1,..., x D ) in the form known as activations a (1) j = D i=0 w (1) ji x i

51 Feed-forward Neural Networks Feed-forward Neural Network Functions Feed-Forward Neural Network Functions III 2 Transform the activations using an activation function g 1j ( ) to give z (1) j = g 1j ( D i=0 w (1) ji x i ) = g 1j (a (1) j ) The resulting quantities corresponds to the output of the basis function in (1) 3 Following (1) these values are again linearly combined to give hidden unit activations a (2) l = D 1 i=0 w (2) ji z(1) i

52 Feed-forward Neural Networks Feed-forward Neural Network Functions Feed-Forward Neural Network Functions IV 4 Transform the hidden unit activations using an activation function g 2j ( ) to give z (2) j = g 2j ( D1 i=0 w (2) ji z(1) i ) = g 2j (a (2) j ) 5 Repeat 3-4 as many times as the desired number of hidden layers. 6 Create the output layer so that it has K units where K is the dimension of the output vector ( DL 1 h k (x; w) = g Lk i=0 where L is the number of layers. w (L) ji z (L 1) i ),

53 Feed-forward Neural Networks Feed-forward Neural Network Functions Feed-Forward Neural Network Functions V There is a one-to-one correspondence between feed-forward network diagrams and function composition. Consider a three-layer feed-forward neural network whose topology is specified as Input layer Hidden layer Hidden layer Ouput layer x 0 1 g 10 1 g 20 x 1 g 11 g 21 g 31 h 1 x 2 g 12 g 22 g 32 h x D g 1D1 g 2D2 g 3D3 h K

54 Feed-forward Neural Networks Feed-forward Neural Network Functions Feed-Forward Neural Network Functions VI then, the corresponds to the following function: ( D2 h k (x; w) = g 3k w (3) ki g 2i i=0 ( D1 j=0 ( D ))) w (2) ij g 2i w (1) jl x l l=0

55 Feed-forward Neural Networks Characterisation of a Feed-forward Neural Network Characterisation of a Feed-forward Neural Network I A Neural Network is characterised by Activity rule: local rules defining how the activities of the neurons respond to each other, i.e. activation functions Learning rule: the way in which the parameters change with time (e.g. as more data arrives), i.e. objective function Architecture: the variables involved in the network and their topology.

56 Feed-forward Neural Networks Choice of Activation Functions Choice of Activation Functions I Desiderata for g( ): Non-linear. g C 1 i.e. g exists, and g and g are continuous. Monotonic increasing. Computational simplicity: g and g should be easy to evaluate. Does not saturate: a function g(a) saturates in one or both tails if g(a) 0 as a ± If g is used in the output units then g(a) [0, 1] if we want to interpret h k (x; w) = P(C k x)

57 Feed-forward Neural Networks Choice of Activation Functions Choice of Activation Functions II Commonly used activation functions: Linear: g(a) = a Logistic/Sigmoid: g(a) = σ(a) = 1 1+e 1 Hyperbolic tangent: g(a) = tanh(a) = 2σ(2a) 1 { 1, for a > 0 Threshold: 1, for a 0 { 1, for a > 0 Heaviside: 0, for a 0 Rectified Linear Units (ReLU): g(a) = max{0, a} Absolute value rectification: g(a) = a Leaky / Parametric ReLU: g(a) = max{0, a} + α min{0, a} If we are performing regression, we simply set the activation functions of the output units to g(a) = a

58 Feed-forward Neural Networks Choice of Objective Functions Choice of Objective Functions I Sum of squared errors: G(w) = 1 2 N n=1 k=1 K (h k (x (n) ; w) y (n) k )2 Computationally simple to evaluate Non-negative Simplifies the proofs of some theorems However, it saturates easily (see later)

59 Feed-forward Neural Networks Choice of Objective Functions Choice of Objective Functions II Negative log-likelihood: G(w) = log = N K n=1 k=1 N K n=1 k=1 h k (x (n) ; w) y(n) k y (n) k log h k (x (n) ; w) Follows naturally from probabilistic discriminative models log helps prevent saturation (see later) Most commonly used

60 Feed-forward Neural Networks Choice of Architecture Choice of Architecture I Definition (Uniform approximation on compact sets) A family {f w } of functions is said to achieve uniform approximation of f : R D R K on compact sets (with respect to a norm on R K ) if for every compact set K R D and every ɛ > 0 we can find a function f w in the family such that f w (x) f(x) < ɛ, x K

61 Feed-forward Neural Networks Choice of Architecture Choice of Architecture II Theorem (Universal Approximation Theorem) Any continuous function f : R D R K can be approximated uniformly (with respect to the Euclidean norm) on compact sets by the family of feed-forward networks with two layers, with linear activation in the output layer and Heaviside units in the hidden layer.

62 Feed-forward Neural Networks Choice of Architecture Choice of Architecture III Theorem (Kolmogorov-Arnold Representation Theorem: An Extension) Any continuous function f : [0, 1] D R K can be represented exactly as ( 2D+1 D ) f k (x) = Φ kj λ j g j (x i ) + j j=1 where λ j R, g j : R R are continuous and monotonic increasing, and Φ kj : R R depends on f. i=1

63 Feed-forward Neural Networks Choice of Architecture Choice of Architecture IV Figure: Kolmogorov-Arnold Representation Theorem feed-forward neural network representation

64 Feed-forward Neural Networks Choice of Architecture Choice of Architecture V So by our universal approximation theorem, two layers suffices, however, deeper networks might: Require fewer units overall Have superior generalization (lower generalization error) Be easier to train There are no hard rules. Much based on experimentation, prior model beliefs and tradition.

65 Feed-forward Neural Networks Choice of Weight Initialization Choice of Weight Initialization I The initialization of w can determine whether the algorithm converges at all. Again, there are no hard rules. However, we might want to take into consideration the following: Avoid symmetric behaviour of different hidden units, e.g. set initial weights randomly. Initial activation of logistic units should be close to 0. Initial activation of ReLU should be small and positive.

66 Training Outline I 1 Introduction Training Data Loss Functions Generalisation And Overfitting 2 Linear Regression Models Linear Regression Models: An Overview Linear Regression Models: Training Linear Basis Function Models 3 Linear Classification Models Linear Classification Models: An Overview Linear Discriminant Functions Generalised Linear Discriminant Functions Generalised Linear Basis Function Discriminant Functions 4 Feed-forward Neural Networks

67 Training Outline II Feed-forward Neural Networks: An Overview Feed-forward Neural Network Functions Characterisation of a Feed-forward Neural Network Choice of Activation Functions Choice of Objective Functions Choice of Architecture Choice of Weight Initialization 5 Training Gradient Descent Choice of Learning Rate Backpropagation: An Overview Backpropagation: The Algorithm Backpropagation: Computational complexities 6 Regularisation

68 Training Outline III Regularisation: An Overview Weight Decay Dataset Augmentation Early Stopping Dropout Weight Sharing 7 Tensorflow Playground

69 Training Gradient Descent Gradient Descent I Question: how do we find the best set of parameters, i.e., how do we minimize G(w)? We have to take into consideration that We have nonlinear activation, so there is no analytical (closed-form) solution. However, G(w) is differentiable (almost everywhere). Answer: use a gradient-based algorithm which will guarantee us a local solution. Notation: Write G(w) = N n=1 G n(w)

70 Training Gradient Descent Gradient Descent II Definition (Gradient Descent) 1 Start with an initial guess, w (0), for w. 2 Move a small distance η in w-space, in the direction in which G decreases most rapidly, G(w): w (i)(t+1) jl = w (i)(t) jl η G(w) w (i) jl w=w (t) 3 Repeat from step 2 until w (t+1) w (t) < ɛ, for each i, j, l.

71 Training Gradient Descent Gradient Descent III Definition (Stochastic Gradient Descent) 1 Start with an initial guess, w (0), for w. 2 For each n = 1, 2,..., N do: Move a small distance η in w-space, in the direction in which G n decreases most rapidly, G n (w): w (i)(t+1) jl = w (i)(t) jl η G n(w) w (i) jl w=w (t) 3 Repeat from step 2 until w (t+1) w (t) < ɛ, for each i, j, l.

72 Training Gradient Descent Gradient Descent IV Definition (Stochastic Gradient Descent with Momentum) 1 Start with an initial guess, w (0), for w. 2 For each n = 1, 2,..., N do: Move a small distance η in w-space, in the direction in which G n decreases most rapidly, G n (w): for each i, j, l w (i)(t+1) jl = w (i)(t) jl η G n(w) w (i) jl + α(w(i)(t) w=w (t) 3 Repeat from step 2 until w (t+1) w (t) < ɛ Benefits: Passes through flat regions more quickly. jl w (i)(t 1) jl ).

73 Training Gradient Descent Gradient Descent V Averages out stochastic variation of stochastic gradient descent.

74 Training Choice of Learning Rate Choice of Learning Rate When choosing the learning rate η, we should consider the following: η should be neither too big nor too small. Figure: Path to global minimum for different values of η η should aim for uniform learningall weights reach their final equilibrium values at about the same time.

75 Training Backpropagation: An Overview Backpropagation: An Overview Backpropagation is an efficient way to calculate the partial derivatives of the learning function with respect to the parameters.

76 Training Backpropagation: The Algorithm Backpropagation: The Algorithm I We will focus on stochastic gradient descent, so our aim is to compute G n (w) w (i) jl Let a (i) j let denote the input to the j th unit after layer i, and z (i) j = { g(a (i) j ) for i = 1, 2,... x (n) j for i = 0 denote the corresponding output.

77 Training Backpropagation: The Algorithm Backpropagation: The Algorithm II 1 Apply an input vector x (n) to the network and forward propagate through the network: for each unit or neuron do D a (i) i 1 j = w (i) jl z(i 1) l, z (i) j = l=0 2 For each output unit, evaluate δ (i) k { g(a (i) j ) for i = 1, 2,... x (n) j for i = 0 = G n(h) ĝ (a (i) h k ) k for the output units, where ĝ denotes the activation function for the output units.

78 Training Backpropagation: The Algorithm Backpropagation: The Algorithm III 3 For each hidden unit compute the backpropagation formula δ (i) j = g (a (i) j 4 Evaluate each derivative ) Di+1 l=0 δ (i) l w (i+1) lj G n (w) w (i) jl = δ (i) j z (i 1) l

79 Training Backpropagation: Computational complexities Backpropagation: Computational complexities Result (Computational Complexities of Backpropagation) Let the total number of weights be m, then, as m Each weight in a (i) j = D i 1 l=0 w(i) jl z(i 1) l, z (i) j = g(a (i) j ) appears exactly once; this step is O(m). Assuming that K the number of output units is fixed, δ (i) k = Gn(h) h k ĝ (a (i) k ) has a fixed cost associated with it; this step is O(1). Each weight in δ (i) j = g (a (i) j ) D i+1 exactly once; this step is O(m). G n(w) w (i) jl l=0 δ(i) l w (i+1) lj appears = δ (i) j z (i 1) l is evaluated exactly once for each weight; this step is O(m). Hence, the total computational cost of backpropagation is O(m).

80 Regularisation Outline I 1 Introduction Training Data Loss Functions Generalisation And Overfitting 2 Linear Regression Models Linear Regression Models: An Overview Linear Regression Models: Training Linear Basis Function Models 3 Linear Classification Models Linear Classification Models: An Overview Linear Discriminant Functions Generalised Linear Discriminant Functions Generalised Linear Basis Function Discriminant Functions 4 Feed-forward Neural Networks

81 Regularisation Outline II Feed-forward Neural Networks: An Overview Feed-forward Neural Network Functions Characterisation of a Feed-forward Neural Network Choice of Activation Functions Choice of Objective Functions Choice of Architecture Choice of Weight Initialization 5 Training Gradient Descent Choice of Learning Rate Backpropagation: An Overview Backpropagation: The Algorithm Backpropagation: Computational complexities 6 Regularisation

82 Regularisation Outline III Regularisation: An Overview Weight Decay Dataset Augmentation Early Stopping Dropout Weight Sharing 7 Tensorflow Playground

83 Regularisation Regularisation: An Overview Regularisation: An Overview I Multilayer neural networks can potentially have millions of parameters meaning that model complexity is over the roof, as a consequence, they are extremely prone to overfitting Figure: Overfitting Versus Underfitting

84 Regularisation Regularisation: An Overview Regularisation: An Overview II We can encode our preference for sparse, and therefore simpler models, by adding a regularisation term to the loss function which takes care of bounding the parameter vector in some way. Definition (Regularization) Regularization is any modification we make to a learning algorithm that is intended to reduce its generalization error but not its training error.

85 Regularisation Regularisation: An Overview Regularisation: An Overview III There are many different kinds of regularisation, the most commonly used ones are Weight decay Dataset augmentation Early stopping Dropout Weight sharing

86 Regularisation Weight Decay Weight Decay I Modify the cost function to explicitly penalise complicated models: G(w) = G(w) + λω(w) where Ω(w) is a non-negative penalty function, λ [0, ) is a regularization coefficient.

87 Regularisation Weight Decay Weight Decay II Example (L 2 / Tikhonov Regularisation) Let Ω(w) = 1 2 w 2 2 = 1 2 (w (i) jl ) i,j,l so that G(w) = G(w)λ 1 2 w 2 2

88 Neural Networks Regularisation Dataset Augmentation Dataset Augmentation Generate synthetic data and add it to the training set. This can be done up-front (store the augmented dataset on disk) or on-the-fly during training. Example with image classification: Translating Rotating Cropping Flipping... Figure: Dataset augmentation for image classification

89 Regularisation Early Stopping Early Stopping I Look for the minimum generalisation error calculated on a separate, labelled validation set as training proceeds.

90 Regularisation Early Stopping Early Stopping II Definition (Gradient Descent) 1 Start with an initial guess, w (0), for w. 2 Move a small distance η in w-space, in the direction in which G decreases most rapidly, G(w): w (i)(t+1) jl = w (i)(t) jl 3 Repeat from step 2 until η G(w) w (i) jl w=w (t), for each i, j, l. arg min{h(w (s) ) s = 0,..., t} t p, s where H(w (s) ) is the prediction error.

91 Regularisation Dropout Dropout Definition (Gradient Descent with Dropout) 1 Start with an initial guess, w (0), for w. 2 Independently, for each non-output unit u, with probability 1 ρ u set its activation g 0. Call the objective function associated with this network G(w, µ), where µ is a binary indicator vector of activations. 3 Move a small distance η in w-space, in the direction in which G(w, µ) decreases most rapidly, G(w, µ): w (i)(t+1) jl = w (i)(t) jl η G(w) w (i) jl w=w (t) 4 Repeat from step 2 until w (t+1) w (t) < ɛ, for each i, j, l.

92 Regularisation Weight Sharing Weight Sharing I Assume some subsets of weight parameters are restricted in their values. Soft-weight sharing: encourage similar values by expressing prior beliefs about weights through a mixture distribution. Hard-weight sharing: force identical values for certain subsets of weights.

93 Regularisation Weight Sharing Weight Sharing II Example (Gaussian Mixture Model) Assume that weights are distributed according to a mixture of V Gaussians i.e. ( ) U V 1 P(w; π, µ, σ) = π v exp (w u µ v ) 2 2πσ 2 v 2σv 2, u=1 v=1 where U is the total number of weights.

94 Regularisation Weight Sharing Weight Sharing III Example (Gaussian Mixture Model (Cont.)) Consider P(w; π, µ, σ) as the prior distribution over the parameters so, the objective function G(w) should be equal to the posterior distribution of the weights given the training data: G(w) = log P((x (n), y (n) ), n = 1, 2,..., N, w) log P(w; π, µ, σ) + constant = G(w) + Ω(w) + constant.

95 Regularisation Weight Sharing Weight Sharing IV Example (Gaussian Mixture Model (Cont.)) where G(w) is the usual negative log-likelihood objective function and Ω(w) is the penalty function given by Ω(w) = ( U V ( )) 1 log π v exp (w u µ v ) 2 2πσ 2 v 2σv 2 u=1 v=1 The constant term is the marginal likelihood which does not depend on w and can without loss of generality be ignored.

96 Tensorflow Outline I 1 Introduction Training Data Loss Functions Generalisation And Overfitting 2 Linear Regression Models Linear Regression Models: An Overview Linear Regression Models: Training Linear Basis Function Models 3 Linear Classification Models Linear Classification Models: An Overview Linear Discriminant Functions Generalised Linear Discriminant Functions Generalised Linear Basis Function Discriminant Functions 4 Feed-forward Neural Networks

97 Tensorflow Outline II Feed-forward Neural Networks: An Overview Feed-forward Neural Network Functions Characterisation of a Feed-forward Neural Network Choice of Activation Functions Choice of Objective Functions Choice of Architecture Choice of Weight Initialization 5 Training Gradient Descent Choice of Learning Rate Backpropagation: An Overview Backpropagation: The Algorithm Backpropagation: Computational complexities 6 Regularisation

98 Tensorflow Outline III Regularisation: An Overview Weight Decay Dataset Augmentation Early Stopping Dropout Weight Sharing 7 Tensorflow Playground

99 Tensorflow Playground Playground

Perceptron: This is convolution!

Perceptron: This is convolution! v v v Shared weights v Filter = local perceptron. Also called kernel. By pooling responses at different locations, we gain robustness to the exact spatial location of image