Neural Networks. Theory And Practice. Marco Del Vecchio 19/07/2017. Warwick Manufacturing Group University of Warwick

Size: px
Start display at page:

Download "Neural Networks. Theory And Practice. Marco Del Vecchio 19/07/2017. Warwick Manufacturing Group University of Warwick"

Transcription

1 Neural Networks Theory And Practice Marco Del Vecchio Warwick Manufacturing Group University of Warwick 19/07/2017

2 Outline I 1 Introduction 2 Linear Regression Models 3 Linear Classification Models 4 Feed-forward Neural Networks 5 Training 6 Regularisation

3 Outline II 7 Tensorflow

4 Introduction Outline I 1 Introduction Training Data Loss Functions Generalisation And Overfitting 2 Linear Regression Models Linear Regression Models: An Overview Linear Regression Models: Training Linear Basis Function Models 3 Linear Classification Models Linear Classification Models: An Overview Linear Discriminant Functions Generalised Linear Discriminant Functions Generalised Linear Basis Function Discriminant Functions 4 Feed-forward Neural Networks

5 Introduction Outline II Feed-forward Neural Networks: An Overview Feed-forward Neural Network Functions Characterisation of a Feed-forward Neural Network Choice of Activation Functions Choice of Objective Functions Choice of Architecture Choice of Weight Initialization 5 Training Gradient Descent Choice of Learning Rate Backpropagation: An Overview Backpropagation: The Algorithm Backpropagation: Computational complexities 6 Regularisation

6 Introduction Outline III Regularisation: An Overview Weight Decay Dataset Augmentation Early Stopping Dropout Weight Sharing 7 Tensorflow Playground

7 Introduction Training Data Training Data Definition (Training Data) Let the training data be denoted as {(x (n), y (n) ) R D R n = 1,..., N}, in the case where the response variable is a scalar, and {(x (n), y (n) ) R D R K n = 1,..., N} when it is a multidimensional vector, where N denotes the total number of training examples.

8 Introduction Loss Functions Loss Functions Definition (Loss Function) A loss function L(X, y, w) is a single, overall measure of loss incurred in taking any of the available decisions or actions. In particular, in this context, we define a loss function to be a mapping that quantifies how unhappy we would be if we used w to make a prediction on X when the correct output is y.

9 Introduction Generalisation And Overfitting Generalisation And Overfitting I Definition (Overfitting) A model, is said to overfit to the data if it does not generalise to out-of-sample cases although it fits the training data very well. More specifically, a model which explains the random error or noise in the data instead of underlying relationship is said to be overfitting.

10 Introduction Generalisation And Overfitting Generalisation And Overfitting II Ideally, we would like to choose a model which performs best (i.e. minimises the loss) on new unseen data. That is, we would like a model that can generalises well beyond the data used during training. However, by the very nature of the problem, unseen data is not available.

11 Introduction Generalisation And Overfitting Generalisation And Overfitting III One solution to this is to train and validate the model on two different datasets. Definition (K-Fold Cross Validation) In k-fold cv, the original sam- ple is randomly partitioned into k equal sized subsamples. Of the k subsamples, a single subsample is retained as the validation data for testing the model, and the remaining k1 subsamples are used as training data. The crossvalidation process is then repeated k times (the folds), with each of the k subsamples used exactly once as the validation data. The k results from the folds are then averaged to produce a single measure of performance.

12 Linear Regression Models Outline I 1 Introduction Training Data Loss Functions Generalisation And Overfitting 2 Linear Regression Models Linear Regression Models: An Overview Linear Regression Models: Training Linear Basis Function Models 3 Linear Classification Models Linear Classification Models: An Overview Linear Discriminant Functions Generalised Linear Discriminant Functions Generalised Linear Basis Function Discriminant Functions 4 Feed-forward Neural Networks

13 Linear Regression Models Outline II Feed-forward Neural Networks: An Overview Feed-forward Neural Network Functions Characterisation of a Feed-forward Neural Network Choice of Activation Functions Choice of Objective Functions Choice of Architecture Choice of Weight Initialization 5 Training Gradient Descent Choice of Learning Rate Backpropagation: An Overview Backpropagation: The Algorithm Backpropagation: Computational complexities 6 Regularisation

14 Linear Regression Models Outline III Regularisation: An Overview Weight Decay Dataset Augmentation Early Stopping Dropout Weight Sharing 7 Tensorflow Playground

15 Linear Regression Models Linear Regression Models: An Overview Linear Regression Models: An Overview I Given a training set {(x (n), y (n) ) R D R n = 1,..., N}, the goal of regression is to predict the value of y (n) given x (n). In the simplest approach, this can be done by directly constructing a function h(x; w) y where w is the vector of parameters specifying h. The simplest form of regression model are those which constrain the relationship between the model parameters and y to be linear.

16 Linear Regression Models Linear Regression Models: An Overview Linear Regression Models: An Overview II That is, we let h(x; w) = w 0 + w 1 x w D w D = w 0 + D w i x i. We can get rid of the special treatment reserved for w 0 by augmenting the input vector: i=1 {(x (n), y (n) ) R D+1 R n = 1,..., N, x 0 = 1}, so that we can write D h(x; w) = w i x i = w T x i=0 This is commonly known as a linear regression model.

17 Linear Regression Models Linear Regression Models: Training Linear Regression Models: Training I Question: Given some training data and a linear regression model parametrised by w how can se set w so that we achieve the lowest error?

18 Linear Regression Models Linear Regression Models: Training Linear Regression Models: Training II Ordinary Least Square (OLS) solution: Let the error on the training set be captured by the sum of squared errors loss function, i.e, let the loss function be given by L(X, y, w) = 1 2 N (y (n) h(x (n) ; w)) 2 = 1 2 n=1 Then, our goal is to find ŵ s.t. N (y (n) w T x (n) ) 2 i=1 ŵ = arg min w R D+1 L(X, y, w) = (X T X) 1 X T y Where the last equality is due to the fact that this specific loss function is convex and smooth, and therefore there exist an analytic solution to the above optimisation problem.

19 Linear Regression Models Linear Basis Function Models Linear Basis Function Models I When we defined a linear regression model we stated that the relationship between the adjustable parameters, w and y must be linear, however, this does not apply to the relationship between x and y. We can extend this class of models by considering linear combinations of fixed non-linear functions of the input variables

20 Linear Regression Models Linear Basis Function Models Linear Basis Function Models II Definition (Linear Basis Function Model) h(x; w) = D w i φ i (x) = w T φ(x), i=0 where φ : R D+1 R M+1 is a vector function such that φ i = φ i with φ 0 1. φ i (x) are known as basis functions

21 Linear Regression Models Linear Basis Function Models Linear Basis Function Models III Let us consider an example: Suppose that we want to create a model which allows us to approximate the relationship between x and y given in the plot below This can be done easily by specifying a linear basis function model as follows: h(x; w) = w 0 + w 1 x + w 2 sin(x) that is we let φ 1 (x) = x, and φ 2 (x) = sin(x)

22 Linear Regression Models Linear Basis Function Models Linear Basis Function Models IV Figure: Prediction of the linear basis function model specified as h(x; w) = w 0 + w 1 x + w 2 sin(x) as given by its Ordinary Least Squared (OLS) solution.

23 Linear Regression Models Linear Basis Function Models Linear Basis Function Models V Figure: Side view in {(x, y, z) R 2 y = sin(x)} space Figure: Data and hyperplane view in {(x, y, z) R 2 y = sin(x)} space

24 Linear Classification Models Outline I 1 Introduction Training Data Loss Functions Generalisation And Overfitting 2 Linear Regression Models Linear Regression Models: An Overview Linear Regression Models: Training Linear Basis Function Models 3 Linear Classification Models Linear Classification Models: An Overview Linear Discriminant Functions Generalised Linear Discriminant Functions Generalised Linear Basis Function Discriminant Functions 4 Feed-forward Neural Networks

25 Linear Classification Models Outline II Feed-forward Neural Networks: An Overview Feed-forward Neural Network Functions Characterisation of a Feed-forward Neural Network Choice of Activation Functions Choice of Objective Functions Choice of Architecture Choice of Weight Initialization 5 Training Gradient Descent Choice of Learning Rate Backpropagation: An Overview Backpropagation: The Algorithm Backpropagation: Computational complexities 6 Regularisation

26 Linear Classification Models Outline III Regularisation: An Overview Weight Decay Dataset Augmentation Early Stopping Dropout Weight Sharing 7 Tensorflow Playground

27 Linear Classification Models Linear Classification Models: An Overview Linear Classification Models: An Overview I Given a D-dimensional input vector x, assign it to one, and only one, of K discrete classes C k, k = 1,..., K. Equivalently, partition the input space into K decision regions whose boundaries are called decision boundaries. Definition (Decision Boundary) The boundary between two classes where it is equiprobable to belong to either class is called the decision boundary.

28 Linear Classification Models Linear Classification Models: An Overview Linear Classification Models: An Overview II Here we focus on linear models for classification, by which we mean that the decision surfaces are linear functions of the input vector x and hence are defined by (D 1)-dimensional hyperplanes within the D-dimensional input space. Definition (Linear Separability) A set of labelled data {(x (n), y (n) ) R D {1,..., K} n = 1,..., N}, is said to be linearly separable if each x can be correctly classified by a set of linear decision boundaries (i.e. hyperplanes).

29 Linear Classification Models Linear Classification Models: An Overview Linear Classification Models: An Overview III For regression problems, the target variable y was simply the vector of real numbers whose values we wish to predict. In the case of classification there are various ways of using the target values to represent class labels depending on the number of classes: K = 2: y (n) {0, 1} K > 2: y (n) R K such that if x (n) belongs to class C k then { y (n) 1, for i = k i = 0, for i k e.g. for K = 3 if x (n) belongs to C 2 then y (n) = (0, 1, 0) T

30 Linear Classification Models Linear Classification Models: An Overview Linear Classification Models: An Overview IV In the linear regression models considered before, the model h(x; w) was linear w.r.t. the parameters w. In the simplest case, the model is also linear w.r.t. the input variables and therefore takes the form w T x + w 0. For classification problems, however, we wish to predict discrete class labels or more generally posterior probabilities that lie between [0, 1]. To achieve this, we consider a generalisation of this model in which we transform the linear function of w using a strictly increasing, non-linear function g( ) so that h(x; w) = g(w T x + w 0 )

31 Linear Classification Models Linear Classification Models: An Overview Linear Classification Models: An Overview V In the machine learning literature g( ) is known as an activation function, whereas its inverse is called a link function in the statistics literature. The decision surfaces correspond to h(x; w) = constant = w T x + w 0 = constant = decision surfaces are still linear functions of x even thought g( ) is non-linear.

32 Linear Classification Models Linear Classification Models: An Overview Linear Classification Models: An Overview VI Definition (Generalised Linear Models) The class of models described by h(x; w) = g(w T x + w 0 ) where g : R R is a strictly increasing, possibly non-linear, function s.t. a i < a j = g(a i ) < g(a j ) are called generalised linear models.

33 Linear Classification Models Linear Classification Models: An Overview Linear Classification Models: An Overview VII Definition (Generalised Linear Basis Functions Models) The class of models described by h(x; w) = g(w T φ(x) + w 0 ) where g : R R is a strictly increasing, possibly non-linear, function s.t. a i < a j = g(a i ) < g(a j ), and φ : R D R M is a vector function such that φ i = φ i are called generalised linear models.

34 Linear Classification Models Linear Discriminant Functions Linear Discriminant Functions I Definition (Discriminant Function) A discriminant function is a function that takes an input vector x and assigns it to one of K classes, denoted C k. A discriminant function specifying decision surfaces as hyperplanes are called linear discriminant functions.

35 Linear Classification Models Linear Discriminant Functions Linear Discriminant Functions II Definition (Linear Discriminant Function) A linear discriminant function for K classes is a discriminant function of the form: where h(x; W ) = W T x, f(x) = k, if h k (x) > h j (x) j k, W T is a K (D + 1) matrix of parameters whose k th row vector w T k = (w k0, w k1,..., w kd ) x is the augmented input vector (1, x 1, x 2,..., x D ) T.

36 Linear Classification Models Generalised Linear Discriminant Functions Generalised Linear Discriminant Functions Definition (Generalised Linear Discriminant Function) A generalised linear discriminant function for K classes is a discriminant function of the form: where h(x; W ) = g(w T x), f(x) = k, if h k (x) > h j (x) j k, W T is a K (D + 1) matrix of parameters g : R K R K is the activation function s.t. a i < a j = g(a) i < g(a) j.

37 Linear Classification Models Generalised Linear Basis Function Discriminant Functions Generalised Linear Basis Function Discriminant Functions I Definition (Generalised Linear Basis Function Discriminant Function) A generalised linear basis function discriminant function for K classes is a discriminant function of the form: where h(x; W ) = g(w T φ(x)), f(x) = k, if h k (x) > h j (x) j k, W T is a K (D + 1) matrix of parameters g : R K R K is an activation function φ : R D+1 R M+1 is a possibly non-linear function.

38 Linear Classification Models Generalised Linear Basis Function Discriminant Functions Generalised Linear Basis Function Discriminant Functions II Key observation: Data that is not linearly separable in input space may be linearly separable in feature space specified by φ.

39 Linear Classification Models Generalised Linear Basis Function Discriminant Functions Generalised Linear Basis Function Discriminant Functions III Let us consider an example: Suppose that we want to create a model which allows us to classify the observations x = (x 1, x 2 ) plotted below into class C 1 or C 2. This can be done by specifying A generalised linear basis function discriminant function model as follows: h(x; w) = σ(w 0 + w 1 φ 1 (x) + w 2 φ 2 (x)) where φ 1 (x) = e x ( 1, 1) 2 /2 φ 2 (x) = e x (1,1) 2 /2 That is, we perform logistic regression in the feature space.

40 Linear Classification Models Generalised Linear Basis Function Discriminant Functions Generalised Linear Basis Function Discriminant Functions IV To find the best set of parameters, we need to access the likelihood of the model under the i.i.d assumption: P(y w, X) = N P(y (n) = 1 w, x (n) ) y(n) (1 P(y (n) = 1 w, x (n) )) (1 y(n) ) n=1

41 Linear Classification Models Generalised Linear Basis Function Discriminant Functions Generalised Linear Basis Function Discriminant Functions V We can do this because logistic regression is also an example of a probabilistic generative model. Suppose that P(x C k ) = exp(a(θ k ) + B(x, φ) + θ T k x) with P(C 1 ) = p = 1 P(C 2 ). That is, we assume that the classconditional densities are members of the exponential family of distributions, where the parameters θ k and φ control the shape of the distribution. An example would be two Gaussian distributions with different means, but with common covariance matrices.

42 Linear Classification Models Generalised Linear Basis Function Discriminant Functions Generalised Linear Basis Function Discriminant Functions VI Then P(C 1 x) = σ(w T x) exactly, for a suitable choice of w.

43 Linear Classification Models Generalised Linear Basis Function Discriminant Functions Generalised Linear Basis Function Discriminant Functions VII Figure: View in the feature space (φ 1 (x), φ 2 (x)) Figure: Data and hyperplane view in the augmented feature space (φ 1 (x), φ 2 (x), z)

44 Feed-forward Neural Networks Outline I 1 Introduction Training Data Loss Functions Generalisation And Overfitting 2 Linear Regression Models Linear Regression Models: An Overview Linear Regression Models: Training Linear Basis Function Models 3 Linear Classification Models Linear Classification Models: An Overview Linear Discriminant Functions Generalised Linear Discriminant Functions Generalised Linear Basis Function Discriminant Functions 4 Feed-forward Neural Networks

45 Feed-forward Neural Networks Outline II Feed-forward Neural Networks: An Overview Feed-forward Neural Network Functions Characterisation of a Feed-forward Neural Network Choice of Activation Functions Choice of Objective Functions Choice of Architecture Choice of Weight Initialization 5 Training Gradient Descent Choice of Learning Rate Backpropagation: An Overview Backpropagation: The Algorithm Backpropagation: Computational complexities 6 Regularisation

46 Feed-forward Neural Networks Outline III Regularisation: An Overview Weight Decay Dataset Augmentation Early Stopping Dropout Weight Sharing 7 Tensorflow Playground

47 Feed-forward Neural Networks Feed-forward Neural Networks: An Overview Feed-forward Neural Networks: An Overview I In the previous two sections we considered models for regression and classification that comprised linear combinations of fixed basis functions. We saw how the use of basis functions allows us to Get a non-linear response from a linear regression model. Classify non-linearly separable data by performing classification in the feature space. However, the models we have considered so far have one huge limitation: the parameters characterising basis functions need to be set a priori and cannot be fitted to the data.

48 Feed-forward Neural Networks Feed-forward Neural Networks: An Overview Feed-forward Neural Networks: An Overview II An alternative approach is to fix the number of basis functions in advance but allow them to be adaptive, that is we use parametric forms for the basis functions in which the parameter values are adopted during training. One of the most successful model of this type in the context of pattern recognition is the feed-forward neural network, also known as the multilayer perceptron (to pay tribute to the first feed-forward neural network called the Perceptron introduced by F. Rosenblatt in 1957)

49 Feed-forward Neural Networks Feed-forward Neural Network Functions Feed-Forward Neural Network Functions I The linear models for regression and classification discussed previously are based on linear combinations of fixed non-linear basis functions φ j (x) and take the form ( M ) h(x; w) = g w j φ j (x) j=0 (1) where g( ) is a non-linear strictly increasing activation function in the case of classification and the identity in the case of regression. A feed-forward neural network extends this class of models by making the basis functions φ j (x) depend on the parameters and then allowing these parameters to be adjusted, along with the coefficients {w j }, during training.

50 Feed-forward Neural Networks Feed-forward Neural Network Functions Feed-Forward Neural Network Functions II This leads to the basis neural network model which can be described as a series of functional transformations: Notation: The connection from g (i 1)j to g (ik) is associated with w (i) kj where the superscript (i) indicates that the corresponding parameters are in the i th layer of the network. 1 Construct D 1 linear combinations of the augmented input vector x = (x 0, x 1,..., x D ) in the form known as activations a (1) j = D i=0 w (1) ji x i

51 Feed-forward Neural Networks Feed-forward Neural Network Functions Feed-Forward Neural Network Functions III 2 Transform the activations using an activation function g 1j ( ) to give z (1) j = g 1j ( D i=0 w (1) ji x i ) = g 1j (a (1) j ) The resulting quantities corresponds to the output of the basis function in (1) 3 Following (1) these values are again linearly combined to give hidden unit activations a (2) l = D 1 i=0 w (2) ji z(1) i

52 Feed-forward Neural Networks Feed-forward Neural Network Functions Feed-Forward Neural Network Functions IV 4 Transform the hidden unit activations using an activation function g 2j ( ) to give z (2) j = g 2j ( D1 i=0 w (2) ji z(1) i ) = g 2j (a (2) j ) 5 Repeat 3-4 as many times as the desired number of hidden layers. 6 Create the output layer so that it has K units where K is the dimension of the output vector ( DL 1 h k (x; w) = g Lk i=0 where L is the number of layers. w (L) ji z (L 1) i ),

53 Feed-forward Neural Networks Feed-forward Neural Network Functions Feed-Forward Neural Network Functions V There is a one-to-one correspondence between feed-forward network diagrams and function composition. Consider a three-layer feed-forward neural network whose topology is specified as Input layer Hidden layer Hidden layer Ouput layer x 0 1 g 10 1 g 20 x 1 g 11 g 21 g 31 h 1 x 2 g 12 g 22 g 32 h x D g 1D1 g 2D2 g 3D3 h K

54 Feed-forward Neural Networks Feed-forward Neural Network Functions Feed-Forward Neural Network Functions VI then, the corresponds to the following function: ( D2 h k (x; w) = g 3k w (3) ki g 2i i=0 ( D1 j=0 ( D ))) w (2) ij g 2i w (1) jl x l l=0

55 Feed-forward Neural Networks Characterisation of a Feed-forward Neural Network Characterisation of a Feed-forward Neural Network I A Neural Network is characterised by Activity rule: local rules defining how the activities of the neurons respond to each other, i.e. activation functions Learning rule: the way in which the parameters change with time (e.g. as more data arrives), i.e. objective function Architecture: the variables involved in the network and their topology.

56 Feed-forward Neural Networks Choice of Activation Functions Choice of Activation Functions I Desiderata for g( ): Non-linear. g C 1 i.e. g exists, and g and g are continuous. Monotonic increasing. Computational simplicity: g and g should be easy to evaluate. Does not saturate: a function g(a) saturates in one or both tails if g(a) 0 as a ± If g is used in the output units then g(a) [0, 1] if we want to interpret h k (x; w) = P(C k x)

57 Feed-forward Neural Networks Choice of Activation Functions Choice of Activation Functions II Commonly used activation functions: Linear: g(a) = a Logistic/Sigmoid: g(a) = σ(a) = 1 1+e 1 Hyperbolic tangent: g(a) = tanh(a) = 2σ(2a) 1 { 1, for a > 0 Threshold: 1, for a 0 { 1, for a > 0 Heaviside: 0, for a 0 Rectified Linear Units (ReLU): g(a) = max{0, a} Absolute value rectification: g(a) = a Leaky / Parametric ReLU: g(a) = max{0, a} + α min{0, a} If we are performing regression, we simply set the activation functions of the output units to g(a) = a

58 Feed-forward Neural Networks Choice of Objective Functions Choice of Objective Functions I Sum of squared errors: G(w) = 1 2 N n=1 k=1 K (h k (x (n) ; w) y (n) k )2 Computationally simple to evaluate Non-negative Simplifies the proofs of some theorems However, it saturates easily (see later)

59 Feed-forward Neural Networks Choice of Objective Functions Choice of Objective Functions II Negative log-likelihood: G(w) = log = N K n=1 k=1 N K n=1 k=1 h k (x (n) ; w) y(n) k y (n) k log h k (x (n) ; w) Follows naturally from probabilistic discriminative models log helps prevent saturation (see later) Most commonly used

60 Feed-forward Neural Networks Choice of Architecture Choice of Architecture I Definition (Uniform approximation on compact sets) A family {f w } of functions is said to achieve uniform approximation of f : R D R K on compact sets (with respect to a norm on R K ) if for every compact set K R D and every ɛ > 0 we can find a function f w in the family such that f w (x) f(x) < ɛ, x K

61 Feed-forward Neural Networks Choice of Architecture Choice of Architecture II Theorem (Universal Approximation Theorem) Any continuous function f : R D R K can be approximated uniformly (with respect to the Euclidean norm) on compact sets by the family of feed-forward networks with two layers, with linear activation in the output layer and Heaviside units in the hidden layer.

62 Feed-forward Neural Networks Choice of Architecture Choice of Architecture III Theorem (Kolmogorov-Arnold Representation Theorem: An Extension) Any continuous function f : [0, 1] D R K can be represented exactly as ( 2D+1 D ) f k (x) = Φ kj λ j g j (x i ) + j j=1 where λ j R, g j : R R are continuous and monotonic increasing, and Φ kj : R R depends on f. i=1

63 Feed-forward Neural Networks Choice of Architecture Choice of Architecture IV Figure: Kolmogorov-Arnold Representation Theorem feed-forward neural network representation

64 Feed-forward Neural Networks Choice of Architecture Choice of Architecture V So by our universal approximation theorem, two layers suffices, however, deeper networks might: Require fewer units overall Have superior generalization (lower generalization error) Be easier to train There are no hard rules. Much based on experimentation, prior model beliefs and tradition.

65 Feed-forward Neural Networks Choice of Weight Initialization Choice of Weight Initialization I The initialization of w can determine whether the algorithm converges at all. Again, there are no hard rules. However, we might want to take into consideration the following: Avoid symmetric behaviour of different hidden units, e.g. set initial weights randomly. Initial activation of logistic units should be close to 0. Initial activation of ReLU should be small and positive.

66 Training Outline I 1 Introduction Training Data Loss Functions Generalisation And Overfitting 2 Linear Regression Models Linear Regression Models: An Overview Linear Regression Models: Training Linear Basis Function Models 3 Linear Classification Models Linear Classification Models: An Overview Linear Discriminant Functions Generalised Linear Discriminant Functions Generalised Linear Basis Function Discriminant Functions 4 Feed-forward Neural Networks

67 Training Outline II Feed-forward Neural Networks: An Overview Feed-forward Neural Network Functions Characterisation of a Feed-forward Neural Network Choice of Activation Functions Choice of Objective Functions Choice of Architecture Choice of Weight Initialization 5 Training Gradient Descent Choice of Learning Rate Backpropagation: An Overview Backpropagation: The Algorithm Backpropagation: Computational complexities 6 Regularisation

68 Training Outline III Regularisation: An Overview Weight Decay Dataset Augmentation Early Stopping Dropout Weight Sharing 7 Tensorflow Playground

69 Training Gradient Descent Gradient Descent I Question: how do we find the best set of parameters, i.e., how do we minimize G(w)? We have to take into consideration that We have nonlinear activation, so there is no analytical (closed-form) solution. However, G(w) is differentiable (almost everywhere). Answer: use a gradient-based algorithm which will guarantee us a local solution. Notation: Write G(w) = N n=1 G n(w)

70 Training Gradient Descent Gradient Descent II Definition (Gradient Descent) 1 Start with an initial guess, w (0), for w. 2 Move a small distance η in w-space, in the direction in which G decreases most rapidly, G(w): w (i)(t+1) jl = w (i)(t) jl η G(w) w (i) jl w=w (t) 3 Repeat from step 2 until w (t+1) w (t) < ɛ, for each i, j, l.

71 Training Gradient Descent Gradient Descent III Definition (Stochastic Gradient Descent) 1 Start with an initial guess, w (0), for w. 2 For each n = 1, 2,..., N do: Move a small distance η in w-space, in the direction in which G n decreases most rapidly, G n (w): w (i)(t+1) jl = w (i)(t) jl η G n(w) w (i) jl w=w (t) 3 Repeat from step 2 until w (t+1) w (t) < ɛ, for each i, j, l.

72 Training Gradient Descent Gradient Descent IV Definition (Stochastic Gradient Descent with Momentum) 1 Start with an initial guess, w (0), for w. 2 For each n = 1, 2,..., N do: Move a small distance η in w-space, in the direction in which G n decreases most rapidly, G n (w): for each i, j, l w (i)(t+1) jl = w (i)(t) jl η G n(w) w (i) jl + α(w(i)(t) w=w (t) 3 Repeat from step 2 until w (t+1) w (t) < ɛ Benefits: Passes through flat regions more quickly. jl w (i)(t 1) jl ).

73 Training Gradient Descent Gradient Descent V Averages out stochastic variation of stochastic gradient descent.

74 Training Choice of Learning Rate Choice of Learning Rate When choosing the learning rate η, we should consider the following: η should be neither too big nor too small. Figure: Path to global minimum for different values of η η should aim for uniform learningall weights reach their final equilibrium values at about the same time.

75 Training Backpropagation: An Overview Backpropagation: An Overview Backpropagation is an efficient way to calculate the partial derivatives of the learning function with respect to the parameters.

76 Training Backpropagation: The Algorithm Backpropagation: The Algorithm I We will focus on stochastic gradient descent, so our aim is to compute G n (w) w (i) jl Let a (i) j let denote the input to the j th unit after layer i, and z (i) j = { g(a (i) j ) for i = 1, 2,... x (n) j for i = 0 denote the corresponding output.

77 Training Backpropagation: The Algorithm Backpropagation: The Algorithm II 1 Apply an input vector x (n) to the network and forward propagate through the network: for each unit or neuron do D a (i) i 1 j = w (i) jl z(i 1) l, z (i) j = l=0 2 For each output unit, evaluate δ (i) k { g(a (i) j ) for i = 1, 2,... x (n) j for i = 0 = G n(h) ĝ (a (i) h k ) k for the output units, where ĝ denotes the activation function for the output units.

78 Training Backpropagation: The Algorithm Backpropagation: The Algorithm III 3 For each hidden unit compute the backpropagation formula δ (i) j = g (a (i) j 4 Evaluate each derivative ) Di+1 l=0 δ (i) l w (i+1) lj G n (w) w (i) jl = δ (i) j z (i 1) l

79 Training Backpropagation: Computational complexities Backpropagation: Computational complexities Result (Computational Complexities of Backpropagation) Let the total number of weights be m, then, as m Each weight in a (i) j = D i 1 l=0 w(i) jl z(i 1) l, z (i) j = g(a (i) j ) appears exactly once; this step is O(m). Assuming that K the number of output units is fixed, δ (i) k = Gn(h) h k ĝ (a (i) k ) has a fixed cost associated with it; this step is O(1). Each weight in δ (i) j = g (a (i) j ) D i+1 exactly once; this step is O(m). G n(w) w (i) jl l=0 δ(i) l w (i+1) lj appears = δ (i) j z (i 1) l is evaluated exactly once for each weight; this step is O(m). Hence, the total computational cost of backpropagation is O(m).

80 Regularisation Outline I 1 Introduction Training Data Loss Functions Generalisation And Overfitting 2 Linear Regression Models Linear Regression Models: An Overview Linear Regression Models: Training Linear Basis Function Models 3 Linear Classification Models Linear Classification Models: An Overview Linear Discriminant Functions Generalised Linear Discriminant Functions Generalised Linear Basis Function Discriminant Functions 4 Feed-forward Neural Networks

81 Regularisation Outline II Feed-forward Neural Networks: An Overview Feed-forward Neural Network Functions Characterisation of a Feed-forward Neural Network Choice of Activation Functions Choice of Objective Functions Choice of Architecture Choice of Weight Initialization 5 Training Gradient Descent Choice of Learning Rate Backpropagation: An Overview Backpropagation: The Algorithm Backpropagation: Computational complexities 6 Regularisation

82 Regularisation Outline III Regularisation: An Overview Weight Decay Dataset Augmentation Early Stopping Dropout Weight Sharing 7 Tensorflow Playground

83 Regularisation Regularisation: An Overview Regularisation: An Overview I Multilayer neural networks can potentially have millions of parameters meaning that model complexity is over the roof, as a consequence, they are extremely prone to overfitting Figure: Overfitting Versus Underfitting

84 Regularisation Regularisation: An Overview Regularisation: An Overview II We can encode our preference for sparse, and therefore simpler models, by adding a regularisation term to the loss function which takes care of bounding the parameter vector in some way. Definition (Regularization) Regularization is any modification we make to a learning algorithm that is intended to reduce its generalization error but not its training error.

85 Regularisation Regularisation: An Overview Regularisation: An Overview III There are many different kinds of regularisation, the most commonly used ones are Weight decay Dataset augmentation Early stopping Dropout Weight sharing

86 Regularisation Weight Decay Weight Decay I Modify the cost function to explicitly penalise complicated models: G(w) = G(w) + λω(w) where Ω(w) is a non-negative penalty function, λ [0, ) is a regularization coefficient.

87 Regularisation Weight Decay Weight Decay II Example (L 2 / Tikhonov Regularisation) Let Ω(w) = 1 2 w 2 2 = 1 2 (w (i) jl ) i,j,l so that G(w) = G(w)λ 1 2 w 2 2

88 Neural Networks Regularisation Dataset Augmentation Dataset Augmentation Generate synthetic data and add it to the training set. This can be done up-front (store the augmented dataset on disk) or on-the-fly during training. Example with image classification: Translating Rotating Cropping Flipping... Figure: Dataset augmentation for image classification

89 Regularisation Early Stopping Early Stopping I Look for the minimum generalisation error calculated on a separate, labelled validation set as training proceeds.

90 Regularisation Early Stopping Early Stopping II Definition (Gradient Descent) 1 Start with an initial guess, w (0), for w. 2 Move a small distance η in w-space, in the direction in which G decreases most rapidly, G(w): w (i)(t+1) jl = w (i)(t) jl 3 Repeat from step 2 until η G(w) w (i) jl w=w (t), for each i, j, l. arg min{h(w (s) ) s = 0,..., t} t p, s where H(w (s) ) is the prediction error.

91 Regularisation Dropout Dropout Definition (Gradient Descent with Dropout) 1 Start with an initial guess, w (0), for w. 2 Independently, for each non-output unit u, with probability 1 ρ u set its activation g 0. Call the objective function associated with this network G(w, µ), where µ is a binary indicator vector of activations. 3 Move a small distance η in w-space, in the direction in which G(w, µ) decreases most rapidly, G(w, µ): w (i)(t+1) jl = w (i)(t) jl η G(w) w (i) jl w=w (t) 4 Repeat from step 2 until w (t+1) w (t) < ɛ, for each i, j, l.

92 Regularisation Weight Sharing Weight Sharing I Assume some subsets of weight parameters are restricted in their values. Soft-weight sharing: encourage similar values by expressing prior beliefs about weights through a mixture distribution. Hard-weight sharing: force identical values for certain subsets of weights.

93 Regularisation Weight Sharing Weight Sharing II Example (Gaussian Mixture Model) Assume that weights are distributed according to a mixture of V Gaussians i.e. ( ) U V 1 P(w; π, µ, σ) = π v exp (w u µ v ) 2 2πσ 2 v 2σv 2, u=1 v=1 where U is the total number of weights.

94 Regularisation Weight Sharing Weight Sharing III Example (Gaussian Mixture Model (Cont.)) Consider P(w; π, µ, σ) as the prior distribution over the parameters so, the objective function G(w) should be equal to the posterior distribution of the weights given the training data: G(w) = log P((x (n), y (n) ), n = 1, 2,..., N, w) log P(w; π, µ, σ) + constant = G(w) + Ω(w) + constant.

95 Regularisation Weight Sharing Weight Sharing IV Example (Gaussian Mixture Model (Cont.)) where G(w) is the usual negative log-likelihood objective function and Ω(w) is the penalty function given by Ω(w) = ( U V ( )) 1 log π v exp (w u µ v ) 2 2πσ 2 v 2σv 2 u=1 v=1 The constant term is the marginal likelihood which does not depend on w and can without loss of generality be ignored.

96 Tensorflow Outline I 1 Introduction Training Data Loss Functions Generalisation And Overfitting 2 Linear Regression Models Linear Regression Models: An Overview Linear Regression Models: Training Linear Basis Function Models 3 Linear Classification Models Linear Classification Models: An Overview Linear Discriminant Functions Generalised Linear Discriminant Functions Generalised Linear Basis Function Discriminant Functions 4 Feed-forward Neural Networks

97 Tensorflow Outline II Feed-forward Neural Networks: An Overview Feed-forward Neural Network Functions Characterisation of a Feed-forward Neural Network Choice of Activation Functions Choice of Objective Functions Choice of Architecture Choice of Weight Initialization 5 Training Gradient Descent Choice of Learning Rate Backpropagation: An Overview Backpropagation: The Algorithm Backpropagation: Computational complexities 6 Regularisation

98 Tensorflow Outline III Regularisation: An Overview Weight Decay Dataset Augmentation Early Stopping Dropout Weight Sharing 7 Tensorflow Playground

99 Tensorflow Playground Playground

Perceptron: This is convolution!

Perceptron: This is convolution! Perceptron: This is convolution! v v v Shared weights v Filter = local perceptron. Also called kernel. By pooling responses at different locations, we gain robustness to the exact spatial location of image

More information

Machine Learning. Topic 5: Linear Discriminants. Bryan Pardo, EECS 349 Machine Learning, 2013

Machine Learning. Topic 5: Linear Discriminants. Bryan Pardo, EECS 349 Machine Learning, 2013 Machine Learning Topic 5: Linear Discriminants Bryan Pardo, EECS 349 Machine Learning, 2013 Thanks to Mark Cartwright for his extensive contributions to these slides Thanks to Alpaydin, Bishop, and Duda/Hart/Stork

More information

Ensemble methods in machine learning. Example. Neural networks. Neural networks

Ensemble methods in machine learning. Example. Neural networks. Neural networks Ensemble methods in machine learning Bootstrap aggregating (bagging) train an ensemble of models based on randomly resampled versions of the training set, then take a majority vote Example What if you

More information

Neural Networks. CE-725: Statistical Pattern Recognition Sharif University of Technology Spring Soleymani

Neural Networks. CE-725: Statistical Pattern Recognition Sharif University of Technology Spring Soleymani Neural Networks CE-725: Statistical Pattern Recognition Sharif University of Technology Spring 2013 Soleymani Outline Biological and artificial neural networks Feed-forward neural networks Single layer

More information

Deep Learning. Architecture Design for. Sargur N. Srihari

Deep Learning. Architecture Design for. Sargur N. Srihari Architecture Design for Deep Learning Sargur N. srihari@cedar.buffalo.edu 1 Topics Overview 1. Example: Learning XOR 2. Gradient-Based Learning 3. Hidden Units 4. Architecture Design 5. Backpropagation

More information

Generative and discriminative classification techniques

Generative and discriminative classification techniques Generative and discriminative classification techniques Machine Learning and Category Representation 013-014 Jakob Verbeek, December 13+0, 013 Course website: http://lear.inrialpes.fr/~verbeek/mlcr.13.14

More information

The exam is closed book, closed notes except your one-page cheat sheet.

The exam is closed book, closed notes except your one-page cheat sheet. CS 189 Fall 2015 Introduction to Machine Learning Final Please do not turn over the page before you are instructed to do so. You have 2 hours and 50 minutes. Please write your initials on the top-right

More information

Kernel Methods & Support Vector Machines

Kernel Methods & Support Vector Machines & Support Vector Machines & Support Vector Machines Arvind Visvanathan CSCE 970 Pattern Recognition 1 & Support Vector Machines Question? Draw a single line to separate two classes? 2 & Support Vector

More information

CLASSIFICATION WITH RADIAL BASIS AND PROBABILISTIC NEURAL NETWORKS

CLASSIFICATION WITH RADIAL BASIS AND PROBABILISTIC NEURAL NETWORKS CLASSIFICATION WITH RADIAL BASIS AND PROBABILISTIC NEURAL NETWORKS CHAPTER 4 CLASSIFICATION WITH RADIAL BASIS AND PROBABILISTIC NEURAL NETWORKS 4.1 Introduction Optical character recognition is one of

More information

Natural Language Processing CS 6320 Lecture 6 Neural Language Models. Instructor: Sanda Harabagiu

Natural Language Processing CS 6320 Lecture 6 Neural Language Models. Instructor: Sanda Harabagiu Natural Language Processing CS 6320 Lecture 6 Neural Language Models Instructor: Sanda Harabagiu In this lecture We shall cover: Deep Neural Models for Natural Language Processing Introduce Feed Forward

More information

Machine Learning. The Breadth of ML Neural Networks & Deep Learning. Marc Toussaint. Duy Nguyen-Tuong. University of Stuttgart

Machine Learning. The Breadth of ML Neural Networks & Deep Learning. Marc Toussaint. Duy Nguyen-Tuong. University of Stuttgart Machine Learning The Breadth of ML Neural Networks & Deep Learning Marc Toussaint University of Stuttgart Duy Nguyen-Tuong Bosch Center for Artificial Intelligence Summer 2017 Neural Networks Consider

More information

Deep Learning. Practical introduction with Keras JORDI TORRES 27/05/2018. Chapter 3 JORDI TORRES

Deep Learning. Practical introduction with Keras JORDI TORRES 27/05/2018. Chapter 3 JORDI TORRES Deep Learning Practical introduction with Keras Chapter 3 27/05/2018 Neuron A neural network is formed by neurons connected to each other; in turn, each connection of one neural network is associated

More information

Support Vector Machines.

Support Vector Machines. Support Vector Machines srihari@buffalo.edu SVM Discussion Overview 1. Overview of SVMs 2. Margin Geometry 3. SVM Optimization 4. Overlapping Distributions 5. Relationship to Logistic Regression 6. Dealing

More information

CS 229 Midterm Review

CS 229 Midterm Review CS 229 Midterm Review Course Staff Fall 2018 11/2/2018 Outline Today: SVMs Kernels Tree Ensembles EM Algorithm / Mixture Models [ Focus on building intuition, less so on solving specific problems. Ask

More information

4.12 Generalization. In back-propagation learning, as many training examples as possible are typically used.

4.12 Generalization. In back-propagation learning, as many training examples as possible are typically used. 1 4.12 Generalization In back-propagation learning, as many training examples as possible are typically used. It is hoped that the network so designed generalizes well. A network generalizes well when

More information

Learning from Data Linear Parameter Models

Learning from Data Linear Parameter Models Learning from Data Linear Parameter Models Copyright David Barber 200-2004. Course lecturer: Amos Storkey a.storkey@ed.ac.uk Course page : http://www.anc.ed.ac.uk/ amos/lfd/ 2 chirps per sec 26 24 22 20

More information

Optimization Methods for Machine Learning (OMML)

Optimization Methods for Machine Learning (OMML) Optimization Methods for Machine Learning (OMML) 2nd lecture Prof. L. Palagi References: 1. Bishop Pattern Recognition and Machine Learning, Springer, 2006 (Chap 1) 2. V. Cherlassky, F. Mulier - Learning

More information

Mixture Models and the EM Algorithm

Mixture Models and the EM Algorithm Mixture Models and the EM Algorithm Padhraic Smyth, Department of Computer Science University of California, Irvine c 2017 1 Finite Mixture Models Say we have a data set D = {x 1,..., x N } where x i is

More information

ECG782: Multidimensional Digital Signal Processing

ECG782: Multidimensional Digital Signal Processing ECG782: Multidimensional Digital Signal Processing Object Recognition http://www.ee.unlv.edu/~b1morris/ecg782/ 2 Outline Knowledge Representation Statistical Pattern Recognition Neural Networks Boosting

More information

Linear Models. Lecture Outline: Numeric Prediction: Linear Regression. Linear Classification. The Perceptron. Support Vector Machines

Linear Models. Lecture Outline: Numeric Prediction: Linear Regression. Linear Classification. The Perceptron. Support Vector Machines Linear Models Lecture Outline: Numeric Prediction: Linear Regression Linear Classification The Perceptron Support Vector Machines Reading: Chapter 4.6 Witten and Frank, 2nd ed. Chapter 4 of Mitchell Solving

More information

COMP 551 Applied Machine Learning Lecture 14: Neural Networks

COMP 551 Applied Machine Learning Lecture 14: Neural Networks COMP 551 Applied Machine Learning Lecture 14: Neural Networks Instructor: (jpineau@cs.mcgill.ca) Class web page: www.cs.mcgill.ca/~jpineau/comp551 Unless otherwise noted, all material posted for this course

More information

Learning. Learning agents Inductive learning. Neural Networks. Different Learning Scenarios Evaluation

Learning. Learning agents Inductive learning. Neural Networks. Different Learning Scenarios Evaluation Learning Learning agents Inductive learning Different Learning Scenarios Evaluation Slides based on Slides by Russell/Norvig, Ronald Williams, and Torsten Reil Material from Russell & Norvig, chapters

More information

732A54/TDDE31 Big Data Analytics

732A54/TDDE31 Big Data Analytics 732A54/TDDE31 Big Data Analytics Lecture 10: Machine Learning with MapReduce Jose M. Peña IDA, Linköping University, Sweden 1/27 Contents MapReduce Framework Machine Learning with MapReduce Neural Networks

More information

K-Means and Gaussian Mixture Models

K-Means and Gaussian Mixture Models K-Means and Gaussian Mixture Models David Rosenberg New York University June 15, 2015 David Rosenberg (New York University) DS-GA 1003 June 15, 2015 1 / 43 K-Means Clustering Example: Old Faithful Geyser

More information

Deep Learning for Computer Vision II

Deep Learning for Computer Vision II IIIT Hyderabad Deep Learning for Computer Vision II C. V. Jawahar Paradigm Shift Feature Extraction (SIFT, HoG, ) Part Models / Encoding Classifier Sparrow Feature Learning Classifier Sparrow L 1 L 2 L

More information

Deep Generative Models Variational Autoencoders

Deep Generative Models Variational Autoencoders Deep Generative Models Variational Autoencoders Sudeshna Sarkar 5 April 2017 Generative Nets Generative models that represent probability distributions over multiple variables in some way. Directed Generative

More information

Chap.12 Kernel methods [Book, Chap.7]

Chap.12 Kernel methods [Book, Chap.7] Chap.12 Kernel methods [Book, Chap.7] Neural network methods became popular in the mid to late 1980s, but by the mid to late 1990s, kernel methods have also become popular in machine learning. The first

More information

Network Traffic Measurements and Analysis

Network Traffic Measurements and Analysis DEIB - Politecnico di Milano Fall, 2017 Sources Hastie, Tibshirani, Friedman: The Elements of Statistical Learning James, Witten, Hastie, Tibshirani: An Introduction to Statistical Learning Andrew Ng:

More information

27: Hybrid Graphical Models and Neural Networks

27: Hybrid Graphical Models and Neural Networks 10-708: Probabilistic Graphical Models 10-708 Spring 2016 27: Hybrid Graphical Models and Neural Networks Lecturer: Matt Gormley Scribes: Jakob Bauer Otilia Stretcu Rohan Varma 1 Motivation We first look

More information

Convexization in Markov Chain Monte Carlo

Convexization in Markov Chain Monte Carlo in Markov Chain Monte Carlo 1 IBM T. J. Watson Yorktown Heights, NY 2 Department of Aerospace Engineering Technion, Israel August 23, 2011 Problem Statement MCMC processes in general are governed by non

More information

Deep Learning Cook Book

Deep Learning Cook Book Deep Learning Cook Book Robert Haschke (CITEC) Overview Input Representation Output Layer + Cost Function Hidden Layer Units Initialization Regularization Input representation Choose an input representation

More information

Pattern Recognition. Kjell Elenius. Speech, Music and Hearing KTH. March 29, 2007 Speech recognition

Pattern Recognition. Kjell Elenius. Speech, Music and Hearing KTH. March 29, 2007 Speech recognition Pattern Recognition Kjell Elenius Speech, Music and Hearing KTH March 29, 2007 Speech recognition 2007 1 Ch 4. Pattern Recognition 1(3) Bayes Decision Theory Minimum-Error-Rate Decision Rules Discriminant

More information

COMP 551 Applied Machine Learning Lecture 16: Deep Learning

COMP 551 Applied Machine Learning Lecture 16: Deep Learning COMP 551 Applied Machine Learning Lecture 16: Deep Learning Instructor: Ryan Lowe (ryan.lowe@cs.mcgill.ca) Slides mostly by: Class web page: www.cs.mcgill.ca/~hvanho2/comp551 Unless otherwise noted, all

More information

3 Nonlinear Regression

3 Nonlinear Regression CSC 4 / CSC D / CSC C 3 Sometimes linear models are not sufficient to capture the real-world phenomena, and thus nonlinear models are necessary. In regression, all such models will have the same basic

More information

Neural Network Learning. Today s Lecture. Continuation of Neural Networks. Artificial Neural Networks. Lecture 24: Learning 3. Victor R.

Neural Network Learning. Today s Lecture. Continuation of Neural Networks. Artificial Neural Networks. Lecture 24: Learning 3. Victor R. Lecture 24: Learning 3 Victor R. Lesser CMPSCI 683 Fall 2010 Today s Lecture Continuation of Neural Networks Artificial Neural Networks Compose of nodes/units connected by links Each link has a numeric

More information

Lecture 19: Convex Non-Smooth Optimization. April 2, 2007

Lecture 19: Convex Non-Smooth Optimization. April 2, 2007 : Convex Non-Smooth Optimization April 2, 2007 Outline Lecture 19 Convex non-smooth problems Examples Subgradients and subdifferentials Subgradient properties Operations with subgradients and subdifferentials

More information

Note Set 4: Finite Mixture Models and the EM Algorithm

Note Set 4: Finite Mixture Models and the EM Algorithm Note Set 4: Finite Mixture Models and the EM Algorithm Padhraic Smyth, Department of Computer Science University of California, Irvine Finite Mixture Models A finite mixture model with K components, for

More information

Digital Image Processing Laboratory: MAP Image Restoration

Digital Image Processing Laboratory: MAP Image Restoration Purdue University: Digital Image Processing Laboratories 1 Digital Image Processing Laboratory: MAP Image Restoration October, 015 1 Introduction This laboratory explores the use of maximum a posteriori

More information

Homework. Gaussian, Bishop 2.3 Non-parametric, Bishop 2.5 Linear regression Pod-cast lecture on-line. Next lectures:

Homework. Gaussian, Bishop 2.3 Non-parametric, Bishop 2.5 Linear regression Pod-cast lecture on-line. Next lectures: Homework Gaussian, Bishop 2.3 Non-parametric, Bishop 2.5 Linear regression 3.0-3.2 Pod-cast lecture on-line Next lectures: I posted a rough plan. It is flexible though so please come with suggestions Bayes

More information

Generative and discriminative classification techniques

Generative and discriminative classification techniques Generative and discriminative classification techniques Machine Learning and Category Representation 2014-2015 Jakob Verbeek, November 28, 2014 Course website: http://lear.inrialpes.fr/~verbeek/mlcr.14.15

More information

Deep Learning with Tensorflow AlexNet

Deep Learning with Tensorflow   AlexNet Machine Learning and Computer Vision Group Deep Learning with Tensorflow http://cvml.ist.ac.at/courses/dlwt_w17/ AlexNet Krizhevsky, Alex, Ilya Sutskever, and Geoffrey E. Hinton, "Imagenet classification

More information

Machine Learning / Jan 27, 2010

Machine Learning / Jan 27, 2010 Revisiting Logistic Regression & Naïve Bayes Aarti Singh Machine Learning 10-701/15-781 Jan 27, 2010 Generative and Discriminative Classifiers Training classifiers involves learning a mapping f: X -> Y,

More information

Lecture 20: Neural Networks for NLP. Zubin Pahuja

Lecture 20: Neural Networks for NLP. Zubin Pahuja Lecture 20: Neural Networks for NLP Zubin Pahuja zpahuja2@illinois.edu courses.engr.illinois.edu/cs447 CS447: Natural Language Processing 1 Today s Lecture Feed-forward neural networks as classifiers simple

More information

.. Spring 2017 CSC 566 Advanced Data Mining Alexander Dekhtyar..

.. Spring 2017 CSC 566 Advanced Data Mining Alexander Dekhtyar.. .. Spring 2017 CSC 566 Advanced Data Mining Alexander Dekhtyar.. Machine Learning: Support Vector Machines: Linear Kernel Support Vector Machines Extending Perceptron Classifiers. There are two ways to

More information

Data Mining: Concepts and Techniques. Chapter 9 Classification: Support Vector Machines. Support Vector Machines (SVMs)

Data Mining: Concepts and Techniques. Chapter 9 Classification: Support Vector Machines. Support Vector Machines (SVMs) Data Mining: Concepts and Techniques Chapter 9 Classification: Support Vector Machines 1 Support Vector Machines (SVMs) SVMs are a set of related supervised learning methods used for classification Based

More information

Deep Learning and Its Applications

Deep Learning and Its Applications Convolutional Neural Network and Its Application in Image Recognition Oct 28, 2016 Outline 1 A Motivating Example 2 The Convolutional Neural Network (CNN) Model 3 Training the CNN Model 4 Issues and Recent

More information

Machine Learning Classifiers and Boosting

Machine Learning Classifiers and Boosting Machine Learning Classifiers and Boosting Reading Ch 18.6-18.12, 20.1-20.3.2 Outline Different types of learning problems Different types of learning algorithms Supervised learning Decision trees Naïve

More information

Clustering. Mihaela van der Schaar. January 27, Department of Engineering Science University of Oxford

Clustering. Mihaela van der Schaar. January 27, Department of Engineering Science University of Oxford Department of Engineering Science University of Oxford January 27, 2017 Many datasets consist of multiple heterogeneous subsets. Cluster analysis: Given an unlabelled data, want algorithms that automatically

More information

Data Analysis 3. Support Vector Machines. Jan Platoš October 30, 2017

Data Analysis 3. Support Vector Machines. Jan Platoš October 30, 2017 Data Analysis 3 Support Vector Machines Jan Platoš October 30, 2017 Department of Computer Science Faculty of Electrical Engineering and Computer Science VŠB - Technical University of Ostrava Table of

More information

CS6220: DATA MINING TECHNIQUES

CS6220: DATA MINING TECHNIQUES CS6220: DATA MINING TECHNIQUES Image Data: Classification via Neural Networks Instructor: Yizhou Sun yzsun@ccs.neu.edu November 19, 2015 Methods to Learn Classification Clustering Frequent Pattern Mining

More information

Hidden Units. Sargur N. Srihari

Hidden Units. Sargur N. Srihari Hidden Units Sargur N. srihari@cedar.buffalo.edu 1 Topics in Deep Feedforward Networks Overview 1. Example: Learning XOR 2. Gradient-Based Learning 3. Hidden Units 4. Architecture Design 5. Backpropagation

More information

6.867 Machine Learning

6.867 Machine Learning 6.867 Machine Learning Problem set 3 Due Tuesday, October 22, in class What and how to turn in? Turn in short written answers to the questions explicitly stated, and when requested to explain or prove.

More information

CIS 520, Machine Learning, Fall 2015: Assignment 7 Due: Mon, Nov 16, :59pm, PDF to Canvas [100 points]

CIS 520, Machine Learning, Fall 2015: Assignment 7 Due: Mon, Nov 16, :59pm, PDF to Canvas [100 points] CIS 520, Machine Learning, Fall 2015: Assignment 7 Due: Mon, Nov 16, 2015. 11:59pm, PDF to Canvas [100 points] Instructions. Please write up your responses to the following problems clearly and concisely.

More information

Knowledge Discovery and Data Mining. Neural Nets. A simple NN as a Mathematical Formula. Notes. Lecture 13 - Neural Nets. Tom Kelsey.

Knowledge Discovery and Data Mining. Neural Nets. A simple NN as a Mathematical Formula. Notes. Lecture 13 - Neural Nets. Tom Kelsey. Knowledge Discovery and Data Mining Lecture 13 - Neural Nets Tom Kelsey School of Computer Science University of St Andrews http://tom.home.cs.st-andrews.ac.uk twk@st-andrews.ac.uk Tom Kelsey ID5059-13-NN

More information

Classification: Linear Discriminant Functions

Classification: Linear Discriminant Functions Classification: Linear Discriminant Functions CE-725: Statistical Pattern Recognition Sharif University of Technology Spring 2013 Soleymani Outline Discriminant functions Linear Discriminant functions

More information

Data Mining. Neural Networks

Data Mining. Neural Networks Data Mining Neural Networks Goals for this Unit Basic understanding of Neural Networks and how they work Ability to use Neural Networks to solve real problems Understand when neural networks may be most

More information

Knowledge Discovery and Data Mining

Knowledge Discovery and Data Mining Knowledge Discovery and Data Mining Lecture 13 - Neural Nets Tom Kelsey School of Computer Science University of St Andrews http://tom.home.cs.st-andrews.ac.uk twk@st-andrews.ac.uk Tom Kelsey ID5059-13-NN

More information

Neural Network Neurons

Neural Network Neurons Neural Networks Neural Network Neurons 1 Receives n inputs (plus a bias term) Multiplies each input by its weight Applies activation function to the sum of results Outputs result Activation Functions Given

More information

Neural Networks and Deep Learning

Neural Networks and Deep Learning Neural Networks and Deep Learning Example Learning Problem Example Learning Problem Celebrity Faces in the Wild Machine Learning Pipeline Raw data Feature extract. Feature computation Inference: prediction,

More information

For Monday. Read chapter 18, sections Homework:

For Monday. Read chapter 18, sections Homework: For Monday Read chapter 18, sections 10-12 The material in section 8 and 9 is interesting, but we won t take time to cover it this semester Homework: Chapter 18, exercise 25 a-b Program 4 Model Neuron

More information

The exam is closed book, closed notes except your one-page (two-sided) cheat sheet.

The exam is closed book, closed notes except your one-page (two-sided) cheat sheet. CS 189 Spring 2015 Introduction to Machine Learning Final You have 2 hours 50 minutes for the exam. The exam is closed book, closed notes except your one-page (two-sided) cheat sheet. No calculators or

More information

FMA901F: Machine Learning Lecture 3: Linear Models for Regression. Cristian Sminchisescu

FMA901F: Machine Learning Lecture 3: Linear Models for Regression. Cristian Sminchisescu FMA901F: Machine Learning Lecture 3: Linear Models for Regression Cristian Sminchisescu Machine Learning: Frequentist vs. Bayesian In the frequentist setting, we seek a fixed parameter (vector), with value(s)

More information

3 Nonlinear Regression

3 Nonlinear Regression 3 Linear models are often insufficient to capture the real-world phenomena. That is, the relation between the inputs and the outputs we want to be able to predict are not linear. As a consequence, nonlinear

More information

Learning from Data: Adaptive Basis Functions

Learning from Data: Adaptive Basis Functions Learning from Data: Adaptive Basis Functions November 21, 2005 http://www.anc.ed.ac.uk/ amos/lfd/ Neural Networks Hidden to output layer - a linear parameter model But adapt the features of the model.

More information

Lecture 21 : A Hybrid: Deep Learning and Graphical Models

Lecture 21 : A Hybrid: Deep Learning and Graphical Models 10-708: Probabilistic Graphical Models, Spring 2018 Lecture 21 : A Hybrid: Deep Learning and Graphical Models Lecturer: Kayhan Batmanghelich Scribes: Paul Liang, Anirudha Rayasam 1 Introduction and Motivation

More information

Lecture 13. Deep Belief Networks. Michael Picheny, Bhuvana Ramabhadran, Stanley F. Chen

Lecture 13. Deep Belief Networks. Michael Picheny, Bhuvana Ramabhadran, Stanley F. Chen Lecture 13 Deep Belief Networks Michael Picheny, Bhuvana Ramabhadran, Stanley F. Chen IBM T.J. Watson Research Center Yorktown Heights, New York, USA {picheny,bhuvana,stanchen}@us.ibm.com 12 December 2012

More information

Clustering web search results

Clustering web search results Clustering K-means Machine Learning CSE546 Emily Fox University of Washington November 4, 2013 1 Clustering images Set of Images [Goldberger et al.] 2 1 Clustering web search results 3 Some Data 4 2 K-means

More information

Supervised Learning (contd) Linear Separation. Mausam (based on slides by UW-AI faculty)

Supervised Learning (contd) Linear Separation. Mausam (based on slides by UW-AI faculty) Supervised Learning (contd) Linear Separation Mausam (based on slides by UW-AI faculty) Images as Vectors Binary handwritten characters Treat an image as a highdimensional vector (e.g., by reading pixel

More information

1 Training/Validation/Testing

1 Training/Validation/Testing CPSC 340 Final (Fall 2015) Name: Student Number: Please enter your information above, turn off cellphones, space yourselves out throughout the room, and wait until the official start of the exam to begin.

More information

Support Vector Machines

Support Vector Machines Support Vector Machines RBF-networks Support Vector Machines Good Decision Boundary Optimization Problem Soft margin Hyperplane Non-linear Decision Boundary Kernel-Trick Approximation Accurancy Overtraining

More information

What is machine learning?

What is machine learning? Machine learning, pattern recognition and statistical data modelling Lecture 12. The last lecture Coryn Bailer-Jones 1 What is machine learning? Data description and interpretation finding simpler relationship

More information

More on Neural Networks. Read Chapter 5 in the text by Bishop, except omit Sections 5.3.3, 5.3.4, 5.4, 5.5.4, 5.5.5, 5.5.6, 5.5.7, and 5.

More on Neural Networks. Read Chapter 5 in the text by Bishop, except omit Sections 5.3.3, 5.3.4, 5.4, 5.5.4, 5.5.5, 5.5.6, 5.5.7, and 5. More on Neural Networks Read Chapter 5 in the text by Bishop, except omit Sections 5.3.3, 5.3.4, 5.4, 5.5.4, 5.5.5, 5.5.6, 5.5.7, and 5.6 Recall the MLP Training Example From Last Lecture log likelihood

More information

Deep Learning for Computer Vision

Deep Learning for Computer Vision Deep Learning for Computer Vision Lecture 7: Universal Approximation Theorem, More Hidden Units, Multi-Class Classifiers, Softmax, and Regularization Peter Belhumeur Computer Science Columbia University

More information

1) Give decision trees to represent the following Boolean functions:

1) Give decision trees to represent the following Boolean functions: 1) Give decision trees to represent the following Boolean functions: 1) A B 2) A [B C] 3) A XOR B 4) [A B] [C Dl Answer: 1) A B 2) A [B C] 1 3) A XOR B = (A B) ( A B) 4) [A B] [C D] 2 2) Consider the following

More information

A General Greedy Approximation Algorithm with Applications

A General Greedy Approximation Algorithm with Applications A General Greedy Approximation Algorithm with Applications Tong Zhang IBM T.J. Watson Research Center Yorktown Heights, NY 10598 tzhang@watson.ibm.com Abstract Greedy approximation algorithms have been

More information

Supervised Learning with Neural Networks. We now look at how an agent might learn to solve a general problem by seeing examples.

Supervised Learning with Neural Networks. We now look at how an agent might learn to solve a general problem by seeing examples. Supervised Learning with Neural Networks We now look at how an agent might learn to solve a general problem by seeing examples. Aims: to present an outline of supervised learning as part of AI; to introduce

More information

Content-based image and video analysis. Machine learning

Content-based image and video analysis. Machine learning Content-based image and video analysis Machine learning for multimedia retrieval 04.05.2009 What is machine learning? Some problems are very hard to solve by writing a computer program by hand Almost all

More information

CMU Lecture 18: Deep learning and Vision: Convolutional neural networks. Teacher: Gianni A. Di Caro

CMU Lecture 18: Deep learning and Vision: Convolutional neural networks. Teacher: Gianni A. Di Caro CMU 15-781 Lecture 18: Deep learning and Vision: Convolutional neural networks Teacher: Gianni A. Di Caro DEEP, SHALLOW, CONNECTED, SPARSE? Fully connected multi-layer feed-forward perceptrons: More powerful

More information

Neural Networks (Overview) Prof. Richard Zanibbi

Neural Networks (Overview) Prof. Richard Zanibbi Neural Networks (Overview) Prof. Richard Zanibbi Inspired by Biology Introduction But as used in pattern recognition research, have little relation with real neural systems (studied in neurology and neuroscience)

More information

Introduction to Pattern Recognition Part II. Selim Aksoy Bilkent University Department of Computer Engineering

Introduction to Pattern Recognition Part II. Selim Aksoy Bilkent University Department of Computer Engineering Introduction to Pattern Recognition Part II Selim Aksoy Bilkent University Department of Computer Engineering saksoy@cs.bilkent.edu.tr RETINA Pattern Recognition Tutorial, Summer 2005 Overview Statistical

More information

Machine Learning. Chao Lan

Machine Learning. Chao Lan Machine Learning Chao Lan Machine Learning Prediction Models Regression Model - linear regression (least square, ridge regression, Lasso) Classification Model - naive Bayes, logistic regression, Gaussian

More information

Regularization and model selection

Regularization and model selection CS229 Lecture notes Andrew Ng Part VI Regularization and model selection Suppose we are trying select among several different models for a learning problem. For instance, we might be using a polynomial

More information

Neural Networks (pp )

Neural Networks (pp ) Notation: Means pencil-and-paper QUIZ Means coding QUIZ Neural Networks (pp. 106-121) The first artificial neural network (ANN) was the (single-layer) perceptron, a simplified model of a biological neuron.

More information

Lab 2: Support vector machines

Lab 2: Support vector machines Artificial neural networks, advanced course, 2D1433 Lab 2: Support vector machines Martin Rehn For the course given in 2006 All files referenced below may be found in the following directory: /info/annfk06/labs/lab2

More information

Table of Contents. Recognition of Facial Gestures... 1 Attila Fazekas

Table of Contents. Recognition of Facial Gestures... 1 Attila Fazekas Table of Contents Recognition of Facial Gestures...................................... 1 Attila Fazekas II Recognition of Facial Gestures Attila Fazekas University of Debrecen, Institute of Informatics

More information

CPSC 340: Machine Learning and Data Mining. More Linear Classifiers Fall 2017

CPSC 340: Machine Learning and Data Mining. More Linear Classifiers Fall 2017 CPSC 340: Machine Learning and Data Mining More Linear Classifiers Fall 2017 Admin Assignment 3: Due Friday of next week. Midterm: Can view your exam during instructor office hours next week, or after

More information

FMA901F: Machine Learning Lecture 6: Graphical Models. Cristian Sminchisescu

FMA901F: Machine Learning Lecture 6: Graphical Models. Cristian Sminchisescu FMA901F: Machine Learning Lecture 6: Graphical Models Cristian Sminchisescu Graphical Models Provide a simple way to visualize the structure of a probabilistic model and can be used to design and motivate

More information

Notes on Multilayer, Feedforward Neural Networks

Notes on Multilayer, Feedforward Neural Networks Notes on Multilayer, Feedforward Neural Networks CS425/528: Machine Learning Fall 2012 Prepared by: Lynne E. Parker [Material in these notes was gleaned from various sources, including E. Alpaydin s book

More information

Pattern Classification Algorithms for Face Recognition

Pattern Classification Algorithms for Face Recognition Chapter 7 Pattern Classification Algorithms for Face Recognition 7.1 Introduction The best pattern recognizers in most instances are human beings. Yet we do not completely understand how the brain recognize

More information

Akarsh Pokkunuru EECS Department Contractive Auto-Encoders: Explicit Invariance During Feature Extraction

Akarsh Pokkunuru EECS Department Contractive Auto-Encoders: Explicit Invariance During Feature Extraction Akarsh Pokkunuru EECS Department 03-16-2017 Contractive Auto-Encoders: Explicit Invariance During Feature Extraction 1 AGENDA Introduction to Auto-encoders Types of Auto-encoders Analysis of different

More information

Deep Learning. Vladimir Golkov Technical University of Munich Computer Vision Group

Deep Learning. Vladimir Golkov Technical University of Munich Computer Vision Group Deep Learning Vladimir Golkov Technical University of Munich Computer Vision Group 1D Input, 1D Output target input 2 2D Input, 1D Output: Data Distribution Complexity Imagine many dimensions (data occupies

More information

CPSC 340: Machine Learning and Data Mining. Principal Component Analysis Fall 2017

CPSC 340: Machine Learning and Data Mining. Principal Component Analysis Fall 2017 CPSC 340: Machine Learning and Data Mining Principal Component Analysis Fall 2017 Assignment 3: 2 late days to hand in tonight. Admin Assignment 4: Due Friday of next week. Last Time: MAP Estimation MAP

More information

Character Recognition Using Convolutional Neural Networks

Character Recognition Using Convolutional Neural Networks Character Recognition Using Convolutional Neural Networks David Bouchain Seminar Statistical Learning Theory University of Ulm, Germany Institute for Neural Information Processing Winter 2006/2007 Abstract

More information

Multi-Layered Perceptrons (MLPs)

Multi-Layered Perceptrons (MLPs) Multi-Layered Perceptrons (MLPs) The XOR problem is solvable if we add an extra node to a Perceptron A set of weights can be found for the above 5 connections which will enable the XOR of the inputs to

More information

low bias high variance high bias low variance error test set training set high low Model Complexity Typical Behaviour Lecture 11:

low bias high variance high bias low variance error test set training set high low Model Complexity Typical Behaviour Lecture 11: Lecture 11: Overfitting and Capacity Control high bias low variance Typical Behaviour low bias high variance Sam Roweis error test set training set November 23, 4 low Model Complexity high Generalization,

More information

Artificial Neural Networks MLP, RBF & GMDH

Artificial Neural Networks MLP, RBF & GMDH Artificial Neural Networks MLP, RBF & GMDH Jan Drchal drchajan@fel.cvut.cz Computational Intelligence Group Department of Computer Science and Engineering Faculty of Electrical Engineering Czech Technical

More information

Network Traffic Measurements and Analysis

Network Traffic Measurements and Analysis DEIB - Politecnico di Milano Fall, 2017 Introduction Often, we have only a set of features x = x 1, x 2,, x n, but no associated response y. Therefore we are not interested in prediction nor classification,

More information

Index. Umberto Michelucci 2018 U. Michelucci, Applied Deep Learning,

Index. Umberto Michelucci 2018 U. Michelucci, Applied Deep Learning, A Acquisition function, 298, 301 Adam optimizer, 175 178 Anaconda navigator conda command, 3 Create button, 5 download and install, 1 installing packages, 8 Jupyter Notebook, 11 13 left navigation pane,

More information

Clustering K-means. Machine Learning CSEP546 Carlos Guestrin University of Washington February 18, Carlos Guestrin

Clustering K-means. Machine Learning CSEP546 Carlos Guestrin University of Washington February 18, Carlos Guestrin Clustering K-means Machine Learning CSEP546 Carlos Guestrin University of Washington February 18, 2014 Carlos Guestrin 2005-2014 1 Clustering images Set of Images [Goldberger et al.] Carlos Guestrin 2005-2014

More information

Generative and discriminative classification

Generative and discriminative classification Generative and discriminative classification Machine Learning and Object Recognition 2017-2018 Jakob Verbeek Classification in its simplest form Given training data labeled for two or more classes Classification

More information