Machine Learning: Chenhao Tan University of Colorado Boulder LECTURE 15

Machine Learning: Chenhao Tan University of Colorado Boulder LECTURE 15 Slides adapted from Jordan Boyd-Graber Machine Learning: Chenhao Tan Boulder 1 of 21

Logistics HW3 available on Github, due on October 19 First prelim next Monday Final project team formation due by October 12 Machine Learning: Chenhao Tan Boulder 2 of 21

Recap Outline Recap Machine Learning: Chenhao Tan Boulder 3 of 21

Recap Logistic Regression as a Single-layer Neural Network Input layer x 1 x 2... Single Layer o 1... What is the activation used in logistic regression? What is the objective function used in logistic regression? x d Machine Learning: Chenhao Tan Boulder 4 of 21

Recap Nonlinearity Options Sigmoid tanh ReLU (rectified linear unit) f (x) = f (x) = 1 1 + exp(x) exp(x) exp( x) exp(x) + exp( x) f (x) = max(0, x) softmax x = https://keras.io/activations/ exp(x) x i exp(x i ) Machine Learning: Chenhao Tan Boulder 5 of 21

Recap Nonlinearity Options Machine Learning: Chenhao Tan Boulder 6 of 21

Recap Loss Function Options l 2 loss l 1 loss Cross entropy (logistic regression) (y i ŷ i ) 2 i y i ŷ i i i y i log ŷ i Hinge loss (more on this during SVM) https://keras.io/losses/ max(0, 1 yŷ) Machine Learning: Chenhao Tan Boulder 7 of 21

Recap Forward propagation algorithm How do we make predictions based on a multi-layer neural network? Store the biases for layer l in b l, weight matrix in W l W 1, b 1 W 2, b 2 W 3, b 3 W 4, b 4 x 1 x 2 o 1...... x d Machine Learning: Chenhao Tan Boulder 8 of 21

Recap Forward propagation algorithm Suppose your network has L layers Make prediction for an instance x 1: Initialize a 0 = x 2: for l = 1 to L do 3: z l = W l a l 1 + b l 4: a l = g(z l ) // g represents the nonlinear activation 5: end for 6: The prediction ŷ is simply a L Machine Learning: Chenhao Tan Boulder 9 of 21

Recap Quiz 1 How many parameters are there in the following feed-forward neural networks (assuming no biases)? x 1 x 2... x 100 h 1... h 50 A. 100 * 50 + 50 * 20 + 20 * 5 h 1... h 20 B. 100 * 50 + 50 + 50 * 20 + 20 + 20 * 5 + 5 o 1... o 5 Machine Learning: Chenhao Tan Boulder 10 of 21

Recap Neural networks in a nutshell Training data S train = {(x, y)} Network architecture (model) ŷ = f w (x) Loss function (objective function) L (y, ŷ) Learning (next week) Machine Learning: Chenhao Tan Boulder 11 of 21

Recap Quiz 2 Which of the following statements is true? (Suppose that training data is large.) A. In training, K-nearest neighbors takes shorter time than neural networks. B. In training, K-nearest neighbors takes longer time than neural networks. C. In testing, K-nearest neighbors takes shorter time than neural networks. D. In testing, K-nearest neighbors takes longer time than neural networks. Machine Learning: Chenhao Tan Boulder 12 of 21

Outline Recap Machine Learning: Chenhao Tan Boulder 13 of 21

Structured data Spatial information https://www.reddit.com/r/aww/comments/6ip2la/before_and_ after_she_was_told_she_was_a_good_girl/ Machine Learning: Chenhao Tan Boulder 14 of 21

Convolutional Layers Sharing parameters across patches input image or input feature map output feature maps a i j = k k w ij x ij i=1 j=1 Number of filters Filter shape Stride size https://github.com/davidstutz/latex-resources/blob/master/ tikz-convolutional-layer/convolutional-layer.tex Machine Learning: Chenhao Tan Boulder 15 of 21

Quiz 3 A concrete example (assuming no biases, convolution with 4 filters, ReLU, max pooling, and finally a fully-connected softmax layer) input image (10*10) 4@6*6 4@3*3 5*1 4 filters 5*5, stride 1 Max pooling 2*2, stride 2 Machine Learning: Chenhao Tan Boulder 16 of 21

Quiz 3 How many parameters are there in the following convolutional neural networks? (assuming no biases, convolution with 4 filters, ReLU, max pooling, and finally a fully-connected softmax layer) input image (10*10) 4@6*6 4@3*3 5*1 4 filters 5*5, stride 1 Max pooling 2*2, stride 2 Machine Learning: Chenhao Tan Boulder 16 of 21

Quiz 3 How many ReLU operations are performed on the forward pass? (assuming no biases, convolution with 4 filters, ReLU, max pooling, and finally a fully-connected softmax layer) input image (10*10) 4@6*6 4@3*3 5*1 4 filters 5*5, stride 1 Max pooling 2*2, stride 2 Machine Learning: Chenhao Tan Boulder 16 of 21

Structured data Sequential information My words fly up, my thoughts remain below: Words without thoughts never to heaven go. Hamlet Machine Learning: Chenhao Tan Boulder 17 of 21

Structured data Sequential information My words fly up, my thoughts remain below: Words without thoughts never to heaven go. Hamlet language activity history Machine Learning: Chenhao Tan Boulder 17 of 21

Structured data Sequential information My words fly up, my thoughts remain below: Words without thoughts never to heaven go. Hamlet language activity history x = (x 1,..., x T ) Machine Learning: Chenhao Tan Boulder 17 of 21

Recurrent Layers Sharing parameters along a sequence h t = f (x t, h t 1 ) Machine Learning: Chenhao Tan Boulder 18 of 21

Recurrent Layers Sharing parameters along a sequence h t = f (x t, h t 1 ) https://colah.github.io/posts/2015-08-understanding-lstms/ Machine Learning: Chenhao Tan Boulder 18 of 21

Long short-term memory A commonly used recurrent neural network in natural language processing f t = σ(w f [h t 1, x t ] + b f ) i t = σ(w i [h t 1, x t ] + b i ) o t = σ(w o [h t 1, x t ] + b o ) C t = tanh(w C [h t 1, x t ] + b C ) C t = f t C t 1 + i t C t h t = o t tanh(c t ) Machine Learning: Chenhao Tan Boulder 19 of 21

Wrap up Neural networks Network architecture (a lot more then fully connected layers) Convulutional layer Recurrent layer Loss function Machine Learning: Chenhao Tan Boulder 20 of 21

What is missing? How to find good weights? How to make the model work (regularization, even more architecture, etc)? Machine Learning: Chenhao Tan Boulder 21 of 21