Multinomial Regression and the Softmax Activation Function. Gary Cottrell!

Similar documents
Lecture 20: Neural Networks for NLP. Zubin Pahuja

Natural Language Processing CS 6320 Lecture 6 Neural Language Models. Instructor: Sanda Harabagiu

SEMANTIC COMPUTING. Lecture 8: Introduction to Deep Learning. TU Dresden, 7 December Dagmar Gromann International Center For Computational Logic

Deep neural networks II

Deep Learning. Practical introduction with Keras JORDI TORRES 27/05/2018. Chapter 3 JORDI TORRES

DEEP LEARNING IN PYTHON. The need for optimization

Ensemble methods in machine learning. Example. Neural networks. Neural networks

Index. Umberto Michelucci 2018 U. Michelucci, Applied Deep Learning,

Deep Learning for Computer Vision II

CPSC 340: Machine Learning and Data Mining. Principal Component Analysis Fall 2016

Hidden Units. Sargur N. Srihari

CSC 578 Neural Networks and Deep Learning

Natural Language Processing with Deep Learning CS224N/Ling284. Christopher Manning Lecture 4: Backpropagation and computation graphs

Neural Network Neurons

CS 179 Lecture 16. Logistic Regression & Parallel SGD

Lecture 2 Notes. Outline. Neural Networks. The Big Idea. Architecture. Instructors: Parth Shah, Riju Pahwa

Perceptron: This is convolution!

A Simple (?) Exercise: Predicting the Next Word

CS 6501: Deep Learning for Computer Graphics. Training Neural Networks II. Connelly Barnes

Multi-Class Logistic Regression and Perceptron

Notes on Multilayer, Feedforward Neural Networks

Tutorial on Machine Learning Tools

Lecture : Neural net: initialization, activations, normalizations and other practical details Anne Solberg March 10, 2017

Neural Network Learning. Today s Lecture. Continuation of Neural Networks. Artificial Neural Networks. Lecture 24: Learning 3. Victor R.

CPSC 340: Machine Learning and Data Mining. Principal Component Analysis Fall 2017

Machine Learning With Python. Bin Chen Nov. 7, 2017 Research Computing Center

CPSC 340: Machine Learning and Data Mining. Deep Learning Fall 2016

Logistic Regression and Gradient Ascent

CS 224d: Assignment #1

Akarsh Pokkunuru EECS Department Contractive Auto-Encoders: Explicit Invariance During Feature Extraction

Deep Learning with Tensorflow AlexNet

Practical Tips for using Backpropagation

Neural Networks (pp )

Machine Learning. MGS Lecture 3: Deep Learning

CS 224N: Assignment #1

CS 1674: Intro to Computer Vision. Neural Networks. Prof. Adriana Kovashka University of Pittsburgh November 16, 2016

Neural Networks. CE-725: Statistical Pattern Recognition Sharif University of Technology Spring Soleymani

CS 237: Probability in Computing

CS 224N: Assignment #1

Package automl. September 13, 2018

Deep Learning with R. Francesca Lazzeri Data Scientist II - Microsoft, AI Research

Keras: Handwritten Digit Recognition using MNIST Dataset

2. Neural network basics

Backpropagation + Deep Learning

06: Logistic Regression

EE 511 Neural Networks

The Mathematics Behind Neural Networks

Machine Learning. Topic 5: Linear Discriminants. Bryan Pardo, EECS 349 Machine Learning, 2013

Simple Model Selection Cross Validation Regularization Neural Networks

CS 2750: Machine Learning. Neural Networks. Prof. Adriana Kovashka University of Pittsburgh April 13, 2016

Opening the black box of Deep Neural Networks via Information (Ravid Shwartz-Ziv and Naftali Tishby) An overview by Philip Amortila and Nicolas Gagné

A Sparse and Locally Shift Invariant Feature Extractor Applied to Document Images

Deep Learning Workshop. Nov. 20, 2015 Andrew Fishberg, Rowan Zellers

For Monday. Read chapter 18, sections Homework:

JOINT INTENT DETECTION AND SLOT FILLING USING CONVOLUTIONAL NEURAL NETWORKS. Puyang Xu, Ruhi Sarikaya. Microsoft Corporation

Exponential Notation

A Sparse and Locally Shift Invariant Feature Extractor Applied to Document Images

FastText. Jon Koss, Abhishek Jindal

Keras: Handwritten Digit Recognition using MNIST Dataset

Neural Networks for Classification

SGD: Stochastic Gradient Descent

If you installed VM and Linux libraries as in the tutorial, you should not get any errors. Otherwise, you may need to install wget or gunzip.

Perceptrons and Backpropagation. Fabio Zachert Cognitive Modelling WiSe 2014/15

Tracking Computer Vision Spring 2018, Lecture 24

arxiv: v1 [cs.lg] 16 Jan 2013

D-Separation. b) the arrows meet head-to-head at the node, and neither the node, nor any of its descendants, are in the set C.

Bayes Net Learning. EECS 474 Fall 2016

Deep Learning. Vladimir Golkov Technical University of Munich Computer Vision Group

Cost Functions in Machine Learning

Multi-layer Perceptron Forward Pass Backpropagation. Lecture 11: Aykut Erdem November 2016 Hacettepe University

Neural Networks. Single-layer neural network. CSE 446: Machine Learning Emily Fox University of Washington March 10, /10/2017

Back-Propagation. Lecture 0 2

Machine Learning: Chenhao Tan University of Colorado Boulder LECTURE 15

World Inside a Computer is Binary

Any Integer Can Be Written as a Fraction

Machine Learning 13. week

Machine Learning. Deep Learning. Eric Xing (and Pengtao Xie) , Fall Lecture 8, October 6, Eric CMU,

DEEP LEARNING REVIEW. Yann LeCun, Yoshua Bengio & Geoffrey Hinton Nature Presented by Divya Chitimalla

Homework 2. Due: March 2, 2018 at 7:00PM. p = 1 m. (x i ). i=1

Machine Learning. The Breadth of ML Neural Networks & Deep Learning. Marc Toussaint. Duy Nguyen-Tuong. University of Stuttgart

ECE 5470 Classification, Machine Learning, and Neural Network Review

CPSC Coding Project (due December 17)

Table of Laplace Transforms

CPSC 340: Machine Learning and Data Mining. Deep Learning Fall 2018

Neural Networks for unsupervised learning From Principal Components Analysis to Autoencoders to semantic hashing

Assignment # 5. Farrukh Jabeen Due Date: November 2, Neural Networks: Backpropation

MoonRiver: Deep Neural Network in C++

Multi Layer Perceptron with Back Propagation. User Manual

More on Neural Networks. Read Chapter 5 in the text by Bishop, except omit Sections 5.3.3, 5.3.4, 5.4, 5.5.4, 5.5.5, 5.5.6, 5.5.7, and 5.

Perceptron Introduction to Machine Learning. Matt Gormley Lecture 5 Jan. 31, 2018

Hand Written Digit Recognition Using Tensorflow and Python

Practical 8: Neural networks

Deep Learning and Its Applications

Week 3: Perceptron and Multi-layer Perceptron

C-Brain: A Deep Learning Accelerator

Detection and Extraction of Events from s

Multi-Layered Perceptrons (MLPs)

Deep Generative Models Variational Autoencoders

Problem Set 2: From Perceptrons to Back-Propagation

Supervised Learning (contd) Linear Separation. Mausam (based on slides by UW-AI faculty)

Transcription:

Multinomial Regression and the Softmax Activation Function Gary Cottrell

Notation reminder We have N data points, or patterns, in the training set, with the pattern number as a superscript: {(x 1,t 1 ), (x 2,t 2 ), (x n,t n ), (x N,t N )}, and t n is the target for pattern n. The x s are (usually) vectors of dimension d, so the n th input pattern is: For the output, the weighted sum of the input (a.k.a. the net input) is written: And if there are multiple outputs, k=1,..,c, we write: and then the k th output is: where g is the activation function. 2

What if we have multiple classes? E.g., 10 classes, as in the MNIST problem (to pick a random example) Then it would be great to force the network to have outputs that are positive and sum to 1, i.e., to represent the probability distribution over the output categories. The softmax activation function does this. 3

Softmax Note that since the denominator is over all categories, the sum over all categories is 1. This is a generalization of the logistic for multiple categories, because softmax can be written as: (proof left to the reader) 4

What is the right error The network looks like this: function? We use 1-out-of-c encoding for c categories, meaning, the target is 1 for the correct category, and 0 everywhere else. So for 10 outputs, the target vector would look like: (0,0,1,0,0,0,0,0,0,0) if the target was the third category. Using Kronecker s delta, we can write this compactly as: 5

What is the right error function? Now, the conditional probability for the n th example is Note that this picks out the j th output if the correct category is j, since t k will be 1 for the j th category and 0 for everything else. SO, the Likelihood for the entire data set is: 6

What is the right error function? SO, the Likelihood for the entire data set is: And the error, the negative log likelihood is: This is also called the cross-entropy error. 7

What is the right error function? The minimum of this error is at: which will be 0 if t is 0 or 1. but this still applies if t is an actual probability between 0 and 1, and then the above quantity is not 0. So we can subtract it off of the error to get an error function that will hit 0 when minimized (i.e., when y=t): 8

What is the right error function? So, we need to go downhill in this error. Recall gradient descent says: The second factor is just x j, as before. For the first factor, for one pattern n, the derivative with respect to the net input has to take into account all of the outputs, because changing the input to one output changes the activations of all the outputs, so: 9

What is the right error function? So, in order to do gradient descent, we need this derivative: The first factor, using, is The second factor, using the definition of softmax is: 10

What is the right error Which leads to: function? So, We get the delta rule again 11

So, the right way to do multinomial regression is: Start with a net like this: Initialize the weights to 0 Use the softmax activation function: For each pass through the data (an epoch): Randomize the order of the patterns Present a pattern, compute the output Update the weights according to the delta rule Repeat until the error stops decreasing (enough) or a maximum number of epochs has been reached. 12

Activation functions and " Forward propagation We ve already seen three kinds of activation functions: Binary threshold units output Logistic units net input Softmax units 13

Activation functions and " Forward propagation Some more: Rectified linear units (ReLu): Leaky ReLu: 14

Tanh: Activation functions and " Forward propagation Linear 15

Activation functions and " Forward propagation Stochastic: The logistic is treated as a probability of being 1 (else 0) 16

Activation functions and " Forward propagation These functions can be applied recursively: This is called forward propagation 17

Activation functions and " Forward propagation These functions can be applied recursively: This is called forward propagation 18

Activation functions and " Forward propagation Different layers need not have the same activation functions: One popular one (not shown here): ReLu in the hiddens, softmax at the output. 19