Lecture 20: Neural Networks for NLP. Zubin Pahuja

Similar documents
Natural Language Processing CS 6320 Lecture 6 Neural Language Models. Instructor: Sanda Harabagiu

Assignment # 5. Farrukh Jabeen Due Date: November 2, Neural Networks: Backpropation

Machine Learning With Python. Bin Chen Nov. 7, 2017 Research Computing Center

SEMANTIC COMPUTING. Lecture 8: Introduction to Deep Learning. TU Dresden, 7 December Dagmar Gromann International Center For Computational Logic

Deep Learning. Practical introduction with Keras JORDI TORRES 27/05/2018. Chapter 3 JORDI TORRES

Multinomial Regression and the Softmax Activation Function. Gary Cottrell!

Neural Networks. CE-725: Statistical Pattern Recognition Sharif University of Technology Spring Soleymani

Multi-layer Perceptron Forward Pass Backpropagation. Lecture 11: Aykut Erdem November 2016 Hacettepe University

CS 1674: Intro to Computer Vision. Neural Networks. Prof. Adriana Kovashka University of Pittsburgh November 16, 2016

Neural Network Neurons

CS 6501: Deep Learning for Computer Graphics. Training Neural Networks II. Connelly Barnes

Machine Learning 13. week

Lecture #11: The Perceptron

Neural Networks (pp )

Deep Learning for Computer Vision

Deep neural networks II

Week 3: Perceptron and Multi-layer Perceptron

Perceptron: This is convolution!

CS 179 Lecture 16. Logistic Regression & Parallel SGD

A Simple (?) Exercise: Predicting the Next Word

Using neural nets to recognize hand-written digits. Srikumar Ramalingam School of Computing University of Utah

CMU Lecture 18: Deep learning and Vision: Convolutional neural networks. Teacher: Gianni A. Di Caro

CSC 578 Neural Networks and Deep Learning

A Neuro Probabilistic Language Model Bengio et. al. 2003

Deep Learning. Vladimir Golkov Technical University of Munich Computer Vision Group

COMP 551 Applied Machine Learning Lecture 16: Deep Learning

Deep Learning. Architecture Design for. Sargur N. Srihari

Deep Learning Applications

CPSC 340: Machine Learning and Data Mining. Deep Learning Fall 2016

Supervised Learning in Neural Networks (Part 2)

Deep Learning Cook Book

Machine Learning. Deep Learning. Eric Xing (and Pengtao Xie) , Fall Lecture 8, October 6, Eric CMU,

Deep Learning for Computer Vision II

ImageNet Classification with Deep Convolutional Neural Networks

Deep Neural Networks Optimization

CS6220: DATA MINING TECHNIQUES

Natural Language Processing with Deep Learning CS224N/Ling284. Christopher Manning Lecture 4: Backpropagation and computation graphs

Multi Layer Perceptron with Back Propagation. User Manual

Introduction to Neural Networks

COMP 551 Applied Machine Learning Lecture 14: Neural Networks

Index. Umberto Michelucci 2018 U. Michelucci, Applied Deep Learning,

Neural Networks. Single-layer neural network. CSE 446: Machine Learning Emily Fox University of Washington March 10, /10/2017

For Monday. Read chapter 18, sections Homework:

Neural Network Optimization and Tuning / Spring 2018 / Recitation 3

2. Neural network basics

Accelerating Convolutional Neural Nets. Yunming Zhang

DEEP LEARNING IN PYTHON. The need for optimization

COMPUTATIONAL INTELLIGENCE

Efficient Deep Learning Optimization Methods

Neural Network Learning. Today s Lecture. Continuation of Neural Networks. Artificial Neural Networks. Lecture 24: Learning 3. Victor R.

Multi-Class Logistic Regression and Perceptron

IMPLEMENTING DEEP LEARNING USING CUDNN 이예하 VUNO INC.

Neural Nets. General Model Building

Technical University of Munich. Exercise 7: Neural Network Basics

MoonRiver: Deep Neural Network in C++

Improving the way neural networks learn Srikumar Ramalingam School of Computing University of Utah

Knowledge Discovery and Data Mining. Neural Nets. A simple NN as a Mathematical Formula. Notes. Lecture 13 - Neural Nets. Tom Kelsey.

Perceptrons and Backpropagation. Fabio Zachert Cognitive Modelling WiSe 2014/15

CPSC 340: Machine Learning and Data Mining. Deep Learning Fall 2018

Knowledge Discovery and Data Mining

LECTURE NOTES Professor Anita Wasilewska NEURAL NETWORKS

CS 224N: Assignment #1

JOINT INTENT DETECTION AND SLOT FILLING USING CONVOLUTIONAL NEURAL NETWORKS. Puyang Xu, Ruhi Sarikaya. Microsoft Corporation

COMP9444 Neural Networks and Deep Learning 7. Image Processing. COMP9444 c Alan Blair, 2017

EE 511 Neural Networks

Deep Learning. Deep Learning provided breakthrough results in speech recognition and image classification. Why?

Lecture : Training a neural net part I Initialization, activations, normalizations and other practical details Anne Solberg February 28, 2018

Neural Networks and Deep Learning

Logical Rhythm - Class 3. August 27, 2018

Artificial Intelligence

Artificial Neural Networks. Introduction to Computational Neuroscience Ardi Tampuu

Practical Methodology. Lecture slides for Chapter 11 of Deep Learning Ian Goodfellow

Ensemble methods in machine learning. Example. Neural networks. Neural networks

CS 224N: Assignment #1

Neural Nets & Deep Learning

Logistic Regression and Gradient Ascent

Notes on Multilayer, Feedforward Neural Networks

Artificial Neural Networks Lecture Notes Part 5. Stephen Lucci, PhD. Part 5

CS 2750: Machine Learning. Neural Networks. Prof. Adriana Kovashka University of Pittsburgh April 13, 2016

Deep Learning and Its Applications

CS489/698: Intro to ML

Neural Networks for Classification

Machine Learning. MGS Lecture 3: Deep Learning

CS 224d: Assignment #1

Face Recognition A Deep Learning Approach

Classification: Linear Discriminant Functions

Lecture : Neural net: initialization, activations, normalizations and other practical details Anne Solberg March 10, 2017

Hidden Units. Sargur N. Srihari

Learning via Optimization

Lecture 2 Notes. Outline. Neural Networks. The Big Idea. Architecture. Instructors: Parth Shah, Riju Pahwa

CPSC 340: Machine Learning and Data Mining. Logistic Regression Fall 2016

Deep Convolutional Neural Networks. Nov. 20th, 2015 Bruce Draper

CS229 Final Project: Predicting Expected Response Times

Mini-project 2 CMPSCI 689 Spring 2015 Due: Tuesday, April 07, in class

Global Optimality in Neural Network Training

Tutorial on Machine Learning Tools

Convolutional Neural Networks

CPSC 340: Machine Learning and Data Mining. More Linear Classifiers Fall 2017

Administrative. Assignment 1 due Wednesday April 18, 11:59pm

Backpropagation + Deep Learning

Transcription:

Lecture 20: Neural Networks for NLP Zubin Pahuja zpahuja2@illinois.edu courses.engr.illinois.edu/cs447 CS447: Natural Language Processing 1

Today s Lecture Feed-forward neural networks as classifiers simple architecture in which computation proceeds from one layer to the next Application to language modeling assigning probabilities to word sequences and predicting upcoming words CS447: Natural Language Processing 2

Supervised Learning Two kinds of prediction problems: Regression predict results with continuous output e.g. price of a house from its size, number of bedrooms, zip code, etc. Classification predict results in a discrete output e.g. whether user will click on an ad CS447: Natural Language Processing 3

What s a Neural Network? CS447: Natural Language Processing 4

Why is deep learning taking off? Unprecedented amount of data performance of traditional learning algorithms such as SVM, logistic regression plateaus Faster computation GPU acceleration algorithms that train faster and deeper using ReLU over sigmoid activation gradient descent optimizers, like Adam End-to-end learning model directly converts input data into output prediction bypassing intermediate steps in a traditional pipeline CS447: Natural Language Processing 5

They are called neural because their origins lie in McCulloch-Pitts Neuron But the modern use in language processing no longer draws on these early biological inspirations CS447: Natural Language Processing 6

Neural Units Building blocks of a neural network Given a set of inputs x 1...x n, a unit has a set of corresponding weights w 1...w n and a bias b, so the weighted sum z can be represented as: or, z = w x + b using dot-product CS447: Natural Language Processing 7

Neural Units Apply non-linear function f (or g) to z to compute activation a: since we are modeling a single unit, the activation is also the final output y CS447: Natural Language Processing 8

Activation Functions: Sigmoid Sigmoid (σ) maps output into the range [0,1] differentiable CS447: Natural Language Processing 9

Activation Functions: Tanh Tanh maps output into the range [-1, 1] better than sigmoid smoothly differentiable and maps outlier values towards the mean CS447: Natural Language Processing 10

Activation Functions: ReLU Rectified Linear Unit (ReLU) y = max(x, 0) High values of z in sigmoid/ tanh result in values of y that are close to 1 which causes problems for learning CS447: Natural Language Processing 11

XOR Problem Minsky-Papert proved perceptron can t compute XOR logical operation CS447: Natural Language Processing 12

XOR Problem Perceptron can compute the logical ANDand OR functions easily But it s not possible to build a perceptron to compute logical XOR! CS447: Natural Language Processing 13

XOR Problem Perceptron is a linear classifier but XOR is not linearly separable for a 2D input x 0 and x 1, the perceptron equation: w 1 x 1 + w 2 x 2 + b = 0 is the equation of a line CS447: Natural Language Processing 14

XOR Problem: Solution XOR function can be computed using two layers of ReLU-based units XOR problem demonstrates need for multi-layer networks CS447: Natural Language Processing 15

XOR Problem: Solution Hidden layer forms a linearly separable representation for the input In this example, we stipulated the weights but in real applications, the weights for neural networks are learned automatically using the error back-propagation algorithm CS447: Natural Language Processing 16

Why do we need non-linear activation functions? Network of simple linear (perceptron) units cannot solve XOR problem a network formed by many layers of purely linear units can always be reduced to a single layer of linear units a [1] = z [1] = W [1] x + b [1] a [2] = z [2] = W [2] a [1] + b [2] = W [2] (W [1] x + b [1] ) + b [2] = (W [2] W [1] ) x + (W [2] b [1] + b [2] ) = W x + b no more expressive than logistic regression! we ve already shown that a single unit cannot solve the XOR problem CS447: Natural Language Processing 17

Feed-Forward Neural Networks a.k.a. multi-layer perceptron (MLP), though it s a misnomer Each layer is fully-connected Represent parameters for hidden layer by combining weight vector w i and bias b i for each unit i into a single weight matrix W and a single bias vector b for the whole layer! [#] = & [#] ' + ) [#] h = + [#] =,(! [#] ) where & R 1 2 1 4 and ), h R 1 2 CS447: Natural Language Processing 18

Feed-Forward Neural Networks Output could be real-valued number (for regression), or a probability distribution across the output nodes (for multinomial classification)! [#] = & [#] h + ) [#], such that! [#] R, -, & [#] R, -, / We apply softmax function to encode! [#] as a probability distribution So a neural network is like logistic regression over induced feature representations from prior layers of the network rather than forming features using feature templates CS447: Natural Language Processing 19

Recap: 2-layer Feed-Forward Neural Network! [#] = & [#] ' [(] + * [#] ' [#] = h =, [#] (! [#] )! [/] = & [/] ' [#] + * [/] ' [/] =, [/] (! [/] ) 01 = ' [/] We use ' [(] to stand for input 2, 01 for predicted output, 1 for ground truth output and g( ) for activation function., [/] might be softmax for multinomial classification or sigmoid for binary classification, while ReLU or tanh might be activation function,( ) at the internal layers. CS447: Natural Language Processing 20

N-layer Feed-Forward Neural Network for i in 1..n:! [#] = & [#] ' [#()] + + [#] ' [#] =, [#] (! [#] ) /0 = ' [1] CS447: Natural Language Processing 21

Training Neural Nets: Loss Function Models the distance between the system output and the gold output Same as logistic regression, the cross-entropy loss for binary classification for multinomial classification CS447: Natural Language Processing 22

Training Neural Nets: Gradient Descent To find parameters that minimize loss function, we use gradient descent But it s much harder to see how to compute the partial derivative of some weight in layer 1 when the loss is attached to some much later layer we use error back-propagation to partial out loss over intermediate layers builds on notion of computation graphs CS447: Natural Language Processing 23

Training Neural Nets: Computation Graphs Computation is broken down into separate operations, each of which is modeled as a node in a graph Consider! ", $, % = % " + 2$ CS447: Natural Language Processing 24

Training Neural Nets: Backward Differentiation Uses chain rule from calculus For f(x) = u(v(x)), we have For our function! = #(% + 2(), we need the derivatives: Requires the intermediate derivatives: CS447: Natural Language Processing 25

Training Neural Nets: Backward Pass Compute from right to left For each node: 1. compute local partial derivative with respect to the parent 2. multiply it by the partial that is passed down from the parent 3. then pass it to the child Also requires derivatives of activation functions CS447: Natural Language Processing 26

Training Neural Nets: Best Practices Non-convex optimization problem 1. initialize weights with small random numbers, preferably gaussians 2. regularize to prevent over-fitting, e.g. dropout Optimization techniques for gradient descent momentum, RMSProp, Adam, etc. CS447: Natural Language Processing 27

Parameters vs Hyperparameters Parameters are learned by gradient descent e.g. weights matrix W and biases b Hyperparameters are set prior to learning e.g. learning rate, mini-batch size, model architecture (number of layers, number of hidden units per layer, choice of activation functions), regularization technique require to be tuned CS447: Natural Language Processing 28

Neural Language Models Predicting upcoming words from prior word context CS447: Natural Language Processing 29

Neural Language Models Feed-forward neural LM is a standard feedforward network that takes as input at time t a representation of some number of previous words (w t 1,w t 2 ) and outputs probability distribution over possible next words Advantages don t need smoothing can handle much longer histories generalize over context of similar words higher predictive accuracy Uses include machine translation, dialog, language generation CS447: Natural Language Processing 30

Embeddings Mapping from words in vocabulary V to vectors of real numbers e Each word may be represented as one hot-vector of length V Concatenate each of N context vectors for preceding words Long, sparse, hard to generalize. Can we learn a concise representation? CS447: Natural Language Processing 31

Embeddings Allow neural n-gram LM to generalize to unseen data better I have to make sure when I get home to feed the cat. If we ve never seen the word dog after feed the, n-gram LM will predict cat given the prefix. But neural LM makes use of similarity of embeddings to assign a reasonably high probability to both dog and cat CS447: Natural Language Processing 32

Embeddings Moving window at time t with pre-trained embedding vector, say using word2vec for each of three previous words w t 1, w t 2, and w t 3, concatenated to produce input CS447: Natural Language Processing 33

Learning Embeddings for Neural n-gram LM Task may place strong constraints on what makes a good representation To learn embeddings, add an extra layer to the network and propagate errors all the way back to the embedding vectors Represent each of N previous words as one hot-vector of length V, and learn an embedding matrix E R $ & such that for one-hot column vector ' ( for word ) (, the projection layer is *' ( =, ( CS447: Natural Language Processing 34

Learning Embeddings: Forward Pass![#] = & = '(), '(+,, '(. [)] = / [)]![#] + 1 [)]![)] = 2[)] (. [)] ). [+] = / [+]![)] + 1 [+] 65 =![+] = 2[+] (. [+] ) Each node i in 65 estimates probability 7 89 _; 89<), 89<+, 89<= ) CS447: Natural Language Processing 35

Training the Neural Language Model To set all the parameters θ = E,W,U,b, we do gradient descent using error back propagation on the computation graph to compute gradient Loss Function: cross-entropy (negative log likelihood) L = log p & ' ( & ')*, & '),, & ')-.* ) Training the parameters to minimize loss will result both in an algorithm for language modeling (a word predictor) but also a new set of embeddings E CS447: Natural Language Processing 36

Summary Neural networks are built out of neural units, which take weighted sum of inputs and apply a non-linear activation function such as sigmoid, tanh, ReLU In a fully-connected feed-forward network, each unit in layer i is connected to each unit in layer i + 1, and there are no cycles Power of neural networks comes from the ability of early layers to learn representations that can be utilized by later layers in the network Neural networks are trained by optimization algorithms like gradient descent using error back-propagation on a computation graph Neural language models use a neural network as a probabilistic classifier, to compute the probability of the next word given the previous n words Neural language models can use pretrained embeddings, or can learn embeddings from scratch in the process of language modeling CS447: Natural Language Processing 37