Natural Language Processing with Deep Learning CS224N/Ling284

Similar documents
Natural Language Processing with Deep Learning CS224N/Ling284. Christopher Manning Lecture 4: Backpropagation and computation graphs

FastText. Jon Koss, Abhishek Jindal

Sequence Modeling: Recurrent and Recursive Nets. By Pyry Takala 14 Oct 2015

Transition-Based Dependency Parsing with Stack Long Short-Term Memory

Recurrent Neural Networks

Lecture 20: Neural Networks for NLP. Zubin Pahuja

Slide credit from Hung-Yi Lee & Richard Socher

Natural Language Processing CS 6320 Lecture 6 Neural Language Models. Instructor: Sanda Harabagiu

Machine Learning 13. week

Recurrent Neural Networks. Nand Kishore, Audrey Huang, Rohan Batra

A Comparison of Sequence-Trained Deep Neural Networks and Recurrent Neural Networks Optical Modeling For Handwriting Recognition

Sentiment Classification of Food Reviews

CSC 578 Neural Networks and Deep Learning

Encoding RNNs, 48 End of sentence (EOS) token, 207 Exploding gradient, 131 Exponential function, 42 Exponential Linear Unit (ELU), 44

CS 224n: Assignment #3

EECS 496 Statistical Language Models. Winter 2018

A Simple (?) Exercise: Predicting the Next Word

Speech and Language Processing. Daniel Jurafsky & James H. Martin. Copyright c All rights reserved. Draft of September 23, 2018.

CS 6501: Deep Learning for Computer Graphics. Training Neural Networks II. Connelly Barnes

Code Mania Artificial Intelligence: a. Module - 1: Introduction to Artificial intelligence and Python:

SEMANTIC COMPUTING. Lecture 8: Introduction to Deep Learning. TU Dresden, 7 December Dagmar Gromann International Center For Computational Logic

Novel Image Captioning

A Neuro Probabilistic Language Model Bengio et. al. 2003

Deep neural networks II

CUED-RNNLM An Open-Source Toolkit for Efficient Training and Evaluation of Recurrent Neural Network Language Models

Exam Marco Kuhlmann. This exam consists of three parts:

Combining Neural Networks and Log-linear Models to Improve Relation Extraction

Natural Language Processing with Deep Learning CS224N/Ling284

CS 224N: Assignment #1

Neural Networks. Single-layer neural network. CSE 446: Machine Learning Emily Fox University of Washington March 10, /10/2017

CS 224N: Assignment #1

Gated Recurrent Models. Stephan Gouws & Richard Klein

CS224n: Natural Language Processing with Deep Learning 1 Lecture Notes: Part IV Dependency Parsing 2 Winter 2019

Deep Learning Applications

JOINT INTENT DETECTION AND SLOT FILLING USING CONVOLUTIONAL NEURAL NETWORKS. Puyang Xu, Ruhi Sarikaya. Microsoft Corporation

Recurrent Neural Network (RNN) Industrial AI Lab.

SEMANTIC COMPUTING. Lecture 9: Deep Learning: Recurrent Neural Networks (RNNs) TU Dresden, 21 December 2018

Transition-based Parsing with Neural Nets

Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank text

CS 224d: Assignment #1

DEEP LEARNING REVIEW. Yann LeCun, Yoshua Bengio & Geoffrey Hinton Nature Presented by Divya Chitimalla

Context Encoding LSTM CS224N Course Project

Sparse Non-negative Matrix Language Modeling

Recurrent Neural Nets II

CPSC 340: Machine Learning and Data Mining. Probabilistic Classification Fall 2017

CPSC 340: Machine Learning and Data Mining. Deep Learning Fall 2016

Neural Network Joint Language Model: An Investigation and An Extension With Global Source Context

Machine Learning. Deep Learning. Eric Xing (and Pengtao Xie) , Fall Lecture 8, October 6, Eric CMU,

Semantic image search using queries

Deep Learning for Visual Computing Prof. Debdoot Sheet Department of Electrical Engineering Indian Institute of Technology, Kharagpur

Replacing Neural Networks with Black-Box ODE Solvers

Lecture 2 Notes. Outline. Neural Networks. The Big Idea. Architecture. Instructors: Parth Shah, Riju Pahwa

Empirical Evaluation of RNN Architectures on Sentence Classification Task

A Hybrid Neural Model for Type Classification of Entity Mentions

Simple Model Selection Cross Validation Regularization Neural Networks

Grounded Compositional Semantics for Finding and Describing Images with Sentences

Deep Learning & Neural Networks

CS5670: Computer Vision

Neural Nets & Deep Learning

COMP 551 Applied Machine Learning Lecture 16: Deep Learning

27: Hybrid Graphical Models and Neural Networks

Multinomial Regression and the Softmax Activation Function. Gary Cottrell!

Perceptron as a graph

CPSC 340: Machine Learning and Data Mining. Multi-Class Classification Fall 2017

Log- linear models. Natural Language Processing: Lecture Kairit Sirts

Query Intent Detection using Convolutional Neural Networks

Knowledge Discovery and Data Mining

LSTM for Language Translation and Image Captioning. Tel Aviv University Deep Learning Seminar Oran Gafni & Noa Yedidia

Deep Learning. Vladimir Golkov Technical University of Munich Computer Vision Group

Backpropagation + Deep Learning

ECE 6504: Advanced Topics in Machine Learning Probabilistic Graphical Models and Large-Scale Learning

16-785: Integrated Intelligence in Robotics: Vision, Language, and Planning. Spring 2018 Lecture 14. Image to Text

COMP 551 Applied Machine Learning Lecture 14: Neural Networks

Lecture : Training a neural net part I Initialization, activations, normalizations and other practical details Anne Solberg February 28, 2018

The Perceptron. Simon Šuster, University of Groningen. Course Learning from data November 18, 2013

CPSC 340: Machine Learning and Data Mining. Deep Learning Fall 2018

Deep Learning on Graphs

Deep Learning with Tensorflow AlexNet

Homework 2: HMM, Viterbi, CRF/Perceptron

Machine Learning Practice and Theory

LSTM and its variants for visual recognition. Xiaodan Liang Sun Yat-sen University

Smoothing. BM1: Advanced Natural Language Processing. University of Potsdam. Tatjana Scheffler

Deep Learning. Volker Tresp Summer 2014

Tutorial on Machine Learning Tools

N-Gram Language Modelling including Feed-Forward NNs. Kenneth Heafield. University of Edinburgh

Knowledge Discovery and Data Mining. Neural Nets. A simple NN as a Mathematical Formula. Notes. Lecture 13 - Neural Nets. Tom Kelsey.

Natural Language Processing with Deep Learning. CS224N/Ling284. Lecture 5: Backpropagation Kevin Clark

Simple and Efficient Learning with Automatic Operation Batching

Topics in AI (CPSC 532L): Multimodal Learning with Vision, Language and Sound. Lecture 12: Deep Reinforcement Learning

Neural Network Neurons

Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling

Partitioning Data. IRDS: Evaluation, Debugging, and Diagnostics. Cross-Validation. Cross-Validation for parameter tuning

Neural Network Models for Text Classification. Hongwei Wang 18/11/2016

CAP 6412 Advanced Computer Vision

Technical University of Munich. Exercise 8: Neural Networks

ECE521 Lecture 18 Graphical Models Hidden Markov Models

CS489/698: Intro to ML

Shallow Parsing Swapnil Chaudhari 11305R011 Ankur Aher Raj Dabre 11305R001

NLP Final Project Fall 2015, Due Friday, December 18

CMU Lecture 18: Deep learning and Vision: Convolutional neural networks. Teacher: Gianni A. Di Caro

Transcription:

Natural Language Processing with Deep Learning CS224N/Ling284 Lecture 8: Recurrent Neural Networks Christopher Manning and Richard Socher

Organization Extra project office hour today after lecture

Overview Traditional language models RNNs RNN language models Important training problems and tricks Intuition for vanishing gradient problem with toy example Vanishing and exploding gradient problems RNNs for other sequence tasks Bidirectional and deep RNNs

Language Models A language model computes a probability for a sequence of words: Useful for machine translation Word ordering: p(the cat is small) > p(small the is cat) Word choice: p(walking home after school) > p(walking house after school)

Traditional Language Models Probability is usually conditioned on window of n previous words An incorrect but necessary Markov assumption! To estimate probabilities, compute for unigrams and bigrams (conditioning on one/two previous word(s):

Traditional Language Models Performance improves with keeping around higher n- grams counts and doing smoothing and so-called backoff (e.g. if 4-gram not found, try 3-gram, etc) There are A LOT of n-grams! à Gigantic RAM requirements! Recent state of the art: Scalable Modified Kneser-Ney Language Model Estimation by Heafield et al.: Using one machine with 140 GB RAM for 2.8 days, we built an unpruned model on 126 billion tokens

Recurrent Neural Networks! RNNs tie the weights at each time step Condition the neural network on all previous words RAM requirement only scales with number of words y t 1 y t y t+1 h t 1 h t h t+1 W W x t 1 x t x t+1

Recurrent Neural Network language model Given list of word vectors: At a single time step: ßà x t h t

Recurrent Neural Network language model Main idea: we use the same set of W weights at all time steps! Everything else is the same: is some initialization vector for the hidden layer at time step 0 is the column vector of L at index [t] at time step t

Recurrent Neural Network language model is a probability distribution over the vocabulary Same cross entropy loss function but predicting words instead of classes

Recurrent Neural Network language model Evaluation could just be negative of average log probability over dataset of size (number of words) T: But more common: Perplexity: 2 J Lower is better!

Training RNNs is hard Multiply the same matrix at each time step during forward prop y t 1 y t y t+1 h t 1 h t h t+1 W W x t 1 x t x t+1 Ideally inputs from many time steps ago can modify output y Take for an example RNN with 2 time steps! Insightful!

The vanishing/exploding gradient problem Multiply the same matrix at each time step during backprop y t 1 y t y t+1 h t 1 h t h t+1 W W x t 1 x t x t+1

The vanishing gradient problem - Details Similar but simpler RNN formulation: Total error is the sum of each error at time steps t Hardcore chain rule application:

The vanishing gradient problem - Details Similar to backprop but less efficient formulation Useful for analysis we ll look at: Remember: More chain rule, remember: Each partial is a Jacobian:

The vanishing gradient problem - Details

Why is the vanishing gradient a problem? The error at a time step ideally can tell a previous time step from many steps away to change during backprop y t 1 y t y t+1 h t 1 h t h t+1 W W x t 1 x t x t+1

The vanishing gradient problem for language models In the case of language modeling or question answering words from time steps far away are not taken into consideration when training to predict the next word Example: Jane walked into the room. John walked in too. It was late in the day. Jane said hi to

Structured Training for Neural Network Transition-Based Parsing David Weiss, Chris Alberti, Michael Collins, Slav Petrov Presented by: Shayne Longpre

What is SyntaxNet? 2016/5: Google announces the World s Most Accurate Parser Goes Open Source SyntaxNet (2016): New, fast, performant Tensorflow framework for syntactic parsing. Now supports 40 languages -- Parse McParseface s 40 cousins

What is SyntaxNet? 2016/5: Google announces the World s Most Accurate Parser Goes Open Source SyntaxNet (2016): New, fast, performant Tensorflow framework for syntactic parsing. Chen Now & supports Manning 40 languages Weiss -- Parse et al. (2015) McParseface s 40 Andor cousins et al. (2016) (2014) + Unlabelled Data + Tune Model + Structured Perceptron & Beam Search + Global Normalization SyntaxNet

What is SyntaxNet? 2016/5: Google announces the World s Most Accurate Parser Goes Open Source SyntaxNet (2016): New, fast, performant Tensorflow framework for syntactic parsing. Chen Now & supports Manning 40 languages Weiss -- Parse et al. (2015) McParseface s 40 Andor cousins et al. (2016) (2014) + Unlabelled Data + Tune Model + Structured Perceptron & Beam Search + Global Normalization SyntaxNet

3 New Contributions ( since Q2 in Assignment 2) 1. Leverage Unlabelled Data -- Tri-Training 2. Tuned Neural Network Model 3. Final Layer: Structured Perceptron w/ Beam Search

1. Tri-Training: Leverage Unlabelled Data High Performance Parser (A) Agree on dependency parse Disagree on dependency parse High Performance Parser (B) Unlabelled Data Tri-Training (Li et al, 2014) Labelled Data

2. Model Changes Chen & Manning (2014):

2. Model Changes Chen & Manning (2014): Weiss et al. (2015):

2. Model Changes Chen & Manning (2014): Weiss et al. (2015): (2x hidden) RELU!!

3. Structured Perceptron Training + Beam Search Problem: Greedy algorithms are unable to look beyond one step ahead, or recover from incorrect decisions.

3. Structured Perceptron Training + Beam Search Problem: Greedy algorithms are unable to look beyond one step ahead, or recover from incorrect decisions. Solution: Look forward -- search the tree of possible transition sequences.

3. Structured Perceptron Training + Beam Search Problem: Greedy algorithms are unable to look beyond one step ahead, or recover from incorrect decisions. Solution: Look forward -- search the tree of possible transition sequences. - Keep track of K top partial transition sequences up to depth m. - Score transition using perceptron:

3. Structured Perceptron Training + Beam Search Problem: Greedy algorithms are unable to look beyond one step ahead, or recover from incorrect decisions. Solution: Look forward -- search the tree of possible transition sequences. - Keep track of K top partial transition sequences up to - depth Score m. transition using perceptron: Feature vector Possible transition sequences Perceptron parameter vector

Conclusions Identify specific flaws in existing models (greedy algorithms) and solve them. In this case, with: More data Better tuning Structured perceptron and beam search Final step to SyntaxNet: Andor et al. (2016) solve the Label Bias Problem using Global Normalization

IPython Notebook with vanishing gradient example Example of simple and clean NNet implementation Comparison of sigmoid and ReLu units A little bit of vanishing gradient

Trick for exploding gradient: clipping trick The solution first introduced by Mikolov is to clip gradients to a maximum value. Makes a big difference in RNNs.

Gradient clipping intuition Figure from paper: On the difficulty of training Recurrent Neural Networks, Pascanu et al. 2013 Error surface of a single hidden unit RNN, High curvature walls Solid lines: standard gradient descent trajectories Dashed lines gradients rescaled to fixed size

For vanishing gradients: Initialization + ReLus! Initialize W (*) s to identity matrix I and f(z) = rect(z) = max(z, 0) à Huge difference! Initialization idea first introduced in Parsing with Compositional Vector Grammars, Socher et al. 2013 New experiments with recurrent neural nets in A Simple Way to Initialize Recurrent Networks of Rectified Linear Units, Le et al. 2015

Perplexity Results KN5 = Count-based language model with Kneser-Ney smoothing & 5-grams Table from paper Extensions of recurrent neural network language model by Mikolov et al 2011

Problem: Softmax is huge and slow Trick: Class-based word prediction p(w t history) = p(c t history)p(w t c t ) = p(c t h t )p(w t c t ) The more classes, the better perplexity but also worse speed:

One last implementation trick You only need to pass backwards through your sequence once and accumulate all the deltas from each E t

Sequence modeling for other tasks Classify each word into: NER Entity level sentiment in context opinionated expressions Example application and slides from paper Opinion Mining with Deep Recurrent Nets by Irsoy and Cardie 2014

Opinion Mining with Deep Recurrent Nets Goal: Classify each word as direct subjective expressions (DSEs) and expressive subjective expressions (ESEs). DSE: Explicit mentions of private states or speech events expressing private states ESE: Expressions that indicate sentiment, emotion, etc. without explicitly conveying them.

Example Annotation In BIO notation (tags either begin-of-entity (B_X) or continuation-of-entity (I_X)): The committee, [as usual] ESE, [has refused to make any statements] DSE.

Approach: Recurrent Neural Network Notation from paper (so you get used to different ones) x represents a token (word) as a vector. y represents the output label (B, I or O) g = softmax! h is the memory, computed from the past memory and current word. It summarizes the sentence up to that time.

Bidirectional RNNs Problem: For classification you want to incorporate information from words both preceding and following Ideas?

Deep Bidirectional RNNs

Data MPQA 1.2 corpus (Wiebe et al., 2005) consists of 535 news articles (11,111 sentences) manually labeled with DSE and ESEs at the phrase level Evaluation: F1 Harmonic mean of precision and recall

Evaluation

Recap Recurrent Neural Network is one of the best deepnlp model families Training them is hard because of vanishing and exploding gradient problems They can be extended in many ways and their training improved with many tricks (more to come) Next week: Most important and powerful RNN extensions with LSTMs and GRUs

Problem with Softmax: No Zero Shot Word Predictions Answers can only be predicted if they were seen during training and part of the softmax But it s natural to learn new words in an active conversation and systems should be able to pick them up

Tackling Obstacle by Predicting Unseen Words Idea: Mixture Model of softmax and pointers: Pointer Sentinel Mixture Models by Stephen Merity, Caiming Xiong, James Bradbury, Richard Socher

Pointer-Sentinel Model - Details

Pointer Sentinel for Language Modeling