Recurrent Neural Nets II

Similar documents
CS839: Probabilistic Graphical Models. Lecture 22: The Attention Mechanism. Theo Rekatsinas

Sequence Modeling: Recurrent and Recursive Nets. By Pyry Takala 14 Oct 2015

Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling

Pointer Network. Oriol Vinyals. 박천음 강원대학교 Intelligent Software Lab.

Show, Discriminate, and Tell: A Discriminatory Image Captioning Model with Deep Neural Networks

Deep Learning Applications

Table of Contents. What Really is a Hidden Unit? Visualizing Feed-Forward NNs. Visualizing Convolutional NNs. Visualizing Recurrent NNs

27: Hybrid Graphical Models and Neural Networks

CSC 578 Neural Networks and Deep Learning

FastText. Jon Koss, Abhishek Jindal

Machine Learning 13. week

CAP 6412 Advanced Computer Vision

Tutorial on Machine Learning Tools

LSTM for Language Translation and Image Captioning. Tel Aviv University Deep Learning Seminar Oran Gafni & Noa Yedidia

Lecture 20: Neural Networks for NLP. Zubin Pahuja

Empirical Evaluation of RNN Architectures on Sentence Classification Task

Natural Language Processing CS 6320 Lecture 6 Neural Language Models. Instructor: Sanda Harabagiu

CPSC 340: Machine Learning and Data Mining. Principal Component Analysis Fall 2016

A Comparison of Sequence-Trained Deep Neural Networks and Recurrent Neural Networks Optical Modeling For Handwriting Recognition

16-785: Integrated Intelligence in Robotics: Vision, Language, and Planning. Spring 2018 Lecture 14. Image to Text

Convolutional Sequence to Sequence Learning. Denis Yarats with Jonas Gehring, Michael Auli, David Grangier, Yann Dauphin Facebook AI Research

Deep Neural Networks Optimization

Show, Attend and Tell: Neural Image Caption Generation with Visual Attention

Dynamic Routing Between Capsules

A Simple (?) Exercise: Predicting the Next Word

Image Captioning with Object Detection and Localization

MoonRiver: Deep Neural Network in C++

JOINT INTENT DETECTION AND SLOT FILLING USING CONVOLUTIONAL NEURAL NETWORKS. Puyang Xu, Ruhi Sarikaya. Microsoft Corporation

CS489/698: Intro to ML

Advanced RNN (GRU and LSTM) for Machine Transla:on. Dr. Kira Radinsky CTO SalesPredict Visi8ng Professor/Scien8st Technion

Index. Umberto Michelucci 2018 U. Michelucci, Applied Deep Learning,

Lecture 2 Notes. Outline. Neural Networks. The Big Idea. Architecture. Instructors: Parth Shah, Riju Pahwa

Show, Discriminate, and Tell: A Discriminatory Image Captioning Model with Deep Neural Networks

MIXED PRECISION TRAINING: THEORY AND PRACTICE Paulius Micikevicius

Sentiment Classification of Food Reviews

EECS 496 Statistical Language Models. Winter 2018

LSTM: An Image Classification Model Based on Fashion-MNIST Dataset

Modeling Sequences Conditioned on Context with RNNs

A Quick Guide on Training a neural network using Keras.

Neural Networks and Deep Learning

Machine Learning With Python. Bin Chen Nov. 7, 2017 Research Computing Center

Learning to Rank with Attentive Media Attributes

Keras: Handwritten Digit Recognition using MNIST Dataset

CS 224n: Assignment #3

Deep Learning. Practical introduction with Keras JORDI TORRES 27/05/2018. Chapter 3 JORDI TORRES

How to Develop Encoder-Decoder LSTMs

Seq2SQL: Generating Structured Queries from Natural Language Using Reinforcement Learning

Deep Neural Networks Applications in Handwriting Recognition

Vulnerability of machine learning models to adversarial examples

Crowd Scene Understanding with Coherent Recurrent Neural Networks

Outline GF-RNN ReNet. Outline

Recurrent Neural Networks

Deep Neural Networks Applications in Handwriting Recognition

More on Neural Networks. Read Chapter 5 in the text by Bishop, except omit Sections 5.3.3, 5.3.4, 5.4, 5.5.4, 5.5.5, 5.5.6, 5.5.7, and 5.

Neural Networks for unsupervised learning From Principal Components Analysis to Autoencoders to semantic hashing

Keras: Handwritten Digit Recognition using MNIST Dataset

Deep Learning in NLP. Horacio Rodríguez. AHLT Deep Learning 2 1

DCU-UvA Multimodal MT System Report

Tutorial on Keras CAP ADVANCED COMPUTER VISION SPRING 2018 KISHAN S ATHREY

Combining Neural Networks and Log-linear Models to Improve Relation Extraction

RNN LSTM and Deep Learning Libraries

Machine Learning for Natural Language Processing. Alice Oh January 17, 2018

Natural Language to Neural Programs

Computer Vision: Homework 5 Optical Character Recognition using Neural Networks

Encoding RNNs, 48 End of sentence (EOS) token, 207 Exploding gradient, 131 Exponential function, 42 Exponential Linear Unit (ELU), 44

RNNs as Directed Graphical Models

Can Active Memory Replace Attention?

Recurrent Neural Networks and Transfer Learning for Action Recognition

Deep Learning with Tensorflow AlexNet

Recurrent Neural Networks

Multi-Glance Attention Models For Image Classification

DEEP LEARNING REVIEW. Yann LeCun, Yoshua Bengio & Geoffrey Hinton Nature Presented by Divya Chitimalla

A Dendrogram. Bioinformatics (Lec 17)

Recurrent Neural Networks with Attention for Genre Classification

Natural Language Processing with Deep Learning CS224N/Ling284

Slide credit from Hung-Yi Lee & Richard Socher

Perceptron: This is convolution!

Deep Learning. Volker Tresp Summer 2014

SEMANTIC COMPUTING. Lecture 8: Introduction to Deep Learning. TU Dresden, 7 December Dagmar Gromann International Center For Computational Logic

Neural Network Joint Language Model: An Investigation and An Extension With Global Source Context

Identification of the correct hard-scatter vertex at the Large Hadron Collider

Code Mania Artificial Intelligence: a. Module - 1: Introduction to Artificial intelligence and Python:

10-701/15-781, Fall 2006, Final

Deep Learning and Its Applications

Machine Learning. Deep Learning. Eric Xing (and Pengtao Xie) , Fall Lecture 8, October 6, Eric CMU,

Bayesian model ensembling using meta-trained recurrent neural networks

An Empirical Evaluation of Deep Architectures on Problems with Many Factors of Variation

Neural Network Optimization and Tuning / Spring 2018 / Recitation 3

Backpropagation + Deep Learning

CS 224N: Assignment #1

Lecture 21 : A Hybrid: Deep Learning and Graphical Models

Grounded Compositional Semantics for Finding and Describing Images with Sentences

Gate-Variants of Gated Recurrent Unit (GRU) Neural Networks

Deep Learning on Graphs

Image Captioning and Generation From Text

CPSC 340: Machine Learning and Data Mining. Principal Component Analysis Fall 2017

CUED-RNNLM An Open-Source Toolkit for Efficient Training and Evaluation of Recurrent Neural Network Language Models

Opening the Black Box Data Driven Visualizaion of Neural N

Recurrent Neural Networks. Nand Kishore, Audrey Huang, Rohan Batra

CIS 520, Machine Learning, Fall 2015: Assignment 7 Due: Mon, Nov 16, :59pm, PDF to Canvas [100 points]

Transcription:

Recurrent Neural Nets II Steven Spielberg Pon Kumar, Tingke (Kevin) Shen Machine Learning Reading Group, Fall 2016 9 November, 2016

Outline 1 Introduction 2 Problem Formulations with RNNs 3 LSTM for Optimization 4 Seq2Seq Learning

Introduction Feed-Forward Neural Networks (NN)

Introduction Rolling NN over time

Introduction Rolling NN over time

Introduction Rolling NN over time

Introduction Computation Flow in RNN

Introduction Computation Flow in RNN

Introduction Computation Flow in RNN

Introduction Computation Flow in RNN

Introduction Computation Flow in RNN

Introduction Computation Flow in RNN

Introduction Computation Flow in RNN

Introduction Computation Flow in RNN

Introduction Computation Flow in RNN

Introduction Computation Flow in RNN

Introduction Computation Flow in RNN

Introduction Computation Flow in RNN

Introduction RNN Representation

Introduction RNN Representation

Introduction RNN Representation

Problem Formulations with RNNs Time Series Prediction Given x t 3, x t 2, x t 1, x t Find x t+1

Problem Formulations with RNNs Time Series Prediction - Learning Error, e = n i=1 ( ˆ x i t+1 x i t+1 )2 Update Weights θ with Error, e using Back Propogation through Time eg: Weather Forecasting, Stock Prediction etc.

Problem Formulations with RNNs Implemetation

Problem Formulations with RNNs Sentence Classification - RNN Error, e = - n i=1 y i log(ŷ i ) Update Weights θ with Error, e using Back Propogation through Time

Problem Formulations with RNNs Sentence Classification - Bidirectional RNN Error, e = - n i=1 y i log(ŷ i ) Update Weights θ with Error, e using Back Propogation through Time

Problem Formulations with RNNs Character Level RNN

Problem Formulations with RNNs Character Level RNN

Problem Formulations with RNNs Sampled Examples from Character level RNN

Problem Formulations with RNNs Dynamic Systems

Problem Formulations with RNNs Dynamic Systems Example https://youtu.be/meyqhkfwupg

LSTM for Optimization Optimization with LSTM Gradient Descent θ t+1 = θ t + αg( f (θ)) where g( f (θ)) is handcrafted update Rule Learning Gradient Descent Update Rule where φ is parameters of LSTM θ t+1 = θ t + g( f (θ), φ)

LSTM for Optimization Learning to Learn Gradient Descent by Gradient Descent

LSTM for Optimization Learning to Learn Gradient Descent by Gradient Descent

LSTM for Optimization Learning to Learn Gradient Descent by Gradient Descent

LSTM for Optimization Learning to Learn Gradient Descent by Gradient Descent

LSTM for Optimization Learning to Learn Gradient Descent by Gradient Descent

LSTM for Optimization Learning to Learn Gradient Descent by Gradient Descent

LSTM for Optimization Learning to Learn Gradient Descent by Gradient Descent

LSTM for Optimization Learning to Learn Gradient Descent by Gradient Descent LSTM LSTM+GAC NTM-BFGS ADAM RMSprop Rprop Adadelta Adagrad SGD 0.50 2.5 0.50 0.25 0.00 0 50 100 0.0 0 50 100 0.25 100 150 200 Figure 4: Comparisons between learned and hand-crafted optimizers performance. Learned optimizers are shown with solid lines and hand-crafted optimizers are shown with dashed lines. Units for the y axis in the MNIST plots are logits. Left: Performance of different optimizers on randomly sampled 10-dimensional quadratic functions. Center: the LSTM optimizer outperforms standard methods training the base network on MNIST. Right: Learning curves for steps 100-200 by an optimizer trained to optimize for 100 steps (continuation of center plot).

3.1 Quadratic functions Recurrent Neural Nets II LSTM for Optimization Learning 0 to Learn 50 Gradient 100 Descent by Gradient Descent 0.0 0.0 0 50 100 0.0 0 50 100 Figure 5: Comparisons between learned and hand-crafted optimizers performance. Units for the y axis are logits. Left: Generalization to the different number of hidden units (40 instead of 20). Center: Generalization to the different number of hidden layers (2 instead of 1). This optimization problem is very hard, because the hidden layers are very narrow. Right: Training curves for an MLP with 20 hidden units using ReLU activations. The LSTM optimizer was trained on an MLP with sigmoid activations. Figure 6: Examples of images styled using the LSTM optimizer. Each triple consists of the content image (left), style (right) and image generated by the LSTM optimizer (center). Left: The result of applying the training style at the training resolution to a test image. Right: The result of applying a new style to a test image at double the resolution on which the optimizer was trained. a learning rate (e.g. decay coefficients for ADAM) we use the default values from the optim package in Torch7. Initial values of all optimizee parameters were sampled from an IID Gaussian distribution.

Seq2Seq Learning Sequence to Sequence Learning We covered how to use LSTM with fixed length inputs and fixed length outputs Sequence to Sequence learning uses two RNNs to solve general sequence to sequence problem of different length Eg: Machine Translation, Chatbots, Image Captioning etc.

Seq2Seq Learning Machine Translation

Seq2Seq Learning Machine Translation

Seq2Seq Learning Machine Translation

Seq2Seq Learning Machine Translation

Seq2Seq Learning Machine Translation

Seq2Seq Learning Machine Translation Error, e = - n i=1 Tt=1 y i t log(ŷ i t) Update Weights θ with Error, e using Back Propogation through Time

Seq2Seq Learning GPU Implementation GPU6 A B C D GPU5 A B C D 80k softmax by 1000 dims This is very big! GPU4 Split softmax into 4 GPUs GPU3 GPU2 1000 LSTM cells 2000 dims per timestep GPU1 2000 x 4 = 8k dims per sentence A B C D A B C 160k vocab in input language

Seq2Seq Learning GPU Implementation GPU6 A B C D GPU5 A B C D 80k softmax by 1000 dims This is very big! GPU4 Split softmax into 4 GPUs GPU3 GPU2 1000 LSTM cells 2000 dims per timestep GPU1 2000 x 4 = 8k dims per sentence A B C D A B C 160k vocab in input language

Seq2Seq Learning GPU Implementation GPU6 A B C D GPU5 A B C D 80k softmax by 1000 dims This is very big! GPU4 Split softmax into 4 GPUs GPU3 GPU2 1000 LSTM cells 2000 dims per timestep GPU1 2000 x 4 = 8k dims per sentence A B C D A B C 160k vocab in input language

Seq2Seq Learning GPU Implementation GPU6 A B C D GPU5 A B C D 80k softmax by 1000 dims This is very big! GPU4 Split softmax into 4 GPUs GPU3 GPU2 1000 LSTM cells 2000 dims per timestep GPU1 2000 x 4 = 8k dims per sentence A B C D A B C 160k vocab in input language

Seq2Seq Learning GPU Implementation GPU6 A B C D GPU5 A B C D 80k softmax by 1000 dims This is very big! GPU4 Split softmax into 4 GPUs GPU3 GPU2 1000 LSTM cells 2000 dims per timestep GPU1 2000 x 4 = 8k dims per sentence A B C D A B C 160k vocab in input language

Seq2Seq Learning GPU Implementation GPU6 A B C D GPU5 A B C D 80k softmax by 1000 dims This is very big! GPU4 Split softmax into 4 GPUs GPU3 GPU2 1000 LSTM cells 2000 dims per timestep GPU1 2000 x 4 = 8k dims per sentence A B C D A B C 160k vocab in input language

Seq2Seq Learning GPU Implementation GPU6 A B C D GPU5 A B C D 80k softmax by 1000 dims This is very big! GPU4 Split softmax into 4 GPUs GPU3 GPU2 1000 LSTM cells 2000 dims per timestep GPU1 2000 x 4 = 8k dims per sentence A B C D A B C 160k vocab in input language

arecurrent sizeable Neural margin, Nets IIdespite its inability to handle out-of-vocabulary words. The LSTM is within 0. BLEU points of the previous state of the art by rescoring the 1000-best list of the baseline system. Seq2Seq Learning 3.7 Performance on long sentences Projection of the Encoder LSTM We were surprised to discover that the LSTM did well on long sentences, which is shown quantita tively in figure 3. Table 3 presents several examples of long sentences and their translations. 3.8 Model Analysis 4 15 I was given a card by her in the garden 3 2 1 Mary admires John Mary is in love with John 10 5 In the garden, she gave me a card She gave me a card in the garden 0 1 2 John admires Mary John is in love with Mary Mary respects John 0 5 In the garden, I gave her a card She was given a card by me in the garden 3 10 4 5 John respects Mary 15 I gave her a card in the garden 6 8 6 4 2 0 2 4 6 8 10 20 15 10 5 0 5 10 15 20 Figure 2: The figure shows a 2-dimensional PCA projection of the LSTM hidden states that are obtaine after processing the phrases in the figures. The phrases are clustered by meaning, which in these examples i

Seq2Seq Learning Seq2Seq Application: Video to Text Input: https://www.youtube.com/watch?v=iidyue5qm0a Output: A monkey is pulling a dog's tail and is chased by the dog. Supervised problem: Given video and sentence pairs

Seq2Seq Learning Why Describe Videos? Robotics applications: human to robot interaction Describing videos for the blind Video indexing Just because

Seq2Seq Learning Alternative Models Holistic video representation: Train classifiers to suggest subject, object, actions Combine objects/actions with language model using graphical model and real world knowledge Pick most probable subject, action, object triplet Insert triplet into sentence template Fix video sequence length Encode video with vanilla NN Output sentence using LSTM Holistic video representation: http://www.aclweb.org/anthology/w13-1302

Seq2Seq Learning Recall: why Sequence to Sequence? Model sees the entire input sequence before starting to output Output sequence length is not fixed to be equal to input sequence length

Seq2Seq Learning Model

Seq2Seq Learning Predictions We want to pick the most probable sequence of words generate sentences greedily, picking highest softmax probability at each time step use beam search: e.g. try top 3 most probable words each time step and only keep top 3 most probable sequences so far

Seq2Seq Learning Examples Demo: https://www.youtube.com/watch?v=per0mjzsyam Paper, code, examples: https://www.cs.utexas.edu/ vsub/s2vt.html

Seq2Seq Learning Seq2Seq Models and Attention Problem: Basic seq2seq RNN models cannot handle very long sequences NEURAL MACHINE TRANSLATION BY JOINTLY LEARNING TO ALIGN AND TRANSLATE Authors: Dzmitry Bahdanau, KyungHyun Cho, Yoshua Bengio

Seq2Seq Learning Basic Seq2Seq Model p(y i {y 1...y i 1 }, x) = g(y i 1, s i, c)

Seq2Seq Learning Attention Seq2Seq Model p(y i {y 1...y i 1 }, x) = g(y i 1, s i, c i ) T c i = α ij h j j=1

Seq2Seq Learning Encoder: Bidirectional RNN 2 independent RNNs, 1 each direction Overall hiddens is concatenation of 2 independent hiddens Hiddens at each time contain information from entire input h j is influenced by inputs around x j the most

Seq2Seq Learning Decoder: Attention α ij = T c i = α ij h j j=1 exp(e ij ) Tk=1 exp(e ik ) e ij = a(s i 1, h j )

Seq2Seq Learning Improvements

Seq2Seq Learning Attention Visualization Picture shows matrix of α ij α ij relevance of input word j to output word i White is 1, black is 0

Seq2Seq Learning End Thank you!