Plan, Attend, Generate: Planning for Sequence-to-Sequence Models

Similar documents
Plan, Attend, Generate: Planning for Sequence-to-Sequence Models

Convolutional Sequence to Sequence Learning. Denis Yarats with Jonas Gehring, Michael Auli, David Grangier, Yann Dauphin Facebook AI Research

Show, Attend and Tell: Neural Image Caption Generation with Visual Attention

Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling

Show, Discriminate, and Tell: A Discriminatory Image Captioning Model with Deep Neural Networks

Outline GF-RNN ReNet. Outline

LSTM for Language Translation and Image Captioning. Tel Aviv University Deep Learning Seminar Oran Gafni & Noa Yedidia

16-785: Integrated Intelligence in Robotics: Vision, Language, and Planning. Spring 2018 Lecture 14. Image to Text

Sequence Modeling: Recurrent and Recursive Nets. By Pyry Takala 14 Oct 2015

Can Active Memory Replace Attention?

Neural Machine Translation In Linear Time

Recurrent Neural Nets II

DCU-UvA Multimodal MT System Report

Seq2SQL: Generating Structured Queries from Natural Language Using Reinforcement Learning

Pointer Network. Oriol Vinyals. 박천음 강원대학교 Intelligent Software Lab.

CAP 6412 Advanced Computer Vision

CS489/698: Intro to ML

Neural Network Joint Language Model: An Investigation and An Extension With Global Source Context

Homework 1. Leaderboard. Read through, submit the default output. Time for questions on Tuesday

Kyoto-NMT: a Neural Machine Translation implementation in Chainer

ABC-CNN: Attention Based CNN for Visual Question Answering

Encoding RNNs, 48 End of sentence (EOS) token, 207 Exploding gradient, 131 Exponential function, 42 Exponential Linear Unit (ELU), 44

MIXED PRECISION TRAINING: THEORY AND PRACTICE Paulius Micikevicius

Asynchronous Parallel Learning for Neural Networks and Structured Models with Dense Features

arxiv: v2 [cs.cl] 1 Jun 2018

Graph neural networks February Visin Francesco

CPSC 340: Machine Learning and Data Mining. Principal Component Analysis Fall 2016

Pixel-level Generative Model

Natural Language Processing CS 6320 Lecture 6 Neural Language Models. Instructor: Sanda Harabagiu

Image-to-Text Transduction with Spatial Self-Attention

Indexing. CS6200: Information Retrieval. Index Construction. Slides by: Jesse Anderton

Kernels + K-Means Introduction to Machine Learning. Matt Gormley Lecture 29 April 25, 2018

Sentiment Classification of Food Reviews

Deep Learning Applications

CS839: Probabilistic Graphical Models. Lecture 22: The Attention Mechanism. Theo Rekatsinas

LSTM: An Image Classification Model Based on Fashion-MNIST Dataset

Learning to Rank with Attentive Media Attributes

SEMANTIC COMPUTING. Lecture 9: Deep Learning: Recurrent Neural Networks (RNNs) TU Dresden, 21 December 2018

Structured Attention Networks

Akarsh Pokkunuru EECS Department Contractive Auto-Encoders: Explicit Invariance During Feature Extraction

CSC 578 Neural Networks and Deep Learning

A Hybrid Neural Model for Type Classification of Entity Mentions

End-To-End Spam Classification With Neural Networks

Show, Discriminate, and Tell: A Discriminatory Image Captioning Model with Deep Neural Networks

arxiv: v1 [cs.cl] 31 Oct 2016

Natural Language Processing - Project 2 CSE 398/ Prof. Sihong Xie Due Date: 11:59pm, Oct 12, 2017 (Phase I), 11:59pm, Oct 31, 2017 (Phase II)

Layerwise Interweaving Convolutional LSTM

Project Final Report

Image Captioning with Attention

Forest-based Neural Machine Translation. Chunpeng Ma, Akihiro Tamura, Masao Utiyama, Tiejun Zhao, EiichiroSumita

Mixture Models and the EM Algorithm

CS224n: Natural Language Processing with Deep Learning 1 Lecture Notes: Part IV Dependency Parsing 2 Winter 2019

Recurrent Neural Networks. Nand Kishore, Audrey Huang, Rohan Batra

DCU System Report on the WMT 2017 Multi-modal Machine Translation Task

May I Have Your Attention Please? (said one neuron to another)

A Sparse and Locally Shift Invariant Feature Extractor Applied to Document Images

Generative Adversarial Text to Image Synthesis

Lecture 20: Neural Networks for NLP. Zubin Pahuja

Supervised Learning with Neural Networks. We now look at how an agent might learn to solve a general problem by seeing examples.

Neural Machine Translation In Sogou, Inc. Feifei Zhai and Dongxu Yang

A Sparse and Locally Shift Invariant Feature Extractor Applied to Document Images

Encoding Techniques in Genetic Algorithms

INF4820. Clustering. Erik Velldal. Nov. 17, University of Oslo. Erik Velldal INF / 22

Machine Learning for Natural Language Processing. Alice Oh January 17, 2018

Recommendation System Using Yelp Data CS 229 Machine Learning Jia Le Xu, Yingran Xu

Table of Contents. What Really is a Hidden Unit? Visualizing Feed-Forward NNs. Visualizing Convolutional NNs. Visualizing Recurrent NNs

Machine Learning 13. week

Linear Methods for Regression and Shrinkage Methods

4.12 Generalization. In back-propagation learning, as many training examples as possible are typically used.

CS6200 Information Retrieval. David Smith College of Computer and Information Science Northeastern University

1 Overview Definitions (read this section carefully) 2

Recurrent Neural Networks and Transfer Learning for Action Recognition

MIXED PRECISION TRAINING OF NEURAL NETWORKS. Carl Case, Senior Architect, NVIDIA

CS 224n: Assignment #5

INF4820 Algorithms for AI and NLP. Evaluating Classifiers Clustering

Discriminative Training for Phrase-Based Machine Translation

Transition-Based Dependency Parsing with Stack Long Short-Term Memory

3D Deep Learning on Geometric Forms. Hao Su

DEEP LEARNING REVIEW. Yann LeCun, Yoshua Bengio & Geoffrey Hinton Nature Presented by Divya Chitimalla

Marcello Federico MMT Srl / FBK Trento, Italy

Classifying Depositional Environments in Satellite Images

International Journal of Scientific & Engineering Research, Volume 4, Issue 5, May-2013 ISSN

Unsupervised Rank Aggregation with Distance-Based Models

CLASSIFICATION WITH RADIAL BASIS AND PROBABILISTIC NEURAL NETWORKS

GRADIENT-BASED OPTIMIZATION OF NEURAL

Computer Architecture Today (I)

Organisation. Assessment

Twitter Demographic Classification Using Deep Multi-modal Multi-task Learning

Deep Learning. Vladimir Golkov Technical University of Munich Computer Vision Group

Entity and Knowledge Base-oriented Information Retrieval

LEARNING TO INFER GRAPHICS PROGRAMS FROM HAND DRAWN IMAGES

Neural Network Learning. Today s Lecture. Continuation of Neural Networks. Artificial Neural Networks. Lecture 24: Learning 3. Victor R.

Empirical Evaluation of RNN Architectures on Sentence Classification Task

arxiv: v1 [cond-mat.dis-nn] 30 Dec 2018

A Neuro Probabilistic Language Model Bengio et. al. 2003

Image Captioning with Object Detection and Localization

Administrativia. CS107 Introduction to Computer Science. Readings. Algorithms. Expressing algorithms

Advanced Search Algorithms

CS294-1 Assignment 2 Report

Subset sum problem and dynamic programming

Transcription:

Plan, Attend, Generate: Planning for Sequence-to-Sequence Models Francis Dutil, Caglar Gulcehre, Adam Trischler, Yoshua Bengio Presented by Xinyuan Zhang April 26, 2018 April 26, 2018 1 / 11

Introduction Motivation Although many natural sequences are output step-by-step because of constraints on the output process, they are not necessarily conceived and ordered according to only local, step-by-step interactions. Planning is one way to induce coherence in sequential outputs like language. In this paper, planning is performed over the input sequence by searching for alignments. April 26, 2018 2 / 11

Introduction Sequence-to-Sequence Models with Attention Existing sequence-to-sequence models with attention have focused on generating the target sequence by aligning each generated output token to another token in the input sequence. In general these models construct alignments using a simple MLP that conditions on the decoder s internal state. April 26, 2018 3 / 11

Model PAG PAG (Plan-Attend-Generate) model 1. Creates a plan; 2. Computes a soft alignment based on the plan; 3. Generates at each time-step in the decoder. The goal is a mechanism that plans which parts of the input sequence to focus on for the next k time-steps of decoding. Concretely, the decoder s internal state is augmented with An alignment plan matrix A commitment plan vector April 26, 2018 4 / 11

Model Alignment Plan Matrix Alignment plan matrix A t R k T stores the alignments for the current and the next k 1 time steps. Context ψ t is obtained by a weighted sum of the encoder annotations ψ t = T α ti h i (1) i where the soft-alignment vector α t = softmax(a t [0]) R T. April 26, 2018 5 / 11

Model Commitment Plan Vector The commitment plan vector c t governs whether to follow the existing alignment plan or to recompute it. Let c t be the discretized commitment plan obtained by setting c t s largest element to 1 and all other elements to 0 and g t = c t [0] be the binary indicator variable. If g t = 0, the shift function ρ( ) shifts the commitment vector and the alignment plan matrix forward from t 1. If g t = 1, the model recomputes the commitment vector and the alignment plan matrix. April 26, 2018 6 / 11

Model PAG Algorithm Algorithm 1: Pseudocode for updating the alignment plan and commitment vector. for j {1, X } do for t {1, Y } do if g t =1 then c t =softmax(f c(s t 1)) β j t =f r(a t 1[j]) {Read alignment plan} Ā t[i]=f align(s t 1, h j,β j t, y t) {Compute candidate alignment plan} u tj =f up(h j, s t 1,ψ t 1) {Compute update gate} A t =(1 u tj) A t 1+u tj Āt {Update alignment plan} else A t =ρ(a t 1) {Shift alignment plan} c t =ρ(c t 1) {Shift commitment plan} end if Compute the alignment as α t =softmax(a t[0]) end for end for April 26, 2018 7 / 11

Model end for rpag Algorithm alignment vector from the previous time-step until the commitment switch activates, at wh model computes a new alignment vector. We call this variant repeat, plan, attend, and gene In order rpag can to reduce be viewed theasmodel s learning an computational explicit segmentation cost, rpag with an(repeat, implicit planning plan, mech unsupervised fashion. Repetition can reduce the computational complexity of the alignment attend, drastically; and generate) it also eliminates is proposed the needby for reusing an explicit the alignment-plan alignment matrix, vectorwhich from reduces the previous memory time consumption step until also. the Wecommitment provide pseudocode switch for rpag activates. in Algorithm 2. Algorithm 2: Pseudocode for updating the repeat alignment and commitment vector. for j {1, X } do for t {1, Y } do if g t =1 then c t =softmax(f c(s t 1,ψ t 1)) α t =softmax(f align(s t 1, h j, y t)) else c t =ρ(c t 1) {Shift the commitment vector c t 1} α t =α t 1 {Reuse the old the alignment} end if end for end for 3.3 Training April 26, 2018 8 / 11

Experiments Baseline: The baseline is the encoder-decoder architecture with attention described in Chung et al. (2016). Tasks: Algorithmic Task: The PAG model solves the Eulerian Circuits problem with 100% absolute accuracy while the baseline achieves 90.4% accuracy on the test set. Question Generation: Models were tested on SQUAD (Rajpurkar et al., 2016), a question answering (QA) corpus wherein each sample is a (document, question, answer) triple. Character-level Neural Machine Translation: Models were tested on the WMT 15 for English to German, English to Czech, and English to Finnish language pairs. April 26, 2018 9 / 11

Czech, and English to Finnish language pairs. We report BLEU scores of each mode document. Pointer-softmax uses the alignments multi-blue.perl to predict script. thethe location best-score of of the each word model for each language pair appears in ingexperiments mechanism has a direct influence on the newstest2013 decoder saspredictions. our development set, newstest2014 as our "Test 2014" and newstes 2015" set. ( ) denotes the results of the baseline that we trained using the hyperpara s from Cont d SQUAD s training set for validationchung and et used al. (2016) the official and the code development provided with sethat paper. For our baseline, we only result, and do not have multiple runs of our models. On WMT 14 and WMT 15 for our models. We trained a model with 800 character-level units for all NMT, GRU Kalchbrenner hidden et states al. (2016) 600have reported better results with dee ing. On the test set the baseline achieved 66.25 convolutional NLLmodels while(bytenets), PAG got23.75 65.45 andnll. 26.26 respectively. -set learning curves of both models in Figure 3. vesfigure: for question-generation Learning curves models for question our Figure development 4: generation Learningset. curves Both (left) for different andmodels character-level haveon WMT 15 for neural En De. Models re trained with the same hyperparameters. mechanism PAG converges faster than our the baseline baseline (which has larger capacity). machine translation for English to German (right). 5 Conclusion eural Machine Translation In this work we addressed a fundamental issue in neural generation of long sequen planning into the alignment mechanism of sequence-to-sequence April 26, 2018 architectures. 10 We / 11pro

Experiments Cont d (a) (b) (c) Figure: 2: We We visualize the the alignments alignments learned learned by PAG by in (a), PAG rpag in (a), in (b), rpag and our in baseline (b), and model the with a 2-layer GRU decoder using h 2 for the attention in (c). As depicted, the alignments learned by PAG baseline and rpag model are smoother (c). than those of the baseline. The baseline tends to put too much attention on the April 26, 2018 11 / 11