Plan, Attend, Generate: Planning for Sequence-to-Sequence Models

Plan, Attend, Generate: Planning for Sequence-to-Sequence Models Francis Dutil, Caglar Gulcehre, Adam Trischler, Yoshua Bengio Presented by Xinyuan Zhang April 26, 2018 April 26, 2018 1 / 11

Introduction Motivation Although many natural sequences are output step-by-step because of constraints on the output process, they are not necessarily conceived and ordered according to only local, step-by-step interactions. Planning is one way to induce coherence in sequential outputs like language. In this paper, planning is performed over the input sequence by searching for alignments. April 26, 2018 2 / 11

Introduction Sequence-to-Sequence Models with Attention Existing sequence-to-sequence models with attention have focused on generating the target sequence by aligning each generated output token to another token in the input sequence. In general these models construct alignments using a simple MLP that conditions on the decoder s internal state. April 26, 2018 3 / 11

Model PAG PAG (Plan-Attend-Generate) model 1. Creates a plan; 2. Computes a soft alignment based on the plan; 3. Generates at each time-step in the decoder. The goal is a mechanism that plans which parts of the input sequence to focus on for the next k time-steps of decoding. Concretely, the decoder s internal state is augmented with An alignment plan matrix A commitment plan vector April 26, 2018 4 / 11

Model Alignment Plan Matrix Alignment plan matrix A t R k T stores the alignments for the current and the next k 1 time steps. Context ψ t is obtained by a weighted sum of the encoder annotations ψ t = T α ti h i (1) i where the soft-alignment vector α t = softmax(a t [0]) R T. April 26, 2018 5 / 11

Model Commitment Plan Vector The commitment plan vector c t governs whether to follow the existing alignment plan or to recompute it. Let c t be the discretized commitment plan obtained by setting c t s largest element to 1 and all other elements to 0 and g t = c t [0] be the binary indicator variable. If g t = 0, the shift function ρ( ) shifts the commitment vector and the alignment plan matrix forward from t 1. If g t = 1, the model recomputes the commitment vector and the alignment plan matrix. April 26, 2018 6 / 11

Model PAG Algorithm Algorithm 1: Pseudocode for updating the alignment plan and commitment vector. for j {1, X } do for t {1, Y } do if g t =1 then c t =softmax(f c(s t 1)) β j t =f r(a t 1[j]) {Read alignment plan} Ā t[i]=f align(s t 1, h j,β j t, y t) {Compute candidate alignment plan} u tj =f up(h j, s t 1,ψ t 1) {Compute update gate} A t =(1 u tj) A t 1+u tj Āt {Update alignment plan} else A t =ρ(a t 1) {Shift alignment plan} c t =ρ(c t 1) {Shift commitment plan} end if Compute the alignment as α t =softmax(a t[0]) end for end for April 26, 2018 7 / 11

Model end for rpag Algorithm alignment vector from the previous time-step until the commitment switch activates, at wh model computes a new alignment vector. We call this variant repeat, plan, attend, and gene In order rpag can to reduce be viewed theasmodel s learning an computational explicit segmentation cost, rpag with an(repeat, implicit planning plan, mech unsupervised fashion. Repetition can reduce the computational complexity of the alignment attend, drastically; and generate) it also eliminates is proposed the needby for reusing an explicit the alignment-plan alignment matrix, vectorwhich from reduces the previous memory time consumption step until also. the Wecommitment provide pseudocode switch for rpag activates. in Algorithm 2. Algorithm 2: Pseudocode for updating the repeat alignment and commitment vector. for j {1, X } do for t {1, Y } do if g t =1 then c t =softmax(f c(s t 1,ψ t 1)) α t =softmax(f align(s t 1, h j, y t)) else c t =ρ(c t 1) {Shift the commitment vector c t 1} α t =α t 1 {Reuse the old the alignment} end if end for end for 3.3 Training April 26, 2018 8 / 11

Experiments Baseline: The baseline is the encoder-decoder architecture with attention described in Chung et al. (2016). Tasks: Algorithmic Task: The PAG model solves the Eulerian Circuits problem with 100% absolute accuracy while the baseline achieves 90.4% accuracy on the test set. Question Generation: Models were tested on SQUAD (Rajpurkar et al., 2016), a question answering (QA) corpus wherein each sample is a (document, question, answer) triple. Character-level Neural Machine Translation: Models were tested on the WMT 15 for English to German, English to Czech, and English to Finnish language pairs. April 26, 2018 9 / 11

Czech, and English to Finnish language pairs. We report BLEU scores of each mode document. Pointer-softmax uses the alignments multi-blue.perl to predict script. thethe location best-score of of the each word model for each language pair appears in ingexperiments mechanism has a direct influence on the newstest2013 decoder saspredictions. our development set, newstest2014 as our "Test 2014" and newstes 2015" set. ( ) denotes the results of the baseline that we trained using the hyperpara s from Cont d SQUAD s training set for validationchung and et used al. (2016) the official and the code development provided with sethat paper. For our baseline, we only result, and do not have multiple runs of our models. On WMT 14 and WMT 15 for our models. We trained a model with 800 character-level units for all NMT, GRU Kalchbrenner hidden et states al. (2016) 600have reported better results with dee ing. On the test set the baseline achieved 66.25 convolutional NLLmodels while(bytenets), PAG got23.75 65.45 andnll. 26.26 respectively. -set learning curves of both models in Figure 3. vesfigure: for question-generation Learning curves models for question our Figure development 4: generation Learningset. curves Both (left) for different andmodels character-level haveon WMT 15 for neural En De. Models re trained with the same hyperparameters. mechanism PAG converges faster than our the baseline baseline (which has larger capacity). machine translation for English to German (right). 5 Conclusion eural Machine Translation In this work we addressed a fundamental issue in neural generation of long sequen planning into the alignment mechanism of sequence-to-sequence April 26, 2018 architectures. 10 We / 11pro

Experiments Cont d (a) (b) (c) Figure: 2: We We visualize the the alignments alignments learned learned by PAG by in (a), PAG rpag in (a), in (b), rpag and our in baseline (b), and model the with a 2-layer GRU decoder using h 2 for the attention in (c). As depicted, the alignments learned by PAG baseline and rpag model are smoother (c). than those of the baseline. The baseline tends to put too much attention on the April 26, 2018 11 / 11