Advanced RNN (GRU and LSTM) for Machine Transla:on. Dr. Kira Radinsky CTO SalesPredict Visi8ng Professor/Scien8st Technion

Size: px

Start display at page:

Download "Advanced RNN (GRU and LSTM) for Machine Transla:on. Dr. Kira Radinsky CTO SalesPredict Visi8ng Professor/Scien8st Technion"

Emerald Richards
6 years ago
Views:

1 Advanced RNN (GRU and LSTM) for Machine Transla:on Dr. Kira Radinsky CTO SalesPredict Visi8ng Professor/Scien8st Technion Slides were adapted from lectures by Richard Socher

2 Overview Machine transla8on RNN Models tackling MT: Gated Recurrent Units by Cho et al. (2014) Long-Short-Term-Memories by Hochreiter and Schmidhuber (1997)

3 Machine Transla:on Methods are sta8s8cal Use parallel corpora European Parliament First parallel corpus: RoseUa Stone à Tradi8onal systems are very complex

4 Current sta:s:cal machine transla:on systems Source language f, e.g. French Target language e, e.g. English Probabilis8c formula8on (using Bayes rule) Transla8on model p(f e) trained on parallel corpus Language model p(e) trained on English only corpus (lots, free!) Transla8on Model French à à Pieces of English à p(f e) Language Model p(e) Decoder argmax p(f e)p(e) à Proper English

5 Phrase-based decoder

Step 1: Alignment Goal: know which word or phrases in

nouveaux séismes spurious word nouveaux séismes Alignment

6 Step 1: Alignment Goal: know which word or phrases in source language would translate to what words or phrases in target language? à Hard already! Japan shaken by two new quakes Le Japon secoué par deux nouveaux séismes spurious word Japan shaken by two new quakes Le Japon secoué par deux nouveaux séismes Alignment examples from Chris Manning/CS224n

7 Step 1: Alignment zero fertility word not translated And the program has been implemented one-to-many alignment Le programme a été mis en application And the program has been implemented Le programme a été mis en application

8 Step 1: Alignment Really hard :/ The balance was the territory of the aboriginal people Le reste appartenait aux autochtones The balance was the territory of Le reste appartenait aux autochtones many-to-one alignments the aboriginal people

9 Step 1: Alignment The poor don t have any money many-to-many alignment Les pauvres sont démunis The poor don t have any money Les pauvres sont phrase alignment démunis

10 Step 1: Alignment We could spend an en8re lecture on alignment models Not only single words but could use phrases, syntax Then consider reordering of translated phrases er geht ja nicht nach hause er geht ja nicht nach hause he does not go home Example from Philipp Koehn

11 Phrase-Based Sta:s:cal MT: The Pharaoh/Moses Model Foreign input segmented into phrases phrase is any subsequence of words not a linguis8c phrase Each phrase is probabilis8cally translated into English P(to the conference zur Konferenz) P(into the mee8ng zur Konferenz) Phrases are probabilis8cally re-ordered (See J&M or Lopez 2008 for an intro.) This is s:ll preky much the state-of-the-art!

12 AMer many steps Each phrase in source language has many possible transla8ons resul8ng in large search space: Translation Options er geht ja nicht nach hause he it, it, he it is he will be it goes he goes is are goes go is are is after all does yes is, of course, not is not are not is not a not is not does not do not not do not does not is not to following not after not to after to according to in home under house return home do not house home chamber at home

13 Decode: Search for best of many hypotheses Hard search problem that also includes language model er geht ja nicht nach hause yes he goes home are does not go home it to

14 Tradi:onal MT Skipped hundreds of important details A lot of human feature engineering Very complex systems Many different, independent machine learning problems

15 Deep learning to the rescue!? Maybe, we could translate directly with an RNN? Encoder Decoder: Awesome y 1 sauce y 2 h 1 h 2 W W h 3 x 1 x 2 x 3 Echt dicke Kiste This needs to capture the en8re phrase!

16 MT with RNNs Simplest Model Encoder: Decoder: Minimize cross entropy error for all target words condi8oned on source words It s not quite that simple ;)

17 RNN Transla:on Model Extensions 1. Train different RNN weights for encoding and decoding Awesome y 1 sauce y 2 h 1 h 2 W W h 3 x 1 x 2 x 3 Echt dicke Kiste This means the φ() func8ons in would have different W (hh) matrices in the two models

18 RNN Transla:on Model Extensions Nota8on: Each input of has its own linear transforma8on matrix. Simple: 2. Compute every hidden state in decoder from Previous hidden state (standard) Last hidden vector of encoder c=h T Previous predicted output word y t---1 Language model with three inputs to each decoder neuron: (h t 1, c, y t 1 ) Cho et al. 2014

19 Different picture, same idea Kyunghyun Cho et al. 2014

20 RNN Transla:on Model Extensions 3. Train stacked/deep RNNs with mul8ple layers 4. Poten8ally train bidirec8onal encoder h (3) h (2) h (1) x 5. Train input sequence in reverse order for simple op8miza8on problem: Instead of A B C à X Y, train with C B A à X Y

21 6. Main Improvement: BeKer Units More complex hidden unit computa8on in recurrence! Gated Recurrent Units (GRU) introduced by Cho et al (see reading list) Main ideas: keep around memories to capture long distance dependencies allow error messages to flow at different strengths depending on the inputs

22 GRUs Standard RNN computes hidden layer at next 8me step directly: GRU first computes an update gate (another layer) based on current input word vector and hidden state Compute reset gate similarly but with different weights

23 GRUs Update gate Reset gate New memory content: Final memory at 8me step combines current and previous 8me steps:

24 GRUs Intui8vely, the update gate defines how much of the previous memory to keep around

25 GRUs Intui8vely, the reset gate determines how to combine the new input with the previous memory If we set the reset to all 1 s and update gate to all 0 s we again arrive at our plain RNN model.

26 GRUs Update gate Reset gate New memory content: If reset gate unit is ~0, then this ignores previous memory and only stores the new word informa8on if it the i-th element of r t is 0 we only take the current word into account Final memory at 8me step combines current and previous 8me steps: if it the i-th element of z t is 1 we copy the previous state and ignore the current one (including the current word). Otherwise, we can take only the current word (based on previous reset gate) or with is connec8on to previous words

27 AKempt at a clean illustra:on Final memory h t---1 h t Memory (reset) ~h t---1 ~ ht Update gate z t---1 z t Has to be sigmoid to illustrate the on/off switch beuer Reset gate r t---1 r t Input: x t---1 x t

28 GRU intui:on If reset is close to 0, ignore previous hidden state à Allows model to drop informa8on that is irrelevant in the future Update gate z controls how much of past state should mauer now. If z close to 1, then we can copy informa8on in that unit through many 8me steps! Less vanishing gradient! Units with short-term dependencies ouen have reset gates very ac8ve

29 GRU intui:on Units with long term dependencies have ac8ve update gates z Illustra8on: z h r ~ h x Deriva8ve of? à rest is same chain rule, but implement with modulariza:on or automa8c differen8a8on (e.g. theano)

30 GRU Python Implementa:on GRU layer is just another way of compu8ng the hidden state. So all we really need to do is change the hidden state computa8on in our forward propaga8on func8on. In our implementa8on we also added bias units. It s quite typical that these are not shown in the equa8ons. I also added a word embedding layer E.

31 GRU Python Implementa:on: Gradients We could derive the gradients for E,W,U,b and by hand using the chain rule, just like we did before. But in prac8ce most people use libraries like Theano that support autodifferena8on of expressions.

32 Adding a second GRU layer

33 Results Here are a few good examples of the network output (capitaliza8on added by me). I am a bot, and this ac8on was performed automa8cally. I enforce myself ridiculously well enough to just youtube. I ve got a good rhythm going! There is no problem here, but at least s8ll wave! It depends on how plausible my judgement is. ( with the cons8tu8on which makes it impossible ) Our network was able to Seman:c dependencies! For example, bot and automa8cally are clearly related, as are the opening and closing brackets.

34 Long-short-term-memories (LSTMs) We can make the units even more complex Allow each 8me step to modify Input gate (high if current cell mauers) Forget (gate 0, forget past) Output (how much cell is exposed) New memory cell Final memory cell: Final hidden state: Many varia8on: LSTM: A Search Space Odyssey

35 Long-short-term-memories (LSTMs) A candidate hidden state that is computed based on the current input and the previous hidden state. It is exactly the same equa8on we had in our vanilla RNN! However, instead of taking as the new hidden state as we did in the RNN, we will use the input gate from above to pick some of it.

36 Long-short-term-memories (LSTMs) previous memory mul8plied by the forget gate the newly computed hidden state mul8plied by the input gate The internal memory of the unit combina8on of how we want to combine previous memory and the new input. We could choose to ignore the old memory completely (forget gate all 0 s) or ignore the newly computed state completely (input gate all 0 s), but most likely we want something in between these two extremes.

37 Long-short-term-memories (LSTMs) Given the memory c t, we finally compute the output hidden state h t by mul8plying the memory with the output gate. Not all of the internal memory may be relevant to the hidden state used by other units in the network.

and Schmidhuber (1997) hup://people.idsia.ch/~juergen/lstm/sld017.htm hup://deeplearning.net/tutorial/lstm.

38 Illustra:ons a bit overwhelming ;) net c j s c =s + c g y in j j j y c j g in g y 1.0 j h h y out j w ic j y in j y out j w ic j net in j w in j i w out j i net out j Long Short---Term Memory by Hochreiter and Schmidhuber (1997) hup://people.idsia.ch/~juergen/lstm/sld017.htm hup://deeplearning.net/tutorial/lstm.html Intui8on: memory cells can keep informa8on intact, unless inputs makes them forget it or overwrite it with new input. Cell can decide to output this informa8on or just store it

39 LSTMs are currently very hip! En vogue default model for most sequence labeling tasks Very powerful, especially when stacked and made even deeper (each hidden layer is already computed by a deep internal network) Most useful if you have lots and lots of data

40 Deep LSTMs don t outperform tradi:onal MT yet Method test BLEU score (ntst14) Bahdanau et al. [2] Baseline System [29] Single forward LSTM, beam size Single reversed LSTM, beam size Ensemble of 5 reversed LSTMs, beam size Ensemble of 2 reversed LSTMs, beam size Ensemble of 5 reversed LSTMs, beam size Ensemble of 5 reversed LSTMs, beam size Table 1: The performance of the LSTM on WMT 14 English to French test set (ntst14). Note that an ensemble of 5 LSTMs with a beam of size 2 is cheaper than of a single LSTM with a beam of size 12. Method test BLEU score (ntst14) Baseline System [29] Cho et al. [5] Best WMT 14 result [9] 37.0 Rescoring the baseline 1000-best with a single forward LSTM Rescoring the baseline 1000-best with a single reversed LSTM Rescoring the baseline 1000-best with an ensemble of 5 reversed LSTMs 36.5 Oracle Rescoring of the Baseline 1000-best lists 45 Sequence to Sequence Learning by Sutskever et al. 2014

41 Deep LSTM for Machine Transla:on PCA of vectors from last 8me step hidden layer 4 15 I was given a card by her in the garden Mary admires John Mary is in love with John 10 5 In the garden, she gave me a card She gave me a card in the garden John admires Mary John is in love with Mary Mary respects John 0 5 In the garden, I gave her a card She was given a card by me in the garden John respects Mary 15 I gave her a card in the garden Sequence to Sequence Learning by Sutskever et al. 2014

42 Further Improvements: More Gates! Gated Feedback Recurrent Neural Networks, Chung et al (a) Conventional stacked RNN (b) Gated Feedback RNN

43 Summary LSTMs/GRU were designed to combat vanishing gradients through a ga*ng mechanism. LTSM (1997) GRU(2014) A LSTM/GRU layer is just another way to compute a hidden state that was previously

44 Summary Recurrent Neural Networks are powerful A lot of ongoing work right now Gated Recurrent Units even beuer LSTMs maybe even beuer (jury s8ll out) This was an advanced lecture à gain intui8on, encourage explora8on Next up: Recursive Neural Networks simpler and also powerful :)

Recurrent Neural Nets II

Recurrent Neural Nets II Steven Spielberg Pon Kumar, Tingke (Kevin) Shen Machine Learning Reading Group, Fall 2016 9 November, 2016 Outline 1 Introduction 2 Problem Formulations with RNNs 3 LSTM for Optimization