Slide credit from Hung-Yi Lee & Richard Socher

Size: px

Start display at page:

Download "Slide credit from Hung-Yi Lee & Richard Socher"

Percival Ryan
5 years ago
Views:

1 Slide credit from Hung-Yi Lee & Richard Socher 1

2 Review Word Vector 2

3 Word2Vec Variants Skip-gram: predicting surrounding words given the target word (Mikolov+, 2013) CBOW (continuous bag-of-words): predicting the target word given the surrounding words (Mikolov+, 2013) LM (Language modeling): predicting the next words given the proceeding contexts (Mikolov+, 2013) Mikolov et al., Efficient estimation of word representations in vector space, in ICLR Workshop, Mikolov et al., Linguistic regularities in continuous space word representations, in NAACL HLT,

4 Word2Vec LM Goal: predicting the next words given the proceeding contexts 4

5 Outline Language Modeling N-gram Language Model Feed-Forward Neural Language Model Recurrent Neural Network Language Model (RNNLM) Recurrent Neural Network Definition Training via Backpropagation through Time (BPTT) Training Issue Applications Sequential Input Sequential Output Aligned Sequential Pairs (Tagging) Unaligned Sequential Pairs (Seq2Seq/Encoder-Decoder) 5

6 Outline Language Modeling N-gram Language Model Feed-Forward Neural Language Model Recurrent Neural Network Language Model (RNNLM) Recurrent Neural Network Definition Training via Backpropagation through Time (BPTT) Training Issue Applications Sequential Input Sequential Output Aligned Sequential Pairs (Tagging) Unaligned Sequential Pairs (Seq2Seq/Encoder-Decoder) 6

7 Language Modeling Goal: estimate the probability of a word sequence Example task: determinate whether a sequence is grammatical or makes more sense recognize speech or wreck a nice beach If P(recognize speech) > P(wreck a nice beach) Output = recognize speech 7

8 Outline Language Modeling N-gram Language Model Feed-Forward Neural Language Model Recurrent Neural Network Language Model (RNNLM) Recurrent Neural Network Definition Training via Backpropagation through Time (BPTT) Training Issue Applications Sequential Input Sequential Output Aligned Sequential Pairs (Tagging) Unaligned Sequential Pairs (Seq2Seq/Encoder-Decoder) 8

N-Gram Language Modeling Goal: estimate the probability of a

on a window of (n-1) previous words Estimate the probability

Count of nice beach in the training data Count of nice in the

9 N-Gram Language Modeling Goal: estimate the probability of a word sequence N-gram language model Probability is conditioned on a window of (n-1) previous words Estimate the probability based on the training data P beach nice = C nice beach C nice Count of nice beach in the training data Count of nice in the training data Issue: some sequences may not appear in the training data 9

10 N-Gram Language Modeling Training data: The dog ran The cat jumped P( jumped dog ) = 0 P( ran cat ) = give some small probability smoothing The probability is not accurate. The phenomenon happens because we cannot collect all the possible text in the world as training data. 10

11 Outline Language Modeling N-gram Language Model Feed-Forward Neural Language Model Recurrent Neural Network Language Model (RNNLM) Recurrent Neural Network Definition Training via Backpropagation through Time (BPTT) Training Issue Applications Sequential Input Sequential Output Aligned Sequential Pairs (Tagging) Unaligned Sequential Pairs (Seq2Seq/Encoder-Decoder) 11

12 Neural Language Modeling Idea: estimate not from count, but from the NN prediction P( wreck a nice beach ) = P(wreck START)P(a wreck)p(nice a)p(beach nice) P(next word is wreck ) P(next word is a ) P(next word is nice ) P(next word is beach ) Neural Network Neural Network Neural Network Neural Network vector of START vector of wreck vector of a vector of nice 12

13 Neural Language Modeling Probability distribution of the next word output hidden input context vector Issue: fixed context window for conditioning Bengio et al., A Neural Probabilistic Language Model, in JMLR,

14 Neural Language Modeling The input layer (or hidden layer) of the related words are close h 2 dog rabbit cat h 1 If P(jump dog) is large, P(jump cat) increase accordingly (even there is not cat jump in the data) Smoothing is automatically done 14

15 Outline Language Modeling N-gram Language Model Feed-Forward Neural Language Model Recurrent Neural Network Language Model (RNNLM) Recurrent Neural Network Definition Training via Backpropagation through Time (BPTT) Training Issue Applications Sequential Input Sequential Output Aligned Sequential Pairs (Tagging) Unaligned Sequential Pairs (Seq2Seq/Encoder-Decoder) 15

16 Recurrent Neural Network Idea: condition the neural network on all previous words and tie the weights at each time step Assumption: temporal information matters 16

17 output word prob dist RNN Language Modeling hidden P(next word is wreck ) P(next word is a ) input P(next word is nice ) context vector P(next word is beach ) vector of START vector of wreck vector of a vector of nice Idea: pass the information from the previous hidden layer to leverage all contexts 17

18 Outline Language Modeling N-gram Language Model Feed-Forward Neural Language Model Recurrent Neural Network Language Model (RNNLM) Recurrent Neural Network Definition Training via Backpropagation through Time (BPTT) Training Issue Applications Sequential Input Sequential Output Aligned Sequential Pairs (Tagging) Unaligned Sequential Pairs (Seq2Seq/Encoder-Decoder) 18

19 RNNLM Formulation At each time step, probability of the next word vector of the current word 19

20 Outline Language Modeling N-gram Language Model Feed-Forward Neural Language Model Recurrent Neural Network Language Model (RNNLM) Recurrent Neural Network Definition Training via Backpropagation through Time (BPTT) Training Issue Applications Sequential Input Sequential Output Aligned Sequential Pairs (Tagging) Unaligned Sequential Pairs (Seq2Seq/Encoder-Decoder) 20

21 Recurrent Neural Network Definition : tanh, ReLU 21

22 Model Training All model parameters can be updated by y t-1 y t y t+1 target predicted 22

23 Outline Language Modeling N-gram Language Model Feed-Forward Neural Language Model Recurrent Neural Network Language Model (RNNLM) Recurrent Neural Network Definition Training via Backpropagation through Time (BPTT) Training Issue Applications Sequential Input Sequential Output Aligned Sequential Pairs (Tagging) Unaligned Sequential Pairs (Seq2Seq/Encoder-Decoder) 23

24 Backpropagation Layer l j l w ij Layer l 1 2 i l i Backward Pass Error signal l a j x 1 j l 1 l 1 Forward Pass 24

25 Backpropagation l δ l δ 1 l δ 2 Layer l 1 2 l z 1 l z 2 Layer L-1 L-1 1 z 2 z L1 1 L1 2 Layer L L 1 2 L z 1 L z 2 C y C y 1 C y 2 l i Backward Pass Error signal l δ i i l z i l W 1 T m W L T L1 z m n L z n C y n 25

26 Backpropagation through Time (BPTT) Unfold x t s t o t y t Input: init, x 1, x 2,, x t Output: o t Target: y t init x 1 s 1 x t-2 x t-1 s t-1 Cy C s t-2 y1 C y 2 C y n 26

27 Backpropagation through Time (BPTT) Unfold x t s t o t y t Input: init, x 1, x 2,, x t Output: o t Target: y t x t-2 x t-1 s t-1 1 s t Cy x 1 s 1 2 n init n 27

28 Backpropagation through Time (BPTT) Unfold x t s t o t y t x t-1 s t-1 Cy Input: init, x 1, x 2,, x t Output: o t Target: y t x t-2 s t-2 x 1 s 1 init 28

29 Backpropagation through Time (BPTT) Unfold Input: init, x 1, x 2,, x t Output: o t Target: y t init i x 1 s 1 j x t-2 j the same memory x t i x t-1 s t-1 j i s t-2 pointer pointer j i s t o t y t Cy Weights are tied together 29

30 Backpropagation through Time (BPTT) Unfold Input: init, x 1, x 2,, x t Output: o t Target: y t i x 1 s 1 j x t-2 j x t i x t-1 s t-1 j k i s t-2 k j i s t o t y t Cy init k Weights are tied together 30

31 BPTT Forward Pass: Backward Pass: Compute s 1, s 2, s 3, s 4 For C (4) For C (3) For C (2) For C (1) y 1 y 2 y 3 y 4 C (1) C (2) C (3) C (4) o 1 o 2 o 3 o 4 init s 1 s 2 s 3 s 4 x 1 x 2 x 3 x 4 31

32 Outline Language Modeling N-gram Language Model Feed-Forward Neural Language Model Recurrent Neural Network Language Model (RNNLM) Recurrent Neural Network Definition Training via Backpropagation through Time (BPTT) Training Issue Applications Sequential Input Sequential Output Aligned Sequential Pairs (Tagging) Unaligned Sequential Pairs (Seq2Seq/Encoder-Decoder) 32

33 RNN Training Issue The gradient is a product of Jacobian matrices, each associated with a step in the forward computation Multiply the same matrix at each time step during backprop The gradient becomes very small or very large quickly vanishing or exploding gradient Bengio et al., Learning long-term dependencies with gradient descent is difficult, IEEE Trans. of Neural Networks, [link] Pascanu et al., On the difficulty of training recurrent neural networks, in ICML, [link] 33

34 Rough Error Surface Cost w 2 w1 The error surface is either very flat or very steep Bengio et al., Learning long-term dependencies with gradient descent is difficult, IEEE Trans. of Neural Networks, [link] Pascanu et al., On the difficulty of training recurrent neural networks, in ICML, [link] 34

35 Vanishing/Exploding Gradient Example step 2 steps 5 steps steps 20 steps 50 steps

36 Outline Language Modeling N-gram Language Model Feed-Forward Neural Language Model Recurrent Neural Network Language Model (RNNLM) Recurrent Neural Network Definition Training via Backpropagation through Time (BPTT) Training Issue Applications Sequential Input Sequential Output Aligned Sequential Pairs (Tagging) Unaligned Sequential Pairs (Seq2Seq/Encoder-Decoder) 36

37 How to Frame the Learning Problem? The learning algorithm f is to map the input domain X into the output domain Y f : X Y Input domain: word, word sequence, audio signal, click logs Output domain: single label, sequence tags, tree structure, probability distribution Network design should leverage input and output domain properties 37

38 Outline Language Modeling N-gram Language Model Feed-Forward Neural Language Model Recurrent Neural Network Language Model (RNNLM) Recurrent Neural Network Definition Training via Backpropagation through Time (BPTT) Training Issue Applications Sequential Input Sequential Output Aligned Sequential Pairs (Tagging) Unaligned Sequential Pairs (Seq2Seq/Encoder-Decoder) 38

39 Input Domain Sequence Modeling Idea: aggregate the meaning from all words into a vector Method: Basic combination: average, sum Neural combination: Recursive neural network (RvNN) Recurrent neural network (RNN) Convolutional neural network (CNN) How to compute 這 (this) 規格 (specification) 有 (have) 誠意 (sincerity) N-dim 39

40 Sentiment Analysis Encode the sequential input into a vector using RNN h 4 Input x 1 Output y 1 x 4 x 2 y 2 這規格有誠意 x N y M RNN considers temporal information to learn sentence vectors as the input of classification tasks 40

41 Outline Language Modeling N-gram Language Model Feed-Forward Neural Language Model Recurrent Neural Network Language Model (RNNLM) Recurrent Neural Network Definition Training via Backpropagation through Time (BPTT) Training Issue Applications Sequential Input Sequential Output Aligned Sequential Pairs (Tagging) Unaligned Sequential Pairs (Seq2Seq/Encoder-Decoder) 41

42 Output Domain Sequence Prediction POS Tagging 推薦我台大後門的餐廳 Speech Recognition 推薦 /VV 我 /PN 台大 /NR 後門 /NN 的 /DEG 餐廳 /NN 大家好 Machine Translation How are you doing today? 你好嗎? The output can be viewed as a sequence of classification 42

43 Outline Language Modeling N-gram Language Model Feed-Forward Neural Language Model Recurrent Neural Network Language Model (RNNLM) Recurrent Neural Network Definition Training via Backpropagation through Time (BPTT) Training Issue Applications Sequential Input Sequential Output Aligned Sequential Pairs (Tagging) Unaligned Sequential Pairs (Seq2Seq/Encoder-Decoder) 43

44 POS Tagging Tag a word at each timestamp Input: word sequence Output: corresponding POS tag sequence N VA AD 四樓好專業 44

Natural Language Understanding (NLU) Tag a word at each timestamp Input: word sequence Output: IOB-format slot tag and intent tag <START> just sent email to bob about fishing this weekend

45 Natural Language Understanding (NLU) Tag a word at each timestamp Input: word sequence Output: IOB-format slot tag and intent tag <START> just sent to bob about fishing this weekend <END> O O O O O O B-contact_name B-subject I-subject I-subject send_ (contact_name= bob, subject= fishing this weekend ) send_ Temporal orders for input and output are the same 45

46 Outline Language Modeling N-gram Language Model Feed-Forward Neural Language Model Recurrent Neural Network Language Model (RNNLM) Recurrent Neural Network Definition Training via Backpropagation through Time (BPTT) Training Issue Applications Sequential Input Sequential Output Aligned Sequential Pairs (Tagging) Unaligned Sequential Pairs (Seq2Seq/Encoder-Decoder) 46

47 Machine Translation Cascade two RNNs, one for encoding and one for decoding Input: word sequences in the source language Output: word sequences in the target language encoder decoder 超棒的醬汁 47

48 Chit-Chat Dialogue Modeling Cascade two RNNs, one for encoding and one for decoding Input: word sequences in the question Output: word sequences in the response Temporal ordering for input and output may be different 48

49 Concluding Remarks Language Modeling RNNLM Recurrent Neural Networks Definition Backpropagation through Time (BPTT) Vanishing/Exploding Gradient Applications Sequential Input: Sequence-Level Embedding Sequential Output: Tagging / Seq2Seq (Encoder-Decoder) 49

CSC 578 Neural Networks and Deep Learning

CSC 578 Neural Networks and Deep Learning Fall 2018/19 7. Recurrent Neural Networks (Some figures adapted from NNDL book) 1 Recurrent Neural Networks 1. Recurrent Neural Networks (RNNs) 2. RNN Training