16-785: Integrated Intelligence in Robotics: Vision, Language, and Planning. Spring 2018 Lecture 14. Image to Text

Size: px

Start display at page:

Download "16-785: Integrated Intelligence in Robotics: Vision, Language, and Planning. Spring 2018 Lecture 14. Image to Text"

Bernadette Thompson
5 years ago
Views:

1 16-785: Integrated Intelligence in Robotics: Vision, Language, and Planning Spring 2018 Lecture 14. Image to Text

2 Input Output Classification tasks 4/1/18 CMU : Integrated Intelligence in Robotics 2

3 Input Output Classification tasks Structured input to structured output tasks o Machine translation or other NLP tasks o Image captioning 4/1/18 CMU : Integrated Intelligence in Robotics (jeanoh@cmu.edu) 3

4 Language modeling using RNN Compute the probability of a sentence s = (w 1, w 2,, w T ) p(w 1, w 2,, w T ) = Π t=1 T p(w t w 1,,w t-1 ) RNN Conditional probability 4/1/18 CMU : Integrated Intelligence in Robotics (jeanoh@cmu.edu) 4

5 Recap: Forward propagation in RNN Recurrent connections between hidden units; output every time step a (t ) = b +Wh (t 1) +Ux (t ), h t = tanh(a (t ) ), o (t ) = c +Vh (t ), ŷ (t ) = softmax(o (t ) ) [Fig 10.3] Softmax to get normalized probabilities CMU : Integrated Intelligence in Robotics (jeanoh@cmu.edu) 5

6 Language modeling using RNN Compute the probability of a sentence s = (w 1, w 2,, w T ) p(w 1, w 2,, w T ) = Π t=1 T p(w t w 1,,w t-1 ) RNN p(w t+1 =w w 1,,w t ) =g θw (h t, w t ) Conditional probability Probability of the next word being w 4/2/18 CMU : Integrated Intelligence in Robotics (jeanoh@cmu.edu) 6

7 Conditional language model p(w t+1 =w w 1,,w t ) =g θw (h t, w t ) h t = φ θ (h t-1, w t, c) Context 4/1/18 CMU : Integrated Intelligence in Robotics (jeanoh@cmu.edu) 7

8 Recap: Encoder-Decoder Sequence-to-Sequence Architecture Map variable-length input sequence to variable-length output sequence Machine translation [Cho et al., 2014] [Sutskever et al., 2014] CMU : Integrated Intelligence in Robotics 8

9 Encoder-Decoder Sequence-to-Sequence Architecture Encoder (reader or input) RNN processes input sequence x=(x (1),, x (nx) )and emits context C CMU : Integrated Intelligence in Robotics (jeanoh@cmu.edu) 9

10 Encoder-Decoder Sequence-to-Sequence Architecture Encoder (reader or input) RNN processes input sequence x=(x (1),, x (nx) )and emits context C Decoder (writer or output) RNN is conditioned on the context C to generate output sequence y=(y (1),, y (ny) ) CMU : Integrated Intelligence in Robotics (jeanoh@cmu.edu) 10

11 Encoder-Decoder [Grid]-to-[Sequence] Architecture Encoder (reader or input) [CNN] processes input image x and emits context C Decoder (writer or output) RNN is conditioned on the context C to generate output sequence y=(y(1),, y(ny)) CMU : Integrated Intelligence in Robotics (jeanoh@cmu.edu) 11

Encoder-Decoder [Grid]-to-[Sequence] Architecture Encoder (reader or input) [CNN] processes image The context model isinput too simple to x and emits context C temporal, or guarantee that spatial,

12 Encoder-Decoder [Grid]-to-[Sequence] Architecture Encoder (reader or input) [CNN] processes image The context model isinput too simple to x and emits context C temporal, or guarantee that spatial, spatio-temporal structures of input are preserved. Decoder (writer or output) RNN is conditioned on the context C to generate output sequence y=(y(1),, y(ny)) CMU : Integrated Intelligence in Robotics (jeanoh@cmu.edu) 12

13 Attention mechanisms allow the system to sequentially focus on different subsets of the input (Cho et al., 2015). 4/1/18 CMU : Integrated Intelligence in Robotics 13

14 Attention mechanism A structured representation of input e.g., a set of fixed-size vectors known as context set C = { c 1, c 2,, c M } Attention model: another neural network to map hidden state to context vector 4/1/18 CMU : Integrated Intelligence in Robotics (jeanoh@cmu.edu) 14

15 e i t = f Att c t = ϕ Attention model Hidden state z { } j=1 z t 1,c i, α j t 1 M { c } i, t αi i=1 { } i=1 i=1 Soft attention: softmax over context vectors in context set Hard attention: one best match M M = MC sampling [Xu et al., 2015] Attention weight α M α i c i e: score of context c i at time t α t i = exp(e t ) i M e j t j=1 Natural for gradient back-propagation 4/1/18 CMU : Integrated Intelligence in Robotics (jeanoh@cmu.edu) 15

16 e i t = f Att c t = ϕ Attention model { } j=1 z t 1,c i, α j t 1 M { c } i, t αi i=1 { } i=1 i=1 Soft attention: softmax over context vectors in context set Hard attention: one best match MC sampling Hidden state z M M = Attention weight α M α i c i e: score of context c i at time t e.g., weighted sum α t i = exp(e t ) i M e j t j=1 Natural for gradient back-propagation 4/1/18 CMU : Integrated Intelligence in Robotics (jeanoh@cmu.edu) 16

17 Conditional RNN language model c t = ϕ M { c } i, t αi i=1 { } i=1 Computing context vector every time step instead of using a fixed-length context vector h t = φ θ (h t-1, x t, c t ) M = M i=1 α i c i 4/1/18 CMU : Integrated Intelligence in Robotics (jeanoh@cmu.edu) 17

18 Image captioning Representation of input image: Activation of the last fully-connected hidden layer as context vector in simple encoderdecoder model Activation of the last convolutional layer to use attention mechanism 4/1/18 CMU : Integrated Intelligence in Robotics 18

19 [Karpathy & Fei-Fei 2015] Generate dense descriptions of images using multimodal embedding 1/31/18

20 Representing images Bounding box detection using: R-CNN + pretrain on ImageNet + finetuning on 200 classes of ImageNet Detection Challenge [Girshick CVPR 14] 1/31/ activations of fully connected layer right before classification

21 Representing images Top 19 bounding boxes + entire input image = 20 1 image à 20 h-dimensional vectors v = W m [CNN θc (I b )] + b m 1/31/18

22 Recap: Bidirectional RNNs Backward in time Forward in time [Fig ] CMU : Integrated Intelligence in Robotics 22

23 Representing sentences Bidirectional RNN Left to right & right to left context Each input word à 1-of-k vector Encode into h-d vector (the same embedding space as images) 1/31/18

24 Alignment Training set: k: image index l: sentence index Multimodal h-d embedding Image à v 1, v 20 Sentence n words à s 1,,s n Similarity between image region & word based on dot product v kt s t S k,l = max i gk v T i s t (Eq. 8) 1/31/18 t g l

25 Multimodal RNN for text generation Image CNN at t 0 START & END: special tokens Each word encoded into a vector Predict next word as probability distribution over dictionary + END 1/31/18

26 Qualitative result 20 occurrences of man in black shirt 60 occurrences of is playing guitar 1/31/18

27 Additional sample results 1/31/18

28 Show & Tell [Vinyals et al., 2015] C Simple encoder-decoder model using fixed-length context vector 4/1/18 CMU : Integrated Intelligence in Robotics (jeanoh@cmu.edu) 28

29 Show & Tell [Vinyals et al., 2015] CNN: Inception V1-3 Batch Normlization 4/1/18 CMU : Integrated Intelligence in Robotics 29

30 Show & Tell [Vinyals et al., 2015] 4/1/18 CMU : Integrated Intelligence in Robotics 30

31 Show, Attend, & Tell [Xu et al., 2015] 4/1/18 CMU : Integrated Intelligence in Robotics 31

32 Show, Attend, & Tell [Xu et al., 2015] 4/1/18 CMU : Integrated Intelligence in Robotics 32

33 Show, Attend, & Tell [Xu et al., 2015] 4/1/18 CMU : Integrated Intelligence in Robotics 33

34 Image captioning with attributes (LSTM-A) [Yao et al., 2017] CNN-RNN encoder-decoder model Predefined set of high-level attributes Multiple instance learning with interattribute correlations 4/1/18 CMU : Integrated Intelligence in Robotics 34

35 Image captioning with attributes (LSTM-A) [Yao et al., 2017] 4/1/18 CMU : Integrated Intelligence in Robotics 35

36 Image captioning with attributes (LSTM-A) [Yao et al., 2017] 4/1/18 CMU : Integrated Intelligence in Robotics 36

37 How much data do we need to achieve decent performance in image captioning? 4/1/18 CMU : Integrated Intelligence in Robotics 37

38 [Young et al., TACL 2014] 30K images + 150K captions P. Young, A. Lai, M. Hodosh, and J. Hockenmaier. From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions. TACL /31/18

39 MS COCO K images x 5 captions 1/31/18

40 Testing on images outside datasets [Google s Show & Tell] Courtesy: Andy Tsai 4/1/18 CMU : Integrated Intelligence in Robotics (jeanoh@cmu.edu) 40

41 Testing on images outside datasets [Show, Attend, & Tell] Courtesy: Junjiao Tian 4/1/18 CMU : Integrated Intelligence in Robotics 41

42 There s a lot of room to improve 4/1/18 CMU : Integrated Intelligence in Robotics (jeanoh@cmu.edu) 42

43 Wednesday papers: Next Project presentation Afshaan Word2vec (Krishna) Skip-thought vector (Satyen) Project midterm report 4/1/18 CMU : Integrated Intelligence in Robotics 43

Show, Discriminate, and Tell: A Discriminatory Image Captioning Model with Deep Neural Networks

Show, Discriminate, and Tell: A Discriminatory Image Captioning Model with Deep Neural Networks Zelun Luo Department of Computer Science Stanford University zelunluo@stanford.edu Te-Lin Wu Department of