CS839: Probabilistic Graphical Models. Lecture 22: The Attention Mechanism. Theo Rekatsinas

Size: px

Start display at page:

Download "CS839: Probabilistic Graphical Models. Lecture 22: The Attention Mechanism. Theo Rekatsinas"

Alban Gibbs
5 years ago
Views:

1 CS839: Probabilistic Graphical Models Lecture 22: The Attention Mechanism Theo Rekatsinas 1

2 Why Attention? Consider machine translation: We need to pay attention to the word we are currently translating. Is the entire sequence needed as context? The cat is black -> Le chat est noir 2

3 Why Attention? Consider machine translation: We need to pay attention to the word we are currently translating. Is the entire sequence needed as context? The cat is black -> Le chat est noir RNNs are the de-facto standard for machine translation Problem: translation relies on reading a complete sentence and compresses all information into a fixed-length vector a sentence with hundreds of words represented by several words will surely lead to information loss, inadequate translation, etc. Long-range dependencies are tricky. 3

4 Basic encoder - decoder 4

5 Soft Attention for Translation 5

6 Soft Attention for Translation Soft Attention for Translation Distribution over input words I love coffee -> Me gusta el café Bahdanau et al, Neural Machine Translation by Jointly Learning to Align and Translate, ICLR 2015

7 Soft Attention for Translation Soft Attention for Translation Distribution over input words I love coffee -> Me gusta el café Bahdanau et al, Neural Machine Translation by Jointly Learning to Align and Translate, ICLR 2015

8 Soft Attention for Translation Soft Attention for Translation Distribution over input words I love coffee -> Me gusta el café Bahdanau et al, Neural Machine Translation by Jointly Learning to Align and Translate, ICLR 2015

9 Soft Attention for Translation Soft Attention for Translation Distribution over input words I love coffee -> Me gusta el café Bahdanau et al, Neural Machine Translation by Jointly Learning to Align and Translate, ICLR 2015

10 Soft Attention Decoder RNN Attention Model Bidirectional encoder RNN From Y. Bengio CVPR 2015 Tutorial

11 Soft Attention Context vector (input to decoder): Mixture weights: Alignment score (how well do input words near j match output words at position i):

12 Soft Attention Luong, Pham and Manning s Translation System (2015): Translation Error Rate vs Human Luong and Manning IWSLT 2015

13 Hard Attention

14 Monotonic Attention

15 Global Attention Blue = encoder Red = decoder Attend to a context vector. Decoder captures global information not only the information from one hidden state. Context vector takes all cell s outputs as input and computes a probability distribution for each token the decoder wants to generate

16 Local Attention Compute a best aligned position first Then compute a context vector centered at that position

17 RNN for Captioning Distribution over vocab RNN only looks at whole image, once d1 d2 CNN h0 h1 h2 Image: H x W x 3 Features: D Hidden state: H y1 First word y2 Second word What if the RNN looks at different parts of the image at each timestep?

18 CNN Image: H x W x 3 Features: L x D Xu et al, Show, Attend and Tell: Neural Image Caption Generation with Visual Attention, ICML 2015

19 CNN h0 Image: H x W x 3 Features: L x D Xu et al, Show, Attend and Tell: Neural Image Caption Generation with Visual Attention, ICML 2015

20 Distribution over L locations a1 CNN h0 Image: H x W x 3 Features: L x D Xu et al, Show, Attend and Tell: Neural Image Caption Generation with Visual Attention, ICML 2015

21 Distribution over L locations a1 CNN h0 Image: H x W x 3 Features: L x D Weighted combination of features Weighted features: D z1

22 Distribution over L locations a1 CNN h0 h1 Image: H x W x 3 Features: L x D Weighted combination of features Weighted features: D z1 y1 First word

23 Distribution over L locations Distribution over vocab a1 a2 d1 CNN h0 h1 Image: H x W x 3 Features: L x D Weighted combination of features Weighted features: D z1 y1 First word

24 Distribution over L locations Distribution over vocab a1 a2 d1 CNN h0 h1 Image: H x W x 3 Features: L x D Weighted combination of features Weighted features: D z1 y1 First word z2

25 Distribution over L locations Distribution over vocab a1 a2 d1 CNN h0 h1 h2 Image: H x W x 3 Features: L x D Weighted combination of features Weighted features: D z1 y1 First word z2 y2

26 Distribution over L locations Distribution over vocab a1 a2 d1 a3 d2 CNN h0 h1 h2 Image: H x W x 3 Features: L x D Weighted combination of features Weighted features: D z1 y1 First word z2 y2

27 Soft vs Hard Attention Soft vs Hard Attention CNN a c b d Image: H x W x 3 Grid of features (Each D-dimensional) From RNN: p a p c p b p d Xu et al, Show, Attend and Tell: Neural Image Caption Generation with Visual Attention, ICML 2015 Distribution over grid locations p a + p b + p c + p c = 1

28 Soft vs Hard Attention Soft vs Hard Attention CNN a c b d Image: H x W x 3 Grid of features (Each D-dimensional) Context vector z (D-dimensional) From RNN: p a p c p b p d Xu et al, Show, Attend and Tell: Neural Image Caption Generation with Visual Attention, ICML 2015 Distribution over grid locations p a + p b + p c + p c = 1

29 Soft vs Hard Attention Soft vs Hard Attention Image: H x W x 3 CNN a c b d Grid of features (Each D-dimensional) Context vector z (D-dimensional) Soft attention: Summarize ALL locations z = p a a+ p b b + p c c + p d d Derivative dz/dp is nice! Train with gradient descent From RNN: p a p c p b p d Xu et al, Show, Attend and Tell: Neural Image Caption Generation with Visual Attention, ICML 2015 Distribution over grid locations p a + p b + p c + p c = 1

30 Soft vs Hard Attention Soft vs Hard Attention Image: H x W x 3 From RNN: CNN Xu et al, Show, Attend and Tell: Neural Image Caption Generation with Visual Attention, ICML 2015 a c b d Grid of features (Each D-dimensional) p a p c p b p d Distribution over grid locations p a + p b + p c + p c = 1 Context vector z (D-dimensional) Soft attention: Summarize ALL locations z = p a a+ p b b + p c c + p d d Derivative dz/dp is nice! Train with gradient descent Hard attention: Sample ONE location according to p, z = that vector With argmax, dz/dp is zero almost everywhere Can t use gradient descent; need reinforcement learning

31 Multi-headed Attention

32 Attention is all you need

33 Attention tricks

34 Soft vs Hard Attention Attention Takeaways Performance: Attention models can improve accuracy and reduce computation at the same time. Complexity: There are many design choices. Those choices have a big effect on performance. Ensembling has unusually large benefits. Simplify where possible!

35 Soft vs Hard Attention Attention Takeaways Explainability: Attention models encode explanations. Both locus and trajectory help understand what s going on. Hard vs. Soft: Soft models are easier to train, hard models require reinforcement learning. They can be combined, as in Luong et al.

16-785: Integrated Intelligence in Robotics: Vision, Language, and Planning. Spring 2018 Lecture 14. Image to Text

16-785: Integrated Intelligence in Robotics: Vision, Language, and Planning Spring 2018 Lecture 14. Image to Text Input Output Classification tasks 4/1/18 CMU 16-785: Integrated Intelligence in Robotics