Show, Attend and Tell: Neural Image Caption Generation with Visual Attention

Similar documents
CAP 6412 Advanced Computer Vision

Show, Discriminate, and Tell: A Discriminatory Image Captioning Model with Deep Neural Networks

Show, Discriminate, and Tell: A Discriminatory Image Captioning Model with Deep Neural Networks

Image Captioning with Object Detection and Localization

Project Final Report

DCU-UvA Multimodal MT System Report

A PARALLEL-FUSION RNN-LSTM ARCHITECTURE FOR IMAGE CAPTION GENERATION

16-785: Integrated Intelligence in Robotics: Vision, Language, and Planning. Spring 2018 Lecture 14. Image to Text

Image-to-Text Transduction with Spatial Self-Attention

Image Captioning and Generation From Text

Convolutional Neural Networks. CSC 4510/9010 Andrew Keenan

Generating Images from Captions with Attention. Elman Mansimov Emilio Parisotto Jimmy Lei Ba Ruslan Salakhutdinov

LSTM for Language Translation and Image Captioning. Tel Aviv University Deep Learning Seminar Oran Gafni & Noa Yedidia

CS839: Probabilistic Graphical Models. Lecture 22: The Attention Mechanism. Theo Rekatsinas

Multi-Glance Attention Models For Image Classification

Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling

Image Captioning with Attention

Recurrent Neural Networks. Nand Kishore, Audrey Huang, Rohan Batra

GAN Frontiers/Related Methods

arxiv: v1 [cs.cv] 7 Feb 2018

GRADIENT-BASED OPTIMIZATION OF NEURAL

Recurrent Neural Nets II

Bounding Out-of-Sample Objects A weakly-supervised approach

Asynchronous Parallel Learning for Neural Networks and Structured Models with Dense Features

Arbitrary Style Transfer in Real-Time with Adaptive Instance Normalization. Presented by: Karen Lucknavalai and Alexandr Kuznetsov

CS489/698: Intro to ML

arxiv: v5 [stat.ml] 4 May 2017

Deep generative models of natural images

10703 Deep Reinforcement Learning and Control

Introduction to Multimodal Machine Translation

Multi-Reference Training with Pseudo-References for Neural Translation and Text Generation

Image-Sentence Multimodal Embedding with Instructive Objectives

Temporal-difference Learning with Sampling Baseline for Image Captioning

SCENE DESCRIPTION FROM IMAGES TO SENTENCES

COMP 551 Applied Machine Learning Lecture 16: Deep Learning

Model Generalization and the Bias-Variance Trade-Off

Novel Image Captioning

arxiv: v1 [cs.lg] 10 Sep 2018

Research on Pruning Convolutional Neural Network, Autoencoder and Capsule Network

Structured Attention Networks

CPSC 340: Machine Learning and Data Mining. Deep Learning Fall 2016

Lecture 20: Neural Networks for NLP. Zubin Pahuja

arxiv: v1 [cs.cv] 17 Nov 2016

PERSON RE-IDENTIFICATION USING VISUAL ATTENTION. Alireza Rahimpour, Liu Liu, Yang Song, Hairong Qi

Plan, Attend, Generate: Planning for Sequence-to-Sequence Models

Deep Generative Models Variational Autoencoders

Residual Networks And Attention Models. cs273b Recitation 11/11/2016. Anna Shcherbina

Neural Network Optimization and Tuning / Spring 2018 / Recitation 3

RNNs as Directed Graphical Models

Deep Learning for Computer Vision II

Deep Learning. Deep Learning. Practical Application Automatically Adding Sounds To Silent Movies

Gradient Descent. Wed Sept 20th, James McInenrey Adapted from slides by Francisco J. R. Ruiz

Generating Natural Video Descriptions via Multimodal Processing

Image Caption with Global-Local Attention

arxiv: v6 [stat.ml] 15 Jun 2015

INTRODUCTION TO DEEP LEARNING

(University Improving of Montreal) Generative Adversarial Networks with Denoising Feature Matching / 17

arxiv: v2 [cs.cv] 23 Mar 2017

VISUAL SALIENCY FOR IMAGE CAPTIONING IN NEW MULTIMEDIA SERVICES. Marcella Cornia, Lorenzo Baraldi, Giuseppe Serra, Rita Cucchiara

Cambridge Interview Technical Talk

Exploring Style Transfer: Extensions to Neural Style Transfer

Clarity: An Exploration of Semantic Information Encoded in Mobile Application GUIs

arxiv: v1 [cs.cv] 20 Mar 2017

Practical Methodology. Lecture slides for Chapter 11 of Deep Learning Ian Goodfellow

Deep Learning. Architecture Design for. Sargur N. Srihari

Layerwise Interweaving Convolutional LSTM

Autoencoder. Representation learning (related to dictionary learning) Both the input and the output are x

Lab meeting (Paper review session) Stacked Generative Adversarial Networks

On the Efficiency of Recurrent Neural Network Optimization Algorithms

arxiv: v1 [cs.lg] 23 Oct 2018

Image to Latex. Romain Sauvestre Stanford University MS&E. Guillaume Genthial Stanford University ICME. Abstract. 1. Introduction

arxiv: v1 [cs.lg] 9 Nov 2015

10703 Deep Reinforcement Learning and Control

Akarsh Pokkunuru EECS Department Contractive Auto-Encoders: Explicit Invariance During Feature Extraction

Facial Expression Classification with Random Filters Feature Extraction

ABC-CNN: Attention Based CNN for Visual Question Answering

Index. Umberto Michelucci 2018 U. Michelucci, Applied Deep Learning,

Alternatives to Direct Supervision

arxiv: v1 [cs.lg] 26 Oct 2018

arxiv: v1 [cs.cv] 21 Dec 2016

Generative Adversarial Text to Image Synthesis

Instance-aware Image and Sentence Matching with Selective Multimodal LSTM

Top-down Visual Saliency Guided by Captions

Machine Learning 13. week

Recurrent Neural Networks and Transfer Learning for Action Recognition

Plan, Attend, Generate: Planning for Sequence-to-Sequence Models

Unsupervised Learning

Convolutional Sequence to Sequence Learning. Denis Yarats with Jonas Gehring, Michael Auli, David Grangier, Yann Dauphin Facebook AI Research

Pointer Network. Oriol Vinyals. 박천음 강원대학교 Intelligent Software Lab.

Semantic image search using queries

Recurrent Convolutional Neural Networks for Scene Labeling

Recurrent Fusion Network for Image Captioning

Encoding RNNs, 48 End of sentence (EOS) token, 207 Exploding gradient, 131 Exponential function, 42 Exponential Linear Unit (ELU), 44

Where-and-When to Look: Deep Siamese Attention Networks for Video-based Person Re-identification

Part Localization by Exploiting Deep Convolutional Networks

Deep Learning Applications

Machine Learning. MGS Lecture 3: Deep Learning

Table of Contents. What Really is a Hidden Unit? Visualizing Feed-Forward NNs. Visualizing Convolutional NNs. Visualizing Recurrent NNs

Generative Adversarial Network

COMP9444 Neural Networks and Deep Learning 7. Image Processing. COMP9444 c Alan Blair, 2017

Transcription:

Show, Attend and Tell: Neural Image Caption Generation with Visual Attention Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhutdinov, Richard Zemel, Yoshua Bengio Presented by Kathy Ge

Motivation: Attention attention allows for salient features to dynamically come to the forefront as needed

Image Caption Generation with Attention Mechanism Encoder: lower convolutional layer of a CNN Decoder: LSTM which generates a caption one word at a time Attention mechanism Deterministic soft mechanism Stochastic hard mechanism Output:

Encoder: CNN Lower convolutional layer of a CNN is used, to capture spatial information encoded in images annotation vector

Decoder: LSTM where i t, f t, c t, o t, h t are the input, forget, memory, output, and hidden state of the LSTM at time t is the context vector which captures the visual information associated with a particular input location is the embedding matrix

Learning Stochastic Hard vs Deterministic Soft Attention Given an annotation vector a i, i = 1,, L for each location i, an attention mechanism generates a positive weight Weight of each annotation vector is computed by an attention model f att using a multi-layer perceptron conditioned on previous hidden states h t-1 Define a function which computes the context vector z t given the annotation vectors and corresponding weights Given the previous word, previous hidden state and context vector, compute output word probability

Deterministic Soft Attention Compute expectation of context vector directly Then can compute a soft attention weighted annotation vector This model is smooth and differentiable, can be computed using standard backpropagation

Doubly Stochastic Attention When training the deterministic version of the model, can introduce a doubly stochastic regularization, where This encourages model to pay equal attention to every part of the image throughout the caption generation In experiments, improved overall BLEU score, and lead to more rich and descriptive captions The model is trained by minimizing the negative log likelihood with penalty

Stochastic Hard Attention Let s t represent the random variable corresponding to the location where the model decides to focus attention at the t th word where z t is a random variable, and s t are intermediate latent variables

Stochastic Hard Attention Define objection function, L S, the variational lower bound Gradient w.r.t. parameters of model, W where

Stochastic Hard Attention Reduce estimator variance by using a moving average baseline and introducing entropy term H[s] Final learning rule: gradient w.r.t. parameters of model, W where λ r, λ e are hyperparameters, and b is exponential decay used in calculating moving average baseline At each point, returns a sampled a i at every point in time based on a multinomial distribution parametrized by Similar to REINFORCE rule

Experiments Evaluated performance on Flickr8K, Flickr30K, and MS COCO Optimized using RMSProp for Flickr8K and Adam for Flickr30K/MS COCO Used Oxford VGGnet pretrained on ImageNet Quantitative results measured using BLEU and METEOR metrics

Qualitative Results

Mistakes

Soft attention model A woman is throwing a frisbee in a park.

Hard attention model A man and a woman playing frisbee in a field.

Soft attention model A woman holding a clock in her hand.

Hard attention model A woman is holding a donut in his hand.

Conclusion Xu et al. introduce an attention based model that is able describe the contents of an image The model is able to fix its gaze on salient objects while generating words in the caption sequence They compare the use of a stochastic hard attention mechanism by maximizing a variational lower bound and a deterministic soft attention mechanism using standard backpropagation Learned attention model can give interpretability to model generation process, and through qualitative analysis can show that alignments of words to locations in an image correspond well to human intuition

Thanks! Any questions?