ABC-CNN: Attention Based CNN for Visual Question Answering

Size: px

Start display at page:

Download "ABC-CNN: Attention Based CNN for Visual Question Answering"

Gregory Hoover
5 years ago
Views:

1 ABC-CNN: Attention Based CNN for Visual Question Answering CIS 601 PRESENTED BY: MAYUR RUMALWALA GUIDED BY: DR. SUNNIE CHUNG

2 AGENDA Ø Introduction Ø Understanding CNN Ø Framework of ABC-CNN Ø Datasets used Ø Conclusion

Introduction Understanding VQA Visual Question Answering (VQA) is the task of answering questions, posed in natural language, about the semantic content in an image VQA is a highly challenging

3 Introduction Understanding VQA Visual Question Answering (VQA) is the task of answering questions, posed in natural language, about the semantic content in an image VQA is a highly challenging problem as it requires the machine to understand natural language queries, extract semantic contents from images, and relate them in a unified framework. Most of VQA model contains three parts: Vision part Question understanding part Answer generation part

4 Introduction VQA Task AI System bananas What is the mustache made of?

5 Introduction Human VQA What is the mustache made of? Is this a mustache? Are you kidding me?

6 Convolutional Neural Network Convolutional Neural Network Image Representation in Pixels How image looks like

7 Convolutional Neural Network Image processing using CNN How filter is used

8 Convolutional Neural Network Overview

ABC-CNN Framework Architecture Green box denotes the image feature extraction part using CNN Blue box is question understanding part using LSTM Yellow box illustrates the

9 ABC-CNN Framework Architecture Green box denotes the image feature extraction part using CNN Blue box is question understanding part using LSTM Yellow box illustrates the attention extraction part with configurable convolution Red box is the answer generation part using multi-class classification based on attention weighted image feature maps

10 ABC-CNN Framework Architecture

11 ABC-CNN Framework Image Feature Extraction The visual information in each image is represented as an N x N x D image feature map. The feature map is generated by dividing an image into an N x N grid, and extracting a D - dimensional feature vector f in each cell of the grid. The D - dimensional feature vector for each cell is the average of all the 10 D - dimensional feature vectors. The final N x N x D image feature map is the concatenation of N x N x D - dimensional feature vectors.

ABC-CNN Framework Question Understanding LSTM model generates a dense question embedding to characterize the semantic meaning of questions. A question is first tokenized into word sequence.

12 ABC-CNN Framework Question Understanding LSTM model generates a dense question embedding to characterize the semantic meaning of questions. A question is first tokenized into word sequence. We convert all the upper case characters to lower case characters, and remove all the punctuations. The words that appear in training set but are absent in test set are replaced with a special symbol #OOV #

ABC-CNN Framework Attention Extraction Question-guided attention map (QAM) is the core of the ABC-CNN framework QAM is generated by searching for visual features that correspond to

13 ABC-CNN Framework Attention Extraction Question-guided attention map (QAM) is the core of the ABC-CNN framework QAM is generated by searching for visual features that correspond to the input query s semantics in the spatial image feature map Convolutional kernel is generated by transforming the question embeddings from the semantic space into the visual space

14 ABC-CNN Framework Answer Generation Answer generation part is a multi-class classifier based on the original image feature map, the dense question embedding, and the attention weighted feature map. Method can be extended to generate full sentences by using an RNN decoder

15 Datasets

16 Datasets DAQUAR o 894 object classes o 654 Test images with 5674 QA pairs o 795 Training images with 6974 QA pairs MS-COCO VQA o training images, validation images, and testing images o Each image is annotated with 3 questions, and each question has 10 candidate answers

17 Examples

18 Try it yourself Reference: Simple Baseline for Visual Question Answering (Massachusetts Institute of Technology, Facebook AI Research)

19 Conclusion This model unifies the visual feature extraction and semantic question understanding via question-guided attention map. The attention map is generated by a configurable convolution network that is adaptively determined by the meaning of questions.

20 Thank you

Image Question Answering using Convolutional Neural Network with Dynamic Parameter Prediction

Image Question Answering using Convolutional Neural Network with Dynamic Parameter Prediction by Noh, Hyeonwoo, Paul Hongsuck Seo, and Bohyung Han.[1] Presented : Badri Patro 1 1 Computer Vision Reading