VISION & LANGUAGE From Captions to Visual Concepts and Back

Similar documents
Image Captioning with Object Detection and Localization

LSTM for Language Translation and Image Captioning. Tel Aviv University Deep Learning Seminar Oran Gafni & Noa Yedidia

Multi-Glance Attention Models For Image Classification

CS5670: Computer Vision

Multimodal Learning. Victoria Dean. MIT 6.S191 Intro to Deep Learning IAP 2017

Martian lava field, NASA, Wikipedia

Grounded Compositional Semantics for Finding and Describing Images with Sentences

Development in Object Detection. Junyuan Lin May 4th

Object detection using Region Proposals (RCNN) Ernest Cheung COMP Presentation

Deformable Part Models

Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks

Fully Convolutional Networks for Semantic Segmentation

Recurrent Convolutional Neural Networks for Scene Labeling

Show, Discriminate, and Tell: A Discriminatory Image Captioning Model with Deep Neural Networks

FastText. Jon Koss, Abhishek Jindal

Object Recognition II

Learning Deep Structured Models for Semantic Segmentation. Guosheng Lin

Novel Image Captioning

Semantic image search using queries

Computer Vision Lecture 16

Finding Tiny Faces Supplementary Materials

Spatial Localization and Detection. Lecture 8-1

CEA LIST s participation to the Scalable Concept Image Annotation task of ImageCLEF 2015

Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks

Andrei Polzounov (Universitat Politecnica de Catalunya, Barcelona, Spain), Artsiom Ablavatski (A*STAR Institute for Infocomm Research, Singapore),

SEMANTIC COMPUTING. Lecture 8: Introduction to Deep Learning. TU Dresden, 7 December Dagmar Gromann International Center For Computational Logic

CS 490: Computer Vision Image Segmentation: Thresholding. Fall 2015 Dr. Michael J. Reale

CAP 6412 Advanced Computer Vision

Constrained Convolutional Neural Networks for Weakly Supervised Segmentation. Deepak Pathak, Philipp Krähenbühl and Trevor Darrell

CS6501: Deep Learning for Visual Recognition. Object Detection I: RCNN, Fast-RCNN, Faster-RCNN

JOINT INTENT DETECTION AND SLOT FILLING USING CONVOLUTIONAL NEURAL NETWORKS. Puyang Xu, Ruhi Sarikaya. Microsoft Corporation

Classification of objects from Video Data (Group 30)

Neural Networks for unsupervised learning From Principal Components Analysis to Autoencoders to semantic hashing

Multiple-Choice Questionnaire Group C

Alternatives to Direct Supervision

Unsupervised Learning

Project 3 Q&A. Jonathan Krause

MULTI-SCALE OBJECT DETECTION WITH FEATURE FUSION AND REGION OBJECTNESS NETWORK. Wenjie Guan, YueXian Zou*, Xiaoqun Zhou

Lecture 20: Neural Networks for NLP. Zubin Pahuja

CS839: Probabilistic Graphical Models. Lecture 22: The Attention Mechanism. Theo Rekatsinas

From Structure-from-Motion Point Clouds to Fast Location Recognition

Machine Learning 13. week

Category-level localization

Regionlet Object Detector with Hand-crafted and CNN Feature

Image-Sentence Multimodal Embedding with Instructive Objectives

INF 5860 Machine learning for image classification. Lecture 11: Visualization Anne Solberg April 4, 2018

Perceptron: This is convolution!

Learning to Segment Object Candidates

Mask R-CNN. By Kaiming He, Georgia Gkioxari, Piotr Dollar and Ross Girshick Presented By Aditya Sanghi

Deep Learning. Visualizing and Understanding Convolutional Networks. Christopher Funk. Pennsylvania State University.

Fuzzy Set Theory in Computer Vision: Example 3, Part II

Deep Learning and Its Applications

Yiqi Yan. May 10, 2017

Scene Text Recognition for Augmented Reality. Sagar G V Adviser: Prof. Bharadwaj Amrutur Indian Institute Of Science

3D Convolutional Neural Networks for Landing Zone Detection from LiDAR

PROBLEM FORMULATION AND RESEARCH METHODOLOGY

(Deep) Learning for Robot Perception and Navigation. Wolfram Burgard

End-to-End Localization and Ranking for Relative Attributes

Learning visual odometry with a convolutional network

Human detection using histogram of oriented gradients. Srikumar Ramalingam School of Computing University of Utah

16-785: Integrated Intelligence in Robotics: Vision, Language, and Planning. Spring 2018 Lecture 14. Image to Text

arxiv: v1 [cs.cv] 20 Dec 2016

AttentionNet for Accurate Localization and Detection of Objects. (To appear in ICCV 2015)

Computer Vision Lecture 16

Practice Exam Sample Solutions

Deep Learning. Vladimir Golkov Technical University of Munich Computer Vision Group

Encoder-Decoder Networks for Semantic Segmentation. Sachin Mehta

Detecting and Recognizing Text in Natural Images using Convolutional Networks

Part Localization by Exploiting Deep Convolutional Networks

Struck: Structured Output Tracking with Kernels. Presented by Mike Liu, Yuhang Ming, and Jing Wang May 24, 2017

Object Detection on Self-Driving Cars in China. Lingyun Li

3 Object Detection. BVM 2018 Tutorial: Advanced Deep Learning Methods. Paul F. Jaeger, Division of Medical Image Computing

Query Processing and Alternative Search Structures. Indexing common words

Announcements. Edge Detection. An Isotropic Gaussian. Filters are templates. Assignment 2 on tracking due this Friday Midterm: Tuesday, May 3.

CIS581: Computer Vision and Computational Photography Project 4, Part B: Convolutional Neural Networks (CNNs) Due: Dec.11, 2017 at 11:59 pm

2 OVERVIEW OF RELATED WORK

Learning-based Localization

Object Localization, Segmentation, Classification, and Pose Estimation in 3D Images using Deep Learning

Deep Learning Applications

Convolutional Neural Networks. Computer Vision Jia-Bin Huang, Virginia Tech

CS 1674: Intro to Computer Vision. Neural Networks. Prof. Adriana Kovashka University of Pittsburgh November 16, 2016

Deep Learning For Video Classification. Presented by Natalie Carlebach & Gil Sharon

Show, Attend and Tell: Neural Image Caption Generation with Visual Attention

Traffic Sign Localization and Classification Methods: An Overview

How to Build Optimized ML Applications with Arm Software

ECCV Presented by: Boris Ivanovic and Yolanda Wang CS 331B - November 16, 2016

Deep Learning for Embedded Security Evaluation

Artificial Intelligence Introduction Handwriting Recognition Kadir Eren Unal ( ), Jakob Heyder ( )

DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution and Fully Connected CRFs

ABC-CNN: Attention Based CNN for Visual Question Answering

CS229: Action Recognition in Tennis

CAP 6412 Advanced Computer Vision

Recurrent Neural Networks. Nand Kishore, Audrey Huang, Rohan Batra

Deep neural networks II

Computer Vision Lecture 16

08 An Introduction to Dense Continuous Robotic Mapping

CS664 Lecture #16: Image registration, robust statistics, motion

PointNet++: Deep Hierarchical Feature Learning on Point Sets in a Metric Space

Neural Machine Translation In Linear Time

ECE 6554:Advanced Computer Vision Pose Estimation

Transcription:

VISION & LANGUAGE From Captions to Visual Concepts and Back Brady Fowler & Kerry Jones Tuesday, February 28th 2017 CS 6501-004 VICENTE

Agenda Problem Domain Object Detection Language Generation Sentence Re-Ranking Results & Comparisons

Problem & Goal Goal: Generate image captions that are on par with human descriptions Previous approaches to generating image captions relied on object, attribute, and relation detectors learned from separate hand-labeled training data This implementation seeks to use only images and captions without any human generated features Benefit of using captions: 1. Caption structure inherently reflects object importance 2. Possible to infer broader concepts (beautiful, flying, open) not directly tied to objects tagged in image. 3. Learning a joint multimodal representation allows global semantic similarities to be measured for re-ranking

Related Work 2 major approaches to automatic image captioning and a few examples: Retrieval of human captions R. Socher et al. used dependency trees to embed sentences into a vector space in order to retrieve images that are described by those sentences Karpathy et al. embedded image fragments (objects) and sentence fragments into common vector space Generation of new captions based on detected objects: Mitchell et al. developed Midge system which integrates word co-occurrence statistics to filter out noise in generation. BabyTalk system which inserts detected words into template slots.

Captioning Pipeline Detect Words Woman, Crowd, Cat, Camera, Holding, Purple Generate Sequences A purple camera with a woman. A woman holding a camera in a crowd. A woman holding a cat. Re-rank Sequences A woman holding a camera in a crowd.

OBJECT DETECTION Apply CNN to image regions with Multiple Instance Learning

Word Detection Approach Input is raw images without bounding boxes Output is probability distribution of word vocabulary Vocab = 1,000 most frequent words; 92% of total words Instead of using entire image, they use dense scanning of the image:* Each region of the image is converted into features w/ CNN Features are mapped to output vocabulary words with highest probability of being in the caption Using multiple instance learning setup this learns a visual signature for each word *early version of the system used edge box recommendations

Word Detection Approach When this fully convolutional network is run over the image, we obtain a coarse spatial response map. Each location in this response map corresponds to the response obtained by applying the original CNN to overlapping shifted regions of the input image (thereby effectively scanning different locations in the image for possible objects). We up-sample the image to make the longer side to be 565 pixels which gives us a 12 12 response map at fc8 for both [21, 42] and corresponds to sliding a 224 224 bounding box in the up-sampled image with a stride of 32. The noisy-or version of MIL is then implemented on top of this response map to generate a single probability p w for each word for each image. We use a cross i entropy loss and optimize the CNN end-to-end for this task with stochastic gradient descent.

Word Detection CNN MIL FC-8 as fully convolutional layers Multiple Instance Learning Per Class Probability Image Spatial Class Probability Maps p w ij = Architecture Layout: Saurabh Gupta

Word Detection For a given word: Divide images into positive and negative bags of bounding boxes (each image = a bag) Pass image through CNN and retrieve response map, (b ij ) p w ij = There are as many (b ij ) as there are regions (j indicates region) For every (b ij ) you compute p w (probability for every word) ij To calculate the probability of a word being in the image (b w ) you i pass in the probability of that word across all regions into: b w i =

Loss After all this we will be left with a vector of word probabilities for the image which we can compare to the ground truth: Estimation: [.01,.03,.01,.9,.01,... 0.1,.8,.6,.01 ] Truth: [ 0, 0, 0, 1, 0,... 0, 1, 1, 0 ] crowd woman camera Use cross entropy loss to optimize the CNN end-to-end as well as the V w and U w weights used in calculating by-region word probability, p w ij Once trained, a global threshold, τ, is selected to pick the top words with probability p w above the threshold i

Word Probability Maps

Word Detection Results Biggest improvement from MIL are concrete objects

Language Generation & Sentence Re-Ranking

Language Generation Maximum Entropy Language Model: Generates novel image descriptions from a bag of likely words. Trained on 400,000 Image Descriptions A search over word sequence is used to find high-likelihood sentences Sentence Re-ranking: Re-ranks set of sentences by a linear weight of the sentences features. Trained using Minimum Error Rate Training(MERT) Deep Multimodal Similarity Model Feature

Maximum Entropy LM Using maximum entropy LM conditioned on words chosen in previous step and only uses each word once To train the model, the objective function is the log-likelihood of captions conditioned on the corresponding set of objects Sentences are generated using Beam Process

Sentence Re-Ranking MERT used to rank sentence likelihood Uses linear combination of features over whole sentence. Log-likelihood of the sequence Length of the sequence The log-probability per word of the sequence The logarithm of the sequences rank in the log-likelihood 11 binary features indicating whether number of objects were mentioned DMSM Score between word sequence and the Image Deep Multimodal Similarity Model(DMSM) is a feature of MERT that measures similarity between images and text.

Deep Multimodal Similarity Model(DMSM) DMSM is used to improve the quality of the sentences. Text Vector: yd Image Vector: xd Trains two neural networks jointly that map images and text fragments to a common vector representation

Deep Multimodal Similarity Model(DMSM) Relevance(R) = cosine(text, Image) For every text-image pair, we compute: The loss function:

Results

Questions?