Multimodal Learning. Victoria Dean. MIT 6.S191 Intro to Deep Learning IAP 2017

Similar documents
VISION & LANGUAGE From Captions to Visual Concepts and Back

Grounded Compositional Semantics for Finding and Describing Images with Sentences

16-785: Integrated Intelligence in Robotics: Vision, Language, and Planning. Spring 2018 Lecture 14. Image to Text

Image-Sentence Multimodal Embedding with Instructive Objectives

Deep Learning. Deep Learning. Practical Application Automatically Adding Sounds To Silent Movies

ABC-CNN: Attention Based CNN for Visual Question Answering

How Pixar Tells a Story By Rachel Slivnick 2018

easy english readers The Little Mermaid By Hans Christian Andersen

Semantic image search using queries

Learning Representations for Multimodal Data with Deep Belief Nets

Convolutional-Recursive Deep Learning for 3D Object Classification

Lecture 2 Notes. Outline. Neural Networks. The Big Idea. Architecture. Instructors: Parth Shah, Riju Pahwa

Novel Image Captioning

Multimodal Sparse Coding for Event Detection

Towards Building Large-Scale Multimodal Knowledge Bases

CS231N Section. Video Understanding 6/1/2018

Advanced Multimodal Machine Learning

Pouya Kousha Fall 2018 CSE 5194 Prof. DK Panda

CS5245: Vision & Graphics for Special Effects. Project Update 1 AY 2009/2010 Semester II. Team Members:

CS839: Probabilistic Graphical Models. Lecture 22: The Attention Mechanism. Theo Rekatsinas

Image Question Answering using Convolutional Neural Network with Dynamic Parameter Prediction

EUROPEAN KANGOUROU LINGUISTICS ENGLISH-LEVELS 3-4. Linguistic ENGLISH. LEVEL: 3 4 (Γ - Δ Δημοτικού)

The little mermaid lives in a beautiful castle in the deep blue sea. She lives with her five sisters and her father, the Merking.

YOLO9000: Better, Faster, Stronger

Applying deep learning to multimodal data in social media

LSTM and its variants for visual recognition. Xiaodan Liang Sun Yat-sen University

Word2vec and beyond. presented by Eleni Triantafillou. March 1, 2016

LSTM for Language Translation and Image Captioning. Tel Aviv University Deep Learning Seminar Oran Gafni & Noa Yedidia

(Deep) Learning for Robot Perception and Navigation. Wolfram Burgard

CMU Lecture 18: Deep learning and Vision: Convolutional neural networks. Teacher: Gianni A. Di Caro

Towards Large-Scale Semantic Representations for Actionable Exploitation. Prof. Trevor Darrell UC Berkeley

Regionlet Object Detector with Hand-crafted and CNN Feature

Martian lava field, NASA, Wikipedia

COMP9444 Neural Networks and Deep Learning 7. Image Processing. COMP9444 c Alan Blair, 2017

Vision, Language and Commonsense. Larry Zitnick Microsoft Research

ROB 537: Learning-Based Control

ImageNet Classification with Deep Convolutional Neural Networks

Internet of things that video

Real-time Gesture Pattern Classification with IMU Data

Fine-tuning Pre-trained Large Scaled ImageNet model on smaller dataset for Detection task

Fuzzy Set Theory in Computer Vision: Example 3

Residual Networks And Attention Models. cs273b Recitation 11/11/2016. Anna Shcherbina

arxiv: v1 [cs.lg] 31 Oct 2018

Computer Vision: Making machines see

Welcome to Unit 5~Answer Key

PIXELS TO VOXELS: MODELING VISUAL REPRESENTATION IN THE HUMAN BRAIN

Telling a Story Visually. Copyright 2012, Oracle. All rights reserved.

JOINT DETECTION AND SEGMENTATION WITH DEEP HIERARCHICAL NETWORKS. Zhao Chen Machine Learning Intern, NVIDIA

Image Captioning and Generation From Text

Apparel Classifier and Recommender using Deep Learning

Your message in Outlook will look something like the illustration below. Begin by right-clicking on the name of the attachment.

Love at First Sight by Arran Lewis

COMP 551 Applied Machine Learning Lecture 16: Deep Learning

CEA LIST s participation to the Scalable Concept Image Annotation task of ImageCLEF 2015

Please Excuse My Dear Aunt Sally

CS 523: Multimedia Systems

Generating Images from Captions with Attention. Elman Mansimov Emilio Parisotto Jimmy Lei Ba Ruslan Salakhutdinov

SurfNet: Generating 3D shape surfaces using deep residual networks-supplementary Material

arxiv:submit/ [cs.cv] 13 Jan 2018

Classification of objects from Video Data (Group 30)

DEEP LEARNING APPROACH FOR REMOTE SENSING IMAGE ANALYSIS

A Deep Learning Approach to Vehicle Speed Estimation

Procedure for Developing a Multimedia Presentation

CPSC340. State-of-the-art Neural Networks. Nando de Freitas November, 2012 University of British Columbia

DEEP LEARNING REVIEW. Yann LeCun, Yoshua Bengio & Geoffrey Hinton Nature Presented by Divya Chitimalla

Light and Sound. Wave Behavior and Interactions

Introduction Let s Go To The Seaside Let s Go To Town

Repurposing Your Podcast. 3 Places Your Podcast Must Be To Maximize Your Reach (And How To Use Each Effectively)

Deep Learning. Architecture Design for. Sargur N. Srihari

SEMANTIC COMPUTING. Lecture 8: Introduction to Deep Learning. TU Dresden, 7 December Dagmar Gromann International Center For Computational Logic

Joint Object Detection and Viewpoint Estimation using CNN features

Recurrent Neural Nets II

3D model classification using convolutional neural network

Lecture #11: The Perceptron

Computer Vision. From traditional approaches to deep neural networks. Stanislav Frolov München,

Getting Started with Java Using Alice. 1 Copyright 2013, Oracle and/or its affiliates. All rights reserved.

arxiv: v1 [cs.cv] 20 Dec 2016

The Fraunhofer IDMT at ImageCLEF 2011 Photo Annotation Task

Maths Activities. Estimate how many centimetres of snow have fallen.

Characterization and Benchmarking of Deep Learning. Natalia Vassilieva, PhD Sr. Research Manager

arxiv: v1 [cs.cv] 17 Nov 2016

Pixels to Voxels: Modeling Visual Representation in the Human Brain

The Picture of Dorian Gray

08 An Introduction to Dense Continuous Robotic Mapping

Adults Men Women Adults Men Women GEICO 98% 99% 98% 96% 96% 96%

Convolutional Neural Networks

TRANSPARENT OBJECT DETECTION USING REGIONS WITH CONVOLUTIONAL NEURAL NETWORK

Deep Learning for Remote Sensing

Encoder-Decoder Networks for Semantic Segmentation. Sachin Mehta

arxiv: v1 [cs.mm] 12 Jan 2016

10-1 Active sentences and passive sentences

SEMANTIC SEGMENTATION AVIRAM BAR HAIM & IRIS TAL

Disguised Face Identification (DFI) with Facial KeyPoints using Spatial Fusion Convolutional Network. Nathan Sun CIS601

Show, Discriminate, and Tell: A Discriminatory Image Captioning Model with Deep Neural Networks

Deep Learning. Visualizing and Understanding Convolutional Networks. Christopher Funk. Pennsylvania State University.

All You Want To Know About CNNs. Yukun Zhu

How to create dialog system

Overview of the medical task of ImageCLEF Alba G. Seco de Herrera Stefano Bromuri Roger Schaer Henning Müller

CS230: Lecture 3 Various Deep Learning Topics

Show, Attend and Tell: Neural Image Caption Generation with Visual Attention

Transcription:

Multimodal Learning Victoria Dean

Talk outline What is multimodal learning and what are the challenges? Flickr example: joint learning of images and tags Image captioning: generating sentences from images SoundNet: learning sound representation from videos

Talk outline What is multimodal learning and what are the challenges? Flickr example: joint learning of images and tags Image captioning: generating sentences from images SoundNet: learning sound representation from videos

Deep learning success in single modalities

Deep learning success in single modalities

Deep learning success in single modalities

What is multimodal learning? In general, learning that involves multiple modalities This can manifest itself in different ways: Input is one modality, output is another Multiple modalities are learned jointly One modality assists in the learning of another...

Data is usually a collection of modalities Multimedia web content

Data is usually a collection of modalities Multimedia web content Product recommendation systems

Data is usually a collection of modalities Multimedia web content Product recommendation systems Robotics

Why is multimodal learning hard? Different representations Text Images Discrete, Sparse Real-valued, Dense

Why is multimodal learning hard? Different representations Noisy and missing data

How can we solve these problems? Combine separate models for single modalities at a higher level Pre-train models on single-modality data How do we combine these models? Embeddings!

Pretraining Initialize with the weights from another network (instead of random) Even if the task is different, low-level features will still be useful, such as edge and shape filters for images Example: take the first 5 convolutional layers from a network trained on the ImageNet classification task

Embeddings A way to represent data In deep learning, this is usually a high-dimensional vector A neural network can take a piece of data and create a corresponding vector in an embedding space A neural network can take a embedding vector as an input Example: word embeddings

Word embeddings A word embedding: word high-dimensional vector Interesting properties

Embeddings We can use embeddings to switch between modalities! In sequence modeling, we saw a sentence embedding to switch between languages for translation Similarly, we can have embeddings for images, sound, etc. that allow us to transfer meaning and concepts across modalities

Talk outline What is multimodal learning and what are the challenges? Flickr example: joint learning of images and tags Image captioning: generating sentences from images SoundNet: learning sound representation from videos

Flickr tagging: task Text Images

Flickr tagging: task Text Images 1 million images from flickr 25,000 have tags

Flickr tagging: model Pretrain unimodal models and combine them at a higher level

Flickr tagging: model Pretrain unimodal models and combine them at a higher level

Flickr tagging: model Pretrain unimodal models and combine them at a higher level

Flickr tagging: example outputs

Flickr tagging: example outputs

Flickr tagging: visualization

Flickr tagging

Talk outline What is multimodal learning and what are the challenges? Flickr example: joint learning of images and tags Image captioning: generating sentences from images SoundNet: learning sound representation from videos

Example: image captioning

Example: image captioning W A young girl asleep A young girl

Human: A young girl asleep on the sofa cuddling a stuffed bear. Computer: A close up of a child holding a stuffed animal. Computer: A baby is asleep next to a teddy bear.

Human: A close up of two bananas with bottles in the background. Computer: A bunch of bananas and a bottle of wine.

Human: A view of inside of a car where a cat is laying down. Computer: A cat sitting on top of a black car.

Human: A green monster kite soaring in a sunny sky. Computer: A man flying through the air while riding a snowboard.

Caption model for neural storytelling We were barely able to catch the breeze at the beach, and it felt as if someone stepped out of my mind. She was in love with him for the first time in months, so she had no intention of escaping. The sun had risen from the ocean, making her feel more alive than normal. She's beautiful, but the truth is that I don't know what to do. The sun was just starting to fade away, leaving people scattered around the Atlantic Ocean. I d seen the men in his life, who guided me at the beach once more. Jamie Kiros, github.com/ryankiros/neural-storyteller

Talk outline What is multimodal learning and what are the challenges? Flickr example: joint learning of images and tags Image captioning: generating sentences from images SoundNet: learning sound representation from videos

SoundNet Idea: learn a sound representation from unlabeled video We have good vision models that can provide information about unlabeled videos Can we train a network that takes sound as an input and learns object and scene information? This sound representation could then be used for sound classification tasks

SoundNet training

SoundNet training Loss for the sound CNN:

SoundNet training Loss for the sound CNN: x is the raw waveform y is the RGB frames g(y) is the object or scene distribution f(x; ) is the output from the sound CNN

SoundNet visualization

SoundNet visualization What audio inputs evoke the maximum output from this neuron?

SoundNet: visualization of hidden units https://projects.csail.mit.edu/soundnet/

Conclusion Multimodal tasks are hard Differences in data representation Noisy and missing data

Conclusion Multimodal tasks are hard Differences in data representation Noisy and missing data What types of models work well? Composition of unimodal models Pretraining unimodally

Conclusion Multimodal tasks are hard What types of models work well? Differences in data representation Noisy and missing data Composition of unimodal models Pretraining unimodally Examples of multimodal tasks Model two modalities jointly (Flickr tagging) Generate one modality from another (image captioning) Use one modality as labels for the other (SoundNet)