Multimodal topic model for texts and images utilizing their embeddings

Similar documents
Multimodal Medical Image Retrieval based on Latent Topic Modeling

Automatic Image Annotation and Retrieval Using Hybrid Approach

Introduction to Information Retrieval

Semantic image search using queries

3. Replace any row by the sum of that row and a constant multiple of any other row.

PLSA DRIVEN IMAGE ANNOTATION, CLASSIFICATION, AND TOURISM RECOMMENDATION. Konstantinos Pliakos and Constantine Kotropoulos

Semantic Image Retrieval Using Correspondence Topic Model with Background Distribution

Gradient of the lower bound

A Novel Model for Semantic Learning and Retrieval of Images

Metric Learning for Large Scale Image Classification:

CLASSIFICATION WITH RADIAL BASIS AND PROBABILISTIC NEURAL NETWORKS

A Deep Relevance Matching Model for Ad-hoc Retrieval

TagProp: Discriminative Metric Learning in Nearest Neighbor Models for Image Annotation

CS145: INTRODUCTION TO DATA MINING

Autoencoder. Representation learning (related to dictionary learning) Both the input and the output are x

Using Maximum Entropy for Automatic Image Annotation

Liangjie Hong*, Dawei Yin*, Jian Guo, Brian D. Davison*

CS230: Lecture 3 Various Deep Learning Topics

Deep Learning and Its Applications

Metric Learning for Large-Scale Image Classification:

CPSC 340: Machine Learning and Data Mining. Principal Component Analysis Fall 2016

ON-LINE GENRE CLASSIFICATION OF TV PROGRAMS USING AUDIO CONTENT

Word2vec and beyond. presented by Eleni Triantafillou. March 1, 2016

The exam is closed book, closed notes except your one-page (two-sided) cheat sheet.

Mining Web Data. Lijun Zhang

Parallelism for LDA Yang Ruan, Changsi An

Classifying Images with Visual/Textual Cues. By Steven Kappes and Yan Cao

arxiv: v1 [cs.cv] 1 Feb 2017

Mining Human Trajectory Data: A Study on Check-in Sequences. Xin Zhao Renmin University of China,

Probabilistic Generative Models for Machine Vision

IPL at ImageCLEF 2017 Concept Detection Task

Integrating Visual and Textual Cues for Query-by-String Word Spotting

Continuous Visual Vocabulary Models for plsa-based Scene Recognition

Detecting Thoracic Diseases from Chest X-Ray Images Binit Topiwala, Mariam Alawadi, Hari Prasad { topbinit, malawadi, hprasad

Supervised classification of law area in the legal domain

A bipartite graph model for associating images and text

Text Modeling with the Trace Norm

Machine Learning / Jan 27, 2010

What was Monet seeing while painting? Translating artworks to photo-realistic images M. Tomei, L. Baraldi, M. Cornia, R. Cucchiara

Unsupervised Outlier Detection and Semi-Supervised Learning

Lecture 24: Image Retrieval: Part II. Visual Computing Systems CMU , Fall 2013

Recognition. Topics that we will try to cover:

Non-negative Matrix Factorization for Multimodal Image Retrieval

Using Semantic Similarity in Crawling-based Web Application Testing. (National Taiwan Univ.)

Multilabel Learning for Automatic Web Services Tagging

Face detection and recognition. Detection Recognition Sally

Paper2vec: Combining Graph and Text Information for Scientific Paper Representation

Impact of Term Weighting Schemes on Document Clustering A Review

Grounded Compositional Semantics for Finding and Describing Images with Sentences

Mining Web Data. Lijun Zhang

Instance-aware Semantic Segmentation via Multi-task Network Cascades

Rich feature hierarchies for accurate object detection and semant

Weakly Supervised Object Recognition with Convolutional Neural Networks

Large Scale Data Analysis Using Deep Learning

Clustering and The Expectation-Maximization Algorithm

Neural Networks. Neural Network. Neural Network. Neural Network 2/21/2008. Andrew Kusiak. Intelligent Systems Laboratory Seamans Center

Recurrent Convolutional Neural Networks for Scene Labeling

Entity and Knowledge Base-oriented Information Retrieval

Modeling semantic aspects for cross-media image indexing

Semi-Supervised Learning of Visual Classifiers from Web Images and Text

CPSC 340: Machine Learning and Data Mining. Principal Component Analysis Fall 2017

Knowledge Discovery and Data Mining 1 (VO) ( )

Image-Sentence Multimodal Embedding with Instructive Objectives

CS249: ADVANCED DATA MINING

Beyond bags of features: Adding spatial information. Many slides adapted from Fei-Fei Li, Rob Fergus, and Antonio Torralba

A Neuro Probabilistic Language Model Bengio et. al. 2003

Feature selection. LING 572 Fei Xia

Spatial Latent Dirichlet Allocation

Deep Learning for Computer Vision

Facial expression recognition using shape and texture information

Supplementary A. Overview. C. Time and Space Complexity. B. Shape Retrieval. D. Permutation Invariant SOM. B.1. Dataset

Supervised Hashing for Image Retrieval via Image Representation Learning

Final Report: Keyword extraction from text

jldadmm: A Java package for the LDA and DMM topic models

Classify My Social Contacts into Circles Stanford University CS224W Fall 2014

Exploring archives with probabilistic models: Topic modelling for the European Commission Archives

Bilevel Sparse Coding

The exam is closed book, closed notes except your one-page (two-sided) cheat sheet.

Machine Learning Part 1

Recommendation System Using Yelp Data CS 229 Machine Learning Jia Le Xu, Yingran Xu

CMPSCI 646, Information Retrieval (Fall 2003)

IJREAT International Journal of Research in Engineering & Advanced Technology, Volume 1, Issue 5, Oct-Nov, 2013 ISSN:

Supervised vs unsupervised clustering

Describable Visual Attributes for Face Verification and Image Search

JOINT INTENT DETECTION AND SLOT FILLING USING CONVOLUTIONAL NEURAL NETWORKS. Puyang Xu, Ruhi Sarikaya. Microsoft Corporation

Detecting Bone Lesions in Multiple Myeloma Patients using Transfer Learning

PROBLEM 4

All You Want To Know About CNNs. Yukun Zhu

CSE 258. Web Mining and Recommender Systems. Advanced Recommender Systems

COMP 551 Applied Machine Learning Lecture 16: Deep Learning

Cluster Evaluation and Expectation Maximization! adapted from: Doug Downey and Bryan Pardo, Northwestern University

Combining Neural Networks and Log-linear Models to Improve Relation Extraction

Perceptron: This is convolution!

16-785: Integrated Intelligence in Robotics: Vision, Language, and Planning. Spring 2018 Lecture 14. Image to Text

ImageCLEF 2011

Improving Recognition through Object Sub-categorization

RSDC 09: Tag Recommendation Using Keywords and Association Rules

Building and Using a Semantivisual Image Hierarchy

Combining Selective Search Segmentation and Random Forest for Image Classification

WEKA homepage.

Transcription:

Multimodal topic model for texts and images utilizing their embeddings Nikolay Smelik, smelik@rain.ifmo.ru Andrey Filchenkov, afilchenkov@corp.ifmo.ru Computer Technologies Lab IDP-16. Barcelona, Spain, 10 14 Oct 2016

Topic model Topic model is defined as a model of document collection that put into correspondence each document in the collection and a topic. A document is not only a textual document, but any structured object represented by a set of elements, such as, for instance, user described with preferences. Illustration is taken from Blei, D. (2012) Probabilistic Topic Models 2 / 25

Multimodal topic models for text and images Multimodal topic models for text and images are useful for Text annotating and image illustrating Text search given image and image search given description Topic clustering Image synthesis from textual description 3 / 25

Research goal The goal of this research is to increase quality of image annotating and search given textual description. We achieve this goal by creating a multimodal topic model that utilize word embedding; convolutional neural network image representation. 4 / 25

Related works CorLDA 1 Image feature are extracted from segments Correspondence LDA as a topic model MixLDA 2 Image feature are extraction with SIFT and clustered Multimodal LDA as a topic model slda 3 Image feature are extraction with SIFT and clustered Supervised LDA as a topic model 1 BleiD. M., Jordan M.I. (2003) Modeling annotated data // SIGIR. 127-134. 2 Feng Y., Lapata M. (2010) Topic models for image annotation and text illustration // Human Language Technologies. 831-839. 3 Wang C., Blei D., Li F.F. (2009) Simultaneous image classification and annotation // CVPR. 1903-1910. 5 / 25

Steps to build the model Image preprocessing Last convolutional layer vector extraction (image vector) Text preprocessing Word embedding vector extraction (word vector) Topic model learning given set of image vectors described by word vectors 6 / 25

Topic model learning step Image vector i is considered as a pseudo document, in which words are word vectors w of words from the image description TF-IDF is used to describe probability of words in a psedodocument A matrix F = p wi W I is build, where p wi = tfidf w, i, I F is represented as a multiplication of matrices Φ and Θ: F ΦΘ. 7 / 25

Topic model learning step: F decomposition (1/2) Representing F as F ΦΘ is matrix decomposition, where Φ represents conditional distributions on words given topic Θ represents conditional distributions on topics given images This problem is solved by maximizing log-likelihood given constraints on normalization and non-negativity of rows: 8 / 25

Topic model learning step: F decomposition (2/2) The problem met by the described approach is that there are many (maybe, infinite number of) solutions to F ΦΘ. This problem can be handled by adding model regularization for Θ and Φ: R i Φ, Θ. The problem is thus reduced to the problem of maximizing a linear combination of L and all R i under the same restrictions: 9 / 25

Scheme for image annotating Input is an image Image vector i input is evaluated A closest image vector i NN from all the known image vectors is found Most probable topics are evaluated for i NN Words, relevant to the found topics are extracted Output are these words 10 / 25

Scheme for image search given annotations Input is an annotation Word vector w input is evaluated A closest word vector w NN from all the known word vectors is found Most probable topics are evaluated for w NN Image vectors, relevant to the found topics are extracted Output are these images 11 / 25

Dataset Microsoft Common Object in Context 21000 images At least five annotation for each image Vocabulary size is 6000 12 / 25

Dataset Microsoft Common Object in Context 21000 images At least five annotation for each image Vocabulary size is 6000 13 / 25

Example 14 / 25

Dataset Microsoft Common Object in Context 21000 images At least five annotation for each image Vocabulary size is 6000 15 / 25

Image preprocessing step Convolutional neural network without fully-connected layers We used a learnt network VGG-16 All the images are compressed to size 224 224 As a result, we obtain an image vector 16 / 25

Text preprocessing step All symbols that are not letters/digits are filtered out Words are extracted Stop-words are eliminated Normal form of words is used A learnt Word2Vec from Gensim library is used For each word, word vector is obtained. 17 / 25

Model quality evaluation Perplexity: P = exp 1 n σ i I σ w i n iw ln p w i ; Matrices sparsity S Φ and S Θ ; Purity: K p = σ w Wt p w t ; Contrast: K c = 1 W t σ w W t p(t w). Model P S Φ S Θ K p K c ARTM 70.312 96.5 88.6 0.889 0.831 PLSA 84.596 82.1 84.6 0.461 0.656 18 / 25

Image annotating results Data: MS COCO dataset Data split: 80/20 Method of evaluation: comparison of top 10 words predicted by a model with top 10 words from description ranked with tf Model Recall Precision F 1 -measure CorrLDA 34.83 37.85 36.27 MixLDA 35.20 37.98 36.54 slda 35.63 38.46 36.99 PLSA 35.94 38.02 36.92 ARTM 40.43 43.37 41.85 19 / 25

Example of image annotating Model Recall Precision F 1 -measure CorrLDA 34.83 37.85 36.27 MixLDA 35.20 37.98 36.54 slda 35.63 38.46 36.99 PLSA 35.94 38.02 36.92 ARTM 40.43 43.37 41.85 20 / 25

Image search results Data: MS COCO dataset Image pool size is 5 Method of evaluation: percentage of found pairs (image description) Model Accuracy MixLDA 43.5 PLSA 55.8 ARTM 60.4 21 / 25

Image search example Input description: Men wearing baseball equipment on a baseball field 22 / 25

Image search example Input description: Men wearing baseball equipment on a baseball field 23 / 25

Conclusion We suggested a multimodal topic model for texts and images utilizing CNNs and Word2Vec We showed that the model outperforms state-of-the-art approaches in image annotating and image search Our future work is no use word vectors and image vectors as frequency vectors and build a topic model for contexts 24 / 25

Thank you! Multimodal topic model for texts and images utilizing their embeddings Nikolay Smelik and Andrey Filchenkov, smelik@rain.ifmo.ru, afilchenkov@corp.ifmo.ru