Image-Sentence Multimodal Embedding with Instructive Objectives
|
|
- Caroline Sherman
- 5 years ago
- Views:
Transcription
1 Image-Sentence Multimodal Embedding with Instructive Objectives Jianhao Wang Shunyu Yao IIIS, Tsinghua University {jh-wang15, Abstract To encode images and sentences into a joint vector space is the first step of the encoder-decoder model for image captioning, also an important and natural problem itself. With image and word features pretrained, we present some novel ideas for learning a multimodal embedding space with images and sentences. Setting image feature vectors as initial hidden layers of LSTM provides combined embeddings of images and sentences and therefore new objectives for training based on it. Two different layers of image features from CNN are combined together to better represent the information inside. For experiments, the encoder enables us to rank sentences for a specific image and visa versa, where some quantitative improvements in results are shown. Finally, some interesting multimodal linguistic regularities in the embedding space are shown. 1 Introduction The problem of image captioning has been an important yet challenging task, involving efforts of both image recognition and description generation. It serves as a bridge between visual and language information and therefore the foundation of search, retrieval and QA systems for images. Among the general translation problem, image captioning can be casted into the framework of encoder-decoder models. For the encoder, a joint vector space for embedding images and sentences is needed. On the other hand, embedding is also a popular and important approach for multimodal representation learning with the following advantages. The representation of texts of images are independent, which allows unimodal uses. Embedding approach can further motivate the exploration of the embedding space, such as the linguistic regularities and other operations on vectors. Embedding is relatively simple and flexible for extension. Various applications can be built upon it, like bidirectional sentence and image retrieval discussed later. These put embedding to a favorable position for us to work on. In this paper, we aim to present some new ideas for training the multimodal embedding based on the use of LSTM. Two main observations that lead to a refined model are: 1. Setting image feature vectors as initial hidden layer of LSTM and sentences as input of LSTM provides combined embeddings of images and sentences, which can instruct the learning of image embeddings and sentence embeddings. 2. Different layers of CNN contain different levels of knowledge of the input image. Experimentation framework introduced in [3] provides evaluation metrics regarding image retrieval and sentence retrieval. Our new ideas remarkably improved these quantitative marks over benchmarks in [8]. Also, to qualitatively analyze properties of the embedding space trained by images and sentences, linguistic regularities are discovered, e.g. *image of man in red* - red + blue *image of man in 30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.
2 blue*. It indicates that the multimodal embedding vector space has a good intrinsic structure left to explore and make use of. 2 Backgrounds Generally for image-sentence multimodal embedding, image features are extracted by a deep convolutional network (CNN), while representations for words are obtained by skip-gram model or continuous bag-of-word model [12]. Sentences can be encoded as final hidden states of LSTM and image features as vectors can also be embedded into the space of LSTM hidden states via an affine linear transformation. Pairwise rank loss, formulated below, are generally used as the objective. 2.1 Pairwise rank loss For training the multimodal embedding, image-description pairs are given. Images are represented as the top layer (before the softmax) of a CNN trained on the ImageNet classification task, e.g. OxfordNet. Let D be the dimensionality of an image feature vector, K be the dimensionality of the embedding space and V be the number of words in the vocabulary. Let the image embedding matrix be W I R K D and the word embedding matrix be W T R K V. For an image embedding vector x R K and a sentence embedding vector v R K, define a scoring function (cos similarity) s(x, v) = x v (1) x v and a contrastive scoring function g(a, b, c) = max{0, α + s(a, c) s(a, b)} (2) where b is a matching term for a and c is a contrastive term for a, and α [0, 1] is a constant threshold. During the training process, we are to optimize min g(x, v, v k )+g(v, x, x k ) (3) θ (x,v) k where {(x k, v k )} is a small set of randomly chosen contrastive terms resampled every epoch. θ denotes all the parameters to learn. It is trivial to see that cos similarity between corresponding pairs is encouraged and cos similarity between non-corresponding pairs is discouraged. The contrastive design of rank loss is vital because it succeeds in keeping distance of different information. For example, min θ (x,v) k s(x, v) will not work because all the embedding vectors will be identical in practice. 2.2 Basic model via LSTM Image feature FC x LSTM v Loss Sequence of word features Sentence Figure 1: Basic model MNLM [8]. Image features are from the FC7 layer of vgg19. FC denotes fully-connected layer. x and v denote image and sentence embeddings respectively. Loss is (3). 2
3 Long short-term memory [2] is a recurrent neural network that incorporates built-in memory cells to store information and exploit long range context. As shown above, the basic model for sentenceimage multimodal embedding uses LSTM to take in a sequence of word features. Then the embedding vector for a sentence is the hidden state of LSTM at the last time step. It is the encoder model applied in [8] which achieved state-of-art performances in bidirectional sentence and image retrieval tasks at that time. 2.3 Related Works A large amount of work has been devoted to learning multimodal representations of text and images. Popular approaches include word-image embedding [16, 1] and sentence-image embedding [4, 14, 8]. These are the foundation of our work. Other than embedding approaches, deep Boltzmann machines [15], log-bilinear neural language models [7], autoencoders [13] and other approaches are also proposed. Focused on the problem of sentence-image multimodal embedding, different architectures have been applied and (3) appears to be the general choice for objective function. The deep visual semantic embedding model [1] embed sentences as the mean of their word embeddings. The semantic dependency tree recursive neural network [14] uses a tree structure to embed on a sentence level. In deep fragment embeddings [6], descriptions are represented as a bag of dependency parses. Fragment objectives are added and object detection is integrated into its framework. [8] introduced the basic LSTM model (MNLM(-vgg)) illustrated in figure 1 and has been our main reference. 3 Method 3.1 Instructive pairwise rank loss Based on the LSTM architecture, a natural way one can think of to embed an image I and a sentence S together is to set image embedding vector x as the initial hidden layer h 0 of the LSTM and take word features of S as the input of the LSTM, yielding the final hidden layer as the output with combined information of I and S, denoted by v (x). 1 However, time and space resources will be squared and no straightforward meaning of this combined embedding approach exists. Image feature x h 0 LSTM 2 v (x) Loss <ĥ> Loss +0.6 LSTM 1 v Loss 1 Sentence Figure 2: Model after applying instructive objective (Model 1). Loss 1 is (3) and loss 2 is (4). FC layer for image features and word embedding step are omitted. <ĥ> denote the whole sequence of hidden layer vectors by LSTM 1, of which v is the last vector. 1 We will use such artificial notations afterwards, where the main body and the superscript represent the input and h 0 of the LSTM respectively. 3
4 We turn this idea into a new instructive way of constructing objective (loss) function: ( ) ( ) min g x, v (x), v (x) k + g v, v (x), v (x k) (4) θ (x,v) where exactly one contrastive term (x k, v k ) is randomly chosen for each (x, k) pair and is resampled every epoch, whereas in (3) a set of contrastive terms are shared by the training set. The change is to save computational resources and time. 3.2 Combination of two layers of image features Different layers of CNN are believed to store different layers of features and hence different levels of knowledge in an image. [18] So the employment of two different layers of the CNN as image features should hopefully result in a better representation for images and consequently for sentences. Our final model is therefore illustrated below. Image feature 1 x 1 Loss 2 x h 0 h 0 Image feature 2 Loss +0.6 v Loss 1 LSTM 1 <ĥ> 1 LSTM 2 <ĥ> 2 LSTM 3 v (x) Sentence Figure 3: Final model (Model 2). Notations are similar to Model 1. FC layers and word embedding step are omitted. In detail, image feature 1 and 2 are extracted from the FC6 and FC7 layers of vgg19 respectively. <ĥ> 1 and <ĥ> 2 represent the whole sequences of hidden layers of LSTM 1 and LSTM2 respectively. v (x) is actually v (x 1,x 2 ) and is the instructive vector with features from v, x 1 and x 2. Final image embedding vectors are taken as the catenation of x 1 and x 2 and final sentence embedding vectors are taken as the catenation of v and v. Loss 1 is (3) for final embeddings vectors and loss 2 is a straightforward extension of (4), namely min θ (x 1,x 2,v) ( g x 1, v (x 1,x 2 ), v (x 1,x 2 ) k ) + g (x ) 1, v (x 1,x 2 ), v (x 1,x 2k ) ( ) +g x 2, v (x1,x2), v (x 1,x 2 ) k + g (x ) 2, v (x1,x2), v (x 1k,x 2) ( ) +g v, v (x 1,x 2 ), v (x 1,x 2k ) + g (v, ) v (x 1,x 2 ), v (x 1k,x 2 ) where exactly one contrastive tripe (x 1k, x 2k, v k ) is randomly chosen for each tripe (x 1, x 2, v) from the training set and is resampled every epoch. This complication, as shown in the next section, does lead to a better performance in retrieval experiments. 4 Experiments As mentioned in the introduction, while bidirectional sentence and image retrieval experiments provides some quantitative comparisons, exploration on linguistic regularities provides insight into the properties of the embedding space. (5) 4
5 4.1 Datasets Our models work on the public image-sentence dataset Flickr8K [3], consisting of 8000 images collected from Flickr. Each image is accompanied with 5 descriptive sentences. Standard training, validation and test split is provided. We attempted to employ Flickr30K [17] but were hindered by hardware limits. 4.2 Sentence Retrieval and image retrieval The problem of sentence retrieval requires to rank the sentences in a given dataset according to their similarities with a given image, while the problem of image retrieval requires to rank the images in a given dataset according to their similarities with a given sentence. [3] They are sometimes alternatively referred to as image annotation and image retrieval, respectively. Two types of quantitative evaluations are performed. Recall@K (K =1, 5, 10) is the mean number of images for which the correct description is ranked within top-k retrieved results (visa versa for sentences, high is good) and Med r is the median rank of the closest ground-truth result from the ranked list (low is good). Sentence and image retrieval were not designed for multimodal embedding problem, and in fact several different frameworks suits to the experiments. For example, [10] tackled the problem as a multimodal matching one and utilized a CNN-based pipeline for the experiments with state-of-art performances on Flickr30K dataset. [9] employed Fisher vector and other advanced approaches with state-of-art performances on Flickr8K dataset. So it is appropriate to focus our comparisons with embedding models. To be more specific, our main comparison is MNLM-vgg [8] whose architecture is our basic model in 2.2. To our primary satisfaction, our Model 2 outperforms all other embedding models listed in the first part of the table below. Table 1: Sentence retrieval and image retrieval experiments. Sentence Retrieval Image Retrieval R@1 R@5 R@10 Med r R@1 R@5 R@10 Med r Random ranking DeViSe [1] SDT-RNN [14] DeFrag [6] MNLM [8] MNLM-vgg [8] Model Model m RNN [11] FV (GMM + HGLMM) [9] m CNN_ENG-vgg [10] Linguistic regularities [8] noted that LSTM encoder is not well apt for discovering linguistic regularities compared with linear representation for sentences, i.e. v = N i=1 w i where w i are word embeddings of the sentence. However, we still find certain linguistic regularities in the multimodal embedding space trained. Formally speaking, given a query image x, a negative word w n and a positive word w p (all with unit norm), we seek an image x = arg max s (x 1, x w n + w p ). (6) x 1 By retrieving top-5 nearest images and picking out the most reasonable one by hand, we select some typical examples below. The only result not within top-5 retrievals are at row 3, indicating the difficulty of action detection. The successful extraction at row 1 is uneasy given a fairly small dataset containing cars. Row 4 is 5
6 Table 2: Linguistic regularities. x x type wn wp -yellow +blue color -grass +snow place -riding +walking action -people +dog object sitting interesting as it reserves the scene of the images (snow). The last row demonstrates a retrieval based on a single word. It is observed that the original image x often resides within top-5 nearest images with x wn + wp, and the reason might be that x is distinctive in its own direction. More work shall be put into this sphere. 5 Discussion Two observations lead to an instructive objective as well as a refined model for sentence-image embedding based on LSTM. We argue that this instructive idea, i.e. a combined embedding of an image and a sentence, can be of potential usefulness in other situations, left unchecked for interested ones. 6
7 From word embedding to sentence embedding, a two-step embedding method is applied. Actually, it is popular to first parse words into semantic fragments and then into a sentence [14]. Embedding on different levels is an interesting topic to explore. Regarding image features, attention-based models and object detection models can be integrated into the framework, enriching the quality of features captured. Alignments between parts of captions to images can significantly empower the model. [5] Image captioning and sentence-image embedding are two important problems, where embedding can be viewed as the first part of encoder-decoder model to generate captions. In return, image captioning equipped with one unimodal embedding might be able to learn a multimodal embedding. Thorough considerations are still needed. References [1] FROME, A.,CORRADO, G.S.,SHLENS, J.,BENGIO, S.,DEAN, J.,MIKOLOV, T.,ET AL. Devise: A deep visual-semantic embedding model. In Advances in neural information processing systems (2013), pp [2] HOCHREITER, S.,AND SCHMIDHUBER, J. Long short-term memory. Neural Computation 9, 8 (1997), [3] HODOSH, M.,YOUNG, P.,AND HOCKENMAIER, J. Framing image description as a ranking task: Data, models and evaluation metrics. J. Artif. Int. Res. 47, 1 (May 2013), [4] KARPATHY, A., AND FEI-FEI, L. Deep visual-semantic alignments for generating image descriptions. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (June 2015). [5] KARPATHY, A., AND FEI-FEI, L. Deep visual-semantic alignments for generating image descriptions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2015), pp [6] KARPATHY, A., JOULIN, A., AND LI, F. F. F. Deep fragment embeddings for bidirectional image sentence mapping. In Advances in neural information processing systems (2014), pp [7] KIROS, R., SALAKHUTDINOV, R., AND ZEMEL, R. Multimodal neural language models. Eprint Arxiv (2014), [8] KIROS, R., SALAKHUTDINOV, R., AND ZEMEL, R. S. Unifying visual-semantic embeddings with multimodal neural language models. CoRR abs/ (2014). [9] KLEIN, B.,LEV, G.,SADEH, G.,AND WOLF, L. Associating neural word embeddings with deep image representations using fisher vectors [10] MA, L., LU, Z., SHANG, L., AND LI, H. Multimodal convolutional neural networks for matching image and sentence. Computer Science (2015), [11] MAO, J., XU, W., YANG, Y., WANG, J., AND YUILLE, A. L. Explain images with multimodal recurrent neural networks. arxiv preprint arxiv: (2014). [12] MIKOLOV, T.,CHEN, K.,CORRADO, G.,AND DEAN, J. Efficient estimation of word representations in vector space. CoRR abs/ (2013). [13] NGIAM, J., KHOSLA, A., KIM, M., NAM, J., LEE, H., AND NG, A. Y. Multimodal deep learning. In International Conference on Machine Learning, ICML 2011, Bellevue, Washington, Usa, June 28 - July (2011), pp [14] SOCHER, R., KARPATHY, A., LE, Q. V., MANNING, C. D., AND NG, A. Y. Grounded compositional semantics for finding and describing images with sentences. Transactions of the Association for Computational Linguistics 2 (2014), [15] SRIVASTAVA, N., AND SALAKHUTDINOV, R. Multimodal learning with deep boltzmann machines. Journal of Machine Learning Research 15, 8 (2014), [16] WESTON, J., BENGIO, S., AND USUNIER, N. Large scale image annotation: Learning to rank with joint word-image embeddings. In European Conference on Machine Learning (2010). 7
8 [17] YOUNG, P., LAI, A., HODOSH, M., AND HOCKENMAIER, J. From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions. Nlp.cs.illinois.edu (2014). [18] YU, W., YANG, K., YAO, H., SUN, X., AND XU, P. Exploiting the complementary strengths of multi-layer cnn features for image retrieval. Neurocomputing (2016). 8
Show, Discriminate, and Tell: A Discriminatory Image Captioning Model with Deep Neural Networks
Show, Discriminate, and Tell: A Discriminatory Image Captioning Model with Deep Neural Networks Zelun Luo Department of Computer Science Stanford University zelunluo@stanford.edu Te-Lin Wu Department of
More informationImage Question Answering using Convolutional Neural Network with Dynamic Parameter Prediction
Image Question Answering using Convolutional Neural Network with Dynamic Parameter Prediction by Noh, Hyeonwoo, Paul Hongsuck Seo, and Bohyung Han.[1] Presented : Badri Patro 1 1 Computer Vision Reading
More informationLearning to Match. Jun Xu, Zhengdong Lu, Tianqi Chen, Hang Li
Learning to Match Jun Xu, Zhengdong Lu, Tianqi Chen, Hang Li 1. Introduction The main tasks in many applications can be formalized as matching between heterogeneous objects, including search, recommendation,
More information16-785: Integrated Intelligence in Robotics: Vision, Language, and Planning. Spring 2018 Lecture 14. Image to Text
16-785: Integrated Intelligence in Robotics: Vision, Language, and Planning Spring 2018 Lecture 14. Image to Text Input Output Classification tasks 4/1/18 CMU 16-785: Integrated Intelligence in Robotics
More informationSemantic image search using queries
Semantic image search using queries Shabaz Basheer Patel, Anand Sampat Department of Electrical Engineering Stanford University CA 94305 shabaz@stanford.edu,asampat@stanford.edu Abstract Previous work,
More informationImage Captioning with Object Detection and Localization
Image Captioning with Object Detection and Localization Zhongliang Yang, Yu-Jin Zhang, Sadaqat ur Rehman, Yongfeng Huang, Department of Electronic Engineering, Tsinghua University, Beijing 100084, China
More informationShow, Discriminate, and Tell: A Discriminatory Image Captioning Model with Deep Neural Networks
Show, Discriminate, and Tell: A Discriminatory Image Captioning Model with Deep Neural Networks Boya Peng Department of Computer Science Stanford University boya@stanford.edu Zelun Luo Department of Computer
More informationarxiv: v1 [cs.cv] 17 Nov 2016
Instance-aware Image and Sentence Matching with Selective Multimodal LSTM arxiv:1611.05588v1 [cs.cv] 17 Nov 2016 An old man with his bag and dog is sitting on the bench beside the road and grass Yan Huang
More informationNovel Image Captioning
000 001 002 003 004 005 006 007 008 009 010 011 012 013 014 015 016 017 018 019 020 021 022 023 024 025 026 027 028 029 030 031 032 033 034 035 036 037 038 039 040 041 042 043 044 045 046 047 048 049 050
More informationGrounded Compositional Semantics for Finding and Describing Images with Sentences
Grounded Compositional Semantics for Finding and Describing Images with Sentences R. Socher, A. Karpathy, V. Le,D. Manning, A Y. Ng - 2013 Ali Gharaee 1 Alireza Keshavarzi 2 1 Department of Computational
More informationEnd-To-End Spam Classification With Neural Networks
End-To-End Spam Classification With Neural Networks Christopher Lennan, Bastian Naber, Jan Reher, Leon Weber 1 Introduction A few years ago, the majority of the internet s network traffic was due to spam
More informationCAP 6412 Advanced Computer Vision
CAP 6412 Advanced Computer Vision http://www.cs.ucf.edu/~bgong/cap6412.html Boqing Gong Feb 04, 2016 Today Administrivia Attention Modeling in Image Captioning, by Karan Neural networks & Backpropagation
More informationA PARALLEL-FUSION RNN-LSTM ARCHITECTURE FOR IMAGE CAPTION GENERATION
A PARALLEL-FUSION RNN-LSTM ARCHITECTURE FOR IMAGE CAPTION GENERATION Minsi Wang, Li Song, Xiaokang Yang, Chuanfei Luo Institute of Image Communication and Network Engineering, Shanghai Jiao Tong University
More informationShow, Attend and Tell: Neural Image Caption Generation with Visual Attention
Show, Attend and Tell: Neural Image Caption Generation with Visual Attention Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhutdinov, Richard Zemel, Yoshua Bengio Presented
More informationInstance-aware Image and Sentence Matching with Selective Multimodal LSTM
Instance-aware Image and Sentence Matching with Selective Multimodal LSTM Yan Huang 1,3 Wei Wang 1,3 Liang Wang 1,2,3 1 Center for Research on Intelligent Perception and Computing (CRIPAC), National Laboratory
More informationJOINT INTENT DETECTION AND SLOT FILLING USING CONVOLUTIONAL NEURAL NETWORKS. Puyang Xu, Ruhi Sarikaya. Microsoft Corporation
JOINT INTENT DETECTION AND SLOT FILLING USING CONVOLUTIONAL NEURAL NETWORKS Puyang Xu, Ruhi Sarikaya Microsoft Corporation ABSTRACT We describe a joint model for intent detection and slot filling based
More informationarxiv: v1 [cs.cv] 14 Jul 2017
Temporal Modeling Approaches for Large-scale Youtube-8M Video Understanding Fu Li, Chuang Gan, Xiao Liu, Yunlong Bian, Xiang Long, Yandong Li, Zhichao Li, Jie Zhou, Shilei Wen Baidu IDL & Tsinghua University
More informationAutoencoder. Representation learning (related to dictionary learning) Both the input and the output are x
Deep Learning 4 Autoencoder, Attention (spatial transformer), Multi-modal learning, Neural Turing Machine, Memory Networks, Generative Adversarial Net Jian Li IIIS, Tsinghua Autoencoder Autoencoder Unsupervised
More informationProceedings of the International MultiConference of Engineers and Computer Scientists 2018 Vol I IMECS 2018, March 14-16, 2018, Hong Kong
, March 14-16, 2018, Hong Kong , March 14-16, 2018, Hong Kong , March 14-16, 2018, Hong Kong , March 14-16, 2018, Hong Kong TABLE I CLASSIFICATION ACCURACY OF DIFFERENT PRE-TRAINED MODELS ON THE TEST DATA
More informationImage Captioning with Attention
ing with Attention Blaine Rister (blaine@stanford.edu), Dieterich Lawson (jdlawson@stanford.edu) 1. Introduction In the past few years, neural networks have fueled dramatic advances in image classication.
More informationLSTM for Language Translation and Image Captioning. Tel Aviv University Deep Learning Seminar Oran Gafni & Noa Yedidia
1 LSTM for Language Translation and Image Captioning Tel Aviv University Deep Learning Seminar Oran Gafni & Noa Yedidia 2 Part I LSTM for Language Translation Motivation Background (RNNs, LSTMs) Model
More informationLSTM and its variants for visual recognition. Xiaodan Liang Sun Yat-sen University
LSTM and its variants for visual recognition Xiaodan Liang xdliang328@gmail.com Sun Yat-sen University Outline Context Modelling with CNN LSTM and its Variants LSTM Architecture Variants Application in
More informationTemporal Activity Detection in Untrimmed Videos with Recurrent Neural Networks
Temporal Activity Detection in Untrimmed Videos with Recurrent Neural Networks Alberto Montes al.montes.gomez@gmail.com Santiago Pascual TALP Research Center santiago.pascual@tsc.upc.edu Amaia Salvador
More informationVISION & LANGUAGE From Captions to Visual Concepts and Back
VISION & LANGUAGE From Captions to Visual Concepts and Back Brady Fowler & Kerry Jones Tuesday, February 28th 2017 CS 6501-004 VICENTE Agenda Problem Domain Object Detection Language Generation Sentence
More informationMulti-Glance Attention Models For Image Classification
Multi-Glance Attention Models For Image Classification Chinmay Duvedi Stanford University Stanford, CA cduvedi@stanford.edu Pararth Shah Stanford University Stanford, CA pararth@stanford.edu Abstract We
More informationMultimodal Learning. Victoria Dean. MIT 6.S191 Intro to Deep Learning IAP 2017
Multimodal Learning Victoria Dean Talk outline What is multimodal learning and what are the challenges? Flickr example: joint learning of images and tags Image captioning: generating sentences from images
More informationEmpirical Evaluation of RNN Architectures on Sentence Classification Task
Empirical Evaluation of RNN Architectures on Sentence Classification Task Lei Shen, Junlin Zhang Chanjet Information Technology lorashen@126.com, zhangjlh@chanjet.com Abstract. Recurrent Neural Networks
More informationLSTM: An Image Classification Model Based on Fashion-MNIST Dataset
LSTM: An Image Classification Model Based on Fashion-MNIST Dataset Kexin Zhang, Research School of Computer Science, Australian National University Kexin Zhang, U6342657@anu.edu.au Abstract. The application
More informationABC-CNN: Attention Based CNN for Visual Question Answering
ABC-CNN: Attention Based CNN for Visual Question Answering CIS 601 PRESENTED BY: MAYUR RUMALWALA GUIDED BY: DR. SUNNIE CHUNG AGENDA Ø Introduction Ø Understanding CNN Ø Framework of ABC-CNN Ø Datasets
More informationObject Detection Based on Deep Learning
Object Detection Based on Deep Learning Yurii Pashchenko AI Ukraine 2016, Kharkiv, 2016 Image classification (mostly what you ve seen) http://tutorial.caffe.berkeleyvision.org/caffe-cvpr15-detection.pdf
More informationMoonRiver: Deep Neural Network in C++
MoonRiver: Deep Neural Network in C++ Chung-Yi Weng Computer Science & Engineering University of Washington chungyi@cs.washington.edu Abstract Artificial intelligence resurges with its dramatic improvement
More informationImage Captioning and Generation From Text
Image Captioning and Generation From Text Presented by: Tony Zhang, Jonathan Kenny, and Jeremy Bernstein Mentor: Stephan Zheng CS159 Advanced Topics in Machine Learning: Structured Prediction California
More informationMultilayer and Multimodal Fusion of Deep Neural Networks for Video Classification
Multilayer and Multimodal Fusion of Deep Neural Networks for Video Classification Xiaodong Yang, Pavlo Molchanov, Jan Kautz INTELLIGENT VIDEO ANALYTICS Surveillance event detection Human-computer interaction
More informationDeep MIML Network. Abstract
Deep MIML Network Ji Feng and Zhi-Hua Zhou National Key Laboratory for Novel Software Technology, Nanjing University, Nanjing 210023, China Collaborative Innovation Center of Novel Software Technology
More informationSocialML: machine learning for social media video creators
SocialML: machine learning for social media video creators Tomasz Trzcinski a,b, Adam Bielski b, Pawel Cyrta b and Matthew Zak b a Warsaw University of Technology b Tooploox firstname.lastname@tooploox.com
More informationarxiv:submit/ [cs.cv] 13 Jan 2018
Benchmark Visual Question Answer Models by using Focus Map Wenda Qiu Yueyang Xianzang Zhekai Zhang Shanghai Jiaotong University arxiv:submit/2130661 [cs.cv] 13 Jan 2018 Abstract Inferring and Executing
More informationDeep Learning Applications
October 20, 2017 Overview Supervised Learning Feedforward neural network Convolution neural network Recurrent neural network Recursive neural network (Recursive neural tensor network) Unsupervised Learning
More informationMultimodal Medical Image Retrieval based on Latent Topic Modeling
Multimodal Medical Image Retrieval based on Latent Topic Modeling Mandikal Vikram 15it217.vikram@nitk.edu.in Suhas BS 15it110.suhas@nitk.edu.in Aditya Anantharaman 15it201.aditya.a@nitk.edu.in Sowmya Kamath
More informationA Deep Relevance Matching Model for Ad-hoc Retrieval
A Deep Relevance Matching Model for Ad-hoc Retrieval Jiafeng Guo 1, Yixing Fan 1, Qingyao Ai 2, W. Bruce Croft 2 1 CAS Key Lab of Web Data Science and Technology, Institute of Computing Technology, Chinese
More informationarxiv: v1 [cs.cv] 20 Dec 2016
End-to-End Pedestrian Collision Warning System based on a Convolutional Neural Network with Semantic Segmentation arxiv:1612.06558v1 [cs.cv] 20 Dec 2016 Heechul Jung heechul@dgist.ac.kr Min-Kook Choi mkchoi@dgist.ac.kr
More informationEND-TO-END CHINESE TEXT RECOGNITION
END-TO-END CHINESE TEXT RECOGNITION Jie Hu 1, Tszhang Guo 1, Ji Cao 2, Changshui Zhang 1 1 Department of Automation, Tsinghua University 2 Beijing SinoVoice Technology November 15, 2017 Presentation at
More informationOn the Efficiency of Recurrent Neural Network Optimization Algorithms
On the Efficiency of Recurrent Neural Network Optimization Algorithms Ben Krause, Liang Lu, Iain Murray, Steve Renals University of Edinburgh Department of Informatics s17005@sms.ed.ac.uk, llu@staffmail.ed.ac.uk,
More informationAutomatic Video Description Generation via LSTM with Joint Two-stream Encoding
Automatic Video Description Generation via LSTM with Joint Two-stream Encoding Chenyang Zhang and Yingli Tian Department of Electrical Engineering The City College of New York New York, New York 10031
More informationLearning Deep Structure-Preserving Image-Text Embeddings
Learning Deep Structure-Preserving Image-Text Embeddings Liwei Wang Yin Li Svetlana Lazebnik lwang97@illinois.edu yli440@gatech.edu University of Illinois at Urbana-Champaign slazebni@illinois.edu Georgia
More informationImproving Face Recognition by Exploring Local Features with Visual Attention
Improving Face Recognition by Exploring Local Features with Visual Attention Yichun Shi and Anil K. Jain Michigan State University Difficulties of Face Recognition Large variations in unconstrained face
More informationDCU-UvA Multimodal MT System Report
DCU-UvA Multimodal MT System Report Iacer Calixto ADAPT Centre School of Computing Dublin City University Dublin, Ireland iacer.calixto@adaptcentre.ie Desmond Elliott ILLC University of Amsterdam Science
More informationarxiv: v1 [cs.lg] 10 Nov 2014
Unifying Visual-Semantic Embeddings with Multimodal Neural Language Models arxiv:1411.2539v1 [cs.lg] 10 Nov 2014 Ryan Kiros, Ruslan Salakhutdinov, Richard S. Zemel University of Toronto Canadian Institute
More informationA Novel Representation and Pipeline for Object Detection
A Novel Representation and Pipeline for Object Detection Vishakh Hegde Stanford University vishakh@stanford.edu Manik Dhar Stanford University dmanik@stanford.edu Abstract Object detection is an important
More informationDetecting Bone Lesions in Multiple Myeloma Patients using Transfer Learning
Detecting Bone Lesions in Multiple Myeloma Patients using Transfer Learning Matthias Perkonigg 1, Johannes Hofmanninger 1, Björn Menze 2, Marc-André Weber 3, and Georg Langs 1 1 Computational Imaging Research
More informationTowards Building Large-Scale Multimodal Knowledge Bases
Towards Building Large-Scale Multimodal Knowledge Bases Dihong Gong Advised by Dr. Daisy Zhe Wang Knowledge Itself is Power --Francis Bacon Analytics Social Robotics Knowledge Graph Nodes Represent entities
More informationGenerative Adversarial Text to Image Synthesis
Generative Adversarial Text to Image Synthesis Scott Reed, Zeynep Akata, Xinchen Yan, Lajanugen Logeswaran, Bernt Schiele, Honglak Lee Presented by: Jingyao Zhan Contents Introduction Related Work Method
More informationDeep Tracking: Biologically Inspired Tracking with Deep Convolutional Networks
Deep Tracking: Biologically Inspired Tracking with Deep Convolutional Networks Si Chen The George Washington University sichen@gwmail.gwu.edu Meera Hahn Emory University mhahn7@emory.edu Mentor: Afshin
More informationDeep Similarity Learning for Multimodal Medical Images
Deep Similarity Learning for Multimodal Medical Images Xi Cheng, Li Zhang, and Yefeng Zheng Siemens Corporation, Corporate Technology, Princeton, NJ, USA Abstract. An effective similarity measure for multi-modal
More informationProject Final Report
Project Final Report Ye Tian Stanford University yetian@stanford.edu Tianlun Li Stanford University tianlunl@stanford.edu Abstract We plan to do image-to-sentence generation. This application bridges vision
More informationSeq2SQL: Generating Structured Queries from Natural Language Using Reinforcement Learning
Seq2SQL: Generating Structured Queries from Natural Language Using Reinforcement Learning V. Zhong, C. Xiong, R. Socher Salesforce Research arxiv: 1709.00103 Reviewed by : Bill Zhang University of Virginia
More informationTowards End-to-End Audio-Sheet-Music Retrieval
Towards End-to-End Audio-Sheet-Music Retrieval Matthias Dorfer, Andreas Arzt and Gerhard Widmer Department of Computational Perception Johannes Kepler University Linz Altenberger Str. 69, A-4040 Linz matthias.dorfer@jku.at
More informationFastText. Jon Koss, Abhishek Jindal
FastText Jon Koss, Abhishek Jindal FastText FastText is on par with state-of-the-art deep learning classifiers in terms of accuracy But it is way faster: FastText can train on more than one billion words
More informationarxiv: v1 [cs.cv] 21 Feb 2018
Learning Image Conditioned Label Space for Multilabel Classification Yi-Nan Li and Mei-Chen Yeh Department of Computer Science and Information Engineering National Taiwan Normal University myeh@csie.ntnu.edu.com
More informationSentiment Classification of Food Reviews
Sentiment Classification of Food Reviews Hua Feng Department of Electrical Engineering Stanford University Stanford, CA 94305 fengh15@stanford.edu Ruixi Lin Department of Electrical Engineering Stanford
More informationPart Localization by Exploiting Deep Convolutional Networks
Part Localization by Exploiting Deep Convolutional Networks Marcel Simon, Erik Rodner, and Joachim Denzler Computer Vision Group, Friedrich Schiller University of Jena, Germany www.inf-cv.uni-jena.de Abstract.
More informationBidirectional Recurrent Convolutional Networks for Video Super-Resolution
Bidirectional Recurrent Convolutional Networks for Video Super-Resolution Qi Zhang & Yan Huang Center for Research on Intelligent Perception and Computing (CRIPAC) National Laboratory of Pattern Recognition
More informationHierarchical Video Frame Sequence Representation with Deep Convolutional Graph Network
Hierarchical Video Frame Sequence Representation with Deep Convolutional Graph Network Feng Mao [0000 0001 6171 3168], Xiang Wu [0000 0003 2698 2156], Hui Xue, and Rong Zhang Alibaba Group, Hangzhou, China
More informationGenerating Natural Video Descriptions via Multimodal Processing
INTERSPEECH 2016 September 8 12, 2016, San Francisco, USA Generating Natural Video Descriptions via Multimodal Processing Qin Jin 1,2, Junwei Liang 3, Xiaozhu Lin 1 1 Multimedia Computing Lab, School of
More informationCOMPUTER vision is moving from predicting discrete,
1 Learning Two-Branch Neural Networks for Image-Text Matching Tasks Liwei Wang, Yin Li, Jing Huang, Svetlana Lazebnik arxiv:1704.03470v3 [cs.cv] 28 Dec 2017 Abstract Image-language matching tasks have
More informationCEA LIST s participation to the Scalable Concept Image Annotation task of ImageCLEF 2015
CEA LIST s participation to the Scalable Concept Image Annotation task of ImageCLEF 2015 Etienne Gadeski, Hervé Le Borgne, and Adrian Popescu CEA, LIST, Laboratory of Vision and Content Engineering, France
More informationDeep Learning for Program Analysis. Lili Mou January, 2016
Deep Learning for Program Analysis Lili Mou January, 2016 Outline Introduction Background Deep Neural Networks Real-Valued Representation Learning Our Models Building Program Vector Representations for
More informationMachine Learning 13. week
Machine Learning 13. week Deep Learning Convolutional Neural Network Recurrent Neural Network 1 Why Deep Learning is so Popular? 1. Increase in the amount of data Thanks to the Internet, huge amount of
More informationContextual Dropout. Sam Fok. Abstract. 1. Introduction. 2. Background and Related Work
Contextual Dropout Finding subnets for subtasks Sam Fok samfok@stanford.edu Abstract The feedforward networks widely used in classification are static and have no means for leveraging information about
More informationGenerating Images from Captions with Attention. Elman Mansimov Emilio Parisotto Jimmy Lei Ba Ruslan Salakhutdinov
Generating Images from Captions with Attention Elman Mansimov Emilio Parisotto Jimmy Lei Ba Ruslan Salakhutdinov Reasoning, Attention, Memory workshop, NIPS 2015 Motivation To simplify the image modelling
More informationRecovering Realistic Texture in Image Super-resolution by Deep Spatial Feature Transform. Xintao Wang Ke Yu Chao Dong Chen Change Loy
Recovering Realistic Texture in Image Super-resolution by Deep Spatial Feature Transform Xintao Wang Ke Yu Chao Dong Chen Change Loy Problem enlarge 4 times Low-resolution image High-resolution image Previous
More informationContent-Based Image Recovery
Content-Based Image Recovery Hong-Yu Zhou and Jianxin Wu National Key Laboratory for Novel Software Technology Nanjing University, China zhouhy@lamda.nju.edu.cn wujx2001@nju.edu.cn Abstract. We propose
More informationDeepIM: Deep Iterative Matching for 6D Pose Estimation - Supplementary Material
DeepIM: Deep Iterative Matching for 6D Pose Estimation - Supplementary Material Yi Li 1, Gu Wang 1, Xiangyang Ji 1, Yu Xiang 2, and Dieter Fox 2 1 Tsinghua University, BNRist 2 University of Washington
More informationCOMPUTER vision is moving from predicting discrete,
1 Learning Two-Branch Neural Networks for Image-Text Matching Tasks Liwei Wang, Yin Li, Svetlana Lazebnik arxiv:1704.03470v2 [cs.cv] 18 Apr 2017 Abstract This paper investigates two-branch neural networks
More informationCombining Selective Search Segmentation and Random Forest for Image Classification
Combining Selective Search Segmentation and Random Forest for Image Classification Gediminas Bertasius November 24, 2013 1 Problem Statement Random Forest algorithm have been successfully used in many
More informationLayerwise Interweaving Convolutional LSTM
Layerwise Interweaving Convolutional LSTM Tiehang Duan and Sargur N. Srihari Department of Computer Science and Engineering The State University of New York at Buffalo Buffalo, NY 14260, United States
More informationYOLO9000: Better, Faster, Stronger
YOLO9000: Better, Faster, Stronger Date: January 24, 2018 Prepared by Haris Khan (University of Toronto) Haris Khan CSC2548: Machine Learning in Computer Vision 1 Overview 1. Motivation for one-shot object
More informationWord2vec and beyond. presented by Eleni Triantafillou. March 1, 2016
Word2vec and beyond presented by Eleni Triantafillou March 1, 2016 The Big Picture There is a long history of word representations Techniques from information retrieval: Latent Semantic Analysis (LSA)
More informationECCV Presented by: Boris Ivanovic and Yolanda Wang CS 331B - November 16, 2016
ECCV 2016 Presented by: Boris Ivanovic and Yolanda Wang CS 331B - November 16, 2016 Fundamental Question What is a good vector representation of an object? Something that can be easily predicted from 2D
More informationCROSS-MODAL TRANSFER WITH NEURAL WORD VECTORS FOR IMAGE FEATURE LEARNING
CROSS-MODAL TRANSFER WITH NEURAL WORD VECTORS FOR IMAGE FEATURE LEARNING Go Irie, Taichi Asami, Shuhei Tarashima, Takayuki Kurozumi, Tetsuya Kinebuchi NTT Corporation ABSTRACT Neural word vector (NWV)
More informationarxiv: v1 [cs.cv] 2 Sep 2018
Natural Language Person Search Using Deep Reinforcement Learning Ankit Shah Language Technologies Institute Carnegie Mellon University aps1@andrew.cmu.edu Tyler Vuong Electrical and Computer Engineering
More informationarxiv: v2 [cs.cv] 23 Mar 2017
Recurrent Memory Addressing for describing videos Arnav Kumar Jain Abhinav Agarwalla Kumar Krishna Agrawal Pabitra Mitra Indian Institute of Technology Kharagpur {arnavkj95, abhinavagarawalla, kumarkrishna,
More informationA FRAMEWORK OF EXTRACTING MULTI-SCALE FEATURES USING MULTIPLE CONVOLUTIONAL NEURAL NETWORKS. Kuan-Chuan Peng and Tsuhan Chen
A FRAMEWORK OF EXTRACTING MULTI-SCALE FEATURES USING MULTIPLE CONVOLUTIONAL NEURAL NETWORKS Kuan-Chuan Peng and Tsuhan Chen School of Electrical and Computer Engineering, Cornell University, Ithaca, NY
More informationExploiting noisy web data for largescale visual recognition
Exploiting noisy web data for largescale visual recognition Lamberto Ballan University of Padova, Italy CVPRW WebVision - Jul 26, 2017 Datasets drive computer vision progress ImageNet Slide credit: O.
More informationDeep Learning for Computer Vision II
IIIT Hyderabad Deep Learning for Computer Vision II C. V. Jawahar Paradigm Shift Feature Extraction (SIFT, HoG, ) Part Models / Encoding Classifier Sparrow Feature Learning Classifier Sparrow L 1 L 2 L
More informationChannel Locality Block: A Variant of Squeeze-and-Excitation
Channel Locality Block: A Variant of Squeeze-and-Excitation 1 st Huayu Li Northern Arizona University Flagstaff, United State Northern Arizona University hl459@nau.edu arxiv:1901.01493v1 [cs.lg] 6 Jan
More informationConditional Random Fields as Recurrent Neural Networks
BIL722 - Deep Learning for Computer Vision Conditional Random Fields as Recurrent Neural Networks S. Zheng, S. Jayasumana, B. Romera-Paredes V. Vineet, Z. Su, D. Du, C. Huang, P.H.S. Torr Introduction
More informationarxiv: v1 [cs.mm] 12 Jan 2016
Learning Subclass Representations for Visually-varied Image Classification Xinchao Li, Peng Xu, Yue Shi, Martha Larson, Alan Hanjalic Multimedia Information Retrieval Lab, Delft University of Technology
More informationGradient of the lower bound
Weakly Supervised with Latent PhD advisor: Dr. Ambedkar Dukkipati Department of Computer Science and Automation gaurav.pandey@csa.iisc.ernet.in Objective Given a training set that comprises image and image-level
More informationHide-and-Seek: Forcing a network to be Meticulous for Weakly-supervised Object and Action Localization
Hide-and-Seek: Forcing a network to be Meticulous for Weakly-supervised Object and Action Localization Krishna Kumar Singh and Yong Jae Lee University of California, Davis ---- Paper Presentation Yixian
More informationImage Caption with Global-Local Attention
Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence (AAAI-17) Image Caption with Global-Local Attention Linghui Li, 1,2 Sheng Tang, 1, Lixi Deng, 1,2 Yongdong Zhang, 1 Qi Tian 3
More informationREGION AVERAGE POOLING FOR CONTEXT-AWARE OBJECT DETECTION
REGION AVERAGE POOLING FOR CONTEXT-AWARE OBJECT DETECTION Kingsley Kuan 1, Gaurav Manek 1, Jie Lin 1, Yuan Fang 1, Vijay Chandrasekhar 1,2 Institute for Infocomm Research, A*STAR, Singapore 1 Nanyang Technological
More informationMultimodal topic model for texts and images utilizing their embeddings
Multimodal topic model for texts and images utilizing their embeddings Nikolay Smelik, smelik@rain.ifmo.ru Andrey Filchenkov, afilchenkov@corp.ifmo.ru Computer Technologies Lab IDP-16. Barcelona, Spain,
More informationCS231N Section. Video Understanding 6/1/2018
CS231N Section Video Understanding 6/1/2018 Outline Background / Motivation / History Video Datasets Models Pre-deep learning CNN + RNN 3D convolution Two-stream What we ve seen in class so far... Image
More informationBinary Convolutional Neural Network on RRAM
Binary Convolutional Neural Network on RRAM Tianqi Tang, Lixue Xia, Boxun Li, Yu Wang, Huazhong Yang Dept. of E.E, Tsinghua National Laboratory for Information Science and Technology (TNList) Tsinghua
More informationarxiv: v1 [cs.cv] 31 Mar 2016
Object Boundary Guided Semantic Segmentation Qin Huang, Chunyang Xia, Wenchao Zheng, Yuhang Song, Hao Xu and C.-C. Jay Kuo arxiv:1603.09742v1 [cs.cv] 31 Mar 2016 University of Southern California Abstract.
More informationRecurrent Neural Networks
Recurrent Neural Networks Javier Béjar Deep Learning 2018/2019 Fall Master in Artificial Intelligence (FIB-UPC) Introduction Sequential data Many problems are described by sequences Time series Video/audio
More informationRUC-Tencent at ImageCLEF 2015: Concept Detection, Localization and Sentence Generation
RUC-Tencent at ImageCLEF 2015: Concept Detection, Localization and Sentence Generation Xirong Li 1, Qin Jin 1, Shuai Liao 1, Junwei Liang 1, Xixi He 1, Yujia Huo 1, Weiyu Lan 1, Bin Xiao 2, Yanxiong Lu
More informationDeep learning for dense per-pixel prediction. Chunhua Shen The University of Adelaide, Australia
Deep learning for dense per-pixel prediction Chunhua Shen The University of Adelaide, Australia Image understanding Classification error Convolution Neural Networks 0.3 0.2 0.1 Image Classification [Krizhevsky
More informationInternet of things that video
Video recognition from a sentence Cees Snoek Intelligent Sensory Information Systems Lab University of Amsterdam The Netherlands Internet of things that video 45 billion cameras by 2022 [LDV Capital] 2
More informationLecture 7: Semantic Segmentation
Semantic Segmentation CSED703R: Deep Learning for Visual Recognition (207F) Segmenting images based on its semantic notion Lecture 7: Semantic Segmentation Bohyung Han Computer Vision Lab. bhhanpostech.ac.kr
More information