Image-Sentence Multimodal Embedding with Instructive Objectives

Size: px
Start display at page:

Download "Image-Sentence Multimodal Embedding with Instructive Objectives"

Transcription

1 Image-Sentence Multimodal Embedding with Instructive Objectives Jianhao Wang Shunyu Yao IIIS, Tsinghua University {jh-wang15, Abstract To encode images and sentences into a joint vector space is the first step of the encoder-decoder model for image captioning, also an important and natural problem itself. With image and word features pretrained, we present some novel ideas for learning a multimodal embedding space with images and sentences. Setting image feature vectors as initial hidden layers of LSTM provides combined embeddings of images and sentences and therefore new objectives for training based on it. Two different layers of image features from CNN are combined together to better represent the information inside. For experiments, the encoder enables us to rank sentences for a specific image and visa versa, where some quantitative improvements in results are shown. Finally, some interesting multimodal linguistic regularities in the embedding space are shown. 1 Introduction The problem of image captioning has been an important yet challenging task, involving efforts of both image recognition and description generation. It serves as a bridge between visual and language information and therefore the foundation of search, retrieval and QA systems for images. Among the general translation problem, image captioning can be casted into the framework of encoder-decoder models. For the encoder, a joint vector space for embedding images and sentences is needed. On the other hand, embedding is also a popular and important approach for multimodal representation learning with the following advantages. The representation of texts of images are independent, which allows unimodal uses. Embedding approach can further motivate the exploration of the embedding space, such as the linguistic regularities and other operations on vectors. Embedding is relatively simple and flexible for extension. Various applications can be built upon it, like bidirectional sentence and image retrieval discussed later. These put embedding to a favorable position for us to work on. In this paper, we aim to present some new ideas for training the multimodal embedding based on the use of LSTM. Two main observations that lead to a refined model are: 1. Setting image feature vectors as initial hidden layer of LSTM and sentences as input of LSTM provides combined embeddings of images and sentences, which can instruct the learning of image embeddings and sentence embeddings. 2. Different layers of CNN contain different levels of knowledge of the input image. Experimentation framework introduced in [3] provides evaluation metrics regarding image retrieval and sentence retrieval. Our new ideas remarkably improved these quantitative marks over benchmarks in [8]. Also, to qualitatively analyze properties of the embedding space trained by images and sentences, linguistic regularities are discovered, e.g. *image of man in red* - red + blue *image of man in 30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.

2 blue*. It indicates that the multimodal embedding vector space has a good intrinsic structure left to explore and make use of. 2 Backgrounds Generally for image-sentence multimodal embedding, image features are extracted by a deep convolutional network (CNN), while representations for words are obtained by skip-gram model or continuous bag-of-word model [12]. Sentences can be encoded as final hidden states of LSTM and image features as vectors can also be embedded into the space of LSTM hidden states via an affine linear transformation. Pairwise rank loss, formulated below, are generally used as the objective. 2.1 Pairwise rank loss For training the multimodal embedding, image-description pairs are given. Images are represented as the top layer (before the softmax) of a CNN trained on the ImageNet classification task, e.g. OxfordNet. Let D be the dimensionality of an image feature vector, K be the dimensionality of the embedding space and V be the number of words in the vocabulary. Let the image embedding matrix be W I R K D and the word embedding matrix be W T R K V. For an image embedding vector x R K and a sentence embedding vector v R K, define a scoring function (cos similarity) s(x, v) = x v (1) x v and a contrastive scoring function g(a, b, c) = max{0, α + s(a, c) s(a, b)} (2) where b is a matching term for a and c is a contrastive term for a, and α [0, 1] is a constant threshold. During the training process, we are to optimize min g(x, v, v k )+g(v, x, x k ) (3) θ (x,v) k where {(x k, v k )} is a small set of randomly chosen contrastive terms resampled every epoch. θ denotes all the parameters to learn. It is trivial to see that cos similarity between corresponding pairs is encouraged and cos similarity between non-corresponding pairs is discouraged. The contrastive design of rank loss is vital because it succeeds in keeping distance of different information. For example, min θ (x,v) k s(x, v) will not work because all the embedding vectors will be identical in practice. 2.2 Basic model via LSTM Image feature FC x LSTM v Loss Sequence of word features Sentence Figure 1: Basic model MNLM [8]. Image features are from the FC7 layer of vgg19. FC denotes fully-connected layer. x and v denote image and sentence embeddings respectively. Loss is (3). 2

3 Long short-term memory [2] is a recurrent neural network that incorporates built-in memory cells to store information and exploit long range context. As shown above, the basic model for sentenceimage multimodal embedding uses LSTM to take in a sequence of word features. Then the embedding vector for a sentence is the hidden state of LSTM at the last time step. It is the encoder model applied in [8] which achieved state-of-art performances in bidirectional sentence and image retrieval tasks at that time. 2.3 Related Works A large amount of work has been devoted to learning multimodal representations of text and images. Popular approaches include word-image embedding [16, 1] and sentence-image embedding [4, 14, 8]. These are the foundation of our work. Other than embedding approaches, deep Boltzmann machines [15], log-bilinear neural language models [7], autoencoders [13] and other approaches are also proposed. Focused on the problem of sentence-image multimodal embedding, different architectures have been applied and (3) appears to be the general choice for objective function. The deep visual semantic embedding model [1] embed sentences as the mean of their word embeddings. The semantic dependency tree recursive neural network [14] uses a tree structure to embed on a sentence level. In deep fragment embeddings [6], descriptions are represented as a bag of dependency parses. Fragment objectives are added and object detection is integrated into its framework. [8] introduced the basic LSTM model (MNLM(-vgg)) illustrated in figure 1 and has been our main reference. 3 Method 3.1 Instructive pairwise rank loss Based on the LSTM architecture, a natural way one can think of to embed an image I and a sentence S together is to set image embedding vector x as the initial hidden layer h 0 of the LSTM and take word features of S as the input of the LSTM, yielding the final hidden layer as the output with combined information of I and S, denoted by v (x). 1 However, time and space resources will be squared and no straightforward meaning of this combined embedding approach exists. Image feature x h 0 LSTM 2 v (x) Loss <ĥ> Loss +0.6 LSTM 1 v Loss 1 Sentence Figure 2: Model after applying instructive objective (Model 1). Loss 1 is (3) and loss 2 is (4). FC layer for image features and word embedding step are omitted. <ĥ> denote the whole sequence of hidden layer vectors by LSTM 1, of which v is the last vector. 1 We will use such artificial notations afterwards, where the main body and the superscript represent the input and h 0 of the LSTM respectively. 3

4 We turn this idea into a new instructive way of constructing objective (loss) function: ( ) ( ) min g x, v (x), v (x) k + g v, v (x), v (x k) (4) θ (x,v) where exactly one contrastive term (x k, v k ) is randomly chosen for each (x, k) pair and is resampled every epoch, whereas in (3) a set of contrastive terms are shared by the training set. The change is to save computational resources and time. 3.2 Combination of two layers of image features Different layers of CNN are believed to store different layers of features and hence different levels of knowledge in an image. [18] So the employment of two different layers of the CNN as image features should hopefully result in a better representation for images and consequently for sentences. Our final model is therefore illustrated below. Image feature 1 x 1 Loss 2 x h 0 h 0 Image feature 2 Loss +0.6 v Loss 1 LSTM 1 <ĥ> 1 LSTM 2 <ĥ> 2 LSTM 3 v (x) Sentence Figure 3: Final model (Model 2). Notations are similar to Model 1. FC layers and word embedding step are omitted. In detail, image feature 1 and 2 are extracted from the FC6 and FC7 layers of vgg19 respectively. <ĥ> 1 and <ĥ> 2 represent the whole sequences of hidden layers of LSTM 1 and LSTM2 respectively. v (x) is actually v (x 1,x 2 ) and is the instructive vector with features from v, x 1 and x 2. Final image embedding vectors are taken as the catenation of x 1 and x 2 and final sentence embedding vectors are taken as the catenation of v and v. Loss 1 is (3) for final embeddings vectors and loss 2 is a straightforward extension of (4), namely min θ (x 1,x 2,v) ( g x 1, v (x 1,x 2 ), v (x 1,x 2 ) k ) + g (x ) 1, v (x 1,x 2 ), v (x 1,x 2k ) ( ) +g x 2, v (x1,x2), v (x 1,x 2 ) k + g (x ) 2, v (x1,x2), v (x 1k,x 2) ( ) +g v, v (x 1,x 2 ), v (x 1,x 2k ) + g (v, ) v (x 1,x 2 ), v (x 1k,x 2 ) where exactly one contrastive tripe (x 1k, x 2k, v k ) is randomly chosen for each tripe (x 1, x 2, v) from the training set and is resampled every epoch. This complication, as shown in the next section, does lead to a better performance in retrieval experiments. 4 Experiments As mentioned in the introduction, while bidirectional sentence and image retrieval experiments provides some quantitative comparisons, exploration on linguistic regularities provides insight into the properties of the embedding space. (5) 4

5 4.1 Datasets Our models work on the public image-sentence dataset Flickr8K [3], consisting of 8000 images collected from Flickr. Each image is accompanied with 5 descriptive sentences. Standard training, validation and test split is provided. We attempted to employ Flickr30K [17] but were hindered by hardware limits. 4.2 Sentence Retrieval and image retrieval The problem of sentence retrieval requires to rank the sentences in a given dataset according to their similarities with a given image, while the problem of image retrieval requires to rank the images in a given dataset according to their similarities with a given sentence. [3] They are sometimes alternatively referred to as image annotation and image retrieval, respectively. Two types of quantitative evaluations are performed. Recall@K (K =1, 5, 10) is the mean number of images for which the correct description is ranked within top-k retrieved results (visa versa for sentences, high is good) and Med r is the median rank of the closest ground-truth result from the ranked list (low is good). Sentence and image retrieval were not designed for multimodal embedding problem, and in fact several different frameworks suits to the experiments. For example, [10] tackled the problem as a multimodal matching one and utilized a CNN-based pipeline for the experiments with state-of-art performances on Flickr30K dataset. [9] employed Fisher vector and other advanced approaches with state-of-art performances on Flickr8K dataset. So it is appropriate to focus our comparisons with embedding models. To be more specific, our main comparison is MNLM-vgg [8] whose architecture is our basic model in 2.2. To our primary satisfaction, our Model 2 outperforms all other embedding models listed in the first part of the table below. Table 1: Sentence retrieval and image retrieval experiments. Sentence Retrieval Image Retrieval R@1 R@5 R@10 Med r R@1 R@5 R@10 Med r Random ranking DeViSe [1] SDT-RNN [14] DeFrag [6] MNLM [8] MNLM-vgg [8] Model Model m RNN [11] FV (GMM + HGLMM) [9] m CNN_ENG-vgg [10] Linguistic regularities [8] noted that LSTM encoder is not well apt for discovering linguistic regularities compared with linear representation for sentences, i.e. v = N i=1 w i where w i are word embeddings of the sentence. However, we still find certain linguistic regularities in the multimodal embedding space trained. Formally speaking, given a query image x, a negative word w n and a positive word w p (all with unit norm), we seek an image x = arg max s (x 1, x w n + w p ). (6) x 1 By retrieving top-5 nearest images and picking out the most reasonable one by hand, we select some typical examples below. The only result not within top-5 retrievals are at row 3, indicating the difficulty of action detection. The successful extraction at row 1 is uneasy given a fairly small dataset containing cars. Row 4 is 5

6 Table 2: Linguistic regularities. x x type wn wp -yellow +blue color -grass +snow place -riding +walking action -people +dog object sitting interesting as it reserves the scene of the images (snow). The last row demonstrates a retrieval based on a single word. It is observed that the original image x often resides within top-5 nearest images with x wn + wp, and the reason might be that x is distinctive in its own direction. More work shall be put into this sphere. 5 Discussion Two observations lead to an instructive objective as well as a refined model for sentence-image embedding based on LSTM. We argue that this instructive idea, i.e. a combined embedding of an image and a sentence, can be of potential usefulness in other situations, left unchecked for interested ones. 6

7 From word embedding to sentence embedding, a two-step embedding method is applied. Actually, it is popular to first parse words into semantic fragments and then into a sentence [14]. Embedding on different levels is an interesting topic to explore. Regarding image features, attention-based models and object detection models can be integrated into the framework, enriching the quality of features captured. Alignments between parts of captions to images can significantly empower the model. [5] Image captioning and sentence-image embedding are two important problems, where embedding can be viewed as the first part of encoder-decoder model to generate captions. In return, image captioning equipped with one unimodal embedding might be able to learn a multimodal embedding. Thorough considerations are still needed. References [1] FROME, A.,CORRADO, G.S.,SHLENS, J.,BENGIO, S.,DEAN, J.,MIKOLOV, T.,ET AL. Devise: A deep visual-semantic embedding model. In Advances in neural information processing systems (2013), pp [2] HOCHREITER, S.,AND SCHMIDHUBER, J. Long short-term memory. Neural Computation 9, 8 (1997), [3] HODOSH, M.,YOUNG, P.,AND HOCKENMAIER, J. Framing image description as a ranking task: Data, models and evaluation metrics. J. Artif. Int. Res. 47, 1 (May 2013), [4] KARPATHY, A., AND FEI-FEI, L. Deep visual-semantic alignments for generating image descriptions. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (June 2015). [5] KARPATHY, A., AND FEI-FEI, L. Deep visual-semantic alignments for generating image descriptions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2015), pp [6] KARPATHY, A., JOULIN, A., AND LI, F. F. F. Deep fragment embeddings for bidirectional image sentence mapping. In Advances in neural information processing systems (2014), pp [7] KIROS, R., SALAKHUTDINOV, R., AND ZEMEL, R. Multimodal neural language models. Eprint Arxiv (2014), [8] KIROS, R., SALAKHUTDINOV, R., AND ZEMEL, R. S. Unifying visual-semantic embeddings with multimodal neural language models. CoRR abs/ (2014). [9] KLEIN, B.,LEV, G.,SADEH, G.,AND WOLF, L. Associating neural word embeddings with deep image representations using fisher vectors [10] MA, L., LU, Z., SHANG, L., AND LI, H. Multimodal convolutional neural networks for matching image and sentence. Computer Science (2015), [11] MAO, J., XU, W., YANG, Y., WANG, J., AND YUILLE, A. L. Explain images with multimodal recurrent neural networks. arxiv preprint arxiv: (2014). [12] MIKOLOV, T.,CHEN, K.,CORRADO, G.,AND DEAN, J. Efficient estimation of word representations in vector space. CoRR abs/ (2013). [13] NGIAM, J., KHOSLA, A., KIM, M., NAM, J., LEE, H., AND NG, A. Y. Multimodal deep learning. In International Conference on Machine Learning, ICML 2011, Bellevue, Washington, Usa, June 28 - July (2011), pp [14] SOCHER, R., KARPATHY, A., LE, Q. V., MANNING, C. D., AND NG, A. Y. Grounded compositional semantics for finding and describing images with sentences. Transactions of the Association for Computational Linguistics 2 (2014), [15] SRIVASTAVA, N., AND SALAKHUTDINOV, R. Multimodal learning with deep boltzmann machines. Journal of Machine Learning Research 15, 8 (2014), [16] WESTON, J., BENGIO, S., AND USUNIER, N. Large scale image annotation: Learning to rank with joint word-image embeddings. In European Conference on Machine Learning (2010). 7

8 [17] YOUNG, P., LAI, A., HODOSH, M., AND HOCKENMAIER, J. From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions. Nlp.cs.illinois.edu (2014). [18] YU, W., YANG, K., YAO, H., SUN, X., AND XU, P. Exploiting the complementary strengths of multi-layer cnn features for image retrieval. Neurocomputing (2016). 8

Show, Discriminate, and Tell: A Discriminatory Image Captioning Model with Deep Neural Networks

Show, Discriminate, and Tell: A Discriminatory Image Captioning Model with Deep Neural Networks Show, Discriminate, and Tell: A Discriminatory Image Captioning Model with Deep Neural Networks Zelun Luo Department of Computer Science Stanford University zelunluo@stanford.edu Te-Lin Wu Department of

More information

Image Question Answering using Convolutional Neural Network with Dynamic Parameter Prediction

Image Question Answering using Convolutional Neural Network with Dynamic Parameter Prediction Image Question Answering using Convolutional Neural Network with Dynamic Parameter Prediction by Noh, Hyeonwoo, Paul Hongsuck Seo, and Bohyung Han.[1] Presented : Badri Patro 1 1 Computer Vision Reading

More information

Learning to Match. Jun Xu, Zhengdong Lu, Tianqi Chen, Hang Li

Learning to Match. Jun Xu, Zhengdong Lu, Tianqi Chen, Hang Li Learning to Match Jun Xu, Zhengdong Lu, Tianqi Chen, Hang Li 1. Introduction The main tasks in many applications can be formalized as matching between heterogeneous objects, including search, recommendation,

More information

16-785: Integrated Intelligence in Robotics: Vision, Language, and Planning. Spring 2018 Lecture 14. Image to Text

16-785: Integrated Intelligence in Robotics: Vision, Language, and Planning. Spring 2018 Lecture 14. Image to Text 16-785: Integrated Intelligence in Robotics: Vision, Language, and Planning Spring 2018 Lecture 14. Image to Text Input Output Classification tasks 4/1/18 CMU 16-785: Integrated Intelligence in Robotics

More information

Semantic image search using queries

Semantic image search using queries Semantic image search using queries Shabaz Basheer Patel, Anand Sampat Department of Electrical Engineering Stanford University CA 94305 shabaz@stanford.edu,asampat@stanford.edu Abstract Previous work,

More information

Image Captioning with Object Detection and Localization

Image Captioning with Object Detection and Localization Image Captioning with Object Detection and Localization Zhongliang Yang, Yu-Jin Zhang, Sadaqat ur Rehman, Yongfeng Huang, Department of Electronic Engineering, Tsinghua University, Beijing 100084, China

More information

Show, Discriminate, and Tell: A Discriminatory Image Captioning Model with Deep Neural Networks

Show, Discriminate, and Tell: A Discriminatory Image Captioning Model with Deep Neural Networks Show, Discriminate, and Tell: A Discriminatory Image Captioning Model with Deep Neural Networks Boya Peng Department of Computer Science Stanford University boya@stanford.edu Zelun Luo Department of Computer

More information

arxiv: v1 [cs.cv] 17 Nov 2016

arxiv: v1 [cs.cv] 17 Nov 2016 Instance-aware Image and Sentence Matching with Selective Multimodal LSTM arxiv:1611.05588v1 [cs.cv] 17 Nov 2016 An old man with his bag and dog is sitting on the bench beside the road and grass Yan Huang

More information

Novel Image Captioning

Novel Image Captioning 000 001 002 003 004 005 006 007 008 009 010 011 012 013 014 015 016 017 018 019 020 021 022 023 024 025 026 027 028 029 030 031 032 033 034 035 036 037 038 039 040 041 042 043 044 045 046 047 048 049 050

More information

Grounded Compositional Semantics for Finding and Describing Images with Sentences

Grounded Compositional Semantics for Finding and Describing Images with Sentences Grounded Compositional Semantics for Finding and Describing Images with Sentences R. Socher, A. Karpathy, V. Le,D. Manning, A Y. Ng - 2013 Ali Gharaee 1 Alireza Keshavarzi 2 1 Department of Computational

More information

End-To-End Spam Classification With Neural Networks

End-To-End Spam Classification With Neural Networks End-To-End Spam Classification With Neural Networks Christopher Lennan, Bastian Naber, Jan Reher, Leon Weber 1 Introduction A few years ago, the majority of the internet s network traffic was due to spam

More information

CAP 6412 Advanced Computer Vision

CAP 6412 Advanced Computer Vision CAP 6412 Advanced Computer Vision http://www.cs.ucf.edu/~bgong/cap6412.html Boqing Gong Feb 04, 2016 Today Administrivia Attention Modeling in Image Captioning, by Karan Neural networks & Backpropagation

More information

A PARALLEL-FUSION RNN-LSTM ARCHITECTURE FOR IMAGE CAPTION GENERATION

A PARALLEL-FUSION RNN-LSTM ARCHITECTURE FOR IMAGE CAPTION GENERATION A PARALLEL-FUSION RNN-LSTM ARCHITECTURE FOR IMAGE CAPTION GENERATION Minsi Wang, Li Song, Xiaokang Yang, Chuanfei Luo Institute of Image Communication and Network Engineering, Shanghai Jiao Tong University

More information

Show, Attend and Tell: Neural Image Caption Generation with Visual Attention

Show, Attend and Tell: Neural Image Caption Generation with Visual Attention Show, Attend and Tell: Neural Image Caption Generation with Visual Attention Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhutdinov, Richard Zemel, Yoshua Bengio Presented

More information

Instance-aware Image and Sentence Matching with Selective Multimodal LSTM

Instance-aware Image and Sentence Matching with Selective Multimodal LSTM Instance-aware Image and Sentence Matching with Selective Multimodal LSTM Yan Huang 1,3 Wei Wang 1,3 Liang Wang 1,2,3 1 Center for Research on Intelligent Perception and Computing (CRIPAC), National Laboratory

More information

JOINT INTENT DETECTION AND SLOT FILLING USING CONVOLUTIONAL NEURAL NETWORKS. Puyang Xu, Ruhi Sarikaya. Microsoft Corporation

JOINT INTENT DETECTION AND SLOT FILLING USING CONVOLUTIONAL NEURAL NETWORKS. Puyang Xu, Ruhi Sarikaya. Microsoft Corporation JOINT INTENT DETECTION AND SLOT FILLING USING CONVOLUTIONAL NEURAL NETWORKS Puyang Xu, Ruhi Sarikaya Microsoft Corporation ABSTRACT We describe a joint model for intent detection and slot filling based

More information

arxiv: v1 [cs.cv] 14 Jul 2017

arxiv: v1 [cs.cv] 14 Jul 2017 Temporal Modeling Approaches for Large-scale Youtube-8M Video Understanding Fu Li, Chuang Gan, Xiao Liu, Yunlong Bian, Xiang Long, Yandong Li, Zhichao Li, Jie Zhou, Shilei Wen Baidu IDL & Tsinghua University

More information

Autoencoder. Representation learning (related to dictionary learning) Both the input and the output are x

Autoencoder. Representation learning (related to dictionary learning) Both the input and the output are x Deep Learning 4 Autoencoder, Attention (spatial transformer), Multi-modal learning, Neural Turing Machine, Memory Networks, Generative Adversarial Net Jian Li IIIS, Tsinghua Autoencoder Autoencoder Unsupervised

More information

Proceedings of the International MultiConference of Engineers and Computer Scientists 2018 Vol I IMECS 2018, March 14-16, 2018, Hong Kong

Proceedings of the International MultiConference of Engineers and Computer Scientists 2018 Vol I IMECS 2018, March 14-16, 2018, Hong Kong , March 14-16, 2018, Hong Kong , March 14-16, 2018, Hong Kong , March 14-16, 2018, Hong Kong , March 14-16, 2018, Hong Kong TABLE I CLASSIFICATION ACCURACY OF DIFFERENT PRE-TRAINED MODELS ON THE TEST DATA

More information

Image Captioning with Attention

Image Captioning with Attention ing with Attention Blaine Rister (blaine@stanford.edu), Dieterich Lawson (jdlawson@stanford.edu) 1. Introduction In the past few years, neural networks have fueled dramatic advances in image classication.

More information

LSTM for Language Translation and Image Captioning. Tel Aviv University Deep Learning Seminar Oran Gafni & Noa Yedidia

LSTM for Language Translation and Image Captioning. Tel Aviv University Deep Learning Seminar Oran Gafni & Noa Yedidia 1 LSTM for Language Translation and Image Captioning Tel Aviv University Deep Learning Seminar Oran Gafni & Noa Yedidia 2 Part I LSTM for Language Translation Motivation Background (RNNs, LSTMs) Model

More information

LSTM and its variants for visual recognition. Xiaodan Liang Sun Yat-sen University

LSTM and its variants for visual recognition. Xiaodan Liang Sun Yat-sen University LSTM and its variants for visual recognition Xiaodan Liang xdliang328@gmail.com Sun Yat-sen University Outline Context Modelling with CNN LSTM and its Variants LSTM Architecture Variants Application in

More information

Temporal Activity Detection in Untrimmed Videos with Recurrent Neural Networks

Temporal Activity Detection in Untrimmed Videos with Recurrent Neural Networks Temporal Activity Detection in Untrimmed Videos with Recurrent Neural Networks Alberto Montes al.montes.gomez@gmail.com Santiago Pascual TALP Research Center santiago.pascual@tsc.upc.edu Amaia Salvador

More information

VISION & LANGUAGE From Captions to Visual Concepts and Back

VISION & LANGUAGE From Captions to Visual Concepts and Back VISION & LANGUAGE From Captions to Visual Concepts and Back Brady Fowler & Kerry Jones Tuesday, February 28th 2017 CS 6501-004 VICENTE Agenda Problem Domain Object Detection Language Generation Sentence

More information

Multi-Glance Attention Models For Image Classification

Multi-Glance Attention Models For Image Classification Multi-Glance Attention Models For Image Classification Chinmay Duvedi Stanford University Stanford, CA cduvedi@stanford.edu Pararth Shah Stanford University Stanford, CA pararth@stanford.edu Abstract We

More information

Multimodal Learning. Victoria Dean. MIT 6.S191 Intro to Deep Learning IAP 2017

Multimodal Learning. Victoria Dean. MIT 6.S191 Intro to Deep Learning IAP 2017 Multimodal Learning Victoria Dean Talk outline What is multimodal learning and what are the challenges? Flickr example: joint learning of images and tags Image captioning: generating sentences from images

More information

Empirical Evaluation of RNN Architectures on Sentence Classification Task

Empirical Evaluation of RNN Architectures on Sentence Classification Task Empirical Evaluation of RNN Architectures on Sentence Classification Task Lei Shen, Junlin Zhang Chanjet Information Technology lorashen@126.com, zhangjlh@chanjet.com Abstract. Recurrent Neural Networks

More information

LSTM: An Image Classification Model Based on Fashion-MNIST Dataset

LSTM: An Image Classification Model Based on Fashion-MNIST Dataset LSTM: An Image Classification Model Based on Fashion-MNIST Dataset Kexin Zhang, Research School of Computer Science, Australian National University Kexin Zhang, U6342657@anu.edu.au Abstract. The application

More information

ABC-CNN: Attention Based CNN for Visual Question Answering

ABC-CNN: Attention Based CNN for Visual Question Answering ABC-CNN: Attention Based CNN for Visual Question Answering CIS 601 PRESENTED BY: MAYUR RUMALWALA GUIDED BY: DR. SUNNIE CHUNG AGENDA Ø Introduction Ø Understanding CNN Ø Framework of ABC-CNN Ø Datasets

More information

Object Detection Based on Deep Learning

Object Detection Based on Deep Learning Object Detection Based on Deep Learning Yurii Pashchenko AI Ukraine 2016, Kharkiv, 2016 Image classification (mostly what you ve seen) http://tutorial.caffe.berkeleyvision.org/caffe-cvpr15-detection.pdf

More information

MoonRiver: Deep Neural Network in C++

MoonRiver: Deep Neural Network in C++ MoonRiver: Deep Neural Network in C++ Chung-Yi Weng Computer Science & Engineering University of Washington chungyi@cs.washington.edu Abstract Artificial intelligence resurges with its dramatic improvement

More information

Image Captioning and Generation From Text

Image Captioning and Generation From Text Image Captioning and Generation From Text Presented by: Tony Zhang, Jonathan Kenny, and Jeremy Bernstein Mentor: Stephan Zheng CS159 Advanced Topics in Machine Learning: Structured Prediction California

More information

Multilayer and Multimodal Fusion of Deep Neural Networks for Video Classification

Multilayer and Multimodal Fusion of Deep Neural Networks for Video Classification Multilayer and Multimodal Fusion of Deep Neural Networks for Video Classification Xiaodong Yang, Pavlo Molchanov, Jan Kautz INTELLIGENT VIDEO ANALYTICS Surveillance event detection Human-computer interaction

More information

Deep MIML Network. Abstract

Deep MIML Network. Abstract Deep MIML Network Ji Feng and Zhi-Hua Zhou National Key Laboratory for Novel Software Technology, Nanjing University, Nanjing 210023, China Collaborative Innovation Center of Novel Software Technology

More information

SocialML: machine learning for social media video creators

SocialML: machine learning for social media video creators SocialML: machine learning for social media video creators Tomasz Trzcinski a,b, Adam Bielski b, Pawel Cyrta b and Matthew Zak b a Warsaw University of Technology b Tooploox firstname.lastname@tooploox.com

More information

arxiv:submit/ [cs.cv] 13 Jan 2018

arxiv:submit/ [cs.cv] 13 Jan 2018 Benchmark Visual Question Answer Models by using Focus Map Wenda Qiu Yueyang Xianzang Zhekai Zhang Shanghai Jiaotong University arxiv:submit/2130661 [cs.cv] 13 Jan 2018 Abstract Inferring and Executing

More information

Deep Learning Applications

Deep Learning Applications October 20, 2017 Overview Supervised Learning Feedforward neural network Convolution neural network Recurrent neural network Recursive neural network (Recursive neural tensor network) Unsupervised Learning

More information

Multimodal Medical Image Retrieval based on Latent Topic Modeling

Multimodal Medical Image Retrieval based on Latent Topic Modeling Multimodal Medical Image Retrieval based on Latent Topic Modeling Mandikal Vikram 15it217.vikram@nitk.edu.in Suhas BS 15it110.suhas@nitk.edu.in Aditya Anantharaman 15it201.aditya.a@nitk.edu.in Sowmya Kamath

More information

A Deep Relevance Matching Model for Ad-hoc Retrieval

A Deep Relevance Matching Model for Ad-hoc Retrieval A Deep Relevance Matching Model for Ad-hoc Retrieval Jiafeng Guo 1, Yixing Fan 1, Qingyao Ai 2, W. Bruce Croft 2 1 CAS Key Lab of Web Data Science and Technology, Institute of Computing Technology, Chinese

More information

arxiv: v1 [cs.cv] 20 Dec 2016

arxiv: v1 [cs.cv] 20 Dec 2016 End-to-End Pedestrian Collision Warning System based on a Convolutional Neural Network with Semantic Segmentation arxiv:1612.06558v1 [cs.cv] 20 Dec 2016 Heechul Jung heechul@dgist.ac.kr Min-Kook Choi mkchoi@dgist.ac.kr

More information

END-TO-END CHINESE TEXT RECOGNITION

END-TO-END CHINESE TEXT RECOGNITION END-TO-END CHINESE TEXT RECOGNITION Jie Hu 1, Tszhang Guo 1, Ji Cao 2, Changshui Zhang 1 1 Department of Automation, Tsinghua University 2 Beijing SinoVoice Technology November 15, 2017 Presentation at

More information

On the Efficiency of Recurrent Neural Network Optimization Algorithms

On the Efficiency of Recurrent Neural Network Optimization Algorithms On the Efficiency of Recurrent Neural Network Optimization Algorithms Ben Krause, Liang Lu, Iain Murray, Steve Renals University of Edinburgh Department of Informatics s17005@sms.ed.ac.uk, llu@staffmail.ed.ac.uk,

More information

Automatic Video Description Generation via LSTM with Joint Two-stream Encoding

Automatic Video Description Generation via LSTM with Joint Two-stream Encoding Automatic Video Description Generation via LSTM with Joint Two-stream Encoding Chenyang Zhang and Yingli Tian Department of Electrical Engineering The City College of New York New York, New York 10031

More information

Learning Deep Structure-Preserving Image-Text Embeddings

Learning Deep Structure-Preserving Image-Text Embeddings Learning Deep Structure-Preserving Image-Text Embeddings Liwei Wang Yin Li Svetlana Lazebnik lwang97@illinois.edu yli440@gatech.edu University of Illinois at Urbana-Champaign slazebni@illinois.edu Georgia

More information

Improving Face Recognition by Exploring Local Features with Visual Attention

Improving Face Recognition by Exploring Local Features with Visual Attention Improving Face Recognition by Exploring Local Features with Visual Attention Yichun Shi and Anil K. Jain Michigan State University Difficulties of Face Recognition Large variations in unconstrained face

More information

DCU-UvA Multimodal MT System Report

DCU-UvA Multimodal MT System Report DCU-UvA Multimodal MT System Report Iacer Calixto ADAPT Centre School of Computing Dublin City University Dublin, Ireland iacer.calixto@adaptcentre.ie Desmond Elliott ILLC University of Amsterdam Science

More information

arxiv: v1 [cs.lg] 10 Nov 2014

arxiv: v1 [cs.lg] 10 Nov 2014 Unifying Visual-Semantic Embeddings with Multimodal Neural Language Models arxiv:1411.2539v1 [cs.lg] 10 Nov 2014 Ryan Kiros, Ruslan Salakhutdinov, Richard S. Zemel University of Toronto Canadian Institute

More information

A Novel Representation and Pipeline for Object Detection

A Novel Representation and Pipeline for Object Detection A Novel Representation and Pipeline for Object Detection Vishakh Hegde Stanford University vishakh@stanford.edu Manik Dhar Stanford University dmanik@stanford.edu Abstract Object detection is an important

More information

Detecting Bone Lesions in Multiple Myeloma Patients using Transfer Learning

Detecting Bone Lesions in Multiple Myeloma Patients using Transfer Learning Detecting Bone Lesions in Multiple Myeloma Patients using Transfer Learning Matthias Perkonigg 1, Johannes Hofmanninger 1, Björn Menze 2, Marc-André Weber 3, and Georg Langs 1 1 Computational Imaging Research

More information

Towards Building Large-Scale Multimodal Knowledge Bases

Towards Building Large-Scale Multimodal Knowledge Bases Towards Building Large-Scale Multimodal Knowledge Bases Dihong Gong Advised by Dr. Daisy Zhe Wang Knowledge Itself is Power --Francis Bacon Analytics Social Robotics Knowledge Graph Nodes Represent entities

More information

Generative Adversarial Text to Image Synthesis

Generative Adversarial Text to Image Synthesis Generative Adversarial Text to Image Synthesis Scott Reed, Zeynep Akata, Xinchen Yan, Lajanugen Logeswaran, Bernt Schiele, Honglak Lee Presented by: Jingyao Zhan Contents Introduction Related Work Method

More information

Deep Tracking: Biologically Inspired Tracking with Deep Convolutional Networks

Deep Tracking: Biologically Inspired Tracking with Deep Convolutional Networks Deep Tracking: Biologically Inspired Tracking with Deep Convolutional Networks Si Chen The George Washington University sichen@gwmail.gwu.edu Meera Hahn Emory University mhahn7@emory.edu Mentor: Afshin

More information

Deep Similarity Learning for Multimodal Medical Images

Deep Similarity Learning for Multimodal Medical Images Deep Similarity Learning for Multimodal Medical Images Xi Cheng, Li Zhang, and Yefeng Zheng Siemens Corporation, Corporate Technology, Princeton, NJ, USA Abstract. An effective similarity measure for multi-modal

More information

Project Final Report

Project Final Report Project Final Report Ye Tian Stanford University yetian@stanford.edu Tianlun Li Stanford University tianlunl@stanford.edu Abstract We plan to do image-to-sentence generation. This application bridges vision

More information

Seq2SQL: Generating Structured Queries from Natural Language Using Reinforcement Learning

Seq2SQL: Generating Structured Queries from Natural Language Using Reinforcement Learning Seq2SQL: Generating Structured Queries from Natural Language Using Reinforcement Learning V. Zhong, C. Xiong, R. Socher Salesforce Research arxiv: 1709.00103 Reviewed by : Bill Zhang University of Virginia

More information

Towards End-to-End Audio-Sheet-Music Retrieval

Towards End-to-End Audio-Sheet-Music Retrieval Towards End-to-End Audio-Sheet-Music Retrieval Matthias Dorfer, Andreas Arzt and Gerhard Widmer Department of Computational Perception Johannes Kepler University Linz Altenberger Str. 69, A-4040 Linz matthias.dorfer@jku.at

More information

FastText. Jon Koss, Abhishek Jindal

FastText. Jon Koss, Abhishek Jindal FastText Jon Koss, Abhishek Jindal FastText FastText is on par with state-of-the-art deep learning classifiers in terms of accuracy But it is way faster: FastText can train on more than one billion words

More information

arxiv: v1 [cs.cv] 21 Feb 2018

arxiv: v1 [cs.cv] 21 Feb 2018 Learning Image Conditioned Label Space for Multilabel Classification Yi-Nan Li and Mei-Chen Yeh Department of Computer Science and Information Engineering National Taiwan Normal University myeh@csie.ntnu.edu.com

More information

Sentiment Classification of Food Reviews

Sentiment Classification of Food Reviews Sentiment Classification of Food Reviews Hua Feng Department of Electrical Engineering Stanford University Stanford, CA 94305 fengh15@stanford.edu Ruixi Lin Department of Electrical Engineering Stanford

More information

Part Localization by Exploiting Deep Convolutional Networks

Part Localization by Exploiting Deep Convolutional Networks Part Localization by Exploiting Deep Convolutional Networks Marcel Simon, Erik Rodner, and Joachim Denzler Computer Vision Group, Friedrich Schiller University of Jena, Germany www.inf-cv.uni-jena.de Abstract.

More information

Bidirectional Recurrent Convolutional Networks for Video Super-Resolution

Bidirectional Recurrent Convolutional Networks for Video Super-Resolution Bidirectional Recurrent Convolutional Networks for Video Super-Resolution Qi Zhang & Yan Huang Center for Research on Intelligent Perception and Computing (CRIPAC) National Laboratory of Pattern Recognition

More information

Hierarchical Video Frame Sequence Representation with Deep Convolutional Graph Network

Hierarchical Video Frame Sequence Representation with Deep Convolutional Graph Network Hierarchical Video Frame Sequence Representation with Deep Convolutional Graph Network Feng Mao [0000 0001 6171 3168], Xiang Wu [0000 0003 2698 2156], Hui Xue, and Rong Zhang Alibaba Group, Hangzhou, China

More information

Generating Natural Video Descriptions via Multimodal Processing

Generating Natural Video Descriptions via Multimodal Processing INTERSPEECH 2016 September 8 12, 2016, San Francisco, USA Generating Natural Video Descriptions via Multimodal Processing Qin Jin 1,2, Junwei Liang 3, Xiaozhu Lin 1 1 Multimedia Computing Lab, School of

More information

COMPUTER vision is moving from predicting discrete,

COMPUTER vision is moving from predicting discrete, 1 Learning Two-Branch Neural Networks for Image-Text Matching Tasks Liwei Wang, Yin Li, Jing Huang, Svetlana Lazebnik arxiv:1704.03470v3 [cs.cv] 28 Dec 2017 Abstract Image-language matching tasks have

More information

CEA LIST s participation to the Scalable Concept Image Annotation task of ImageCLEF 2015

CEA LIST s participation to the Scalable Concept Image Annotation task of ImageCLEF 2015 CEA LIST s participation to the Scalable Concept Image Annotation task of ImageCLEF 2015 Etienne Gadeski, Hervé Le Borgne, and Adrian Popescu CEA, LIST, Laboratory of Vision and Content Engineering, France

More information

Deep Learning for Program Analysis. Lili Mou January, 2016

Deep Learning for Program Analysis. Lili Mou January, 2016 Deep Learning for Program Analysis Lili Mou January, 2016 Outline Introduction Background Deep Neural Networks Real-Valued Representation Learning Our Models Building Program Vector Representations for

More information

Machine Learning 13. week

Machine Learning 13. week Machine Learning 13. week Deep Learning Convolutional Neural Network Recurrent Neural Network 1 Why Deep Learning is so Popular? 1. Increase in the amount of data Thanks to the Internet, huge amount of

More information

Contextual Dropout. Sam Fok. Abstract. 1. Introduction. 2. Background and Related Work

Contextual Dropout. Sam Fok. Abstract. 1. Introduction. 2. Background and Related Work Contextual Dropout Finding subnets for subtasks Sam Fok samfok@stanford.edu Abstract The feedforward networks widely used in classification are static and have no means for leveraging information about

More information

Generating Images from Captions with Attention. Elman Mansimov Emilio Parisotto Jimmy Lei Ba Ruslan Salakhutdinov

Generating Images from Captions with Attention. Elman Mansimov Emilio Parisotto Jimmy Lei Ba Ruslan Salakhutdinov Generating Images from Captions with Attention Elman Mansimov Emilio Parisotto Jimmy Lei Ba Ruslan Salakhutdinov Reasoning, Attention, Memory workshop, NIPS 2015 Motivation To simplify the image modelling

More information

Recovering Realistic Texture in Image Super-resolution by Deep Spatial Feature Transform. Xintao Wang Ke Yu Chao Dong Chen Change Loy

Recovering Realistic Texture in Image Super-resolution by Deep Spatial Feature Transform. Xintao Wang Ke Yu Chao Dong Chen Change Loy Recovering Realistic Texture in Image Super-resolution by Deep Spatial Feature Transform Xintao Wang Ke Yu Chao Dong Chen Change Loy Problem enlarge 4 times Low-resolution image High-resolution image Previous

More information

Content-Based Image Recovery

Content-Based Image Recovery Content-Based Image Recovery Hong-Yu Zhou and Jianxin Wu National Key Laboratory for Novel Software Technology Nanjing University, China zhouhy@lamda.nju.edu.cn wujx2001@nju.edu.cn Abstract. We propose

More information

DeepIM: Deep Iterative Matching for 6D Pose Estimation - Supplementary Material

DeepIM: Deep Iterative Matching for 6D Pose Estimation - Supplementary Material DeepIM: Deep Iterative Matching for 6D Pose Estimation - Supplementary Material Yi Li 1, Gu Wang 1, Xiangyang Ji 1, Yu Xiang 2, and Dieter Fox 2 1 Tsinghua University, BNRist 2 University of Washington

More information

COMPUTER vision is moving from predicting discrete,

COMPUTER vision is moving from predicting discrete, 1 Learning Two-Branch Neural Networks for Image-Text Matching Tasks Liwei Wang, Yin Li, Svetlana Lazebnik arxiv:1704.03470v2 [cs.cv] 18 Apr 2017 Abstract This paper investigates two-branch neural networks

More information

Combining Selective Search Segmentation and Random Forest for Image Classification

Combining Selective Search Segmentation and Random Forest for Image Classification Combining Selective Search Segmentation and Random Forest for Image Classification Gediminas Bertasius November 24, 2013 1 Problem Statement Random Forest algorithm have been successfully used in many

More information

Layerwise Interweaving Convolutional LSTM

Layerwise Interweaving Convolutional LSTM Layerwise Interweaving Convolutional LSTM Tiehang Duan and Sargur N. Srihari Department of Computer Science and Engineering The State University of New York at Buffalo Buffalo, NY 14260, United States

More information

YOLO9000: Better, Faster, Stronger

YOLO9000: Better, Faster, Stronger YOLO9000: Better, Faster, Stronger Date: January 24, 2018 Prepared by Haris Khan (University of Toronto) Haris Khan CSC2548: Machine Learning in Computer Vision 1 Overview 1. Motivation for one-shot object

More information

Word2vec and beyond. presented by Eleni Triantafillou. March 1, 2016

Word2vec and beyond. presented by Eleni Triantafillou. March 1, 2016 Word2vec and beyond presented by Eleni Triantafillou March 1, 2016 The Big Picture There is a long history of word representations Techniques from information retrieval: Latent Semantic Analysis (LSA)

More information

ECCV Presented by: Boris Ivanovic and Yolanda Wang CS 331B - November 16, 2016

ECCV Presented by: Boris Ivanovic and Yolanda Wang CS 331B - November 16, 2016 ECCV 2016 Presented by: Boris Ivanovic and Yolanda Wang CS 331B - November 16, 2016 Fundamental Question What is a good vector representation of an object? Something that can be easily predicted from 2D

More information

CROSS-MODAL TRANSFER WITH NEURAL WORD VECTORS FOR IMAGE FEATURE LEARNING

CROSS-MODAL TRANSFER WITH NEURAL WORD VECTORS FOR IMAGE FEATURE LEARNING CROSS-MODAL TRANSFER WITH NEURAL WORD VECTORS FOR IMAGE FEATURE LEARNING Go Irie, Taichi Asami, Shuhei Tarashima, Takayuki Kurozumi, Tetsuya Kinebuchi NTT Corporation ABSTRACT Neural word vector (NWV)

More information

arxiv: v1 [cs.cv] 2 Sep 2018

arxiv: v1 [cs.cv] 2 Sep 2018 Natural Language Person Search Using Deep Reinforcement Learning Ankit Shah Language Technologies Institute Carnegie Mellon University aps1@andrew.cmu.edu Tyler Vuong Electrical and Computer Engineering

More information

arxiv: v2 [cs.cv] 23 Mar 2017

arxiv: v2 [cs.cv] 23 Mar 2017 Recurrent Memory Addressing for describing videos Arnav Kumar Jain Abhinav Agarwalla Kumar Krishna Agrawal Pabitra Mitra Indian Institute of Technology Kharagpur {arnavkj95, abhinavagarawalla, kumarkrishna,

More information

A FRAMEWORK OF EXTRACTING MULTI-SCALE FEATURES USING MULTIPLE CONVOLUTIONAL NEURAL NETWORKS. Kuan-Chuan Peng and Tsuhan Chen

A FRAMEWORK OF EXTRACTING MULTI-SCALE FEATURES USING MULTIPLE CONVOLUTIONAL NEURAL NETWORKS. Kuan-Chuan Peng and Tsuhan Chen A FRAMEWORK OF EXTRACTING MULTI-SCALE FEATURES USING MULTIPLE CONVOLUTIONAL NEURAL NETWORKS Kuan-Chuan Peng and Tsuhan Chen School of Electrical and Computer Engineering, Cornell University, Ithaca, NY

More information

Exploiting noisy web data for largescale visual recognition

Exploiting noisy web data for largescale visual recognition Exploiting noisy web data for largescale visual recognition Lamberto Ballan University of Padova, Italy CVPRW WebVision - Jul 26, 2017 Datasets drive computer vision progress ImageNet Slide credit: O.

More information

Deep Learning for Computer Vision II

Deep Learning for Computer Vision II IIIT Hyderabad Deep Learning for Computer Vision II C. V. Jawahar Paradigm Shift Feature Extraction (SIFT, HoG, ) Part Models / Encoding Classifier Sparrow Feature Learning Classifier Sparrow L 1 L 2 L

More information

Channel Locality Block: A Variant of Squeeze-and-Excitation

Channel Locality Block: A Variant of Squeeze-and-Excitation Channel Locality Block: A Variant of Squeeze-and-Excitation 1 st Huayu Li Northern Arizona University Flagstaff, United State Northern Arizona University hl459@nau.edu arxiv:1901.01493v1 [cs.lg] 6 Jan

More information

Conditional Random Fields as Recurrent Neural Networks

Conditional Random Fields as Recurrent Neural Networks BIL722 - Deep Learning for Computer Vision Conditional Random Fields as Recurrent Neural Networks S. Zheng, S. Jayasumana, B. Romera-Paredes V. Vineet, Z. Su, D. Du, C. Huang, P.H.S. Torr Introduction

More information

arxiv: v1 [cs.mm] 12 Jan 2016

arxiv: v1 [cs.mm] 12 Jan 2016 Learning Subclass Representations for Visually-varied Image Classification Xinchao Li, Peng Xu, Yue Shi, Martha Larson, Alan Hanjalic Multimedia Information Retrieval Lab, Delft University of Technology

More information

Gradient of the lower bound

Gradient of the lower bound Weakly Supervised with Latent PhD advisor: Dr. Ambedkar Dukkipati Department of Computer Science and Automation gaurav.pandey@csa.iisc.ernet.in Objective Given a training set that comprises image and image-level

More information

Hide-and-Seek: Forcing a network to be Meticulous for Weakly-supervised Object and Action Localization

Hide-and-Seek: Forcing a network to be Meticulous for Weakly-supervised Object and Action Localization Hide-and-Seek: Forcing a network to be Meticulous for Weakly-supervised Object and Action Localization Krishna Kumar Singh and Yong Jae Lee University of California, Davis ---- Paper Presentation Yixian

More information

Image Caption with Global-Local Attention

Image Caption with Global-Local Attention Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence (AAAI-17) Image Caption with Global-Local Attention Linghui Li, 1,2 Sheng Tang, 1, Lixi Deng, 1,2 Yongdong Zhang, 1 Qi Tian 3

More information

REGION AVERAGE POOLING FOR CONTEXT-AWARE OBJECT DETECTION

REGION AVERAGE POOLING FOR CONTEXT-AWARE OBJECT DETECTION REGION AVERAGE POOLING FOR CONTEXT-AWARE OBJECT DETECTION Kingsley Kuan 1, Gaurav Manek 1, Jie Lin 1, Yuan Fang 1, Vijay Chandrasekhar 1,2 Institute for Infocomm Research, A*STAR, Singapore 1 Nanyang Technological

More information

Multimodal topic model for texts and images utilizing their embeddings

Multimodal topic model for texts and images utilizing their embeddings Multimodal topic model for texts and images utilizing their embeddings Nikolay Smelik, smelik@rain.ifmo.ru Andrey Filchenkov, afilchenkov@corp.ifmo.ru Computer Technologies Lab IDP-16. Barcelona, Spain,

More information

CS231N Section. Video Understanding 6/1/2018

CS231N Section. Video Understanding 6/1/2018 CS231N Section Video Understanding 6/1/2018 Outline Background / Motivation / History Video Datasets Models Pre-deep learning CNN + RNN 3D convolution Two-stream What we ve seen in class so far... Image

More information

Binary Convolutional Neural Network on RRAM

Binary Convolutional Neural Network on RRAM Binary Convolutional Neural Network on RRAM Tianqi Tang, Lixue Xia, Boxun Li, Yu Wang, Huazhong Yang Dept. of E.E, Tsinghua National Laboratory for Information Science and Technology (TNList) Tsinghua

More information

arxiv: v1 [cs.cv] 31 Mar 2016

arxiv: v1 [cs.cv] 31 Mar 2016 Object Boundary Guided Semantic Segmentation Qin Huang, Chunyang Xia, Wenchao Zheng, Yuhang Song, Hao Xu and C.-C. Jay Kuo arxiv:1603.09742v1 [cs.cv] 31 Mar 2016 University of Southern California Abstract.

More information

Recurrent Neural Networks

Recurrent Neural Networks Recurrent Neural Networks Javier Béjar Deep Learning 2018/2019 Fall Master in Artificial Intelligence (FIB-UPC) Introduction Sequential data Many problems are described by sequences Time series Video/audio

More information

RUC-Tencent at ImageCLEF 2015: Concept Detection, Localization and Sentence Generation

RUC-Tencent at ImageCLEF 2015: Concept Detection, Localization and Sentence Generation RUC-Tencent at ImageCLEF 2015: Concept Detection, Localization and Sentence Generation Xirong Li 1, Qin Jin 1, Shuai Liao 1, Junwei Liang 1, Xixi He 1, Yujia Huo 1, Weiyu Lan 1, Bin Xiao 2, Yanxiong Lu

More information

Deep learning for dense per-pixel prediction. Chunhua Shen The University of Adelaide, Australia

Deep learning for dense per-pixel prediction. Chunhua Shen The University of Adelaide, Australia Deep learning for dense per-pixel prediction Chunhua Shen The University of Adelaide, Australia Image understanding Classification error Convolution Neural Networks 0.3 0.2 0.1 Image Classification [Krizhevsky

More information

Internet of things that video

Internet of things that video Video recognition from a sentence Cees Snoek Intelligent Sensory Information Systems Lab University of Amsterdam The Netherlands Internet of things that video 45 billion cameras by 2022 [LDV Capital] 2

More information

Lecture 7: Semantic Segmentation

Lecture 7: Semantic Segmentation Semantic Segmentation CSED703R: Deep Learning for Visual Recognition (207F) Segmenting images based on its semantic notion Lecture 7: Semantic Segmentation Bohyung Han Computer Vision Lab. bhhanpostech.ac.kr

More information