Image-Sentence Multimodal Embedding with Instructive Objectives

Size: px

Start display at page:

Download "Image-Sentence Multimodal Embedding with Instructive Objectives"

Caroline Sherman
5 years ago
Views:

1 Image-Sentence Multimodal Embedding with Instructive Objectives Jianhao Wang Shunyu Yao IIIS, Tsinghua University {jh-wang15, Abstract To encode images and sentences into a joint vector space is the first step of the encoder-decoder model for image captioning, also an important and natural problem itself. With image and word features pretrained, we present some novel ideas for learning a multimodal embedding space with images and sentences. Setting image feature vectors as initial hidden layers of LSTM provides combined embeddings of images and sentences and therefore new objectives for training based on it. Two different layers of image features from CNN are combined together to better represent the information inside. For experiments, the encoder enables us to rank sentences for a specific image and visa versa, where some quantitative improvements in results are shown. Finally, some interesting multimodal linguistic regularities in the embedding space are shown. 1 Introduction The problem of image captioning has been an important yet challenging task, involving efforts of both image recognition and description generation. It serves as a bridge between visual and language information and therefore the foundation of search, retrieval and QA systems for images. Among the general translation problem, image captioning can be casted into the framework of encoder-decoder models. For the encoder, a joint vector space for embedding images and sentences is needed. On the other hand, embedding is also a popular and important approach for multimodal representation learning with the following advantages. The representation of texts of images are independent, which allows unimodal uses. Embedding approach can further motivate the exploration of the embedding space, such as the linguistic regularities and other operations on vectors. Embedding is relatively simple and flexible for extension. Various applications can be built upon it, like bidirectional sentence and image retrieval discussed later. These put embedding to a favorable position for us to work on. In this paper, we aim to present some new ideas for training the multimodal embedding based on the use of LSTM. Two main observations that lead to a refined model are: 1. Setting image feature vectors as initial hidden layer of LSTM and sentences as input of LSTM provides combined embeddings of images and sentences, which can instruct the learning of image embeddings and sentence embeddings. 2. Different layers of CNN contain different levels of knowledge of the input image. Experimentation framework introduced in [3] provides evaluation metrics regarding image retrieval and sentence retrieval. Our new ideas remarkably improved these quantitative marks over benchmarks in [8]. Also, to qualitatively analyze properties of the embedding space trained by images and sentences, linguistic regularities are discovered, e.g. *image of man in red* - red + blue *image of man in 30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.

2 blue*. It indicates that the multimodal embedding vector space has a good intrinsic structure left to explore and make use of. 2 Backgrounds Generally for image-sentence multimodal embedding, image features are extracted by a deep convolutional network (CNN), while representations for words are obtained by skip-gram model or continuous bag-of-word model [12]. Sentences can be encoded as final hidden states of LSTM and image features as vectors can also be embedded into the space of LSTM hidden states via an affine linear transformation. Pairwise rank loss, formulated below, are generally used as the objective. 2.1 Pairwise rank loss For training the multimodal embedding, image-description pairs are given. Images are represented as the top layer (before the softmax) of a CNN trained on the ImageNet classification task, e.g. OxfordNet. Let D be the dimensionality of an image feature vector, K be the dimensionality of the embedding space and V be the number of words in the vocabulary. Let the image embedding matrix be W I R K D and the word embedding matrix be W T R K V. For an image embedding vector x R K and a sentence embedding vector v R K, define a scoring function (cos similarity) s(x, v) = x v (1) x v and a contrastive scoring function g(a, b, c) = max{0, α + s(a, c) s(a, b)} (2) where b is a matching term for a and c is a contrastive term for a, and α [0, 1] is a constant threshold. During the training process, we are to optimize min g(x, v, v k )+g(v, x, x k ) (3) θ (x,v) k where {(x k, v k )} is a small set of randomly chosen contrastive terms resampled every epoch. θ denotes all the parameters to learn. It is trivial to see that cos similarity between corresponding pairs is encouraged and cos similarity between non-corresponding pairs is discouraged. The contrastive design of rank loss is vital because it succeeds in keeping distance of different information. For example, min θ (x,v) k s(x, v) will not work because all the embedding vectors will be identical in practice. 2.2 Basic model via LSTM Image feature FC x LSTM v Loss Sequence of word features Sentence Figure 1: Basic model MNLM [8]. Image features are from the FC7 layer of vgg19. FC denotes fully-connected layer. x and v denote image and sentence embeddings respectively. Loss is (3). 2

3 Long short-term memory [2] is a recurrent neural network that incorporates built-in memory cells to store information and exploit long range context. As shown above, the basic model for sentenceimage multimodal embedding uses LSTM to take in a sequence of word features. Then the embedding vector for a sentence is the hidden state of LSTM at the last time step. It is the encoder model applied in [8] which achieved state-of-art performances in bidirectional sentence and image retrieval tasks at that time. 2.3 Related Works A large amount of work has been devoted to learning multimodal representations of text and images. Popular approaches include word-image embedding [16, 1] and sentence-image embedding [4, 14, 8]. These are the foundation of our work. Other than embedding approaches, deep Boltzmann machines [15], log-bilinear neural language models [7], autoencoders [13] and other approaches are also proposed. Focused on the problem of sentence-image multimodal embedding, different architectures have been applied and (3) appears to be the general choice for objective function. The deep visual semantic embedding model [1] embed sentences as the mean of their word embeddings. The semantic dependency tree recursive neural network [14] uses a tree structure to embed on a sentence level. In deep fragment embeddings [6], descriptions are represented as a bag of dependency parses. Fragment objectives are added and object detection is integrated into its framework. [8] introduced the basic LSTM model (MNLM(-vgg)) illustrated in figure 1 and has been our main reference. 3 Method 3.1 Instructive pairwise rank loss Based on the LSTM architecture, a natural way one can think of to embed an image I and a sentence S together is to set image embedding vector x as the initial hidden layer h 0 of the LSTM and take word features of S as the input of the LSTM, yielding the final hidden layer as the output with combined information of I and S, denoted by v (x). 1 However, time and space resources will be squared and no straightforward meaning of this combined embedding approach exists. Image feature x h 0 LSTM 2 v (x) Loss <ĥ> Loss +0.6 LSTM 1 v Loss 1 Sentence Figure 2: Model after applying instructive objective (Model 1). Loss 1 is (3) and loss 2 is (4). FC layer for image features and word embedding step are omitted. <ĥ> denote the whole sequence of hidden layer vectors by LSTM 1, of which v is the last vector. 1 We will use such artificial notations afterwards, where the main body and the superscript represent the input and h 0 of the LSTM respectively. 3

4 We turn this idea into a new instructive way of constructing objective (loss) function: ( ) ( ) min g x, v (x), v (x) k + g v, v (x), v (x k) (4) θ (x,v) where exactly one contrastive term (x k, v k ) is randomly chosen for each (x, k) pair and is resampled every epoch, whereas in (3) a set of contrastive terms are shared by the training set. The change is to save computational resources and time. 3.2 Combination of two layers of image features Different layers of CNN are believed to store different layers of features and hence different levels of knowledge in an image. [18] So the employment of two different layers of the CNN as image features should hopefully result in a better representation for images and consequently for sentences. Our final model is therefore illustrated below. Image feature 1 x 1 Loss 2 x h 0 h 0 Image feature 2 Loss +0.6 v Loss 1 LSTM 1 <ĥ> 1 LSTM 2 <ĥ> 2 LSTM 3 v (x) Sentence Figure 3: Final model (Model 2). Notations are similar to Model 1. FC layers and word embedding step are omitted. In detail, image feature 1 and 2 are extracted from the FC6 and FC7 layers of vgg19 respectively. <ĥ> 1 and <ĥ> 2 represent the whole sequences of hidden layers of LSTM 1 and LSTM2 respectively. v (x) is actually v (x 1,x 2 ) and is the instructive vector with features from v, x 1 and x 2. Final image embedding vectors are taken as the catenation of x 1 and x 2 and final sentence embedding vectors are taken as the catenation of v and v. Loss 1 is (3) for final embeddings vectors and loss 2 is a straightforward extension of (4), namely min θ (x 1,x 2,v) ( g x 1, v (x 1,x 2 ), v (x 1,x 2 ) k ) + g (x ) 1, v (x 1,x 2 ), v (x 1,x 2k ) ( ) +g x 2, v (x1,x2), v (x 1,x 2 ) k + g (x ) 2, v (x1,x2), v (x 1k,x 2) ( ) +g v, v (x 1,x 2 ), v (x 1,x 2k ) + g (v, ) v (x 1,x 2 ), v (x 1k,x 2 ) where exactly one contrastive tripe (x 1k, x 2k, v k ) is randomly chosen for each tripe (x 1, x 2, v) from the training set and is resampled every epoch. This complication, as shown in the next section, does lead to a better performance in retrieval experiments. 4 Experiments As mentioned in the introduction, while bidirectional sentence and image retrieval experiments provides some quantitative comparisons, exploration on linguistic regularities provides insight into the properties of the embedding space. (5) 4

5 4.1 Datasets Our models work on the public image-sentence dataset Flickr8K [3], consisting of 8000 images collected from Flickr. Each image is accompanied with 5 descriptive sentences. Standard training, validation and test split is provided. We attempted to employ Flickr30K [17] but were hindered by hardware limits. 4.2 Sentence Retrieval and image retrieval The problem of sentence retrieval requires to rank the sentences in a given dataset according to their similarities with a given image, while the problem of image retrieval requires to rank the images in a given dataset according to their similarities with a given sentence. [3] They are sometimes alternatively referred to as image annotation and image retrieval, respectively. Two types of quantitative evaluations are performed. Recall@K (K =1, 5, 10) is the mean number of images for which the correct description is ranked within top-k retrieved results (visa versa for sentences, high is good) and Med r is the median rank of the closest ground-truth result from the ranked list (low is good). Sentence and image retrieval were not designed for multimodal embedding problem, and in fact several different frameworks suits to the experiments. For example, [10] tackled the problem as a multimodal matching one and utilized a CNN-based pipeline for the experiments with state-of-art performances on Flickr30K dataset. [9] employed Fisher vector and other advanced approaches with state-of-art performances on Flickr8K dataset. So it is appropriate to focus our comparisons with embedding models. To be more specific, our main comparison is MNLM-vgg [8] whose architecture is our basic model in 2.2. To our primary satisfaction, our Model 2 outperforms all other embedding models listed in the first part of the table below. Table 1: Sentence retrieval and image retrieval experiments. Sentence Retrieval Image Retrieval R@1 R@5 R@10 Med r R@1 R@5 R@10 Med r Random ranking DeViSe [1] SDT-RNN [14] DeFrag [6] MNLM [8] MNLM-vgg [8] Model Model m RNN [11] FV (GMM + HGLMM) [9] m CNN_ENG-vgg [10] Linguistic regularities [8] noted that LSTM encoder is not well apt for discovering linguistic regularities compared with linear representation for sentences, i.e. v = N i=1 w i where w i are word embeddings of the sentence. However, we still find certain linguistic regularities in the multimodal embedding space trained. Formally speaking, given a query image x, a negative word w n and a positive word w p (all with unit norm), we seek an image x = arg max s (x 1, x w n + w p ). (6) x 1 By retrieving top-5 nearest images and picking out the most reasonable one by hand, we select some typical examples below. The only result not within top-5 retrievals are at row 3, indicating the difficulty of action detection. The successful extraction at row 1 is uneasy given a fairly small dataset containing cars. Row 4 is 5

+snow place -riding +walking action -people

reserves the scene of the images (snow).

often resides within top-5 nearest images

that x is distinctive in its own direction.

5 Discussion Two observations lead to an

model for sentence-image embedding based on

i.e. a combined embedding of an image and a

6 Table 2: Linguistic regularities. x x type wn wp -yellow +blue color -grass +snow place -riding +walking action -people +dog object sitting interesting as it reserves the scene of the images (snow). The last row demonstrates a retrieval based on a single word. It is observed that the original image x often resides within top-5 nearest images with x wn + wp, and the reason might be that x is distinctive in its own direction. More work shall be put into this sphere. 5 Discussion Two observations lead to an instructive objective as well as a refined model for sentence-image embedding based on LSTM. We argue that this instructive idea, i.e. a combined embedding of an image and a sentence, can be of potential usefulness in other situations, left unchecked for interested ones. 6

7 From word embedding to sentence embedding, a two-step embedding method is applied. Actually, it is popular to first parse words into semantic fragments and then into a sentence [14]. Embedding on different levels is an interesting topic to explore. Regarding image features, attention-based models and object detection models can be integrated into the framework, enriching the quality of features captured. Alignments between parts of captions to images can significantly empower the model. [5] Image captioning and sentence-image embedding are two important problems, where embedding can be viewed as the first part of encoder-decoder model to generate captions. In return, image captioning equipped with one unimodal embedding might be able to learn a multimodal embedding. Thorough considerations are still needed. References [1] FROME, A.,CORRADO, G.S.,SHLENS, J.,BENGIO, S.,DEAN, J.,MIKOLOV, T.,ET AL. Devise: A deep visual-semantic embedding model. In Advances in neural information processing systems (2013), pp [2] HOCHREITER, S.,AND SCHMIDHUBER, J. Long short-term memory. Neural Computation 9, 8 (1997), [3] HODOSH, M.,YOUNG, P.,AND HOCKENMAIER, J. Framing image description as a ranking task: Data, models and evaluation metrics. J. Artif. Int. Res. 47, 1 (May 2013), [4] KARPATHY, A., AND FEI-FEI, L. Deep visual-semantic alignments for generating image descriptions. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (June 2015). [5] KARPATHY, A., AND FEI-FEI, L. Deep visual-semantic alignments for generating image descriptions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2015), pp [6] KARPATHY, A., JOULIN, A., AND LI, F. F. F. Deep fragment embeddings for bidirectional image sentence mapping. In Advances in neural information processing systems (2014), pp [7] KIROS, R., SALAKHUTDINOV, R., AND ZEMEL, R. Multimodal neural language models. Eprint Arxiv (2014), [8] KIROS, R., SALAKHUTDINOV, R., AND ZEMEL, R. S. Unifying visual-semantic embeddings with multimodal neural language models. CoRR abs/ (2014). [9] KLEIN, B.,LEV, G.,SADEH, G.,AND WOLF, L. Associating neural word embeddings with deep image representations using fisher vectors [10] MA, L., LU, Z., SHANG, L., AND LI, H. Multimodal convolutional neural networks for matching image and sentence. Computer Science (2015), [11] MAO, J., XU, W., YANG, Y., WANG, J., AND YUILLE, A. L. Explain images with multimodal recurrent neural networks. arxiv preprint arxiv: (2014). [12] MIKOLOV, T.,CHEN, K.,CORRADO, G.,AND DEAN, J. Efficient estimation of word representations in vector space. CoRR abs/ (2013). [13] NGIAM, J., KHOSLA, A., KIM, M., NAM, J., LEE, H., AND NG, A. Y. Multimodal deep learning. In International Conference on Machine Learning, ICML 2011, Bellevue, Washington, Usa, June 28 - July (2011), pp [14] SOCHER, R., KARPATHY, A., LE, Q. V., MANNING, C. D., AND NG, A. Y. Grounded compositional semantics for finding and describing images with sentences. Transactions of the Association for Computational Linguistics 2 (2014), [15] SRIVASTAVA, N., AND SALAKHUTDINOV, R. Multimodal learning with deep boltzmann machines. Journal of Machine Learning Research 15, 8 (2014), [16] WESTON, J., BENGIO, S., AND USUNIER, N. Large scale image annotation: Learning to rank with joint word-image embeddings. In European Conference on Machine Learning (2010). 7

8 [17] YOUNG, P., LAI, A., HODOSH, M., AND HOCKENMAIER, J. From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions. Nlp.cs.illinois.edu (2014). [18] YU, W., YANG, K., YAO, H., SUN, X., AND XU, P. Exploiting the complementary strengths of multi-layer cnn features for image retrieval. Neurocomputing (2016). 8

Show, Discriminate, and Tell: A Discriminatory Image Captioning Model with Deep Neural Networks

Show, Discriminate, and Tell: A Discriminatory Image Captioning Model with Deep Neural Networks Zelun Luo Department of Computer Science Stanford University zelunluo@stanford.edu Te-Lin Wu Department of