Temporal-difference Learning with Sampling Baseline for Image Captioning

Size: px
Start display at page:

Download "Temporal-difference Learning with Sampling Baseline for Image Captioning"

Transcription

1 Temporal-difference Learning with Sampling Baseline for Image Captioning Hui Chen, Guiguang Ding, Sicheng Zhao, Jungong Han School of Software, Tsinghua University, Beijing , China School of Computing and Communications, Lancaster University, Lancaster, LA1 4YW, UK Abstract The existing methods for image captioning usually train the language model under the cross entropy loss, which results in the exposure bias and inconsistency of evaluation metric. Recent research has shown these two issues can be well addressed by policy gradient method in reinforcement learning domain attributable to its unique capability of directly optimizing the discrete and non-differentiable evaluation metric. In this paper, we utilize reinforcement learning method to train the image captioning model. Specifically, we train our image captioning model to maximize the overall reward of the sentences by adopting the temporal-difference (TD) learning method, which takes the correlation between temporally successive actions into account. In this way, we assign different values to different words in one sampled sentence by a discounted coefficient when back-propagating the gradient with the REINFORCE algorithm, enabling the correlation between actions to be learned. Besides, instead of estimating a baseline to normalize the rewards with another network, we utilize the reward of another Monte-Carlo sample as the baseline to avoid high variance. We show that our proposed method can improve the quality of generated captions and outperforms the state-of-the-art methods on the benchmark dataset MS COCO in terms of seven evaluation metrics. Introduction Scene understanding is one of the ultimate goals of computer vision. Image captioning aims at generating reasonable captions automatically for images which is of great importance to scene understanding. It is a challenging task not only because the captioning models must be capable of recognizing what objects are in the image, but also must be powerful e- nough to understand the semantic relationships among the objects and describe them properly in natural language. It is also of great significance to enable machine mimicking the human ability to express the rich visual information with descriptive language, and thus attracts much attention from academic researchers and industry companies. This research was supported by the National Natural Science Foundation of China (Grant Nos , ), the Royal Society Newton Mobility Grant (IE150997) and the Project Funded by China Postdoctoral Science Foundation (No. 2017M610897). Corresponding authors: Guiguang Ding and Jungong Han. Copyright c 2018, Association for the Advancement of Artificial Intelligence ( All rights reserved. Inspired by the machine translation domain, recent works focus on the deep network based and end-to-end methods mainly under the encoder-decoder framework. In general, the recurrent neural networks (RNN), especially long short term memory () (Hochreiter and Schmidhuber 1997), are employed as the decoder to generate captions (Vinyals et al. 2015; Jin et al. 2015; Xu et al. 2015; You et al. 2016; Zhao et al. 2017) on the basis of the visual features of image extracted by the CNN. These models are usually trained to maximize the likelihood of next ground-truth word given the previous ground-truth words. However, this method will lead to a problem called exposure bias (Ranzato et al. 2015), since at test time, the model uses the word sampled from the model predictions as the next input, instead of the ground-truth words. The second problem is about the inconsistency between the optimizing function at training time and the evaluation metrics at test time. The training procedure attempts to lower the cross entropy loss, while the metrics used to evaluate a generated sentence are some discrete and non-differentiable NLP metrics such as BLEU, ROUGE, CIDEr, and METEOR. These two problems limit the ability of the model to understand the image and describe it with descriptive sentences. It has been shown that the reinforcement learning (RL) can provide a solution to these two identified issues above. There are some works exploring the idea of incorporating the reinforcement learning into image captioning. (Ranzato et al. 2015) proposed a novel training procedure at the sequence level using the policy gradient method. (Rennie et al. 2017) adopted the same loss function as (Ranzato et al. 2015) but the baseline modelling method is slightly different, where they proposed a self-critical training method with the caption generated by the inference algorithm at test time. (Liu et al. 2016) employed the same method to produce the baseline as (Ranzato et al. 2015), and their main contribution lies in using Monte Carlo rollouts to approximate the value function. Despite their better performance, especially compared to the non-rl approaches, there are still some shortcomings in these works. For example, (Rennie et al. 2017) and (Ranzato et al. 2015) both implicitly assumed that every word in one sampled sequence makes the same contribution to the reward, which is clearly not reasonable in general. (Liu et al. 2016) estimated a baseline reward by simply adopting a MLP to learn the baseline reward from the state vector

2 of RNN like Ranzato et al. did. This method usually exhibits high variance, thus making the training unstable. In this paper, we apply the temporal difference method (Sutton 1988) to model the RL value function, instead of the monte carlo rollouts, because the monte carlo rollouts method only learns from the observed values, meaning that the value can not be obtained until the sequence is finished. Differently, the temporal difference method assumes that there are correlations between temporally successive actions, thus, it can estimate the value of actions based on the previously learned estimates of the successive actions by means of the dynamic programming idea. Since the context of the sentence has a strong correlation, we assume that the temporal difference learning could be more appropriate to model the value function. Besides, to reduce the variance during the model training, we also use the baseline suggested by (Rennie et al. 2017) where they consider the caption generated by the test-time inference algorithm to be the baseline caption. However, we notice that the way of baseline in (Rennie et al. 2017) can not approximate the value function correctly, because the test-time inference algorithm tends to pick the fairly good sentence which is better than the sentence sampled from the model distribution in most cases. Instead, we generate two sentences both sampled from the model distribution with the idea that the quality of actions sampled from the same distribution in multinomial sample policy are close in terms of the probability. Therefore, we adopt one of the two sentences as the baseline sequence, and apply the temporal difference method. Overall, the contributions of this paper are three-fold: We directly optimize the evaluation metrics during training through a temporal difference method in reinforcement learning where each action at different time step has different impacts on the model. To avoid the high variance during the training, we employ a novel baseline modelling method by using a sequence sampled from the same distribution as the sequence for gradient to calculate the baseline. We conduct a massive of experiments and comparisons with other methods. The results demonstrate that the proposed method has a significant superiority over the-stateof-the-art methods Related Work The literature on image captioning can be divided into three categories based on different ways of sequence generation (Jia et al. 2015): template-based methods (Farhadi et al. 2010; Kulkarni et al. 2011; Elliott and Keller 2013), transfer-based methods (Gong et al. 2014; Devlin et al. 2015; Mao et al. 2015) and the neural network-based methods. S- ince the proposed method adopts the same framework as the neural network-based methods, we mainly introduce the related works about image captioning with them. The neural network-based methods get inspirations from machine translation (Schwenk 2012; Cho et al. 2014) where two RNNs are used as the encoder and the decoder respectively. Vinyals et al. (2015) replaced the RNN encoder with a deep CNN, and adopted the to decode the image vector to a sentence. This work achieved a reasonable result and hereafter there are many works following this idea and studying further. Xu et al. (2015) applied the attention mechanism in the image captioning task in which the decoder can function as the human s eye focusing its attention on different regions of the image at each time step. Lu et al. (2017) improved the attention model by introducing a visual sentinel allowing the attention module adaptively attend to the visual regions. You et al. (2016) proposed a semantic attention model which selectively attends to semantic concept regions by fusing the global image feature and the semantic attributes feature from an attribute detector. Chen et al. (2017a) proposed a spatial and channel-wise attention model to attend to both image features and visual regions adaptively. Recently, researchers made efforts to incorporate reinforcement learning into the standard encoder-decoder framework to address the exposure bias and the nondifferentiable metric issues. Specifically, (Ranzato et al. 2015) used the REINFORCE algorithm (Williams 1992) and proposed a novel training method at sequence level directly optimizing the non-differentiable test metric. (Liu et al. 2016) applied the policy gradient algorithm in the training procedure for image captioning models, in which the words sampled from the current model at each time step were awarded with different future rewards via averaging the rewards of some Monte-Carlo samples. A simple MLP was used to produce the estimate of the future reward, and such estimate will in turn be treated as the baseline to reduce the variance. Self-critical sequence training (SCST) (Rennie et al. 2017) adopted the policy gradient algorithm as well but the difference from (Liu et al. 2016) is that SCST just ran the forward process twice and obtained two sequences, one generated by running the inference algorithm at test time and the other sampled from the multinomial strategy. SCST made the reward of the sequence from the inference algorithm as a baseline to reduce the training variance. (Ranzato et al. 2015; Rennie et al. 2017) simply assume that each word shares the same importance to the reward of the sentence, so that each of them obtains the same gradient when back-propagating the gradient. This assumption is not reasonable in general. Lu et al. (2017) find the model will be likely prone to visual words like red, horse, bus more than the non-visual words such as of and a by applying an adaptive attention model, which is indeed with accordance with the human s attention schema. Chen et al. (2017c) show that assigning different weights to different words helps the model be aware of the different importance of words in a sentence and enhances the model s ability of generating high-quality captions. (Liu et al. 2016) trains an extra MLP based on the output of units to estimate the baseline, turning MLP to an estimator for the action space. However, MLP does not seem to be a good estimator since the action space can be enormous, and it may cause the high variance, thus making the training unstable. In contrast, in our method, we allow the captioning model learn different values of words by the temporal difference learning. Besides, we employ a sampling baseline strategy to make the training with low variance and stable.

3 Reward sample w 1 s w t s sample w T s γ T 1 r s r s BP γ T t 1 r s r s BP r s r s Training Set CNN sample w 1 s w t s sample w T s Figure 1: The framework of the proposed model, including two parts: the encoder (in blue rectangle) and the decoder (in red rectangle). The top and bottom s share the same parameters. The right arrow means the forward operation and the left arrow means the backward operation. W s = (w1, s w2, s..., wt s ) and W s = (w1 s, w2 s,..., wt s ) are two sampled sequences from the model in multinomial policy. r s and r s are the rewards of sequences W s and W s, respectively. γ is a discounted coefficient in temporal difference method. s t is the output of the softmax function. Methodology Encoder-Decoder framework Given an image I, the image captioning model needs to generate a caption sequence W = {w 1, w 2,..., w T }, w t D where D is the vocabulary dictionary. We adopt the standard CNN-RNN architecture for image captioning. CNN, which can be seen as an encoder, encodes an input image into a vector. RNN functions as a decoder aiming to generate the captions given the image feature. Here, we use L- STM (Hochreiter and Schmidhuber 1997) as the decoder. During generation, generates a word at each time step conditioned on the previously generated words w t 1, the previous hidden state h t 1 and the context vector c t 1 containing the context information that has seen. The updates the hidden units and cells as follows: x 1 = CNN(I), x 0 = E(w 0 ) x t = E(w t ) i t = σ(w ix x t + W ih h t 1 + b i )(input gate) f t = σ(w fx x t + W fh h t 1 + b f )(forget gate) o t = σ(w ox x t + W oh h t 1 + b o )(output gate) c t = i t φ(w zxx t + W zh h t 1 + b c ) + f t c t 1 h t = o t tanh(c t ) q t = W qh h t where w 0 is a special token indicating the start of the sequence, CNN(I) is the feature extractor for image I, E() is the embedding function which maps the one-hot representation of a word into the embedding semantic space. We initialize the c 0 and h 0 to the zero vector. Then a distribution over the next word w t will be produced by using the softmax function: (1) w t Softmax(q t ) (2) The likelihood of a word w t at time step t is decided by a conditional probability conditioned on the input image I and previous words w 0, w 1,...w t 1 : p(w t I, w 0, w 1,.., w t 1 ). So the probability of a generated sequence W = (w 0, w 1, w 2,.., w T ) given the input image I will be the product of the conditional probability of each word: p(w I) = T p(w t I, w 0, w 1,..., w t 1 ) (3) Show and tell paper (Vinyals et al. 2015) uses the crossentropy loss (XENT) to train the whole network. The XENT loss maximizes the probability of the description W generated by the model, which intends to minimize: L = log p(w t I, w 0, w 1,..., w t 1 ) (4) The XENT loss will lead the model to generate the word with the highest posteriori probability at each time step t without considering the quality of the whole sequence at test time and cause a phenomena called search error (Ranzato et al. 2015). Temporal difference learning: TD(λ) Reinforcement learning can provide solutions for decisionmaking problem. We consider the image captioning task as a decision-making problem or a finite Markov process (MDP). In the MDP setting, the state can be defined as the information that has known at the current time step. So we consider the state s t as a list consisting of the image and the previous words: s t = {I, w 0, w 1,..., w t 1 } (5) And the action is the input image or the word generated at different time step. The parameter of the network, θ, defines the policy network p θ which will produce an action distribution, in other words, the prediction of the next word here. The decoder can be viewed as an agent that takes

4 an action (image feature and words) in guidance of the action distribution. After each action a t, the updates its internal parameters to increase or decrease the probability of taking the action a t according to the reward. Reward is an important element in RL, which decides the evolution direction of the agent. Here, we define the reward as the s- core computed by evaluating the generated captions using the corresponding ground-truth sequences under the standard evaluation metrics, such as BLEU-1,2,3,4,CIDEr, ME- TEOR, etc. We denote the reward by r in the following. In reinforcement learning, the agent s task is to maximize the total amount of rewards passing from the environment to the agent. For image captioning, the reward will not be calculated until the EOS, a special token indicating the end of the sequence, is generated by the model. Therefore, it is necessary to define the reward function for each word. In this paper, we define the reward for each word w t as follows: { r t = T r t = (6) 0 t < T where r is the score calculated using the evaluation metrics and T is the final time step. The agent aims to maximize the cumulative rewards it received in the long run. For an episode (a 0, a 1,..., a T ), we define the Q-value function Q(s t, a t+1 ) as a function of the current state s t of the model and some possible action a t+1 to estimate the expected future reward. There are many ways to define the Q-value function. (Liu et al. 2016) exploited Monte Carlo rollouts method in which the model will generate many sequences and used the average of rewards of these sequences as the Q-value. While in this paper, we adopt the temporal-difference (TD) learning to estimate Q-value function. In temporal difference learning, n-step expected return G t:t+n is defined as the sum of the next n rewards plus the estimated value of the next (n + 1) th state, each appropriately discounted, in n-step TD method: G t:t+n = r t+1 +γr t γ n 1 r t+n +γ n V (s t+n ) (7) where 0 t T n. The n-step expected return can be viewed as a n-step backup starting from current time step t. And the Q-value is a weighted average of a few n-step back-ups in the TD(λ) method, in which all weights sum to 1. Specifically, the Q-value in TD(λ) is defined as follows: Q(s t, a t+1 ) = (1 λ) λ n 1 G t:t+n (8) n=1 Since the length of generated sequence has limit T in image captioning, we have: t 1 Q(s t, a t+1 ) = (1 λ) λ n 1 G t:t+n +λ T t 1 G t (9) n=1 where λ is the trad-off parameter which decides how much the model depends on the current expected return G t. Here, we set λ = 1 for our image captioning model. Then, with λ = 1, Eq. (6) and Eq. (7), we have: Q(s t, a t+1 ) = γ T t 1 r (10) Now, we define the RL loss function as follows: L(θ) = E W s p θ [ Q(s t, a t+1 )] (11) where W s = (w s 0, w s 1,..., w s T ) and ws t is sampled from the model at time step t. The gradient L(θ) can be calculated as in REINFORCE algorithm (Williams 1992): L(θ) = E W s p θ [ Q(s t, a t+1 ) θ log p θ (W s )] (12) In practice, Eq. (12) can be approximated using one sequence generated by the network using the Monte-Carlo sample method for each training sample. So we have: L(θ) = = Q(s t, a t+1 ) θ log p θ (W s ) γ T t 1 r θ log p θ (W s ) (13) The definition of Q-value above makes the estimator with high variance. In order to reduce the variance during training, we introduce the baseline. (Rennie et al. 2017) used the reward of the sequence obtained by the current model with the greedy sampling strategy. (Liu et al. 2016) used an MLP to estimate the baseline reward. In this paper, we introduce a new baseline strategy similar to (Rennie et al. 2017) where the difference is that we use a sequence obtained with a multinomial sampling strategy. Then the gradient function will be as follows: L(θ) = γ T t 1 (r r baseline ) θ log p θ (W s ) (14) In fact, the two sequences, one for gradient and the other for baseline, are both generated by the current network p θ with a multinomial sampling strategy. The idea is that the difference between reward r and r baseline is small since they are computed by two sequences which are both sampled from the same distribution and this will achieve a lower variance during training than the way in (Rennie et al. 2017) resulting in a more stable parameters updating. Then according to the chain rule, the final gradient will be as follows: L(θ) q t L(θ) = (15) q t θ where q t is the input of the softmax function at time step t and L(θ) q t = γ T t 1 (r r baseline )(1 w s t p θ (w t h t )) (16) The framework of the proposed method is depicted in Figure 1. Firstly, the CNN network extracts the feature of the input image. Then the absorbs the feature of the image at the beginning (here is at -1 time step) to initialize the

5 hidden vectors for language model. Next, at each time step, the will be fed in the word sampled from the current model at last time step, except at the 0th time step, until a special token EOS is generated. The model will generate two sequences, W s and W s, sampled in multinomial policy. The gradient put on words of W s is determined by the difference between the rewards of W s and W s. This can lower the variance of the gradients and makes the training procedure stable. Experiments Dataset and setting We evaluate our proposed method on the popular MS CO- CO dataset (Lin et al. 2014). MS COCO dataset contains 123,287 images labeled with at least 5 captions including training images and validation images. MS COCO provides images as test set for online evaluation as well. Since the standard test set is not public, we use 5000 images for validation, 5000 images for test and the remains for training, as in previous works (Xu et al. 2015; You et al. 2016; Chen et al. 2017c) for offline evaluation. We use the code publicly 1 to preprocess the dataset, such as pruning infrequent words, and we end up with a vocabulary set which has 9567 different words. We use different metrics, including BLEU-1, BLEU-2, BLEU-3, BLEU-4, METE- OR, ROUGE-L and CIDEr, to evaluate the proposed method and compare with other methods. We extract the image s feature in two different ways. In the first way, the image is encoded as a global feature vector of dimension 2048, and during training, the image feature vector is only fed into the unit at the beginning. In the second, the full image is encoded with the final convolutional layer of Resnet-101 and ends up with a feature map, and at each time step, this feature map will be input into the units. In the following, we denote the models with image features obtained in the first way as the FC models, and those in the second way as attention(att) models. Implementation Details We use ResNet-101 (He et al. 2016) pretrained on ImageNet to encode images. All images are preprocessed as follows: scaling the smaller edge to 256, doing color normalization and cropping to centered rectangle. The decoder is a onelayer with a hidden state size of 512. The embedding dimension of word is fixed to 512. We set the embedding dimension of image feature to 512 using a linear layer. When training the attention model, the parameter updating of L- STM follows (Rennie et al. 2017). We train models under the XENT loss using ADAM optimizer with a learning rate of and finetune the CNN from the beginning. We then train the models under the reinforcement loss to optimize the CIDEr-D metric without finetuning. For all models, the batch size is set to 16 and every 1K iterations the model evaluation will be performed during training. When training models under the RL loss, the learning rate for language 1 model is initialized to and set to after 50K iterations, then decreased every 100K iterations until When training models using RL loss, we use the models trained under XENT loss as pretrained models to reduce the search space. By default, the beam search size is fixed to 3 for all models for test. Performance on MS COCO Performance of our models. To test the effectiveness of TD(λ) modelling method and the baseline method we proposed, we conduct a series of experiments for image captioning on karpathy s split of MS COCO dataset. The configurations of models are listed as follows: XENT-FC: the FC model trained with the XENT loss. SR-Greedy-FC: the FC model trained with a shared reward for every word in a sampled sentence. TD-Greedy-FC: the FC model trained with TD learning and the baseline is computed by the reward of the sequence sampled from the greedy policy. TD-Multinomial-FC: the attention model trained with TD learning and the baseline is computed by the reward of the sequence sampled from the multinomial policy. The results of these four models above are listed in Table 1. The model in the first row is trained with the XENT loss and three models in the second row are trained with the reinforcement learning. Through comparing the result of the XENT-FC with the three RL models in the second row, we can find that our proposed method with the reinforcement learning can improve the performance at a great margin. Compared with the performance of the SR-Greedy-FC model, the TD-Greedy-FC model performs better in all metrics, indicating the effectiveness of the TD(λ) modelling method. The TD-Multinomial-FC model achieves an improvement of 1.1% and 2.4% in terms of the CIDEr metric compared with the TD-Greedy-FC model and SR-Greedy-FC model respectively. Better performance can be attributed to the TD(λ) modelling method which approximates different actions with the discounted expected future reward and the baseline method we proposed which can make the variance more lower than the method that uses the sampled sequence from a greedy policy as the baseline sequence. Comparison with the state-of-the-art methods. To verify the effectiveness of our proposed method, we also compare our models with several state-of-the-art methods. The comparison results are shown in Table 2, where - means that the corresponding scores are not reported in the original papers and the performance of MIXER is from (Rennie et al. 2017). Methods in the first row of the table do not train the image captioning model via reinforcement learning methods, while those in the second row incorporate the reinforcement learning technique when training the model. For fair comparison, we only report the FC-2K model of SCST (Rennie et al. 2017) which employs the same CNN model as ours to extract the image feature. The third row lists two of our models. TD-Multinomial-ATT adopts the attention mechanism as (Rennie et al. 2017) but with a smaller region-point numbers of the feature map. It can be seen that

6 Table 1: Performance of the proposed method on MS COCO dataset. BLEU-1 BLEU-2 BLEU-3 BLEU-4 METEOR ROUGE-L CIDEr XENT-FC SR-Greedy-FC TD-Greedy-FC TD-Multinomial-FC Table 2: Performance comparison of the proposed method with other methods on MS COCO dataset. BLEU-1 BLEU-2 BLEU-3 BLEU-4 METEOR ROUGE-L CIDEr Google NIC (Vinyals et al. 2015) Toronto (Xu et al. 2015) ATT (You et al. 2016) m-rnn (Mao et al. 2015) R- (Chen et al. 2017c) MSM (Yao et al. 2016) MIXER (Ranzato et al. 2015) SCST(FC-2K) (Rennie et al. 2017) TD-Multinomial-FC TD-Multinomial-ATT Table 3: Evaluation on the online MS COCO testing server. indicates the results of ensemble models. BLEU-1 BLEU-2 BLEU-3 BLEU-4 METEOR ROUGE-L CIDEr c5 c40 c5 c40 c5 c40 c5 c40 c5 c40 c5 c40 c5 c40 MSM (Yao et al. 2016) R- (Chen et al. 2017c) Adaptive Attention (Lu et al. 2017) Google NIC (Vinyals et al. 2015) ATT (You et al. 2016) ERD (Wu and Cohen 2016) SCA-CNN (Chen et al. 2017b) MS Captivator (Fang et al. 2015) TD-Multinomial-ATT C ID E r b e a m s iz e K (a) XENT-FC C ID E r b e a m s iz e K (b) TD-Multinomial-FC Figure 3: The influence of beam search size K on the XENT- FC and TD-Multinomial-FC models our two models outperform the models trained without the reinforcement learning from comparison between models in the first row and the third row. And under the same conditions, our models have an superiority over MIXER and SC- ST models with an improvement of 9.7% and 5.3% in terms of the CIDEr metric, respectively. Performance on COCO test Server. We also submit results of the official test set generated by our best model on online coco testing server 2, and compare the performance 2 with state-of-the-art systems. The results are shown in Table 3. We can see that our single model achieves the best performance on BLEU-1 (c5), BLEU-2 (c40) and CIDEr (c5 and c40) among these published systems. When looking at other metrics, our method is also one of the the best. Our model does not have advantages in all metrics for two reasons: (1) we only optimize the CIDEr metric when training our image captioning models; (2) we do not employ models ensemble to improve the performance further. Further exploration of optimizing the fusion of the metrics and models ensemble can be left as the future work. Parameter analysis We now analyze the influence of the beam search size K in the test stage. We contrast the TD-Multinomial-FC model with XENT-FC with the beam size in the range of {1, 3, 5, 7, 9, 10}. The results are depicted in Figure 3. We can see that the beam search size K has a greater impact on the XENT-FC model than on the TD-Multinomial-FC. Specifically, the performance is like in the XENT-FC model, while the TD-Multinomial-FC does not make much difference as the K goes bigger. We suppose that our proposed method will make the standard deviation of the action

7 a little boy sitting in front of a bag of food a young child sitting on a table with a bag a little girl is holding a green banana a little girl holding a green toothbrush a kitchen with a refrigerator and a sink a kitchen with a stove and a window a stuffed animal is sitting on a window sill a teddy bear sitting on top of a window a herd of wild animals grazing in a field a herd of elephants walking in a field a group of people riding surfboards on top of a wave a group of people riding surfboards on a wave in the ocean a city street filled with lots of traffic a group of cars driving down a city street with a traffic light a man and woman standing next to each other a woman and a man holding a glass of wine a herd of elephants standing next to each other a herd of elephants walking in a street a bunch of ripe bananas sitting next to each other a bunch of oranges and bananas on a table a couple of people that are in the water two people are riding on a boat in the water a man with a hat and glasses on his head a man wearing a hat talking on a cell phone Figure 2: Quality examples of our best model (red) compared with the attention model trained under XENT loss (black). distribution become bigger because our method encourages the action with a higher future reward being sampled more frequently by the model when training. Qualitative Analysis Here we provide some quality examples of our captioning model shown in Figure 2. The sentences in black are generated by the pretrained attention model under the XENT loss. And the sentences in red are generated by our best model trained under the RL loss based on the pretrained attention model. So we can sense the improvement by the reinforcement learning intuitively by analysing the captions generated by the two models. In general, the RL model can generate more descriptive captions than the base attention model. Specifically, in Figure 2, for the top four images, the base attention cannot recognize some objects in the image correctly. An example can be found in image 2 where the toothbrush is mistaken as a banana by the base model, whereas the RL model correctly describes it. For the middle four images, the RL model can express the visual content in more detail and descriptively, for instance in image 7, the RL model can see the traffic light and infer that the cars are driving on the street, while the base model just recognizes the city street and the traffic. For the bottom four images, the RL model can organize the language better matching the habit of human cognition than the base attention model. Taking image 12 as an example, this image shows us a scene that a man is talking on the cell phone. The RL model describes the scene correctly while the base attention model does not, though its description of the man is not completely wrong. Conclusion In this paper, we proposed to incorporate the reinforcement learning method into image captioning task by considering the caption generating procedure as a RL problem. Different from previous RL works for image captioning, which consider the words to be equally important for the whole sequence generation, we formulated the value function by the temporal difference method, which takes the correlation between the temporal successive actions into consideration. Besides, to avoid the high variance during training, we introduced a baseline by calculating the reward of a sequence sampled by the current model. Experimental results on MS COCO dataset and comparisons with state-of-the-art methods demonstrated the effectiveness of our proposed method.

8 References Chen, L.; Zhang, H.; Xiao, J.; Nie, L.; Shao, J.; and Chua, T.-S. 2017a. Sca-cnn: Spatial and channel-wise attention in convolutional networks for image captioning. In IEEE Conference on Computer Vision and Pattern Recognition. Chen, L.; Zhang, H.; Xiao, J.; Nie, L.; Shao, J.; and Chua, T.-S. 2017b. Sca-cnn: Spatial and channel-wise attention in convolutional networks for image captioning. CVPR. Chen, M.; Ding, G.; Zhao, S.; Chen, H.; Liu, Q.; and Han, J. 2017c. Reference based lstm for image captioning. AAAI. Cho, K.; Van Merriënboer, B.; Gülçehre, Ç.; Bahdanau, D.; Bougares, F.; Schwenk, H.; and Bengio, Y Learning phrase representations using rnn encoder-decoder for statistical machine translation. In EMNLP, Devlin, J.; Gupta, S.; Girshick, R.; Mitchell, M.; and Zitnick, C. L Exploring nearest neighbor approaches for image captioning. arxiv preprint arxiv: Elliott, D., and Keller, F Image description using visual dependency representations. In EMNLP, Fang, H.; Gupta, S.; Iandola, F.; Srivastava, R. K.; Deng, L.; Dollár, P.; Gao, J.; He, X.; Mitchell, M.; Platt, J. C.; et al From captions to visual concepts and back. In CVPR, Farhadi, A.; Hejrati, M.; Sadeghi, M. A.; Young, P.; Rashtchian, C.; Hockenmaier, J.; and Forsyth, D Every picture tells a story: Generating sentences from images. In ECCV, Gong, Y.; Wang, L.; Hodosh, M.; Hockenmaier, J.; and Lazebnik, S Improving image-sentence embeddings using large weakly annotated photo collections. In ECCV, He, K.; Zhang, X.; Ren, S.; and Sun, J Deep residual learning for image recognition. CVPR 00: Hochreiter, S., and Schmidhuber, J Long short-term memory. Neural Computation 9(8): Jia, X.; Gavves, E.; Fernando, B.; and Tuytelaars, T Guiding the long-short term memory model for image caption generation. In ICCV, Jin, J.; Fu, K.; Cui, R.; Sha, F.; and Zhang, C Aligning where to see and what to tell: image caption with regionbased attention and scene factorization. arxiv preprint arxiv: Kulkarni, G.; Premraj, V.; Dhar, S.; Li, S.; Choi, Y.; Berg, A. C.; and Berg, T. L Baby talk: Understanding and generating simple image descriptions. In CVPR, Lin, T. Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Dollr, P.; and Zitnick, C. L Microsoft coco: Common objects in context. In ECCV, Liu, S.; Zhu, Z.; Ye, N.; Guadarrama, S.; and Murphy, K Optimization of image description metrics using policy gradient methods. arxiv preprint arxiv: Lu, J.; Xiong, C.; Parikh, D.; and Socher, R Knowing when to look: Adaptive attention via a visual sentinel for image captioning. CVPR. Mao, J.; Xu, W.; Yang, Y.; Wang, J.; and Yuille, A. L Deep captioning with multimodal recurrent neural networks (m-rnn). In ICLR. Ranzato, M.; Chopra, S.; Auli, M.; and Zaremba, W Sequence level training with recurrent neural networks. I- CLR. Rennie, S. J.; Marcheret, E.; Mroueh, Y.; Ross, J.; and Goel, V Self-critical sequence training for image captioning. CVPR. Schwenk, H Continuous space translation models for phrase-based statistical machine translation. In COLING, Sutton, R. S Learning to predict by the methods of temporal differences. Mach. Learn. 3(1):9 44. Vinyals, O.; Toshev, A.; Bengio, S.; and Erhan, D Show and tell: A neural image caption generator. In CVPR, Williams, R. J Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine learning 8(3-4): Wu, Z. Y. Y. Y. Y., and Cohen, R. S. W. W Encode, review, and decode: Reviewer module for caption generation. NIPS. Xu, K.; Ba, J.; Kiros, R.; Cho, K.; Courville, A.; Salakhudinov, R.; Zemel, R.; and Bengio, Y Show, attend and tell: Neural image caption generation with visual attention. In ICML, Yao, T.; Pan, Y.; Li, Y.; Qiu, Z.; and Mei, T Boosting image captioning with attributes. arxiv preprint arxiv: You, Q.; Jin, H.; Wang, Z.; Fang, C.; and Luo, J Image captioning with semantic attention. In CVPR, Zhao, S.; Yao, H.; Gao, Y.; Ji, R.; and Ding, G Continuous probability distribution prediction of image emotions via multitask shared sparse regression. IEEE Transactions on Multimedia 19(3):

Show, Discriminate, and Tell: A Discriminatory Image Captioning Model with Deep Neural Networks

Show, Discriminate, and Tell: A Discriminatory Image Captioning Model with Deep Neural Networks Show, Discriminate, and Tell: A Discriminatory Image Captioning Model with Deep Neural Networks Zelun Luo Department of Computer Science Stanford University zelunluo@stanford.edu Te-Lin Wu Department of

More information

Image Captioning with Object Detection and Localization

Image Captioning with Object Detection and Localization Image Captioning with Object Detection and Localization Zhongliang Yang, Yu-Jin Zhang, Sadaqat ur Rehman, Yongfeng Huang, Department of Electronic Engineering, Tsinghua University, Beijing 100084, China

More information

Reference Based LSTM for Image Captioning

Reference Based LSTM for Image Captioning Reference Based LSTM for Image Captioning Minghai Chen, Guiguang Ding, Sicheng Zhao, Hui Chen, Jungong Han, Qiang Liu School of Software, Tsinghua University, Beijing 100084, China Northumbria University,

More information

Show, Discriminate, and Tell: A Discriminatory Image Captioning Model with Deep Neural Networks

Show, Discriminate, and Tell: A Discriminatory Image Captioning Model with Deep Neural Networks Show, Discriminate, and Tell: A Discriminatory Image Captioning Model with Deep Neural Networks Boya Peng Department of Computer Science Stanford University boya@stanford.edu Zelun Luo Department of Computer

More information

16-785: Integrated Intelligence in Robotics: Vision, Language, and Planning. Spring 2018 Lecture 14. Image to Text

16-785: Integrated Intelligence in Robotics: Vision, Language, and Planning. Spring 2018 Lecture 14. Image to Text 16-785: Integrated Intelligence in Robotics: Vision, Language, and Planning Spring 2018 Lecture 14. Image to Text Input Output Classification tasks 4/1/18 CMU 16-785: Integrated Intelligence in Robotics

More information

Image Caption with Global-Local Attention

Image Caption with Global-Local Attention Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence (AAAI-17) Image Caption with Global-Local Attention Linghui Li, 1,2 Sheng Tang, 1, Lixi Deng, 1,2 Yongdong Zhang, 1 Qi Tian 3

More information

LSTM for Language Translation and Image Captioning. Tel Aviv University Deep Learning Seminar Oran Gafni & Noa Yedidia

LSTM for Language Translation and Image Captioning. Tel Aviv University Deep Learning Seminar Oran Gafni & Noa Yedidia 1 LSTM for Language Translation and Image Captioning Tel Aviv University Deep Learning Seminar Oran Gafni & Noa Yedidia 2 Part I LSTM for Language Translation Motivation Background (RNNs, LSTMs) Model

More information

Show, Attend and Tell: Neural Image Caption Generation with Visual Attention

Show, Attend and Tell: Neural Image Caption Generation with Visual Attention Show, Attend and Tell: Neural Image Caption Generation with Visual Attention Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhutdinov, Richard Zemel, Yoshua Bengio Presented

More information

Recurrent Fusion Network for Image Captioning

Recurrent Fusion Network for Image Captioning Recurrent Fusion Network for Image Captioning Wenhao Jiang 1, Lin Ma 1, Yu-Gang Jiang 2, Wei Liu 1, Tong Zhang 1 1 Tencen AI Lab, 2 Fudan University {cswhjiang, forest.linma}@gmail.com, ygj@fudan.edu.cn,

More information

Image Question Answering using Convolutional Neural Network with Dynamic Parameter Prediction

Image Question Answering using Convolutional Neural Network with Dynamic Parameter Prediction Image Question Answering using Convolutional Neural Network with Dynamic Parameter Prediction by Noh, Hyeonwoo, Paul Hongsuck Seo, and Bohyung Han.[1] Presented : Badri Patro 1 1 Computer Vision Reading

More information

Novel Image Captioning

Novel Image Captioning 000 001 002 003 004 005 006 007 008 009 010 011 012 013 014 015 016 017 018 019 020 021 022 023 024 025 026 027 028 029 030 031 032 033 034 035 036 037 038 039 040 041 042 043 044 045 046 047 048 049 050

More information

Image-Sentence Multimodal Embedding with Instructive Objectives

Image-Sentence Multimodal Embedding with Instructive Objectives Image-Sentence Multimodal Embedding with Instructive Objectives Jianhao Wang Shunyu Yao IIIS, Tsinghua University {jh-wang15, yao-sy15}@mails.tsinghua.edu.cn Abstract To encode images and sentences into

More information

Semantic image search using queries

Semantic image search using queries Semantic image search using queries Shabaz Basheer Patel, Anand Sampat Department of Electrical Engineering Stanford University CA 94305 shabaz@stanford.edu,asampat@stanford.edu Abstract Previous work,

More information

Vision, Language and Commonsense. Larry Zitnick Microsoft Research

Vision, Language and Commonsense. Larry Zitnick Microsoft Research Vision, Language and Commonsense Larry Zitnick Microsoft Research https://en.wikipedia.org/wiki/file:bright_green_tree_-_waikato.jpg Long, blue, spiky-edged shadows crept out across the snow-fields,

More information

CAP 6412 Advanced Computer Vision

CAP 6412 Advanced Computer Vision CAP 6412 Advanced Computer Vision http://www.cs.ucf.edu/~bgong/cap6412.html Boqing Gong Feb 04, 2016 Today Administrivia Attention Modeling in Image Captioning, by Karan Neural networks & Backpropagation

More information

Classifying a specific image region using convolutional nets with an ROI mask as input

Classifying a specific image region using convolutional nets with an ROI mask as input Classifying a specific image region using convolutional nets with an ROI mask as input 1 Sagi Eppel Abstract Convolutional neural nets (CNN) are the leading computer vision method for classifying images.

More information

Learning Object Context for Dense Captioning

Learning Object Context for Dense Captioning Learning Object Context for Dense ing Xiangyang Li 1,2, Shuqiang Jiang 1,2, Jungong Han 3 1 Key Laboratory of Intelligent Information Processing of Chinese Academy of Sciences (CAS), Institute of Computing

More information

Image Captioning with Attention

Image Captioning with Attention ing with Attention Blaine Rister (blaine@stanford.edu), Dieterich Lawson (jdlawson@stanford.edu) 1. Introduction In the past few years, neural networks have fueled dramatic advances in image classication.

More information

Recurrent Neural Networks. Nand Kishore, Audrey Huang, Rohan Batra

Recurrent Neural Networks. Nand Kishore, Audrey Huang, Rohan Batra Recurrent Neural Networks Nand Kishore, Audrey Huang, Rohan Batra Roadmap Issues Motivation 1 Application 1: Sequence Level Training 2 Basic Structure 3 4 Variations 5 Application 3: Image Classification

More information

Empirical Evaluation of RNN Architectures on Sentence Classification Task

Empirical Evaluation of RNN Architectures on Sentence Classification Task Empirical Evaluation of RNN Architectures on Sentence Classification Task Lei Shen, Junlin Zhang Chanjet Information Technology lorashen@126.com, zhangjlh@chanjet.com Abstract. Recurrent Neural Networks

More information

arxiv: v1 [cs.cv] 14 Jul 2017

arxiv: v1 [cs.cv] 14 Jul 2017 Temporal Modeling Approaches for Large-scale Youtube-8M Video Understanding Fu Li, Chuang Gan, Xiao Liu, Yunlong Bian, Xiang Long, Yandong Li, Zhichao Li, Jie Zhou, Shilei Wen Baidu IDL & Tsinghua University

More information

Generating Natural Video Descriptions via Multimodal Processing

Generating Natural Video Descriptions via Multimodal Processing INTERSPEECH 2016 September 8 12, 2016, San Francisco, USA Generating Natural Video Descriptions via Multimodal Processing Qin Jin 1,2, Junwei Liang 3, Xiaozhu Lin 1 1 Multimedia Computing Lab, School of

More information

REGION AVERAGE POOLING FOR CONTEXT-AWARE OBJECT DETECTION

REGION AVERAGE POOLING FOR CONTEXT-AWARE OBJECT DETECTION REGION AVERAGE POOLING FOR CONTEXT-AWARE OBJECT DETECTION Kingsley Kuan 1, Gaurav Manek 1, Jie Lin 1, Yuan Fang 1, Vijay Chandrasekhar 1,2 Institute for Infocomm Research, A*STAR, Singapore 1 Nanyang Technological

More information

arxiv: v2 [cs.cv] 12 Apr 2017

arxiv: v2 [cs.cv] 12 Apr 2017 input image SCA-CNN: Spatial and Channel-wise Attention in Convolutional Networks for Image Captioning Long Chen 1 Hanwang Zhang 2 Jun Xiao 1 Liqiang Nie 3 Jian Shao 1 Wei Liu 4 Tat-Seng Chua 5 1 Zhejiang

More information

[Supplementary Material] Improving Occlusion and Hard Negative Handling for Single-Stage Pedestrian Detectors

[Supplementary Material] Improving Occlusion and Hard Negative Handling for Single-Stage Pedestrian Detectors [Supplementary Material] Improving Occlusion and Hard Negative Handling for Single-Stage Pedestrian Detectors Junhyug Noh Soochan Lee Beomsu Kim Gunhee Kim Department of Computer Science and Engineering

More information

Multi-Glance Attention Models For Image Classification

Multi-Glance Attention Models For Image Classification Multi-Glance Attention Models For Image Classification Chinmay Duvedi Stanford University Stanford, CA cduvedi@stanford.edu Pararth Shah Stanford University Stanford, CA pararth@stanford.edu Abstract We

More information

arxiv: v1 [cs.cv] 19 Sep 2018

arxiv: v1 [cs.cv] 19 Sep 2018 Exploring Visual Relationship for Image Captioning Ting Yao 1, Yingwei Pan 1, Yehao Li 2, and Tao Mei 1 arxiv:180907041v1 [cscv] 19 Sep 2018 1 JD AI Research, Beijing, China 2 Sun Yat-sen University, Guangzhou,

More information

A FRAMEWORK OF EXTRACTING MULTI-SCALE FEATURES USING MULTIPLE CONVOLUTIONAL NEURAL NETWORKS. Kuan-Chuan Peng and Tsuhan Chen

A FRAMEWORK OF EXTRACTING MULTI-SCALE FEATURES USING MULTIPLE CONVOLUTIONAL NEURAL NETWORKS. Kuan-Chuan Peng and Tsuhan Chen A FRAMEWORK OF EXTRACTING MULTI-SCALE FEATURES USING MULTIPLE CONVOLUTIONAL NEURAL NETWORKS Kuan-Chuan Peng and Tsuhan Chen School of Electrical and Computer Engineering, Cornell University, Ithaca, NY

More information

arxiv: v1 [cs.cv] 2 Sep 2018

arxiv: v1 [cs.cv] 2 Sep 2018 Natural Language Person Search Using Deep Reinforcement Learning Ankit Shah Language Technologies Institute Carnegie Mellon University aps1@andrew.cmu.edu Tyler Vuong Electrical and Computer Engineering

More information

Slides credited from Dr. David Silver & Hung-Yi Lee

Slides credited from Dr. David Silver & Hung-Yi Lee Slides credited from Dr. David Silver & Hung-Yi Lee Review Reinforcement Learning 2 Reinforcement Learning RL is a general purpose framework for decision making RL is for an agent with the capacity to

More information

A PARALLEL-FUSION RNN-LSTM ARCHITECTURE FOR IMAGE CAPTION GENERATION

A PARALLEL-FUSION RNN-LSTM ARCHITECTURE FOR IMAGE CAPTION GENERATION A PARALLEL-FUSION RNN-LSTM ARCHITECTURE FOR IMAGE CAPTION GENERATION Minsi Wang, Li Song, Xiaokang Yang, Chuanfei Luo Institute of Image Communication and Network Engineering, Shanghai Jiao Tong University

More information

Recurrent Attentional Reinforcement Learning for Multi-label Image Recognition

Recurrent Attentional Reinforcement Learning for Multi-label Image Recognition Recurrent Attentional Reinforcement Learning for Multi-label Image Recognition Tianshui Chen 1, Zhouxia Wang 1,2, Guanbin Li 1, Liang Lin 1,2 1 School of Data and Computer Science, Sun Yat-sen University,

More information

arxiv: v1 [cs.cv] 2 Jun 2016

arxiv: v1 [cs.cv] 2 Jun 2016 Storytelling of Photo Stream with Bidirectional Multi-thread Recurrent Neural Network Yu Liu University at Buffalo, SUNY yliu44@buffalo.edu Jianlong Fu Microsoft Research Asia jianf@microsoft.com Tao Mei

More information

arxiv: v1 [cs.cv] 17 Nov 2016

arxiv: v1 [cs.cv] 17 Nov 2016 Instance-aware Image and Sentence Matching with Selective Multimodal LSTM arxiv:1611.05588v1 [cs.cv] 17 Nov 2016 An old man with his bag and dog is sitting on the bench beside the road and grass Yan Huang

More information

Proceedings of the International MultiConference of Engineers and Computer Scientists 2018 Vol I IMECS 2018, March 14-16, 2018, Hong Kong

Proceedings of the International MultiConference of Engineers and Computer Scientists 2018 Vol I IMECS 2018, March 14-16, 2018, Hong Kong , March 14-16, 2018, Hong Kong , March 14-16, 2018, Hong Kong , March 14-16, 2018, Hong Kong , March 14-16, 2018, Hong Kong TABLE I CLASSIFICATION ACCURACY OF DIFFERENT PRE-TRAINED MODELS ON THE TEST DATA

More information

arxiv: v1 [cs.lg] 26 Oct 2018

arxiv: v1 [cs.lg] 26 Oct 2018 Automatic Graphics Program Generation using Attention-Based Hierarchical Decoder Zhihao Zhu, Zhan Xue, and Zejian Yuan arxiv:1810.11536v1 [cs.lg] 26 Oct 2018 Institute of Artificial Intelligence and Robotics

More information

Visual Relationship Detection with Deep Structural Ranking

Visual Relationship Detection with Deep Structural Ranking Visual Relationship Detection with Deep Structural Ranking Kongming Liang1,3, Yuhong Guo2, Hong Chang1, Xilin Chen1,3 1 Key Lab of Intelligent Information Processing of Chinese Academy of Sciences (CAS),

More information

End-To-End Spam Classification With Neural Networks

End-To-End Spam Classification With Neural Networks End-To-End Spam Classification With Neural Networks Christopher Lennan, Bastian Naber, Jan Reher, Leon Weber 1 Introduction A few years ago, the majority of the internet s network traffic was due to spam

More information

Cost-alleviative Learning for Deep Convolutional Neural Network-based Facial Part Labeling

Cost-alleviative Learning for Deep Convolutional Neural Network-based Facial Part Labeling [DOI: 10.2197/ipsjtcva.7.99] Express Paper Cost-alleviative Learning for Deep Convolutional Neural Network-based Facial Part Labeling Takayoshi Yamashita 1,a) Takaya Nakamura 1 Hiroshi Fukui 1,b) Yuji

More information

Layerwise Interweaving Convolutional LSTM

Layerwise Interweaving Convolutional LSTM Layerwise Interweaving Convolutional LSTM Tiehang Duan and Sargur N. Srihari Department of Computer Science and Engineering The State University of New York at Buffalo Buffalo, NY 14260, United States

More information

Hierarchical Video Frame Sequence Representation with Deep Convolutional Graph Network

Hierarchical Video Frame Sequence Representation with Deep Convolutional Graph Network Hierarchical Video Frame Sequence Representation with Deep Convolutional Graph Network Feng Mao [0000 0001 6171 3168], Xiang Wu [0000 0003 2698 2156], Hui Xue, and Rong Zhang Alibaba Group, Hangzhou, China

More information

Improving Image Captioning with Conditional Generative Adversarial Nets

Improving Image Captioning with Conditional Generative Adversarial Nets Improving Image Captioning with Conditional Generative Adversarial Nets Chen Chen, Shuai Mu, Wanpeng Xiao, Zexiong Ye, Liesi Wu, Qi Ju Tencent AI Lab, Shenzhen, China 518000 {beckhamchen, harrymou, wanpengxiao,

More information

Boosting Image Captioning with Attributes

Boosting Image Captioning with Attributes Boosting Image Captioning with Ting Yao, Yingwei Pan, Yehao Li, Zhaofan Qiu, and Tao Mei Microsoft Research, Beijing, China University of Science and Technology of China, Hefei, China Sun Yat-Sen University,

More information

R-FCN++: Towards Accurate Region-Based Fully Convolutional Networks for Object Detection

R-FCN++: Towards Accurate Region-Based Fully Convolutional Networks for Object Detection The Thirty-Second AAAI Conference on Artificial Intelligence (AAAI-18) R-FCN++: Towards Accurate Region-Based Fully Convolutional Networks for Object Detection Zeming Li, 1 Yilun Chen, 2 Gang Yu, 2 Yangdong

More information

arxiv: v1 [cs.cv] 4 Sep 2018

arxiv: v1 [cs.cv] 4 Sep 2018 Text2Scene: Generating Abstract Scenes from Textual Descriptions Fuwen Tan 1, Song Feng 2, Vicente Ordonez 1 1 University of Virginia, 2 IBM Thomas J. Watson Research Center. fuwen.tan@virginia.edu, sfeng@us.ibm.com,

More information

arxiv: v2 [cs.cv] 23 Mar 2017

arxiv: v2 [cs.cv] 23 Mar 2017 Recurrent Memory Addressing for describing videos Arnav Kumar Jain Abhinav Agarwalla Kumar Krishna Agrawal Pabitra Mitra Indian Institute of Technology Kharagpur {arnavkj95, abhinavagarawalla, kumarkrishna,

More information

An Empirical Study of Language CNN for Image Captioning

An Empirical Study of Language CNN for Image Captioning An Empirical Study of Language CNN for Image Captioning Jiuxiang Gu 1, Gang Wang 2, Jianfei Cai 3, Tsuhan Chen 3 1 ROSE Lab, Interdisciplinary Graduate School, Nanyang Technological University, Singapore

More information

Diverse and Accurate Image Description Using a Variational Auto-Encoder with an Additive Gaussian Encoding Space

Diverse and Accurate Image Description Using a Variational Auto-Encoder with an Additive Gaussian Encoding Space Diverse and Accurate Image Description Using a Variational Auto-Encoder with an Additive Gaussian Encoding Space Liwei Wang Alexander G. Schwing Svetlana Lazebnik {lwang97, aschwing, slazebni}@illinois.edu

More information

Automatic Video Description Generation via LSTM with Joint Two-stream Encoding

Automatic Video Description Generation via LSTM with Joint Two-stream Encoding Automatic Video Description Generation via LSTM with Joint Two-stream Encoding Chenyang Zhang and Yingli Tian Department of Electrical Engineering The City College of New York New York, New York 10031

More information

LSTM and its variants for visual recognition. Xiaodan Liang Sun Yat-sen University

LSTM and its variants for visual recognition. Xiaodan Liang Sun Yat-sen University LSTM and its variants for visual recognition Xiaodan Liang xdliang328@gmail.com Sun Yat-sen University Outline Context Modelling with CNN LSTM and its Variants LSTM Architecture Variants Application in

More information

Deep MIML Network. Abstract

Deep MIML Network. Abstract Deep MIML Network Ji Feng and Zhi-Hua Zhou National Key Laboratory for Novel Software Technology, Nanjing University, Nanjing 210023, China Collaborative Innovation Center of Novel Software Technology

More information

Grounded Compositional Semantics for Finding and Describing Images with Sentences

Grounded Compositional Semantics for Finding and Describing Images with Sentences Grounded Compositional Semantics for Finding and Describing Images with Sentences R. Socher, A. Karpathy, V. Le,D. Manning, A Y. Ng - 2013 Ali Gharaee 1 Alireza Keshavarzi 2 1 Department of Computational

More information

Machine Learning 13. week

Machine Learning 13. week Machine Learning 13. week Deep Learning Convolutional Neural Network Recurrent Neural Network 1 Why Deep Learning is so Popular? 1. Increase in the amount of data Thanks to the Internet, huge amount of

More information

Object Detection Based on Deep Learning

Object Detection Based on Deep Learning Object Detection Based on Deep Learning Yurii Pashchenko AI Ukraine 2016, Kharkiv, 2016 Image classification (mostly what you ve seen) http://tutorial.caffe.berkeleyvision.org/caffe-cvpr15-detection.pdf

More information

Topics in AI (CPSC 532L): Multimodal Learning with Vision, Language and Sound. Lecture 12: Deep Reinforcement Learning

Topics in AI (CPSC 532L): Multimodal Learning with Vision, Language and Sound. Lecture 12: Deep Reinforcement Learning Topics in AI (CPSC 532L): Multimodal Learning with Vision, Language and Sound Lecture 12: Deep Reinforcement Learning Types of Learning Supervised training Learning from the teacher Training data includes

More information

Project Final Report

Project Final Report Project Final Report Ye Tian Stanford University yetian@stanford.edu Tianlun Li Stanford University tianlunl@stanford.edu Abstract We plan to do image-to-sentence generation. This application bridges vision

More information

Temporal Activity Detection in Untrimmed Videos with Recurrent Neural Networks

Temporal Activity Detection in Untrimmed Videos with Recurrent Neural Networks Temporal Activity Detection in Untrimmed Videos with Recurrent Neural Networks Alberto Montes al.montes.gomez@gmail.com Santiago Pascual TALP Research Center santiago.pascual@tsc.upc.edu Amaia Salvador

More information

Multimodal Learning. Victoria Dean. MIT 6.S191 Intro to Deep Learning IAP 2017

Multimodal Learning. Victoria Dean. MIT 6.S191 Intro to Deep Learning IAP 2017 Multimodal Learning Victoria Dean Talk outline What is multimodal learning and what are the challenges? Flickr example: joint learning of images and tags Image captioning: generating sentences from images

More information

LSTM: An Image Classification Model Based on Fashion-MNIST Dataset

LSTM: An Image Classification Model Based on Fashion-MNIST Dataset LSTM: An Image Classification Model Based on Fashion-MNIST Dataset Kexin Zhang, Research School of Computer Science, Australian National University Kexin Zhang, U6342657@anu.edu.au Abstract. The application

More information

Computer Vision Lecture 16

Computer Vision Lecture 16 Announcements Computer Vision Lecture 16 Deep Learning Applications 11.01.2017 Seminar registration period starts on Friday We will offer a lab course in the summer semester Deep Robot Learning Topic:

More information

Channel Locality Block: A Variant of Squeeze-and-Excitation

Channel Locality Block: A Variant of Squeeze-and-Excitation Channel Locality Block: A Variant of Squeeze-and-Excitation 1 st Huayu Li Northern Arizona University Flagstaff, United State Northern Arizona University hl459@nau.edu arxiv:1901.01493v1 [cs.lg] 6 Jan

More information

Supplementary Material: Pixelwise Instance Segmentation with a Dynamically Instantiated Network

Supplementary Material: Pixelwise Instance Segmentation with a Dynamically Instantiated Network Supplementary Material: Pixelwise Instance Segmentation with a Dynamically Instantiated Network Anurag Arnab and Philip H.S. Torr University of Oxford {anurag.arnab, philip.torr}@eng.ox.ac.uk 1. Introduction

More information

arxiv: v1 [cs.cv] 26 Jul 2018

arxiv: v1 [cs.cv] 26 Jul 2018 A Better Baseline for AVA Rohit Girdhar João Carreira Carl Doersch Andrew Zisserman DeepMind Carnegie Mellon University University of Oxford arxiv:1807.10066v1 [cs.cv] 26 Jul 2018 Abstract We introduce

More information

Extend the shallow part of Single Shot MultiBox Detector via Convolutional Neural Network

Extend the shallow part of Single Shot MultiBox Detector via Convolutional Neural Network Extend the shallow part of Single Shot MultiBox Detector via Convolutional Neural Network Liwen Zheng, Canmiao Fu, Yong Zhao * School of Electronic and Computer Engineering, Shenzhen Graduate School of

More information

Generating Images from Captions with Attention. Elman Mansimov Emilio Parisotto Jimmy Lei Ba Ruslan Salakhutdinov

Generating Images from Captions with Attention. Elman Mansimov Emilio Parisotto Jimmy Lei Ba Ruslan Salakhutdinov Generating Images from Captions with Attention Elman Mansimov Emilio Parisotto Jimmy Lei Ba Ruslan Salakhutdinov Reasoning, Attention, Memory workshop, NIPS 2015 Motivation To simplify the image modelling

More information

Geometry-aware Traffic Flow Analysis by Detection and Tracking

Geometry-aware Traffic Flow Analysis by Detection and Tracking Geometry-aware Traffic Flow Analysis by Detection and Tracking 1,2 Honghui Shi, 1 Zhonghao Wang, 1,2 Yang Zhang, 1,3 Xinchao Wang, 1 Thomas Huang 1 IFP Group, Beckman Institute at UIUC, 2 IBM Research,

More information

Lecture 7: Semantic Segmentation

Lecture 7: Semantic Segmentation Semantic Segmentation CSED703R: Deep Learning for Visual Recognition (207F) Segmenting images based on its semantic notion Lecture 7: Semantic Segmentation Bohyung Han Computer Vision Lab. bhhanpostech.ac.kr

More information

Lecture 5: Object Detection

Lecture 5: Object Detection Object Detection CSED703R: Deep Learning for Visual Recognition (2017F) Lecture 5: Object Detection Bohyung Han Computer Vision Lab. bhhan@postech.ac.kr 2 Traditional Object Detection Algorithms Region-based

More information

Computer Vision Lecture 16

Computer Vision Lecture 16 Computer Vision Lecture 16 Deep Learning Applications 11.01.2017 Bastian Leibe RWTH Aachen http://www.vision.rwth-aachen.de leibe@vision.rwth-aachen.de Announcements Seminar registration period starts

More information

Learning to Match. Jun Xu, Zhengdong Lu, Tianqi Chen, Hang Li

Learning to Match. Jun Xu, Zhengdong Lu, Tianqi Chen, Hang Li Learning to Match Jun Xu, Zhengdong Lu, Tianqi Chen, Hang Li 1. Introduction The main tasks in many applications can be formalized as matching between heterogeneous objects, including search, recommendation,

More information

arxiv: v2 [cs.cv] 7 Aug 2017

arxiv: v2 [cs.cv] 7 Aug 2017 Dense Captioning with Joint Inference and Visual Context Linjie Yang Kevin Tang Jianchao Yang Li-Jia Li Snap Inc. {linjie.yang, kevin.tang, jianchao.yang}@snap.com lijiali@cs.stanford.edu arxiv:1611.06949v2

More information

On the Efficiency of Recurrent Neural Network Optimization Algorithms

On the Efficiency of Recurrent Neural Network Optimization Algorithms On the Efficiency of Recurrent Neural Network Optimization Algorithms Ben Krause, Liang Lu, Iain Murray, Steve Renals University of Edinburgh Department of Informatics s17005@sms.ed.ac.uk, llu@staffmail.ed.ac.uk,

More information

arxiv: v1 [cs.cv] 6 Jul 2016

arxiv: v1 [cs.cv] 6 Jul 2016 arxiv:607.079v [cs.cv] 6 Jul 206 Deep CORAL: Correlation Alignment for Deep Domain Adaptation Baochen Sun and Kate Saenko University of Massachusetts Lowell, Boston University Abstract. Deep neural networks

More information

Efficient Segmentation-Aided Text Detection For Intelligent Robots

Efficient Segmentation-Aided Text Detection For Intelligent Robots Efficient Segmentation-Aided Text Detection For Intelligent Robots Junting Zhang, Yuewei Na, Siyang Li, C.-C. Jay Kuo University of Southern California Outline Problem Definition and Motivation Related

More information

JOINT INTENT DETECTION AND SLOT FILLING USING CONVOLUTIONAL NEURAL NETWORKS. Puyang Xu, Ruhi Sarikaya. Microsoft Corporation

JOINT INTENT DETECTION AND SLOT FILLING USING CONVOLUTIONAL NEURAL NETWORKS. Puyang Xu, Ruhi Sarikaya. Microsoft Corporation JOINT INTENT DETECTION AND SLOT FILLING USING CONVOLUTIONAL NEURAL NETWORKS Puyang Xu, Ruhi Sarikaya Microsoft Corporation ABSTRACT We describe a joint model for intent detection and slot filling based

More information

Dense Captioning with Joint Inference and Visual Context

Dense Captioning with Joint Inference and Visual Context Dense Captioning with Joint Inference and Visual Context Linjie Yang Kevin Tang Jianchao Yang Li-Jia Li Snap Inc. Google Inc. {linjie.yang, kevin.tang, jianchao.yang}@snap.com lijiali@cs.stanford.edu Abstract

More information

arxiv: v1 [cs.cv] 31 Mar 2016

arxiv: v1 [cs.cv] 31 Mar 2016 Object Boundary Guided Semantic Segmentation Qin Huang, Chunyang Xia, Wenchao Zheng, Yuhang Song, Hao Xu and C.-C. Jay Kuo arxiv:1603.09742v1 [cs.cv] 31 Mar 2016 University of Southern California Abstract.

More information

Mask R-CNN. presented by Jiageng Zhang, Jingyao Zhan, Yunhan Ma

Mask R-CNN. presented by Jiageng Zhang, Jingyao Zhan, Yunhan Ma Mask R-CNN presented by Jiageng Zhang, Jingyao Zhan, Yunhan Ma Mask R-CNN Background Related Work Architecture Experiment Mask R-CNN Background Related Work Architecture Experiment Background From left

More information

Content-Based Image Recovery

Content-Based Image Recovery Content-Based Image Recovery Hong-Yu Zhou and Jianxin Wu National Key Laboratory for Novel Software Technology Nanjing University, China zhouhy@lamda.nju.edu.cn wujx2001@nju.edu.cn Abstract. We propose

More information

arxiv:submit/ [cs.cv] 13 Jan 2018

arxiv:submit/ [cs.cv] 13 Jan 2018 Benchmark Visual Question Answer Models by using Focus Map Wenda Qiu Yueyang Xianzang Zhekai Zhang Shanghai Jiaotong University arxiv:submit/2130661 [cs.cv] 13 Jan 2018 Abstract Inferring and Executing

More information

Early Embedding and Late Reranking for Video Captioning

Early Embedding and Late Reranking for Video Captioning Early Embedding and Late Reranking for Video Captioning Jianfeng Dong Xirong Li Weiyu Lan Yujia Huo Cees G. M. Snoek College of Computer Science and Technology, Zhejiang University Key Lab of Data Engineering

More information

SocialML: machine learning for social media video creators

SocialML: machine learning for social media video creators SocialML: machine learning for social media video creators Tomasz Trzcinski a,b, Adam Bielski b, Pawel Cyrta b and Matthew Zak b a Warsaw University of Technology b Tooploox firstname.lastname@tooploox.com

More information

Reconstruction Network for Video Captioning

Reconstruction Network for Video Captioning Reconstruction Network for Video Captioning Bairui Wang Lin Ma Wei Zhang Wei Liu Tencent AI Lab School of Control Science and Engineering, Shandong University {bairuiwong,forest.linma}@gmail.com davidzhang@sdu.edu.cn

More information

Feature-Fused SSD: Fast Detection for Small Objects

Feature-Fused SSD: Fast Detection for Small Objects Feature-Fused SSD: Fast Detection for Small Objects Guimei Cao, Xuemei Xie, Wenzhe Yang, Quan Liao, Guangming Shi, Jinjian Wu School of Electronic Engineering, Xidian University, China xmxie@mail.xidian.edu.cn

More information

Instance-aware Image and Sentence Matching with Selective Multimodal LSTM

Instance-aware Image and Sentence Matching with Selective Multimodal LSTM Instance-aware Image and Sentence Matching with Selective Multimodal LSTM Yan Huang 1,3 Wei Wang 1,3 Liang Wang 1,2,3 1 Center for Research on Intelligent Perception and Computing (CRIPAC), National Laboratory

More information

arxiv: v1 [cs.cv] 17 Nov 2016

arxiv: v1 [cs.cv] 17 Nov 2016 Multimodal Memory Modelling for Video Captioning arxiv:1611.05592v1 [cs.cv] 17 Nov 2016 Junbo Wang 1 Wei Wang 1 Yan Huang 1 Liang Wang 1,2 Tieniu Tan 1,2 1 Center for Research on Intelligent Perception

More information

Every Picture Tells a Story: Generating Sentences from Images

Every Picture Tells a Story: Generating Sentences from Images Every Picture Tells a Story: Generating Sentences from Images Ali Farhadi, Mohsen Hejrati, Mohammad Amin Sadeghi, Peter Young, Cyrus Rashtchian, Julia Hockenmaier, David Forsyth University of Illinois

More information

RUC-Tencent at ImageCLEF 2015: Concept Detection, Localization and Sentence Generation

RUC-Tencent at ImageCLEF 2015: Concept Detection, Localization and Sentence Generation RUC-Tencent at ImageCLEF 2015: Concept Detection, Localization and Sentence Generation Xirong Li 1, Qin Jin 1, Shuai Liao 1, Junwei Liang 1, Xixi He 1, Yujia Huo 1, Weiyu Lan 1, Bin Xiao 2, Yanxiong Lu

More information

Diffusion Convolutional Recurrent Neural Network: Data-Driven Traffic Forecasting

Diffusion Convolutional Recurrent Neural Network: Data-Driven Traffic Forecasting Diffusion Convolutional Recurrent Neural Network: Data-Driven Traffic Forecasting Yaguang Li Joint work with Rose Yu, Cyrus Shahabi, Yan Liu Page 1 Introduction Traffic congesting is wasteful of time,

More information

Deep Learning in Visual Recognition. Thanks Da Zhang for the slides

Deep Learning in Visual Recognition. Thanks Da Zhang for the slides Deep Learning in Visual Recognition Thanks Da Zhang for the slides Deep Learning is Everywhere 2 Roadmap Introduction Convolutional Neural Network Application Image Classification Object Detection Object

More information

arxiv: v1 [cs.cv] 20 Dec 2016

arxiv: v1 [cs.cv] 20 Dec 2016 End-to-End Pedestrian Collision Warning System based on a Convolutional Neural Network with Semantic Segmentation arxiv:1612.06558v1 [cs.cv] 20 Dec 2016 Heechul Jung heechul@dgist.ac.kr Min-Kook Choi mkchoi@dgist.ac.kr

More information

Recurrent Neural Networks and Transfer Learning for Action Recognition

Recurrent Neural Networks and Transfer Learning for Action Recognition Recurrent Neural Networks and Transfer Learning for Action Recognition Andrew Giel Stanford University agiel@stanford.edu Ryan Diaz Stanford University ryandiaz@stanford.edu Abstract We have taken on the

More information

Automated Neural Image Caption Generator for Visually Impaired People

Automated Neural Image Caption Generator for Visually Impaired People 000 001 002 003 004 005 006 007 008 009 010 011 012 013 014 015 016 017 018 019 020 021 022 023 024 025 026 027 028 029 030 031 032 033 034 035 036 037 038 039 040 041 042 043 044 045 046 047 048 049 050

More information

UTS submission to Google YouTube-8M Challenge 2017

UTS submission to Google YouTube-8M Challenge 2017 UTS submission to Google YouTube-8M Challenge 2017 Linchao Zhu Yanbin Liu Yi Yang University of Technology Sydney {zhulinchao7, csyanbin, yee.i.yang}@gmail.com Abstract In this paper, we present our solution

More information

Language Models for Image Captioning: The Quirks and What Works

Language Models for Image Captioning: The Quirks and What Works Language Models for Image Captioning: The Quirks and What Works Jacob Devlin, Hao Cheng, Hao Fang, Saurabh Gupta, Li Deng, Xiaodong He, Geoffrey Zweig, Margaret Mitchell Microsoft Research Corresponding

More information

arxiv: v1 [cs.cv] 21 Dec 2016

arxiv: v1 [cs.cv] 21 Dec 2016 arxiv:1612.07360v1 [cs.cv] 21 Dec 2016 Top-down Visual Saliency Guided by Captions Vasili Ramanishka Boston University Abir Das Boston University Jianming Zhang Adobe Research Kate Saenko Boston University

More information

Class 5: Attributes and Semantic Features

Class 5: Attributes and Semantic Features Class 5: Attributes and Semantic Features Rogerio Feris, Feb 21, 2013 EECS 6890 Topics in Information Processing Spring 2013, Columbia University http://rogerioferis.com/visualrecognitionandsearch Project

More information

Rewind to Track: Parallelized Apprenticeship Learning with Backward Tracklets

Rewind to Track: Parallelized Apprenticeship Learning with Backward Tracklets Rewind to Track: Parallelized Apprenticeship Learning with Backward Tracklets Jiang Liu 1,2, Jia Chen 2, De Cheng 2, Chenqiang Gao 1, Alexander G. Hauptmann 2 1 Chongqing University of Posts and Telecommunications

More information

Real-time Object Detection CS 229 Course Project

Real-time Object Detection CS 229 Course Project Real-time Object Detection CS 229 Course Project Zibo Gong 1, Tianchang He 1, and Ziyi Yang 1 1 Department of Electrical Engineering, Stanford University December 17, 2016 Abstract Objection detection

More information

arxiv: v1 [cs.cv] 16 Nov 2015

arxiv: v1 [cs.cv] 16 Nov 2015 Coarse-to-fine Face Alignment with Multi-Scale Local Patch Regression Zhiao Huang hza@megvii.com Erjin Zhou zej@megvii.com Zhimin Cao czm@megvii.com arxiv:1511.04901v1 [cs.cv] 16 Nov 2015 Abstract Facial

More information