Temporal-difference Learning with Sampling Baseline for Image Captioning

Size: px

Start display at page:

Download "Temporal-difference Learning with Sampling Baseline for Image Captioning"

Allan Murphy
6 years ago
Views:

1 Temporal-difference Learning with Sampling Baseline for Image Captioning Hui Chen, Guiguang Ding, Sicheng Zhao, Jungong Han School of Software, Tsinghua University, Beijing , China School of Computing and Communications, Lancaster University, Lancaster, LA1 4YW, UK Abstract The existing methods for image captioning usually train the language model under the cross entropy loss, which results in the exposure bias and inconsistency of evaluation metric. Recent research has shown these two issues can be well addressed by policy gradient method in reinforcement learning domain attributable to its unique capability of directly optimizing the discrete and non-differentiable evaluation metric. In this paper, we utilize reinforcement learning method to train the image captioning model. Specifically, we train our image captioning model to maximize the overall reward of the sentences by adopting the temporal-difference (TD) learning method, which takes the correlation between temporally successive actions into account. In this way, we assign different values to different words in one sampled sentence by a discounted coefficient when back-propagating the gradient with the REINFORCE algorithm, enabling the correlation between actions to be learned. Besides, instead of estimating a baseline to normalize the rewards with another network, we utilize the reward of another Monte-Carlo sample as the baseline to avoid high variance. We show that our proposed method can improve the quality of generated captions and outperforms the state-of-the-art methods on the benchmark dataset MS COCO in terms of seven evaluation metrics. Introduction Scene understanding is one of the ultimate goals of computer vision. Image captioning aims at generating reasonable captions automatically for images which is of great importance to scene understanding. It is a challenging task not only because the captioning models must be capable of recognizing what objects are in the image, but also must be powerful e- nough to understand the semantic relationships among the objects and describe them properly in natural language. It is also of great significance to enable machine mimicking the human ability to express the rich visual information with descriptive language, and thus attracts much attention from academic researchers and industry companies. This research was supported by the National Natural Science Foundation of China (Grant Nos , ), the Royal Society Newton Mobility Grant (IE150997) and the Project Funded by China Postdoctoral Science Foundation (No. 2017M610897). Corresponding authors: Guiguang Ding and Jungong Han. Copyright c 2018, Association for the Advancement of Artificial Intelligence ( All rights reserved. Inspired by the machine translation domain, recent works focus on the deep network based and end-to-end methods mainly under the encoder-decoder framework. In general, the recurrent neural networks (RNN), especially long short term memory () (Hochreiter and Schmidhuber 1997), are employed as the decoder to generate captions (Vinyals et al. 2015; Jin et al. 2015; Xu et al. 2015; You et al. 2016; Zhao et al. 2017) on the basis of the visual features of image extracted by the CNN. These models are usually trained to maximize the likelihood of next ground-truth word given the previous ground-truth words. However, this method will lead to a problem called exposure bias (Ranzato et al. 2015), since at test time, the model uses the word sampled from the model predictions as the next input, instead of the ground-truth words. The second problem is about the inconsistency between the optimizing function at training time and the evaluation metrics at test time. The training procedure attempts to lower the cross entropy loss, while the metrics used to evaluate a generated sentence are some discrete and non-differentiable NLP metrics such as BLEU, ROUGE, CIDEr, and METEOR. These two problems limit the ability of the model to understand the image and describe it with descriptive sentences. It has been shown that the reinforcement learning (RL) can provide a solution to these two identified issues above. There are some works exploring the idea of incorporating the reinforcement learning into image captioning. (Ranzato et al. 2015) proposed a novel training procedure at the sequence level using the policy gradient method. (Rennie et al. 2017) adopted the same loss function as (Ranzato et al. 2015) but the baseline modelling method is slightly different, where they proposed a self-critical training method with the caption generated by the inference algorithm at test time. (Liu et al. 2016) employed the same method to produce the baseline as (Ranzato et al. 2015), and their main contribution lies in using Monte Carlo rollouts to approximate the value function. Despite their better performance, especially compared to the non-rl approaches, there are still some shortcomings in these works. For example, (Rennie et al. 2017) and (Ranzato et al. 2015) both implicitly assumed that every word in one sampled sequence makes the same contribution to the reward, which is clearly not reasonable in general. (Liu et al. 2016) estimated a baseline reward by simply adopting a MLP to learn the baseline reward from the state vector

2 of RNN like Ranzato et al. did. This method usually exhibits high variance, thus making the training unstable. In this paper, we apply the temporal difference method (Sutton 1988) to model the RL value function, instead of the monte carlo rollouts, because the monte carlo rollouts method only learns from the observed values, meaning that the value can not be obtained until the sequence is finished. Differently, the temporal difference method assumes that there are correlations between temporally successive actions, thus, it can estimate the value of actions based on the previously learned estimates of the successive actions by means of the dynamic programming idea. Since the context of the sentence has a strong correlation, we assume that the temporal difference learning could be more appropriate to model the value function. Besides, to reduce the variance during the model training, we also use the baseline suggested by (Rennie et al. 2017) where they consider the caption generated by the test-time inference algorithm to be the baseline caption. However, we notice that the way of baseline in (Rennie et al. 2017) can not approximate the value function correctly, because the test-time inference algorithm tends to pick the fairly good sentence which is better than the sentence sampled from the model distribution in most cases. Instead, we generate two sentences both sampled from the model distribution with the idea that the quality of actions sampled from the same distribution in multinomial sample policy are close in terms of the probability. Therefore, we adopt one of the two sentences as the baseline sequence, and apply the temporal difference method. Overall, the contributions of this paper are three-fold: We directly optimize the evaluation metrics during training through a temporal difference method in reinforcement learning where each action at different time step has different impacts on the model. To avoid the high variance during the training, we employ a novel baseline modelling method by using a sequence sampled from the same distribution as the sequence for gradient to calculate the baseline. We conduct a massive of experiments and comparisons with other methods. The results demonstrate that the proposed method has a significant superiority over the-stateof-the-art methods Related Work The literature on image captioning can be divided into three categories based on different ways of sequence generation (Jia et al. 2015): template-based methods (Farhadi et al. 2010; Kulkarni et al. 2011; Elliott and Keller 2013), transfer-based methods (Gong et al. 2014; Devlin et al. 2015; Mao et al. 2015) and the neural network-based methods. S- ince the proposed method adopts the same framework as the neural network-based methods, we mainly introduce the related works about image captioning with them. The neural network-based methods get inspirations from machine translation (Schwenk 2012; Cho et al. 2014) where two RNNs are used as the encoder and the decoder respectively. Vinyals et al. (2015) replaced the RNN encoder with a deep CNN, and adopted the to decode the image vector to a sentence. This work achieved a reasonable result and hereafter there are many works following this idea and studying further. Xu et al. (2015) applied the attention mechanism in the image captioning task in which the decoder can function as the human s eye focusing its attention on different regions of the image at each time step. Lu et al. (2017) improved the attention model by introducing a visual sentinel allowing the attention module adaptively attend to the visual regions. You et al. (2016) proposed a semantic attention model which selectively attends to semantic concept regions by fusing the global image feature and the semantic attributes feature from an attribute detector. Chen et al. (2017a) proposed a spatial and channel-wise attention model to attend to both image features and visual regions adaptively. Recently, researchers made efforts to incorporate reinforcement learning into the standard encoder-decoder framework to address the exposure bias and the nondifferentiable metric issues. Specifically, (Ranzato et al. 2015) used the REINFORCE algorithm (Williams 1992) and proposed a novel training method at sequence level directly optimizing the non-differentiable test metric. (Liu et al. 2016) applied the policy gradient algorithm in the training procedure for image captioning models, in which the words sampled from the current model at each time step were awarded with different future rewards via averaging the rewards of some Monte-Carlo samples. A simple MLP was used to produce the estimate of the future reward, and such estimate will in turn be treated as the baseline to reduce the variance. Self-critical sequence training (SCST) (Rennie et al. 2017) adopted the policy gradient algorithm as well but the difference from (Liu et al. 2016) is that SCST just ran the forward process twice and obtained two sequences, one generated by running the inference algorithm at test time and the other sampled from the multinomial strategy. SCST made the reward of the sequence from the inference algorithm as a baseline to reduce the training variance. (Ranzato et al. 2015; Rennie et al. 2017) simply assume that each word shares the same importance to the reward of the sentence, so that each of them obtains the same gradient when back-propagating the gradient. This assumption is not reasonable in general. Lu et al. (2017) find the model will be likely prone to visual words like red, horse, bus more than the non-visual words such as of and a by applying an adaptive attention model, which is indeed with accordance with the human s attention schema. Chen et al. (2017c) show that assigning different weights to different words helps the model be aware of the different importance of words in a sentence and enhances the model s ability of generating high-quality captions. (Liu et al. 2016) trains an extra MLP based on the output of units to estimate the baseline, turning MLP to an estimator for the action space. However, MLP does not seem to be a good estimator since the action space can be enormous, and it may cause the high variance, thus making the training unstable. In contrast, in our method, we allow the captioning model learn different values of words by the temporal difference learning. Besides, we employ a sampling baseline strategy to make the training with low variance and stable.

Reward sample w 1 s w t s sample w T s γ T 1 r s r s BP γ T t 1 r s r s BP r s r s Training Set CNN sample w 1 s w t s sample w T s Figure 1: The framework of the proposed model, including two parts:

The right arrow means the forward operation and the left arrow means the backward operation. W s = (w1, s w2, s..., wt s ) and W s = (w1 s, w2 s,.

γ is a discounted coefficient in temporal difference method. s t is the output of the softmax function.

3 Reward sample w 1 s w t s sample w T s γ T 1 r s r s BP γ T t 1 r s r s BP r s r s Training Set CNN sample w 1 s w t s sample w T s Figure 1: The framework of the proposed model, including two parts: the encoder (in blue rectangle) and the decoder (in red rectangle). The top and bottom s share the same parameters. The right arrow means the forward operation and the left arrow means the backward operation. W s = (w1, s w2, s..., wt s ) and W s = (w1 s, w2 s,..., wt s ) are two sampled sequences from the model in multinomial policy. r s and r s are the rewards of sequences W s and W s, respectively. γ is a discounted coefficient in temporal difference method. s t is the output of the softmax function. Methodology Encoder-Decoder framework Given an image I, the image captioning model needs to generate a caption sequence W = {w 1, w 2,..., w T }, w t D where D is the vocabulary dictionary. We adopt the standard CNN-RNN architecture for image captioning. CNN, which can be seen as an encoder, encodes an input image into a vector. RNN functions as a decoder aiming to generate the captions given the image feature. Here, we use L- STM (Hochreiter and Schmidhuber 1997) as the decoder. During generation, generates a word at each time step conditioned on the previously generated words w t 1, the previous hidden state h t 1 and the context vector c t 1 containing the context information that has seen. The updates the hidden units and cells as follows: x 1 = CNN(I), x 0 = E(w 0 ) x t = E(w t ) i t = σ(w ix x t + W ih h t 1 + b i )(input gate) f t = σ(w fx x t + W fh h t 1 + b f )(forget gate) o t = σ(w ox x t + W oh h t 1 + b o )(output gate) c t = i t φ(w zxx t + W zh h t 1 + b c ) + f t c t 1 h t = o t tanh(c t ) q t = W qh h t where w 0 is a special token indicating the start of the sequence, CNN(I) is the feature extractor for image I, E() is the embedding function which maps the one-hot representation of a word into the embedding semantic space. We initialize the c 0 and h 0 to the zero vector. Then a distribution over the next word w t will be produced by using the softmax function: (1) w t Softmax(q t ) (2) The likelihood of a word w t at time step t is decided by a conditional probability conditioned on the input image I and previous words w 0, w 1,...w t 1 : p(w t I, w 0, w 1,.., w t 1 ). So the probability of a generated sequence W = (w 0, w 1, w 2,.., w T ) given the input image I will be the product of the conditional probability of each word: p(w I) = T p(w t I, w 0, w 1,..., w t 1 ) (3) Show and tell paper (Vinyals et al. 2015) uses the crossentropy loss (XENT) to train the whole network. The XENT loss maximizes the probability of the description W generated by the model, which intends to minimize: L = log p(w t I, w 0, w 1,..., w t 1 ) (4) The XENT loss will lead the model to generate the word with the highest posteriori probability at each time step t without considering the quality of the whole sequence at test time and cause a phenomena called search error (Ranzato et al. 2015). Temporal difference learning: TD(λ) Reinforcement learning can provide solutions for decisionmaking problem. We consider the image captioning task as a decision-making problem or a finite Markov process (MDP). In the MDP setting, the state can be defined as the information that has known at the current time step. So we consider the state s t as a list consisting of the image and the previous words: s t = {I, w 0, w 1,..., w t 1 } (5) And the action is the input image or the word generated at different time step. The parameter of the network, θ, defines the policy network p θ which will produce an action distribution, in other words, the prediction of the next word here. The decoder can be viewed as an agent that takes

4 an action (image feature and words) in guidance of the action distribution. After each action a t, the updates its internal parameters to increase or decrease the probability of taking the action a t according to the reward. Reward is an important element in RL, which decides the evolution direction of the agent. Here, we define the reward as the s- core computed by evaluating the generated captions using the corresponding ground-truth sequences under the standard evaluation metrics, such as BLEU-1,2,3,4,CIDEr, ME- TEOR, etc. We denote the reward by r in the following. In reinforcement learning, the agent s task is to maximize the total amount of rewards passing from the environment to the agent. For image captioning, the reward will not be calculated until the EOS, a special token indicating the end of the sequence, is generated by the model. Therefore, it is necessary to define the reward function for each word. In this paper, we define the reward for each word w t as follows: { r t = T r t = (6) 0 t < T where r is the score calculated using the evaluation metrics and T is the final time step. The agent aims to maximize the cumulative rewards it received in the long run. For an episode (a 0, a 1,..., a T ), we define the Q-value function Q(s t, a t+1 ) as a function of the current state s t of the model and some possible action a t+1 to estimate the expected future reward. There are many ways to define the Q-value function. (Liu et al. 2016) exploited Monte Carlo rollouts method in which the model will generate many sequences and used the average of rewards of these sequences as the Q-value. While in this paper, we adopt the temporal-difference (TD) learning to estimate Q-value function. In temporal difference learning, n-step expected return G t:t+n is defined as the sum of the next n rewards plus the estimated value of the next (n + 1) th state, each appropriately discounted, in n-step TD method: G t:t+n = r t+1 +γr t γ n 1 r t+n +γ n V (s t+n ) (7) where 0 t T n. The n-step expected return can be viewed as a n-step backup starting from current time step t. And the Q-value is a weighted average of a few n-step back-ups in the TD(λ) method, in which all weights sum to 1. Specifically, the Q-value in TD(λ) is defined as follows: Q(s t, a t+1 ) = (1 λ) λ n 1 G t:t+n (8) n=1 Since the length of generated sequence has limit T in image captioning, we have: t 1 Q(s t, a t+1 ) = (1 λ) λ n 1 G t:t+n +λ T t 1 G t (9) n=1 where λ is the trad-off parameter which decides how much the model depends on the current expected return G t. Here, we set λ = 1 for our image captioning model. Then, with λ = 1, Eq. (6) and Eq. (7), we have: Q(s t, a t+1 ) = γ T t 1 r (10) Now, we define the RL loss function as follows: L(θ) = E W s p θ [ Q(s t, a t+1 )] (11) where W s = (w s 0, w s 1,..., w s T ) and ws t is sampled from the model at time step t. The gradient L(θ) can be calculated as in REINFORCE algorithm (Williams 1992): L(θ) = E W s p θ [ Q(s t, a t+1 ) θ log p θ (W s )] (12) In practice, Eq. (12) can be approximated using one sequence generated by the network using the Monte-Carlo sample method for each training sample. So we have: L(θ) = = Q(s t, a t+1 ) θ log p θ (W s ) γ T t 1 r θ log p θ (W s ) (13) The definition of Q-value above makes the estimator with high variance. In order to reduce the variance during training, we introduce the baseline. (Rennie et al. 2017) used the reward of the sequence obtained by the current model with the greedy sampling strategy. (Liu et al. 2016) used an MLP to estimate the baseline reward. In this paper, we introduce a new baseline strategy similar to (Rennie et al. 2017) where the difference is that we use a sequence obtained with a multinomial sampling strategy. Then the gradient function will be as follows: L(θ) = γ T t 1 (r r baseline ) θ log p θ (W s ) (14) In fact, the two sequences, one for gradient and the other for baseline, are both generated by the current network p θ with a multinomial sampling strategy. The idea is that the difference between reward r and r baseline is small since they are computed by two sequences which are both sampled from the same distribution and this will achieve a lower variance during training than the way in (Rennie et al. 2017) resulting in a more stable parameters updating. Then according to the chain rule, the final gradient will be as follows: L(θ) q t L(θ) = (15) q t θ where q t is the input of the softmax function at time step t and L(θ) q t = γ T t 1 (r r baseline )(1 w s t p θ (w t h t )) (16) The framework of the proposed method is depicted in Figure 1. Firstly, the CNN network extracts the feature of the input image. Then the absorbs the feature of the image at the beginning (here is at -1 time step) to initialize the

5 hidden vectors for language model. Next, at each time step, the will be fed in the word sampled from the current model at last time step, except at the 0th time step, until a special token EOS is generated. The model will generate two sequences, W s and W s, sampled in multinomial policy. The gradient put on words of W s is determined by the difference between the rewards of W s and W s. This can lower the variance of the gradients and makes the training procedure stable. Experiments Dataset and setting We evaluate our proposed method on the popular MS CO- CO dataset (Lin et al. 2014). MS COCO dataset contains 123,287 images labeled with at least 5 captions including training images and validation images. MS COCO provides images as test set for online evaluation as well. Since the standard test set is not public, we use 5000 images for validation, 5000 images for test and the remains for training, as in previous works (Xu et al. 2015; You et al. 2016; Chen et al. 2017c) for offline evaluation. We use the code publicly 1 to preprocess the dataset, such as pruning infrequent words, and we end up with a vocabulary set which has 9567 different words. We use different metrics, including BLEU-1, BLEU-2, BLEU-3, BLEU-4, METE- OR, ROUGE-L and CIDEr, to evaluate the proposed method and compare with other methods. We extract the image s feature in two different ways. In the first way, the image is encoded as a global feature vector of dimension 2048, and during training, the image feature vector is only fed into the unit at the beginning. In the second, the full image is encoded with the final convolutional layer of Resnet-101 and ends up with a feature map, and at each time step, this feature map will be input into the units. In the following, we denote the models with image features obtained in the first way as the FC models, and those in the second way as attention(att) models. Implementation Details We use ResNet-101 (He et al. 2016) pretrained on ImageNet to encode images. All images are preprocessed as follows: scaling the smaller edge to 256, doing color normalization and cropping to centered rectangle. The decoder is a onelayer with a hidden state size of 512. The embedding dimension of word is fixed to 512. We set the embedding dimension of image feature to 512 using a linear layer. When training the attention model, the parameter updating of L- STM follows (Rennie et al. 2017). We train models under the XENT loss using ADAM optimizer with a learning rate of and finetune the CNN from the beginning. We then train the models under the reinforcement loss to optimize the CIDEr-D metric without finetuning. For all models, the batch size is set to 16 and every 1K iterations the model evaluation will be performed during training. When training models under the RL loss, the learning rate for language 1 model is initialized to and set to after 50K iterations, then decreased every 100K iterations until When training models using RL loss, we use the models trained under XENT loss as pretrained models to reduce the search space. By default, the beam search size is fixed to 3 for all models for test. Performance on MS COCO Performance of our models. To test the effectiveness of TD(λ) modelling method and the baseline method we proposed, we conduct a series of experiments for image captioning on karpathy s split of MS COCO dataset. The configurations of models are listed as follows: XENT-FC: the FC model trained with the XENT loss. SR-Greedy-FC: the FC model trained with a shared reward for every word in a sampled sentence. TD-Greedy-FC: the FC model trained with TD learning and the baseline is computed by the reward of the sequence sampled from the greedy policy. TD-Multinomial-FC: the attention model trained with TD learning and the baseline is computed by the reward of the sequence sampled from the multinomial policy. The results of these four models above are listed in Table 1. The model in the first row is trained with the XENT loss and three models in the second row are trained with the reinforcement learning. Through comparing the result of the XENT-FC with the three RL models in the second row, we can find that our proposed method with the reinforcement learning can improve the performance at a great margin. Compared with the performance of the SR-Greedy-FC model, the TD-Greedy-FC model performs better in all metrics, indicating the effectiveness of the TD(λ) modelling method. The TD-Multinomial-FC model achieves an improvement of 1.1% and 2.4% in terms of the CIDEr metric compared with the TD-Greedy-FC model and SR-Greedy-FC model respectively. Better performance can be attributed to the TD(λ) modelling method which approximates different actions with the discounted expected future reward and the baseline method we proposed which can make the variance more lower than the method that uses the sampled sequence from a greedy policy as the baseline sequence. Comparison with the state-of-the-art methods. To verify the effectiveness of our proposed method, we also compare our models with several state-of-the-art methods. The comparison results are shown in Table 2, where - means that the corresponding scores are not reported in the original papers and the performance of MIXER is from (Rennie et al. 2017). Methods in the first row of the table do not train the image captioning model via reinforcement learning methods, while those in the second row incorporate the reinforcement learning technique when training the model. For fair comparison, we only report the FC-2K model of SCST (Rennie et al. 2017) which employs the same CNN model as ours to extract the image feature. The third row lists two of our models. TD-Multinomial-ATT adopts the attention mechanism as (Rennie et al. 2017) but with a smaller region-point numbers of the feature map. It can be seen that

6 Table 1: Performance of the proposed method on MS COCO dataset. BLEU-1 BLEU-2 BLEU-3 BLEU-4 METEOR ROUGE-L CIDEr XENT-FC SR-Greedy-FC TD-Greedy-FC TD-Multinomial-FC Table 2: Performance comparison of the proposed method with other methods on MS COCO dataset. BLEU-1 BLEU-2 BLEU-3 BLEU-4 METEOR ROUGE-L CIDEr Google NIC (Vinyals et al. 2015) Toronto (Xu et al. 2015) ATT (You et al. 2016) m-rnn (Mao et al. 2015) R- (Chen et al. 2017c) MSM (Yao et al. 2016) MIXER (Ranzato et al. 2015) SCST(FC-2K) (Rennie et al. 2017) TD-Multinomial-FC TD-Multinomial-ATT Table 3: Evaluation on the online MS COCO testing server. indicates the results of ensemble models. BLEU-1 BLEU-2 BLEU-3 BLEU-4 METEOR ROUGE-L CIDEr c5 c40 c5 c40 c5 c40 c5 c40 c5 c40 c5 c40 c5 c40 MSM (Yao et al. 2016) R- (Chen et al. 2017c) Adaptive Attention (Lu et al. 2017) Google NIC (Vinyals et al. 2015) ATT (You et al. 2016) ERD (Wu and Cohen 2016) SCA-CNN (Chen et al. 2017b) MS Captivator (Fang et al. 2015) TD-Multinomial-ATT C ID E r b e a m s iz e K (a) XENT-FC C ID E r b e a m s iz e K (b) TD-Multinomial-FC Figure 3: The influence of beam search size K on the XENT- FC and TD-Multinomial-FC models our two models outperform the models trained without the reinforcement learning from comparison between models in the first row and the third row. And under the same conditions, our models have an superiority over MIXER and SC- ST models with an improvement of 9.7% and 5.3% in terms of the CIDEr metric, respectively. Performance on COCO test Server. We also submit results of the official test set generated by our best model on online coco testing server 2, and compare the performance 2 with state-of-the-art systems. The results are shown in Table 3. We can see that our single model achieves the best performance on BLEU-1 (c5), BLEU-2 (c40) and CIDEr (c5 and c40) among these published systems. When looking at other metrics, our method is also one of the the best. Our model does not have advantages in all metrics for two reasons: (1) we only optimize the CIDEr metric when training our image captioning models; (2) we do not employ models ensemble to improve the performance further. Further exploration of optimizing the fusion of the metrics and models ensemble can be left as the future work. Parameter analysis We now analyze the influence of the beam search size K in the test stage. We contrast the TD-Multinomial-FC model with XENT-FC with the beam size in the range of {1, 3, 5, 7, 9, 10}. The results are depicted in Figure 3. We can see that the beam search size K has a greater impact on the XENT-FC model than on the TD-Multinomial-FC. Specifically, the performance is like in the XENT-FC model, while the TD-Multinomial-FC does not make much difference as the K goes bigger. We suppose that our proposed method will make the standard deviation of the action

1 2 3 4 a little boy sitting in front of a bag of food a young child sitting on a table with a bag a little girl is holding a green banana a little

sill a teddy bear sitting on top of a window 5 6 7 8 a herd of wild animals grazing in a field a herd of elephants walking in a field a group of

of cars driving down a city street with a traffic light a man and woman standing next to each other a woman and a man holding a glass of wine 9 10 11

of oranges and bananas on a table a couple of people that are in the water two people are riding on a boat in the water a man with a hat and glasses

training. Qualitative Analysis Here we provide some quality examples of our captioning model shown in Figure 2.

And the sentences in red are generated by our best model trained under the RL loss based on the pretrained attention model.

In general, the RL model can generate more descriptive captions than the base attention model.

An example can be found in image 2 where the toothbrush is mistaken as a banana by the base model, whereas the RL model correctly describes it.

see the traffic light and infer that the cars are driving on the street, while the base model just recognizes the city street and the traffic.

Taking image 12 as an example, this image shows us a scene that a man is talking on the cell phone.

Conclusion In this paper, we proposed to incorporate the reinforcement learning method into image captioning task by considering the caption

7 a little boy sitting in front of a bag of food a young child sitting on a table with a bag a little girl is holding a green banana a little girl holding a green toothbrush a kitchen with a refrigerator and a sink a kitchen with a stove and a window a stuffed animal is sitting on a window sill a teddy bear sitting on top of a window a herd of wild animals grazing in a field a herd of elephants walking in a field a group of people riding surfboards on top of a wave a group of people riding surfboards on a wave in the ocean a city street filled with lots of traffic a group of cars driving down a city street with a traffic light a man and woman standing next to each other a woman and a man holding a glass of wine a herd of elephants standing next to each other a herd of elephants walking in a street a bunch of ripe bananas sitting next to each other a bunch of oranges and bananas on a table a couple of people that are in the water two people are riding on a boat in the water a man with a hat and glasses on his head a man wearing a hat talking on a cell phone Figure 2: Quality examples of our best model (red) compared with the attention model trained under XENT loss (black). distribution become bigger because our method encourages the action with a higher future reward being sampled more frequently by the model when training. Qualitative Analysis Here we provide some quality examples of our captioning model shown in Figure 2. The sentences in black are generated by the pretrained attention model under the XENT loss. And the sentences in red are generated by our best model trained under the RL loss based on the pretrained attention model. So we can sense the improvement by the reinforcement learning intuitively by analysing the captions generated by the two models. In general, the RL model can generate more descriptive captions than the base attention model. Specifically, in Figure 2, for the top four images, the base attention cannot recognize some objects in the image correctly. An example can be found in image 2 where the toothbrush is mistaken as a banana by the base model, whereas the RL model correctly describes it. For the middle four images, the RL model can express the visual content in more detail and descriptively, for instance in image 7, the RL model can see the traffic light and infer that the cars are driving on the street, while the base model just recognizes the city street and the traffic. For the bottom four images, the RL model can organize the language better matching the habit of human cognition than the base attention model. Taking image 12 as an example, this image shows us a scene that a man is talking on the cell phone. The RL model describes the scene correctly while the base attention model does not, though its description of the man is not completely wrong. Conclusion In this paper, we proposed to incorporate the reinforcement learning method into image captioning task by considering the caption generating procedure as a RL problem. Different from previous RL works for image captioning, which consider the words to be equally important for the whole sequence generation, we formulated the value function by the temporal difference method, which takes the correlation between the temporal successive actions into consideration. Besides, to avoid the high variance during training, we introduced a baseline by calculating the reward of a sequence sampled by the current model. Experimental results on MS COCO dataset and comparisons with state-of-the-art methods demonstrated the effectiveness of our proposed method.

8 References Chen, L.; Zhang, H.; Xiao, J.; Nie, L.; Shao, J.; and Chua, T.-S. 2017a. Sca-cnn: Spatial and channel-wise attention in convolutional networks for image captioning. In IEEE Conference on Computer Vision and Pattern Recognition. Chen, L.; Zhang, H.; Xiao, J.; Nie, L.; Shao, J.; and Chua, T.-S. 2017b. Sca-cnn: Spatial and channel-wise attention in convolutional networks for image captioning. CVPR. Chen, M.; Ding, G.; Zhao, S.; Chen, H.; Liu, Q.; and Han, J. 2017c. Reference based lstm for image captioning. AAAI. Cho, K.; Van Merriënboer, B.; Gülçehre, Ç.; Bahdanau, D.; Bougares, F.; Schwenk, H.; and Bengio, Y Learning phrase representations using rnn encoder-decoder for statistical machine translation. In EMNLP, Devlin, J.; Gupta, S.; Girshick, R.; Mitchell, M.; and Zitnick, C. L Exploring nearest neighbor approaches for image captioning. arxiv preprint arxiv: Elliott, D., and Keller, F Image description using visual dependency representations. In EMNLP, Fang, H.; Gupta, S.; Iandola, F.; Srivastava, R. K.; Deng, L.; Dollár, P.; Gao, J.; He, X.; Mitchell, M.; Platt, J. C.; et al From captions to visual concepts and back. In CVPR, Farhadi, A.; Hejrati, M.; Sadeghi, M. A.; Young, P.; Rashtchian, C.; Hockenmaier, J.; and Forsyth, D Every picture tells a story: Generating sentences from images. In ECCV, Gong, Y.; Wang, L.; Hodosh, M.; Hockenmaier, J.; and Lazebnik, S Improving image-sentence embeddings using large weakly annotated photo collections. In ECCV, He, K.; Zhang, X.; Ren, S.; and Sun, J Deep residual learning for image recognition. CVPR 00: Hochreiter, S., and Schmidhuber, J Long short-term memory. Neural Computation 9(8): Jia, X.; Gavves, E.; Fernando, B.; and Tuytelaars, T Guiding the long-short term memory model for image caption generation. In ICCV, Jin, J.; Fu, K.; Cui, R.; Sha, F.; and Zhang, C Aligning where to see and what to tell: image caption with regionbased attention and scene factorization. arxiv preprint arxiv: Kulkarni, G.; Premraj, V.; Dhar, S.; Li, S.; Choi, Y.; Berg, A. C.; and Berg, T. L Baby talk: Understanding and generating simple image descriptions. In CVPR, Lin, T. Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Dollr, P.; and Zitnick, C. L Microsoft coco: Common objects in context. In ECCV, Liu, S.; Zhu, Z.; Ye, N.; Guadarrama, S.; and Murphy, K Optimization of image description metrics using policy gradient methods. arxiv preprint arxiv: Lu, J.; Xiong, C.; Parikh, D.; and Socher, R Knowing when to look: Adaptive attention via a visual sentinel for image captioning. CVPR. Mao, J.; Xu, W.; Yang, Y.; Wang, J.; and Yuille, A. L Deep captioning with multimodal recurrent neural networks (m-rnn). In ICLR. Ranzato, M.; Chopra, S.; Auli, M.; and Zaremba, W Sequence level training with recurrent neural networks. I- CLR. Rennie, S. J.; Marcheret, E.; Mroueh, Y.; Ross, J.; and Goel, V Self-critical sequence training for image captioning. CVPR. Schwenk, H Continuous space translation models for phrase-based statistical machine translation. In COLING, Sutton, R. S Learning to predict by the methods of temporal differences. Mach. Learn. 3(1):9 44. Vinyals, O.; Toshev, A.; Bengio, S.; and Erhan, D Show and tell: A neural image caption generator. In CVPR, Williams, R. J Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine learning 8(3-4): Wu, Z. Y. Y. Y. Y., and Cohen, R. S. W. W Encode, review, and decode: Reviewer module for caption generation. NIPS. Xu, K.; Ba, J.; Kiros, R.; Cho, K.; Courville, A.; Salakhudinov, R.; Zemel, R.; and Bengio, Y Show, attend and tell: Neural image caption generation with visual attention. In ICML, Yao, T.; Pan, Y.; Li, Y.; Qiu, Z.; and Mei, T Boosting image captioning with attributes. arxiv preprint arxiv: You, Q.; Jin, H.; Wang, Z.; Fang, C.; and Luo, J Image captioning with semantic attention. In CVPR, Zhao, S.; Yao, H.; Gao, Y.; Ji, R.; and Ding, G Continuous probability distribution prediction of image emotions via multitask shared sparse regression. IEEE Transactions on Multimedia 19(3):

Show, Discriminate, and Tell: A Discriminatory Image Captioning Model with Deep Neural Networks

Show, Discriminate, and Tell: A Discriminatory Image Captioning Model with Deep Neural Networks Zelun Luo Department of Computer Science Stanford University zelunluo@stanford.edu Te-Lin Wu Department of