Grounded Compositional Semantics for Finding and Describing Images with Sentences

Size: px

Start display at page:

Download "Grounded Compositional Semantics for Finding and Describing Images with Sentences"

Emerald Nichols
5 years ago
Views:

1 Grounded Compositional Semantics for Finding and Describing Images with Sentences R. Socher, A. Karpathy, V. Le,D. Manning, A Y. Ng Ali Gharaee 1 Alireza Keshavarzi 2 1 Department of Computational Linguistic University of Tuebingen 2 Department of Computer Science University of Tuebingen July 13, 2017 Describing Images with Sentences July 13, / 38

2 Outline 1 Introduction 2 Related Work 3 DT-RNN Inputs Forward Propagation 4 Learning Images Setup Deep Neural Network Training 5 Multimodal Mapping 6 Experiment 7 Conclusion Describing Images with Sentences July 13, / 38

3 Introduction Introduction Single word vector spaces can capture meaning of the single words. Describing Images with Sentences July 13, / 38

4 Introduction Introduction Single word vector spaces can capture meaning of the single words. BUT words rarely appear in isolation Describing Images with Sentences July 13, / 38

5 Introduction Introduction Single word vector spaces can capture meaning of the single words. BUT words rarely appear in isolation Play vs. Two children are playing in a park Describing Images with Sentences July 13, / 38

6 Introduction Introduction The paper introduces a model which learns to map sentences and images into a common embedding space in order to be able to retrieve one from the other. Describing Images with Sentences July 13, / 38

Introduction Introduction The paper introduces a model which learns to map sentences and images into a common embedding space in order to be able to retrieve one from the other.

7 Introduction Introduction The paper introduces a model which learns to map sentences and images into a common embedding space in order to be able to retrieve one from the other. The model for mapping sentences into this space is based on ideas from Recursive Neural Networks (RNNs), it computes compositional vector representations inside dependency trees. Describing Images with Sentences July 13, / 38

8 Introduction Introduction Find images that show such a scene: A man wearing a helmet jumps on his bike near a beach. Describing Images with Sentences July 13, / 38

9 Introduction Introduction Find images that show such a scene: A man wearing a helmet jumps on his bike near a beach. Conversely, when given a query image, we would like to find a description that goes beyond a single label by providing a correct sentence describing it, a task that has recently garnered a lot of attention. Describing Images with Sentences July 13, / 38

10 Related Work Related Work The presented model is connected to several areas of NLP and vision research: Describing Images with Sentences July 13, / 38

11 Related Work Related Work The presented model is connected to several areas of NLP and vision research: 1 Semantic Vector Spaces and Their Compositionality: Describing Images with Sentences July 13, / 38

12 Related Work Related Work The presented model is connected to several areas of NLP and vision research: 1 Semantic Vector Spaces and Their Compositionality: Most of the compositionality algorithms and related datasets capture two-word compositions. (Mitchell and Lapata, 2010) Describing Images with Sentences July 13, / 38

13 Related Work Related Work The presented model is connected to several areas of NLP and vision research: 1 Semantic Vector Spaces and Their Compositionality: Most of the compositionality algorithms and related datasets capture two-word compositions. (Mitchell and Lapata, 2010) 2 Multimodal Embeddings: Describing Images with Sentences July 13, / 38

14 Related Work Related Work The presented model is connected to several areas of NLP and vision research: 1 Semantic Vector Spaces and Their Compositionality: Most of the compositionality algorithms and related datasets capture two-word compositions. (Mitchell and Lapata, 2010) 2 Multimodal Embeddings: Multimodal embedding methods project data from multiple sources such as sound and video or images and text. Describing Images with Sentences July 13, / 38

15 Related Work Related Work The presented model is connected to several areas of NLP and vision research: 1 Semantic Vector Spaces and Their Compositionality: Most of the compositionality algorithms and related datasets capture two-word compositions. (Mitchell and Lapata, 2010) 2 Multimodal Embeddings: Multimodal embedding methods project data from multiple sources such as sound and video or images and text. Recently, single word vector embeddings have been used for zero shot learning. Describing Images with Sentences July 13, / 38

16 Related Work Related Work The presented model is connected to several areas of NLP and vision research: 1 Semantic Vector Spaces and Their Compositionality: Most of the compositionality algorithms and related datasets capture two-word compositions. (Mitchell and Lapata, 2010) 2 Multimodal Embeddings: Multimodal embedding methods project data from multiple sources such as sound and video or images and text. Recently, single word vector embeddings have been used for zero shot learning. Mapping images to word vectors enabled their system to classify images as depicting objects such as cat without seeing any examples of this class. (Socher et al.,2013c) Describing Images with Sentences July 13, / 38

17 Related Work Related Work 3 Detailed Image Annotation Describing Images with Sentences July 13, / 38

18 Related Work Related Work 3 Detailed Image Annotation Early work in this area includes generating single words or fixed phrases from images. Describing Images with Sentences July 13, / 38

19 Related Work Related Work 3 Detailed Image Annotation Early work in this area includes generating single words or fixed phrases from images. Describing images with more detailed, longer textual description.(yao 2010) Describing Images with Sentences July 13, / 38

20 Related Work Related Work 3 Detailed Image Annotation Early work in this area includes generating single words or fixed phrases from images. Describing images with more detailed, longer textual description.(yao 2010) The model of this paper is based on Compositional sentence vector representation and doesn t require specific language generation techniques and sophisticated inference methods. Describing Images with Sentences July 13, / 38

21 Related Work Related Work 3 Detailed Image Annotation Early work in this area includes generating single words or fixed phrases from images. Describing images with more detailed, longer textual description.(yao 2010) The model of this paper is based on Compositional sentence vector representation and doesn t require specific language generation techniques and sophisticated inference methods. Since it s based on neural networks inference, it s fast and simple. Describing Images with Sentences July 13, / 38

22 DT-RNN Inputs Word vector How to build representation for longer phrases Describing Images with Sentences July 13, / 38

23 DT-RNN Inputs Word vector How to build representation for longer phrases Single Word Vector: simple by averaging Describing Images with Sentences July 13, / 38

24 DT-RNN Inputs Word vector How to build representation for longer phrases Single Word Vector: simple by averaging Bag of Word: good performance, but cannot distinguish important visual differences Describing Images with Sentences July 13, / 38

25 DT-RNN Inputs Word vector How to build representation for longer phrases Single Word Vector: simple by averaging Bag of Word: good performance, but cannot distinguish important visual differences The car crashed into the bike The bike crashed into the car Describing Images with Sentences July 13, / 38

26 DT-RNN Inputs Word vector How to build representation for longer phrases Single Word Vector: simple by averaging Bag of Word: good performance, but cannot distinguish important visual differences The car crashed into the bike The bike crashed into the car Constituency Tree: very good, but too much syntactic structure Describing Images with Sentences July 13, / 38

27 DT-RNN Inputs Word vector How to build representation for longer phrases Single Word Vector: simple by averaging Bag of Word: good performance, but cannot distinguish important visual differences The car crashed into the bike The bike crashed into the car Constituency Tree: very good, but too much syntactic structure The child was hugged by its mother The mother hugged her child Describing Images with Sentences July 13, / 38

28 DT-RNN Inputs Different Constituency tree for a passive and active form of sentence. Describing Images with Sentences July 13, / 38

29 DT-RNN Inputs Word vector How to build representation for longer phrases Single Word Vector: simple by averaging Bag of Word: good performance, but cannot distinguish important visual differences The car crashed into the bike The bike crashed into the car Constituency Tree: very good, but too much syntactic structure The child was hugged by its mother The mother hugged her child Dependency Tree: focuses more on recognizing actions and agents Describing Images with Sentences July 13, / 38

30 DT-RNN Inputs Agent and action remain same in Dependency tree. Describing Images with Sentences July 13, / 38

31 DT-RNN Inputs Word vector A sentence or Phrase with m words A word with d-dimensional feature (d = 50) Describing Images with Sentences July 13, / 38

32 DT-RNN Inputs Word vector A sentence or Phrase with m words A word with d-dimensional feature (d = 50) Construct a Neural Network that outputs high scores for windows and documents that occur in a large unlabeled corpus and low scores for window-documents pairs where one word is replaced by a random word. Describing Images with Sentences July 13, / 38

33 DT-RNN Inputs Word vector Optimize Neural network with Gradient descent Derivative backpropagate into a word embedding matrix A which stores word vectors as columns Use embedding matrix X that contains columns vector of A of each word in our sentences Then we represent Input sentence s = ((w 1, x w1 ),..., (w m, x wm )) As an ordered list of (word,vector) pairs. Describing Images with Sentences July 13, / 38

34 DT-RNN Inputs Dependecy Tree Using Dependency Parser to parse the Sentence s = (w 1,..., w m ) Describing Images with Sentences July 13, / 38

35 DT-RNN Inputs Dependecy Tree Using Dependency Parser to parse the Sentence s = (w 1,..., w m ) Define d(s) as an ordered list of (child,parent) indices: d(s) = {(i, j)}, i = 1,..., m and j {1,..., m} {0} Describing Images with Sentences July 13, / 38

36 DT-RNN Inputs Dependecy Tree Using Dependency Parser to parse the Sentence s = (w 1,..., w m ) Define d(s) as an ordered list of (child,parent) indices: d(s) = {(i, j)}, i = 1,..., m and j {1,..., m} {0} d = {(1, 2), (2, 0), ((3, 2), (4, 2), (5, 4)} The Final input is pair of Dependency Tree and words vector of sentence (s, d) Describing Images with Sentences July 13, / 38

37 DT-RNN Forward Propagation Forward Propagation Define a Compositionality function: h c = g θ (x c ) = f(w v x c ), W v R n d Describing Images with Sentences July 13, / 38

38 DT-RNN Forward Propagation Forward Propagation Define a Compositionality function: h c = g θ (x c ) = f(w v x c ), W v R n d Use tanh as activation function Describing Images with Sentences July 13, / 38

39 DT-RNN Forward Propagation Forward Propagation Define a Compositionality function: h c = g θ (x c ) = f(w v x c ), W v R n d Use tanh as activation function In our example first we should compute leaf node (c = 1, 3, 5) h 1 = g θ (x 1 ) = f(w v x 1 ) Describing Images with Sentences July 13, / 38

40 DT-RNN Forward Propagation Forward Propagation Define a Compositionality function: h c = g θ (x c ) = f(w v x c ), W v R n d Use tanh as activation function In our example first we should compute leaf node (c = 1, 3, 5) h 1 = g θ (x 1 ) = f(w v x 1 ) The final sentence representation is h 2 but we need to compute h 4 Describing Images with Sentences July 13, / 38

41 DT-RNN Forward Propagation Forward Propagation For h 4 we have a sum over child nodes h 4 = gθ(x 4, h 5 ) = f(w v x 4 + W r1 h 5 ), W r1 R n n Describing Images with Sentences July 13, / 38

42 DT-RNN Forward Propagation Forward Propagation For h 4 we have a sum over child nodes h 4 = gθ(x 4, h 5 ) = f(w v x 4 + W r1 h 5 ), W r1 R n n Generally, we have multiple matrices for composing with hidden child vectors. W r = (W r1,..., W rkr ) W l = (W l1,..., W lkl ) K is the number of maximum needed matrices in training data Describing Images with Sentences July 13, / 38

43 DT-RNN Forward Propagation Forward Propagation For h 4 we have a sum over child nodes h 4 = gθ(x 4, h 5 ) = f(w v x 4 + W r1 h 5 ), W r1 R n n Generally, we have multiple matrices for composing with hidden child vectors. W r = (W r1,..., W rkr ) W l = (W l1,..., W lkl ) K is the number of maximum needed matrices in training data How about if test sentence needs hidden child vector greater than k? Use Identity matrix Divide sentence Trim sentence Describing Images with Sentences July 13, / 38

44 DT-RNN Forward Propagation Forward Propagation For h 4 we have a sum over child nodes h 4 = gθ(x 4, h 5 ) = f(w v x 4 + W r1 h 5 ), W r1 R n n Generally, we have multiple matrices for composing with hidden child vectors. W r = (W r1,..., W rkr ) W l = (W l1,..., W lkl ) K is the number of maximum needed matrices in training data How about if test sentence needs hidden child vector greater than k? Use Identity matrix Divide sentence Trim sentence Describing Images with Sentences July 13, / 38

45 DT-RNN Forward Propagation Forward Propagation Now, we can compute the root: h 2 h 2 = g θ (x 2, h 1, h 3, h 4 ) = f(w v x 2 + W l1 h 1 + W r1 h 3 + W r2 h 4 ) Describing Images with Sentences July 13, / 38

46 DT-RNN Forward Propagation Forward Propagation Now, we can compute the root: h 2 h 2 = g θ (x 2, h 1, h 3, h 4 ) = f(w v x 2 + W l1 h 1 + W r1 h 3 + W r2 h 4 ) Different results for small sentences and large sentences! Describing Images with Sentences July 13, / 38

47 DT-RNN Forward Propagation Forward Propagation Now, we can compute the root: h 2 h 2 = g θ (x 2, h 1, h 3, h 4 ) = f(w v x 2 + W l1 h 1 + W r1 h 3 + W r2 h 4 ) Different results for small sentences and large sentences! Describing Images with Sentences July 13, / 38

48 DT-RNN Forward Propagation Normalization Normalize hidden nodes: h i = f 1 W v x i + l(i) j C(i) l(i) = the number of leaf nodes under node i we can compute l(i) = 1 + j C(i) l(j) l(j)w pos(i,j) h i C(i, y) = a set of child nodes of node i in dependency tree y pos(i, j) is relative position of child j with respect to node i e.g l1 or r3 Describing Images with Sentences July 13, / 38

49 DT-RNN Forward Propagation Comparison to Constituency Tree RNN Why Constituency tree (CT-RNN) doesn t work well? It is a binary tree. each node has only two child node (c 1, c 2 ) Describing Images with Sentences July 13, / 38

50 DT-RNN Forward Propagation Comparison to Constituency Tree RNN Why Constituency tree (CT-RNN) doesn t work well? It is a binary tree. each node has only two child node (c 1, c 2 ) Composition function is: h = f(w l1 c 1 + W r1 c 2 ) W R d 2d Describing Images with Sentences July 13, / 38

51 DT-RNN Forward Propagation Comparison to Constituency Tree RNN Why Constituency tree (CT-RNN) doesn t work well? It is a binary tree. each node has only two child node (c 1, c 2 ) Composition function is: h = f(w l1 c 1 + W r1 c 2 ) W R d 2d DT-RNN allows n-ary nodes in tree Describing Images with Sentences July 13, / 38

52 DT-RNN Forward Propagation Comparison to Constituency Tree RNN Why Constituency tree (CT-RNN) doesn t work well? It is a binary tree. each node has only two child node (c 1, c 2 ) Composition function is: h = f(w l1 c 1 + W r1 c 2 ) W R d 2d DT-RNN allows n-ary nodes in tree in CT-RNN: last in, larger weight = last words are more important. Describing Images with Sentences July 13, / 38

53 DT-RNN Forward Propagation Comparison to Constituency Tree RNN Why Constituency tree (CT-RNN) doesn t work well? It is a binary tree. each node has only two child node (c 1, c 2 ) Composition function is: h = f(w l1 c 1 + W r1 c 2 ) W R d 2d DT-RNN allows n-ary nodes in tree in CT-RNN: last in, larger weight = last words are more important. CT-RNN capture the syntactic of sentences more than DT-RNN Describing Images with Sentences July 13, / 38

54 DT-RNN Forward Propagation Comparison to Constituency Tree RNN Why Constituency tree (CT-RNN) doesn t work well? It is a binary tree. each node has only two child node (c 1, c 2 ) Composition function is: h = f(w l1 c 1 + W r1 c 2 ) W R d 2d DT-RNN allows n-ary nodes in tree in CT-RNN: last in, larger weight = last words are more important. CT-RNN capture the syntactic of sentences more than DT-RNN But To describe an Image, we need agents and action Describing Images with Sentences July 13, / 38

55 DT-RNN Forward Propagation Comparison to Constituency Tree RNN Why Constituency tree (CT-RNN) doesn t work well? It is a binary tree. each node has only two child node (c 1, c 2 ) Composition function is: h = f(w l1 c 1 + W r1 c 2 ) W R d 2d DT-RNN allows n-ary nodes in tree in CT-RNN: last in, larger weight = last words are more important. CT-RNN capture the syntactic of sentences more than DT-RNN But To describe an Image, we need agents and action The dependency tree structures push the central content words such as the main action or verb and its subject and object to be merged last Describing Images with Sentences July 13, / 38

56 DT-RNN Forward Propagation Comparison to Constituency Tree RNN Why Constituency tree (CT-RNN) doesn t work well? It is a binary tree. each node has only two child node (c 1, c 2 ) Composition function is: h = f(w l1 c 1 + W r1 c 2 ) W R d 2d DT-RNN allows n-ary nodes in tree in CT-RNN: last in, larger weight = last words are more important. CT-RNN capture the syntactic of sentences more than DT-RNN But To describe an Image, we need agents and action The dependency tree structures push the central content words such as the main action or verb and its subject and object to be merged last Final sentence representation in DT-RNN is more robust to less important adjectival modifiers, word order changes, etc. Describing Images with Sentences July 13, / 38

57 Learning Images Setup Deep Neural Network Data representation Two dataset: 20 million Random web images (unsupervised learning) 14 million labeled images to classify 22,000 categories (supervised learning) Describing Images with Sentences July 13, / 38

58 Learning Images Setup Deep Neural Network Data representation Two dataset: 20 million Random web images (unsupervised learning) 14 million labeled images to classify 22,000 categories (supervised learning) Input Image: Resize and Rescale to pixel Describing Images with Sentences July 13, / 38

59 Learning Images Setup Deep Neural Network Layer Architecture 3 layers, 3 stages (9 layers): Describing Images with Sentences July 13, / 38

60 Learning Images Setup Deep Neural Network Layer Architecture 3 layers, 3 stages (9 layers): Filtering: learnable parameters! Describing Images with Sentences July 13, / 38

61 Learning Images Setup Deep Neural Network Layer Architecture 3 layers, 3 stages (9 layers): Filtering: learnable parameters! L2-Pooling: taking the square of the filtering units, summing them up in a small area in the image, and taking the square root Describing Images with Sentences July 13, / 38

62 Learning Images Setup Deep Neural Network Layer Architecture 3 layers, 3 stages (9 layers): Filtering: learnable parameters! L2-Pooling: taking the square of the filtering units, summing them up in a small area in the image, and taking the square root Local contrast normalization: takes inputs in a small area of the lower layer, subtracts the mean and divides by the standard deviation Describing Images with Sentences July 13, / 38

63 Learning Images Setup Deep Neural Network Filtering Describing Images with Sentences July 13, / 38

64 Learning Images Setup Deep Neural Network Filtering The values of Filter (in first layer) after training. Describing Images with Sentences July 13, / 38

65 Learning Images Training Training Unsupervised objective: Trying to reconstruct the input while keeping the neurons sparse. Describing Images with Sentences July 13, / 38

66 Learning Images Training Training Unsupervised objective: Trying to reconstruct the input while keeping the neurons sparse. Supervised objective: Adjust the features in the entire network. Describing Images with Sentences July 13, / 38

67 Learning Images Training Training Unsupervised objective: Trying to reconstruct the input while keeping the neurons sparse. Supervised objective: Adjust the features in the entire network. Adding a bottle-neck layer in between the last layer and the classifier: To reduce the number of connections to d = 4096 Describing Images with Sentences July 13, / 38

68 Learning Images Training Training Unsupervised objective: Trying to reconstruct the input while keeping the neurons sparse. Supervised objective: Adjust the features in the entire network. Adding a bottle-neck layer in between the last layer and the classifier: To reduce the number of connections to d = 4096 Performs a feedforward computation to compute the values of the bottleneck layer Describing Images with Sentences July 13, / 38

69 Learning Images Training Training Unsupervised objective: Trying to reconstruct the input while keeping the neurons sparse. Supervised objective: Adjust the features in the entire network. Adding a bottle-neck layer in between the last layer and the classifier: To reduce the number of connections to d = 4096 Performs a feedforward computation to compute the values of the bottleneck layer Use a linear activation for this layer Describing Images with Sentences July 13, / 38

70 Learning Images Training Training Unsupervised objective: Trying to reconstruct the input while keeping the neurons sparse. Supervised objective: Adjust the features in the entire network. Adding a bottle-neck layer in between the last layer and the classifier: To reduce the number of connections to d = 4096 Performs a feedforward computation to compute the values of the bottleneck layer Use a linear activation for this layer Using the features at the bottleneck layer as the feature vector z of an image in multimodal space. Describing Images with Sentences July 13, / 38

71 Learning Images Training Some Facts! In total number of connections of this network is approximately 1.36 billion. How long does it take to train this network? 1 Describing Images with Sentences July 13, / 38

72 Learning Images Training Some Facts! In total number of connections of this network is approximately 1.36 billion. How long does it take to train this network? It takes 8 days on a large cluster of machines. On 169 machines (where each machine had 16 CPU cores) 1 Describing Images with Sentences July 13, / 38

73 Learning Images Training Some Facts! In total number of connections of this network is approximately 1.36 billion. How long does it take to train this network? It takes 8 days on a large cluster of machines. On 169 machines (where each machine had 16 CPU cores) How about the number of connections in the human visual cortex? 1 Describing Images with Sentences July 13, / 38

74 Learning Images Training Some Facts! In total number of connections of this network is approximately 1.36 billion. How long does it take to train this network? It takes 8 days on a large cluster of machines. On 169 machines (where each machine had 16 CPU cores) How about the number of connections in the human visual cortex? 10 6 times larger in terms of the number of neurons and synapses How long does it take a person to recognize objects? 1 Describing Images with Sentences July 13, / 38

75 Learning Images Training Some Facts! In total number of connections of this network is approximately 1.36 billion. How long does it take to train this network? It takes 8 days on a large cluster of machines. On 169 machines (where each machine had 16 CPU cores) How about the number of connections in the human visual cortex? 10 6 times larger in terms of the number of neurons and synapses How long does it take a person to recognize objects? Training time of an infant is between 3 to 5 months! Describing Images with Sentences July 13, / 38

76 Multimodal Mapping Multimodal Mapping The previous two sections described how we can map sentences into a d = 50-dimensional space and how to extract high quality image feature vectors of 4096 dimensions. Describing Images with Sentences July 13, / 38

77 Multimodal Mapping Multimodal Mapping The previous two sections described how we can map sentences into a d = 50-dimensional space and how to extract high quality image feature vectors of 4096 dimensions. We now define the final multimodal objective function for learning joint image-sentence representations with these models. Describing Images with Sentences July 13, / 38

78 Multimodal Mapping Multimodal Mapping The previous two sections described how we can map sentences into a d = 50-dimensional space and how to extract high quality image feature vectors of 4096 dimensions. We now define the final multimodal objective function for learning joint image-sentence representations with these models. The training set consists of N images and their feature vectors (z i ) and each image has 5 sentence descriptions s i1,..., s i5 for which we use the DT-RNN to compute vector representations. Describing Images with Sentences July 13, / 38

79 Multimodal Mapping Multimodal Mapping Figure: Sentence length varies greatly and different objects can be mentioned first. Hence, models have to be invariant to word ordering Describing Images with Sentences July 13, / 38

80 Multimodal Mapping Multimodal Mapping For training, we use a max-margin objective function. Describing Images with Sentences July 13, / 38

81 Multimodal Mapping Multimodal Mapping For training, we use a max-margin objective function. The ranking cost function to minimize is: J(W I, θ) = max(0, vi T y j + vi T y c ) i,j P c S\S(i) + i,j P c I\I(j) max(0, v T i y j + v T c y j ) Describing Images with Sentences July 13, / 38

82 Multimodal Mapping Multimodal Mapping For training, we use a max-margin objective function. The ranking cost function to minimize is: J(W I, θ) = max(0, vi T y j + vi T y c ) i,j P c S\S(i) + i,j P c I\I(j) max(0, v T i y j + v T c y j ) The objective function is very similar to the Hinge loss function in SVM. Describing Images with Sentences July 13, / 38

83 Multimodal Mapping Multimodal Mapping For training, we use a max-margin objective function. The ranking cost function to minimize is: J(W I, θ) = max(0, vi T y j + vi T y c ) i,j P c S\S(i) + i,j P c I\I(j) max(0, v T i y j + v T c y j ) The objective function is very similar to the Hinge loss function in SVM. The final objective also includes the regularization term: λ ( θ W I F ) (here the Frobenius norm is l 2 norm) Describing Images with Sentences July 13, / 38

84 Multimodal Mapping Multimodal Mapping For training, we use a max-margin objective function. The ranking cost function to minimize is: J(W I, θ) = max(0, vi T y j + vi T y c ) i,j P c S\S(i) + i,j P c I\I(j) max(0, v T i y j + v T c y j ) The objective function is very similar to the Hinge loss function in SVM. The final objective also includes the regularization term: λ ( θ W I F ) (here the Frobenius norm is l 2 norm) A modified version of AdaGrad for optimization of both W I and the DT-RNN as well as the other baselines. Describing Images with Sentences July 13, / 38

85 Multimodal Mapping Multimodal Mapping An alternative objective function is based on the squared loss J(W I, θ) = (i,j) P v i y j 2 2. Describing Images with Sentences July 13, / 38

86 Experiment Experiment The authors use dataset of Rashtchian et al. (2010) which consists of 1000 images, each with 5 sentences. Figure 1 Describing Images with Sentences July 13, / 38

87 Experiment Experiment The authors use dataset of Rashtchian et al. (2010) which consists of 1000 images, each with 5 sentences. Figure 1 Three different experiments to evaluate and compare the DT-RNN: 1 Analyzing how well the sentence vectors capture similarity in visual meaning. Describing Images with Sentences July 13, / 38

88 Experiment Experiment The authors use dataset of Rashtchian et al. (2010) which consists of 1000 images, each with 5 sentences. Figure 1 Three different experiments to evaluate and compare the DT-RNN: 1 Analyzing how well the sentence vectors capture similarity in visual meaning. 2 Analyzing Image Search with Query Sentences. Describing Images with Sentences July 13, / 38

89 Experiment Experiment The authors use dataset of Rashtchian et al. (2010) which consists of 1000 images, each with 5 sentences. Figure 1 Three different experiments to evaluate and compare the DT-RNN: 1 Analyzing how well the sentence vectors capture similarity in visual meaning. 2 Analyzing Image Search with Query Sentences. 3 Describing Images by Finding Suitable Sentences. In the experiments data is split into 800 training, 100 development and 100 test images. Since there are 5 sentences describing each image, there are 4000 training sentences and 500 testing sentences. The dataset has 3020 unique words. Describing Images with Sentences July 13, / 38

90 Experiment Experiment For both DT-RNNs the weight matrices are initialized to block identity matrices plus Gaussian noise Describing Images with Sentences July 13, / 38

91 Experiment Experiment For both DT-RNNs the weight matrices are initialized to block identity matrices plus Gaussian noise Length of word vectors and hidden vectors = 50. λ = 0.08, the learning rate of AdaGrad = (by using the development split) The best model uses a margin of = 3. Describing Images with Sentences July 13, / 38

92 Experiment Experiment For both DT-RNNs the weight matrices are initialized to block identity matrices plus Gaussian noise Length of word vectors and hidden vectors = 50. λ = 0.08, the learning rate of AdaGrad = (by using the development split) The best model uses a margin of = 3. Similarity of Sentences Describing the Same Image: First all 500 sentences from the test set have been mapped into the multimodal space. Then for each sentence, we find the nearest neighbor sentences in terms of inner products. Describing Images with Sentences July 13, / 38

93 Experiment Experiment Figure: Left: Comparison of methods for sentence similarity judgments. Lower numbers are better. Center: Comparison of methods for image search with query sentences. Shown is the average rank of the single correct image that is being described. Right: Average rank of a correct sentence description for a query image. Describing Images with Sentences July 13, / 38

94 Experiment Experiment Image Search with Query Sentences This experiment evaluates how well we can find images that display the visual meaning of a given sentence. First a query sentence is mapped into the vector space and then images will be found in the same space using simple inner products. As shown in Table 1 (center), the new DT-RNN outperforms all other models. Describing Images with Sentences July 13, / 38

95 Experiment Experiment Image Search with Query Sentences This experiment evaluates how well we can find images that display the visual meaning of a given sentence. First a query sentence is mapped into the vector space and then images will be found in the same space using simple inner products. As shown in Table 1 (center), the new DT-RNN outperforms all other models. Describing Images by Finding Suitable Sentences For an image, suitable textual descriptions again have been searched simply by finding closeby sentence vectors in the multi-modal embedding space. Describing Images with Sentences July 13, / 38

96 Experiment Experiment Image Search with Query Sentences This experiment evaluates how well we can find images that display the visual meaning of a given sentence. First a query sentence is mapped into the vector space and then images will be found in the same space using simple inner products. As shown in Table 1 (center), the new DT-RNN outperforms all other models. Describing Images by Finding Suitable Sentences For an image, suitable textual descriptions again have been searched simply by finding closeby sentence vectors in the multi-modal embedding space. The average ranking of 25.3 for a correct sentence description is out of 500 possible sentences. Describing Images with Sentences July 13, / 38

97 Experiment Multimodal Mapping Figure: Images and their sentence descriptions assigned by the DT-RNN. Describing Images with Sentences July 13, / 38

98 Experiment Experiment Main failure mode of the SDT-RNN occurs when a sentence that should describe the same image does not use a verb but the other sentences of that image do include a verb. Describing Images with Sentences July 13, / 38

99 Experiment Experiment Main failure mode of the SDT-RNN occurs when a sentence that should describe the same image does not use a verb but the other sentences of that image do include a verb. For example, the following sentence pair has vectors that are very far apart from each other even though they are supposed to describe the same image: 1. A blue and yellow airplane flying straight down while emitting white smoke 2. Airplane in dive position Describing Images with Sentences July 13, / 38

100 Conclusion Conclusion Our new model outperforms baselines and other commonly used models that can compute continuous vector representations for sentences. Describing Images with Sentences July 13, / 38

101 Conclusion Conclusion Our new model outperforms baselines and other commonly used models that can compute continuous vector representations for sentences. In comparison to related models, the DTRNN is more invariant and robust to surface changes such as word order. Describing Images with Sentences July 13, / 38

102 Appendix For Further Reading For Further Reading I Oriol Vinyals, Alexander Toshev, Samy Bengio, Dumitru Erhan. Show and Tell: Lessons learned from the 2015 MSCOCO Image Captioning Challenge Chris Shallue. A TensorFlow implementation of the image-to-text model described above Andrej Karpathy, Li Fei-Fei. Deep Visual-Semantic Alignments for Generating Image Descriptions Describing Images with Sentences July 13, / 38

Semantic image search using queries

Semantic image search using queries Shabaz Basheer Patel, Anand Sampat Department of Electrical Engineering Stanford University CA 94305 shabaz@stanford.edu,asampat@stanford.edu Abstract Previous work,