Grounded Compositional Semantics for Finding and Describing Images with Sentences

Size: px
Start display at page:

Download "Grounded Compositional Semantics for Finding and Describing Images with Sentences"

Transcription

1 Grounded Compositional Semantics for Finding and Describing Images with Sentences R. Socher, A. Karpathy, V. Le,D. Manning, A Y. Ng Ali Gharaee 1 Alireza Keshavarzi 2 1 Department of Computational Linguistic University of Tuebingen 2 Department of Computer Science University of Tuebingen July 13, 2017 Describing Images with Sentences July 13, / 38

2 Outline 1 Introduction 2 Related Work 3 DT-RNN Inputs Forward Propagation 4 Learning Images Setup Deep Neural Network Training 5 Multimodal Mapping 6 Experiment 7 Conclusion Describing Images with Sentences July 13, / 38

3 Introduction Introduction Single word vector spaces can capture meaning of the single words. Describing Images with Sentences July 13, / 38

4 Introduction Introduction Single word vector spaces can capture meaning of the single words. BUT words rarely appear in isolation Describing Images with Sentences July 13, / 38

5 Introduction Introduction Single word vector spaces can capture meaning of the single words. BUT words rarely appear in isolation Play vs. Two children are playing in a park Describing Images with Sentences July 13, / 38

6 Introduction Introduction The paper introduces a model which learns to map sentences and images into a common embedding space in order to be able to retrieve one from the other. Describing Images with Sentences July 13, / 38

7 Introduction Introduction The paper introduces a model which learns to map sentences and images into a common embedding space in order to be able to retrieve one from the other. The model for mapping sentences into this space is based on ideas from Recursive Neural Networks (RNNs), it computes compositional vector representations inside dependency trees. Describing Images with Sentences July 13, / 38

8 Introduction Introduction Find images that show such a scene: A man wearing a helmet jumps on his bike near a beach. Describing Images with Sentences July 13, / 38

9 Introduction Introduction Find images that show such a scene: A man wearing a helmet jumps on his bike near a beach. Conversely, when given a query image, we would like to find a description that goes beyond a single label by providing a correct sentence describing it, a task that has recently garnered a lot of attention. Describing Images with Sentences July 13, / 38

10 Related Work Related Work The presented model is connected to several areas of NLP and vision research: Describing Images with Sentences July 13, / 38

11 Related Work Related Work The presented model is connected to several areas of NLP and vision research: 1 Semantic Vector Spaces and Their Compositionality: Describing Images with Sentences July 13, / 38

12 Related Work Related Work The presented model is connected to several areas of NLP and vision research: 1 Semantic Vector Spaces and Their Compositionality: Most of the compositionality algorithms and related datasets capture two-word compositions. (Mitchell and Lapata, 2010) Describing Images with Sentences July 13, / 38

13 Related Work Related Work The presented model is connected to several areas of NLP and vision research: 1 Semantic Vector Spaces and Their Compositionality: Most of the compositionality algorithms and related datasets capture two-word compositions. (Mitchell and Lapata, 2010) 2 Multimodal Embeddings: Describing Images with Sentences July 13, / 38

14 Related Work Related Work The presented model is connected to several areas of NLP and vision research: 1 Semantic Vector Spaces and Their Compositionality: Most of the compositionality algorithms and related datasets capture two-word compositions. (Mitchell and Lapata, 2010) 2 Multimodal Embeddings: Multimodal embedding methods project data from multiple sources such as sound and video or images and text. Describing Images with Sentences July 13, / 38

15 Related Work Related Work The presented model is connected to several areas of NLP and vision research: 1 Semantic Vector Spaces and Their Compositionality: Most of the compositionality algorithms and related datasets capture two-word compositions. (Mitchell and Lapata, 2010) 2 Multimodal Embeddings: Multimodal embedding methods project data from multiple sources such as sound and video or images and text. Recently, single word vector embeddings have been used for zero shot learning. Describing Images with Sentences July 13, / 38

16 Related Work Related Work The presented model is connected to several areas of NLP and vision research: 1 Semantic Vector Spaces and Their Compositionality: Most of the compositionality algorithms and related datasets capture two-word compositions. (Mitchell and Lapata, 2010) 2 Multimodal Embeddings: Multimodal embedding methods project data from multiple sources such as sound and video or images and text. Recently, single word vector embeddings have been used for zero shot learning. Mapping images to word vectors enabled their system to classify images as depicting objects such as cat without seeing any examples of this class. (Socher et al.,2013c) Describing Images with Sentences July 13, / 38

17 Related Work Related Work 3 Detailed Image Annotation Describing Images with Sentences July 13, / 38

18 Related Work Related Work 3 Detailed Image Annotation Early work in this area includes generating single words or fixed phrases from images. Describing Images with Sentences July 13, / 38

19 Related Work Related Work 3 Detailed Image Annotation Early work in this area includes generating single words or fixed phrases from images. Describing images with more detailed, longer textual description.(yao 2010) Describing Images with Sentences July 13, / 38

20 Related Work Related Work 3 Detailed Image Annotation Early work in this area includes generating single words or fixed phrases from images. Describing images with more detailed, longer textual description.(yao 2010) The model of this paper is based on Compositional sentence vector representation and doesn t require specific language generation techniques and sophisticated inference methods. Describing Images with Sentences July 13, / 38

21 Related Work Related Work 3 Detailed Image Annotation Early work in this area includes generating single words or fixed phrases from images. Describing images with more detailed, longer textual description.(yao 2010) The model of this paper is based on Compositional sentence vector representation and doesn t require specific language generation techniques and sophisticated inference methods. Since it s based on neural networks inference, it s fast and simple. Describing Images with Sentences July 13, / 38

22 DT-RNN Inputs Word vector How to build representation for longer phrases Describing Images with Sentences July 13, / 38

23 DT-RNN Inputs Word vector How to build representation for longer phrases Single Word Vector: simple by averaging Describing Images with Sentences July 13, / 38

24 DT-RNN Inputs Word vector How to build representation for longer phrases Single Word Vector: simple by averaging Bag of Word: good performance, but cannot distinguish important visual differences Describing Images with Sentences July 13, / 38

25 DT-RNN Inputs Word vector How to build representation for longer phrases Single Word Vector: simple by averaging Bag of Word: good performance, but cannot distinguish important visual differences The car crashed into the bike The bike crashed into the car Describing Images with Sentences July 13, / 38

26 DT-RNN Inputs Word vector How to build representation for longer phrases Single Word Vector: simple by averaging Bag of Word: good performance, but cannot distinguish important visual differences The car crashed into the bike The bike crashed into the car Constituency Tree: very good, but too much syntactic structure Describing Images with Sentences July 13, / 38

27 DT-RNN Inputs Word vector How to build representation for longer phrases Single Word Vector: simple by averaging Bag of Word: good performance, but cannot distinguish important visual differences The car crashed into the bike The bike crashed into the car Constituency Tree: very good, but too much syntactic structure The child was hugged by its mother The mother hugged her child Describing Images with Sentences July 13, / 38

28 DT-RNN Inputs Different Constituency tree for a passive and active form of sentence. Describing Images with Sentences July 13, / 38

29 DT-RNN Inputs Word vector How to build representation for longer phrases Single Word Vector: simple by averaging Bag of Word: good performance, but cannot distinguish important visual differences The car crashed into the bike The bike crashed into the car Constituency Tree: very good, but too much syntactic structure The child was hugged by its mother The mother hugged her child Dependency Tree: focuses more on recognizing actions and agents Describing Images with Sentences July 13, / 38

30 DT-RNN Inputs Agent and action remain same in Dependency tree. Describing Images with Sentences July 13, / 38

31 DT-RNN Inputs Word vector A sentence or Phrase with m words A word with d-dimensional feature (d = 50) Describing Images with Sentences July 13, / 38

32 DT-RNN Inputs Word vector A sentence or Phrase with m words A word with d-dimensional feature (d = 50) Construct a Neural Network that outputs high scores for windows and documents that occur in a large unlabeled corpus and low scores for window-documents pairs where one word is replaced by a random word. Describing Images with Sentences July 13, / 38

33 DT-RNN Inputs Word vector Optimize Neural network with Gradient descent Derivative backpropagate into a word embedding matrix A which stores word vectors as columns Use embedding matrix X that contains columns vector of A of each word in our sentences Then we represent Input sentence s = ((w 1, x w1 ),..., (w m, x wm )) As an ordered list of (word,vector) pairs. Describing Images with Sentences July 13, / 38

34 DT-RNN Inputs Dependecy Tree Using Dependency Parser to parse the Sentence s = (w 1,..., w m ) Describing Images with Sentences July 13, / 38

35 DT-RNN Inputs Dependecy Tree Using Dependency Parser to parse the Sentence s = (w 1,..., w m ) Define d(s) as an ordered list of (child,parent) indices: d(s) = {(i, j)}, i = 1,..., m and j {1,..., m} {0} Describing Images with Sentences July 13, / 38

36 DT-RNN Inputs Dependecy Tree Using Dependency Parser to parse the Sentence s = (w 1,..., w m ) Define d(s) as an ordered list of (child,parent) indices: d(s) = {(i, j)}, i = 1,..., m and j {1,..., m} {0} d = {(1, 2), (2, 0), ((3, 2), (4, 2), (5, 4)} The Final input is pair of Dependency Tree and words vector of sentence (s, d) Describing Images with Sentences July 13, / 38

37 DT-RNN Forward Propagation Forward Propagation Define a Compositionality function: h c = g θ (x c ) = f(w v x c ), W v R n d Describing Images with Sentences July 13, / 38

38 DT-RNN Forward Propagation Forward Propagation Define a Compositionality function: h c = g θ (x c ) = f(w v x c ), W v R n d Use tanh as activation function Describing Images with Sentences July 13, / 38

39 DT-RNN Forward Propagation Forward Propagation Define a Compositionality function: h c = g θ (x c ) = f(w v x c ), W v R n d Use tanh as activation function In our example first we should compute leaf node (c = 1, 3, 5) h 1 = g θ (x 1 ) = f(w v x 1 ) Describing Images with Sentences July 13, / 38

40 DT-RNN Forward Propagation Forward Propagation Define a Compositionality function: h c = g θ (x c ) = f(w v x c ), W v R n d Use tanh as activation function In our example first we should compute leaf node (c = 1, 3, 5) h 1 = g θ (x 1 ) = f(w v x 1 ) The final sentence representation is h 2 but we need to compute h 4 Describing Images with Sentences July 13, / 38

41 DT-RNN Forward Propagation Forward Propagation For h 4 we have a sum over child nodes h 4 = gθ(x 4, h 5 ) = f(w v x 4 + W r1 h 5 ), W r1 R n n Describing Images with Sentences July 13, / 38

42 DT-RNN Forward Propagation Forward Propagation For h 4 we have a sum over child nodes h 4 = gθ(x 4, h 5 ) = f(w v x 4 + W r1 h 5 ), W r1 R n n Generally, we have multiple matrices for composing with hidden child vectors. W r = (W r1,..., W rkr ) W l = (W l1,..., W lkl ) K is the number of maximum needed matrices in training data Describing Images with Sentences July 13, / 38

43 DT-RNN Forward Propagation Forward Propagation For h 4 we have a sum over child nodes h 4 = gθ(x 4, h 5 ) = f(w v x 4 + W r1 h 5 ), W r1 R n n Generally, we have multiple matrices for composing with hidden child vectors. W r = (W r1,..., W rkr ) W l = (W l1,..., W lkl ) K is the number of maximum needed matrices in training data How about if test sentence needs hidden child vector greater than k? Use Identity matrix Divide sentence Trim sentence Describing Images with Sentences July 13, / 38

44 DT-RNN Forward Propagation Forward Propagation For h 4 we have a sum over child nodes h 4 = gθ(x 4, h 5 ) = f(w v x 4 + W r1 h 5 ), W r1 R n n Generally, we have multiple matrices for composing with hidden child vectors. W r = (W r1,..., W rkr ) W l = (W l1,..., W lkl ) K is the number of maximum needed matrices in training data How about if test sentence needs hidden child vector greater than k? Use Identity matrix Divide sentence Trim sentence Describing Images with Sentences July 13, / 38

45 DT-RNN Forward Propagation Forward Propagation Now, we can compute the root: h 2 h 2 = g θ (x 2, h 1, h 3, h 4 ) = f(w v x 2 + W l1 h 1 + W r1 h 3 + W r2 h 4 ) Describing Images with Sentences July 13, / 38

46 DT-RNN Forward Propagation Forward Propagation Now, we can compute the root: h 2 h 2 = g θ (x 2, h 1, h 3, h 4 ) = f(w v x 2 + W l1 h 1 + W r1 h 3 + W r2 h 4 ) Different results for small sentences and large sentences! Describing Images with Sentences July 13, / 38

47 DT-RNN Forward Propagation Forward Propagation Now, we can compute the root: h 2 h 2 = g θ (x 2, h 1, h 3, h 4 ) = f(w v x 2 + W l1 h 1 + W r1 h 3 + W r2 h 4 ) Different results for small sentences and large sentences! Describing Images with Sentences July 13, / 38

48 DT-RNN Forward Propagation Normalization Normalize hidden nodes: h i = f 1 W v x i + l(i) j C(i) l(i) = the number of leaf nodes under node i we can compute l(i) = 1 + j C(i) l(j) l(j)w pos(i,j) h i C(i, y) = a set of child nodes of node i in dependency tree y pos(i, j) is relative position of child j with respect to node i e.g l1 or r3 Describing Images with Sentences July 13, / 38

49 DT-RNN Forward Propagation Comparison to Constituency Tree RNN Why Constituency tree (CT-RNN) doesn t work well? It is a binary tree. each node has only two child node (c 1, c 2 ) Describing Images with Sentences July 13, / 38

50 DT-RNN Forward Propagation Comparison to Constituency Tree RNN Why Constituency tree (CT-RNN) doesn t work well? It is a binary tree. each node has only two child node (c 1, c 2 ) Composition function is: h = f(w l1 c 1 + W r1 c 2 ) W R d 2d Describing Images with Sentences July 13, / 38

51 DT-RNN Forward Propagation Comparison to Constituency Tree RNN Why Constituency tree (CT-RNN) doesn t work well? It is a binary tree. each node has only two child node (c 1, c 2 ) Composition function is: h = f(w l1 c 1 + W r1 c 2 ) W R d 2d DT-RNN allows n-ary nodes in tree Describing Images with Sentences July 13, / 38

52 DT-RNN Forward Propagation Comparison to Constituency Tree RNN Why Constituency tree (CT-RNN) doesn t work well? It is a binary tree. each node has only two child node (c 1, c 2 ) Composition function is: h = f(w l1 c 1 + W r1 c 2 ) W R d 2d DT-RNN allows n-ary nodes in tree in CT-RNN: last in, larger weight = last words are more important. Describing Images with Sentences July 13, / 38

53 DT-RNN Forward Propagation Comparison to Constituency Tree RNN Why Constituency tree (CT-RNN) doesn t work well? It is a binary tree. each node has only two child node (c 1, c 2 ) Composition function is: h = f(w l1 c 1 + W r1 c 2 ) W R d 2d DT-RNN allows n-ary nodes in tree in CT-RNN: last in, larger weight = last words are more important. CT-RNN capture the syntactic of sentences more than DT-RNN Describing Images with Sentences July 13, / 38

54 DT-RNN Forward Propagation Comparison to Constituency Tree RNN Why Constituency tree (CT-RNN) doesn t work well? It is a binary tree. each node has only two child node (c 1, c 2 ) Composition function is: h = f(w l1 c 1 + W r1 c 2 ) W R d 2d DT-RNN allows n-ary nodes in tree in CT-RNN: last in, larger weight = last words are more important. CT-RNN capture the syntactic of sentences more than DT-RNN But To describe an Image, we need agents and action Describing Images with Sentences July 13, / 38

55 DT-RNN Forward Propagation Comparison to Constituency Tree RNN Why Constituency tree (CT-RNN) doesn t work well? It is a binary tree. each node has only two child node (c 1, c 2 ) Composition function is: h = f(w l1 c 1 + W r1 c 2 ) W R d 2d DT-RNN allows n-ary nodes in tree in CT-RNN: last in, larger weight = last words are more important. CT-RNN capture the syntactic of sentences more than DT-RNN But To describe an Image, we need agents and action The dependency tree structures push the central content words such as the main action or verb and its subject and object to be merged last Describing Images with Sentences July 13, / 38

56 DT-RNN Forward Propagation Comparison to Constituency Tree RNN Why Constituency tree (CT-RNN) doesn t work well? It is a binary tree. each node has only two child node (c 1, c 2 ) Composition function is: h = f(w l1 c 1 + W r1 c 2 ) W R d 2d DT-RNN allows n-ary nodes in tree in CT-RNN: last in, larger weight = last words are more important. CT-RNN capture the syntactic of sentences more than DT-RNN But To describe an Image, we need agents and action The dependency tree structures push the central content words such as the main action or verb and its subject and object to be merged last Final sentence representation in DT-RNN is more robust to less important adjectival modifiers, word order changes, etc. Describing Images with Sentences July 13, / 38

57 Learning Images Setup Deep Neural Network Data representation Two dataset: 20 million Random web images (unsupervised learning) 14 million labeled images to classify 22,000 categories (supervised learning) Describing Images with Sentences July 13, / 38

58 Learning Images Setup Deep Neural Network Data representation Two dataset: 20 million Random web images (unsupervised learning) 14 million labeled images to classify 22,000 categories (supervised learning) Input Image: Resize and Rescale to pixel Describing Images with Sentences July 13, / 38

59 Learning Images Setup Deep Neural Network Layer Architecture 3 layers, 3 stages (9 layers): Describing Images with Sentences July 13, / 38

60 Learning Images Setup Deep Neural Network Layer Architecture 3 layers, 3 stages (9 layers): Filtering: learnable parameters! Describing Images with Sentences July 13, / 38

61 Learning Images Setup Deep Neural Network Layer Architecture 3 layers, 3 stages (9 layers): Filtering: learnable parameters! L2-Pooling: taking the square of the filtering units, summing them up in a small area in the image, and taking the square root Describing Images with Sentences July 13, / 38

62 Learning Images Setup Deep Neural Network Layer Architecture 3 layers, 3 stages (9 layers): Filtering: learnable parameters! L2-Pooling: taking the square of the filtering units, summing them up in a small area in the image, and taking the square root Local contrast normalization: takes inputs in a small area of the lower layer, subtracts the mean and divides by the standard deviation Describing Images with Sentences July 13, / 38

63 Learning Images Setup Deep Neural Network Filtering Describing Images with Sentences July 13, / 38

64 Learning Images Setup Deep Neural Network Filtering The values of Filter (in first layer) after training. Describing Images with Sentences July 13, / 38

65 Learning Images Training Training Unsupervised objective: Trying to reconstruct the input while keeping the neurons sparse. Describing Images with Sentences July 13, / 38

66 Learning Images Training Training Unsupervised objective: Trying to reconstruct the input while keeping the neurons sparse. Supervised objective: Adjust the features in the entire network. Describing Images with Sentences July 13, / 38

67 Learning Images Training Training Unsupervised objective: Trying to reconstruct the input while keeping the neurons sparse. Supervised objective: Adjust the features in the entire network. Adding a bottle-neck layer in between the last layer and the classifier: To reduce the number of connections to d = 4096 Describing Images with Sentences July 13, / 38

68 Learning Images Training Training Unsupervised objective: Trying to reconstruct the input while keeping the neurons sparse. Supervised objective: Adjust the features in the entire network. Adding a bottle-neck layer in between the last layer and the classifier: To reduce the number of connections to d = 4096 Performs a feedforward computation to compute the values of the bottleneck layer Describing Images with Sentences July 13, / 38

69 Learning Images Training Training Unsupervised objective: Trying to reconstruct the input while keeping the neurons sparse. Supervised objective: Adjust the features in the entire network. Adding a bottle-neck layer in between the last layer and the classifier: To reduce the number of connections to d = 4096 Performs a feedforward computation to compute the values of the bottleneck layer Use a linear activation for this layer Describing Images with Sentences July 13, / 38

70 Learning Images Training Training Unsupervised objective: Trying to reconstruct the input while keeping the neurons sparse. Supervised objective: Adjust the features in the entire network. Adding a bottle-neck layer in between the last layer and the classifier: To reduce the number of connections to d = 4096 Performs a feedforward computation to compute the values of the bottleneck layer Use a linear activation for this layer Using the features at the bottleneck layer as the feature vector z of an image in multimodal space. Describing Images with Sentences July 13, / 38

71 Learning Images Training Some Facts! In total number of connections of this network is approximately 1.36 billion. How long does it take to train this network? 1 Describing Images with Sentences July 13, / 38

72 Learning Images Training Some Facts! In total number of connections of this network is approximately 1.36 billion. How long does it take to train this network? It takes 8 days on a large cluster of machines. On 169 machines (where each machine had 16 CPU cores) 1 Describing Images with Sentences July 13, / 38

73 Learning Images Training Some Facts! In total number of connections of this network is approximately 1.36 billion. How long does it take to train this network? It takes 8 days on a large cluster of machines. On 169 machines (where each machine had 16 CPU cores) How about the number of connections in the human visual cortex? 1 Describing Images with Sentences July 13, / 38

74 Learning Images Training Some Facts! In total number of connections of this network is approximately 1.36 billion. How long does it take to train this network? It takes 8 days on a large cluster of machines. On 169 machines (where each machine had 16 CPU cores) How about the number of connections in the human visual cortex? 10 6 times larger in terms of the number of neurons and synapses How long does it take a person to recognize objects? 1 Describing Images with Sentences July 13, / 38

75 Learning Images Training Some Facts! In total number of connections of this network is approximately 1.36 billion. How long does it take to train this network? It takes 8 days on a large cluster of machines. On 169 machines (where each machine had 16 CPU cores) How about the number of connections in the human visual cortex? 10 6 times larger in terms of the number of neurons and synapses How long does it take a person to recognize objects? Training time of an infant is between 3 to 5 months! Describing Images with Sentences July 13, / 38

76 Multimodal Mapping Multimodal Mapping The previous two sections described how we can map sentences into a d = 50-dimensional space and how to extract high quality image feature vectors of 4096 dimensions. Describing Images with Sentences July 13, / 38

77 Multimodal Mapping Multimodal Mapping The previous two sections described how we can map sentences into a d = 50-dimensional space and how to extract high quality image feature vectors of 4096 dimensions. We now define the final multimodal objective function for learning joint image-sentence representations with these models. Describing Images with Sentences July 13, / 38

78 Multimodal Mapping Multimodal Mapping The previous two sections described how we can map sentences into a d = 50-dimensional space and how to extract high quality image feature vectors of 4096 dimensions. We now define the final multimodal objective function for learning joint image-sentence representations with these models. The training set consists of N images and their feature vectors (z i ) and each image has 5 sentence descriptions s i1,..., s i5 for which we use the DT-RNN to compute vector representations. Describing Images with Sentences July 13, / 38

79 Multimodal Mapping Multimodal Mapping Figure: Sentence length varies greatly and different objects can be mentioned first. Hence, models have to be invariant to word ordering Describing Images with Sentences July 13, / 38

80 Multimodal Mapping Multimodal Mapping For training, we use a max-margin objective function. Describing Images with Sentences July 13, / 38

81 Multimodal Mapping Multimodal Mapping For training, we use a max-margin objective function. The ranking cost function to minimize is: J(W I, θ) = max(0, vi T y j + vi T y c ) i,j P c S\S(i) + i,j P c I\I(j) max(0, v T i y j + v T c y j ) Describing Images with Sentences July 13, / 38

82 Multimodal Mapping Multimodal Mapping For training, we use a max-margin objective function. The ranking cost function to minimize is: J(W I, θ) = max(0, vi T y j + vi T y c ) i,j P c S\S(i) + i,j P c I\I(j) max(0, v T i y j + v T c y j ) The objective function is very similar to the Hinge loss function in SVM. Describing Images with Sentences July 13, / 38

83 Multimodal Mapping Multimodal Mapping For training, we use a max-margin objective function. The ranking cost function to minimize is: J(W I, θ) = max(0, vi T y j + vi T y c ) i,j P c S\S(i) + i,j P c I\I(j) max(0, v T i y j + v T c y j ) The objective function is very similar to the Hinge loss function in SVM. The final objective also includes the regularization term: λ ( θ W I F ) (here the Frobenius norm is l 2 norm) Describing Images with Sentences July 13, / 38

84 Multimodal Mapping Multimodal Mapping For training, we use a max-margin objective function. The ranking cost function to minimize is: J(W I, θ) = max(0, vi T y j + vi T y c ) i,j P c S\S(i) + i,j P c I\I(j) max(0, v T i y j + v T c y j ) The objective function is very similar to the Hinge loss function in SVM. The final objective also includes the regularization term: λ ( θ W I F ) (here the Frobenius norm is l 2 norm) A modified version of AdaGrad for optimization of both W I and the DT-RNN as well as the other baselines. Describing Images with Sentences July 13, / 38

85 Multimodal Mapping Multimodal Mapping An alternative objective function is based on the squared loss J(W I, θ) = (i,j) P v i y j 2 2. Describing Images with Sentences July 13, / 38

86 Experiment Experiment The authors use dataset of Rashtchian et al. (2010) which consists of 1000 images, each with 5 sentences. Figure 1 Describing Images with Sentences July 13, / 38

87 Experiment Experiment The authors use dataset of Rashtchian et al. (2010) which consists of 1000 images, each with 5 sentences. Figure 1 Three different experiments to evaluate and compare the DT-RNN: 1 Analyzing how well the sentence vectors capture similarity in visual meaning. Describing Images with Sentences July 13, / 38

88 Experiment Experiment The authors use dataset of Rashtchian et al. (2010) which consists of 1000 images, each with 5 sentences. Figure 1 Three different experiments to evaluate and compare the DT-RNN: 1 Analyzing how well the sentence vectors capture similarity in visual meaning. 2 Analyzing Image Search with Query Sentences. Describing Images with Sentences July 13, / 38

89 Experiment Experiment The authors use dataset of Rashtchian et al. (2010) which consists of 1000 images, each with 5 sentences. Figure 1 Three different experiments to evaluate and compare the DT-RNN: 1 Analyzing how well the sentence vectors capture similarity in visual meaning. 2 Analyzing Image Search with Query Sentences. 3 Describing Images by Finding Suitable Sentences. In the experiments data is split into 800 training, 100 development and 100 test images. Since there are 5 sentences describing each image, there are 4000 training sentences and 500 testing sentences. The dataset has 3020 unique words. Describing Images with Sentences July 13, / 38

90 Experiment Experiment For both DT-RNNs the weight matrices are initialized to block identity matrices plus Gaussian noise Describing Images with Sentences July 13, / 38

91 Experiment Experiment For both DT-RNNs the weight matrices are initialized to block identity matrices plus Gaussian noise Length of word vectors and hidden vectors = 50. λ = 0.08, the learning rate of AdaGrad = (by using the development split) The best model uses a margin of = 3. Describing Images with Sentences July 13, / 38

92 Experiment Experiment For both DT-RNNs the weight matrices are initialized to block identity matrices plus Gaussian noise Length of word vectors and hidden vectors = 50. λ = 0.08, the learning rate of AdaGrad = (by using the development split) The best model uses a margin of = 3. Similarity of Sentences Describing the Same Image: First all 500 sentences from the test set have been mapped into the multimodal space. Then for each sentence, we find the nearest neighbor sentences in terms of inner products. Describing Images with Sentences July 13, / 38

93 Experiment Experiment Figure: Left: Comparison of methods for sentence similarity judgments. Lower numbers are better. Center: Comparison of methods for image search with query sentences. Shown is the average rank of the single correct image that is being described. Right: Average rank of a correct sentence description for a query image. Describing Images with Sentences July 13, / 38

94 Experiment Experiment Image Search with Query Sentences This experiment evaluates how well we can find images that display the visual meaning of a given sentence. First a query sentence is mapped into the vector space and then images will be found in the same space using simple inner products. As shown in Table 1 (center), the new DT-RNN outperforms all other models. Describing Images with Sentences July 13, / 38

95 Experiment Experiment Image Search with Query Sentences This experiment evaluates how well we can find images that display the visual meaning of a given sentence. First a query sentence is mapped into the vector space and then images will be found in the same space using simple inner products. As shown in Table 1 (center), the new DT-RNN outperforms all other models. Describing Images by Finding Suitable Sentences For an image, suitable textual descriptions again have been searched simply by finding closeby sentence vectors in the multi-modal embedding space. Describing Images with Sentences July 13, / 38

96 Experiment Experiment Image Search with Query Sentences This experiment evaluates how well we can find images that display the visual meaning of a given sentence. First a query sentence is mapped into the vector space and then images will be found in the same space using simple inner products. As shown in Table 1 (center), the new DT-RNN outperforms all other models. Describing Images by Finding Suitable Sentences For an image, suitable textual descriptions again have been searched simply by finding closeby sentence vectors in the multi-modal embedding space. The average ranking of 25.3 for a correct sentence description is out of 500 possible sentences. Describing Images with Sentences July 13, / 38

97 Experiment Multimodal Mapping Figure: Images and their sentence descriptions assigned by the DT-RNN. Describing Images with Sentences July 13, / 38

98 Experiment Experiment Main failure mode of the SDT-RNN occurs when a sentence that should describe the same image does not use a verb but the other sentences of that image do include a verb. Describing Images with Sentences July 13, / 38

99 Experiment Experiment Main failure mode of the SDT-RNN occurs when a sentence that should describe the same image does not use a verb but the other sentences of that image do include a verb. For example, the following sentence pair has vectors that are very far apart from each other even though they are supposed to describe the same image: 1. A blue and yellow airplane flying straight down while emitting white smoke 2. Airplane in dive position Describing Images with Sentences July 13, / 38

100 Conclusion Conclusion Our new model outperforms baselines and other commonly used models that can compute continuous vector representations for sentences. Describing Images with Sentences July 13, / 38

101 Conclusion Conclusion Our new model outperforms baselines and other commonly used models that can compute continuous vector representations for sentences. In comparison to related models, the DTRNN is more invariant and robust to surface changes such as word order. Describing Images with Sentences July 13, / 38

102 Appendix For Further Reading For Further Reading I Oriol Vinyals, Alexander Toshev, Samy Bengio, Dumitru Erhan. Show and Tell: Lessons learned from the 2015 MSCOCO Image Captioning Challenge Chris Shallue. A TensorFlow implementation of the image-to-text model described above Andrej Karpathy, Li Fei-Fei. Deep Visual-Semantic Alignments for Generating Image Descriptions Describing Images with Sentences July 13, / 38

Semantic image search using queries

Semantic image search using queries Semantic image search using queries Shabaz Basheer Patel, Anand Sampat Department of Electrical Engineering Stanford University CA 94305 shabaz@stanford.edu,asampat@stanford.edu Abstract Previous work,

More information

Context Encoding LSTM CS224N Course Project

Context Encoding LSTM CS224N Course Project Context Encoding LSTM CS224N Course Project Abhinav Rastogi arastogi@stanford.edu Supervised by - Samuel R. Bowman December 7, 2015 Abstract This project uses ideas from greedy transition based parsing

More information

Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank text

Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank text Philosophische Fakultät Seminar für Sprachwissenschaft Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank text 06 July 2017, Patricia Fischer & Neele Witte Overview Sentiment

More information

Machine Learning 13. week

Machine Learning 13. week Machine Learning 13. week Deep Learning Convolutional Neural Network Recurrent Neural Network 1 Why Deep Learning is so Popular? 1. Increase in the amount of data Thanks to the Internet, huge amount of

More information

LSTM for Language Translation and Image Captioning. Tel Aviv University Deep Learning Seminar Oran Gafni & Noa Yedidia

LSTM for Language Translation and Image Captioning. Tel Aviv University Deep Learning Seminar Oran Gafni & Noa Yedidia 1 LSTM for Language Translation and Image Captioning Tel Aviv University Deep Learning Seminar Oran Gafni & Noa Yedidia 2 Part I LSTM for Language Translation Motivation Background (RNNs, LSTMs) Model

More information

Novel Image Captioning

Novel Image Captioning 000 001 002 003 004 005 006 007 008 009 010 011 012 013 014 015 016 017 018 019 020 021 022 023 024 025 026 027 028 029 030 031 032 033 034 035 036 037 038 039 040 041 042 043 044 045 046 047 048 049 050

More information

A Neuro Probabilistic Language Model Bengio et. al. 2003

A Neuro Probabilistic Language Model Bengio et. al. 2003 A Neuro Probabilistic Language Model Bengio et. al. 2003 Class Discussion Notes Scribe: Olivia Winn February 1, 2016 Opening thoughts (or why this paper is interesting): Word embeddings currently have

More information

COMP 551 Applied Machine Learning Lecture 16: Deep Learning

COMP 551 Applied Machine Learning Lecture 16: Deep Learning COMP 551 Applied Machine Learning Lecture 16: Deep Learning Instructor: Ryan Lowe (ryan.lowe@cs.mcgill.ca) Slides mostly by: Class web page: www.cs.mcgill.ca/~hvanho2/comp551 Unless otherwise noted, all

More information

Deep Learning. Deep Learning. Practical Application Automatically Adding Sounds To Silent Movies

Deep Learning. Deep Learning. Practical Application Automatically Adding Sounds To Silent Movies http://blog.csdn.net/zouxy09/article/details/8775360 Automatic Colorization of Black and White Images Automatically Adding Sounds To Silent Movies Traditionally this was done by hand with human effort

More information

Deep Learning Applications

Deep Learning Applications October 20, 2017 Overview Supervised Learning Feedforward neural network Convolution neural network Recurrent neural network Recursive neural network (Recursive neural tensor network) Unsupervised Learning

More information

Multimodal Learning. Victoria Dean. MIT 6.S191 Intro to Deep Learning IAP 2017

Multimodal Learning. Victoria Dean. MIT 6.S191 Intro to Deep Learning IAP 2017 Multimodal Learning Victoria Dean Talk outline What is multimodal learning and what are the challenges? Flickr example: joint learning of images and tags Image captioning: generating sentences from images

More information

DEEP LEARNING REVIEW. Yann LeCun, Yoshua Bengio & Geoffrey Hinton Nature Presented by Divya Chitimalla

DEEP LEARNING REVIEW. Yann LeCun, Yoshua Bengio & Geoffrey Hinton Nature Presented by Divya Chitimalla DEEP LEARNING REVIEW Yann LeCun, Yoshua Bengio & Geoffrey Hinton Nature 2015 -Presented by Divya Chitimalla What is deep learning Deep learning allows computational models that are composed of multiple

More information

Convolutional-Recursive Deep Learning for 3D Object Classification

Convolutional-Recursive Deep Learning for 3D Object Classification Convolutional-Recursive Deep Learning for 3D Object Classification Richard Socher, Brody Huval, Bharath Bhat, Christopher D. Manning, Andrew Y. Ng NIPS 2012 Iro Armeni, Manik Dhar Motivation Hand-designed

More information

Lecture 2 Notes. Outline. Neural Networks. The Big Idea. Architecture. Instructors: Parth Shah, Riju Pahwa

Lecture 2 Notes. Outline. Neural Networks. The Big Idea. Architecture. Instructors: Parth Shah, Riju Pahwa Instructors: Parth Shah, Riju Pahwa Lecture 2 Notes Outline 1. Neural Networks The Big Idea Architecture SGD and Backpropagation 2. Convolutional Neural Networks Intuition Architecture 3. Recurrent Neural

More information

CSE 250B Project Assignment 4

CSE 250B Project Assignment 4 CSE 250B Project Assignment 4 Hani Altwary haltwa@cs.ucsd.edu Kuen-Han Lin kul016@ucsd.edu Toshiro Yamada toyamada@ucsd.edu Abstract The goal of this project is to implement the Semi-Supervised Recursive

More information

VISION & LANGUAGE From Captions to Visual Concepts and Back

VISION & LANGUAGE From Captions to Visual Concepts and Back VISION & LANGUAGE From Captions to Visual Concepts and Back Brady Fowler & Kerry Jones Tuesday, February 28th 2017 CS 6501-004 VICENTE Agenda Problem Domain Object Detection Language Generation Sentence

More information

Akarsh Pokkunuru EECS Department Contractive Auto-Encoders: Explicit Invariance During Feature Extraction

Akarsh Pokkunuru EECS Department Contractive Auto-Encoders: Explicit Invariance During Feature Extraction Akarsh Pokkunuru EECS Department 03-16-2017 Contractive Auto-Encoders: Explicit Invariance During Feature Extraction 1 AGENDA Introduction to Auto-encoders Types of Auto-encoders Analysis of different

More information

ECG782: Multidimensional Digital Signal Processing

ECG782: Multidimensional Digital Signal Processing ECG782: Multidimensional Digital Signal Processing Object Recognition http://www.ee.unlv.edu/~b1morris/ecg782/ 2 Outline Knowledge Representation Statistical Pattern Recognition Neural Networks Boosting

More information

Machine Learning. Deep Learning. Eric Xing (and Pengtao Xie) , Fall Lecture 8, October 6, Eric CMU,

Machine Learning. Deep Learning. Eric Xing (and Pengtao Xie) , Fall Lecture 8, October 6, Eric CMU, Machine Learning 10-701, Fall 2015 Deep Learning Eric Xing (and Pengtao Xie) Lecture 8, October 6, 2015 Eric Xing @ CMU, 2015 1 A perennial challenge in computer vision: feature engineering SIFT Spin image

More information

LSTM and its variants for visual recognition. Xiaodan Liang Sun Yat-sen University

LSTM and its variants for visual recognition. Xiaodan Liang Sun Yat-sen University LSTM and its variants for visual recognition Xiaodan Liang xdliang328@gmail.com Sun Yat-sen University Outline Context Modelling with CNN LSTM and its Variants LSTM Architecture Variants Application in

More information

DeepWalk: Online Learning of Social Representations

DeepWalk: Online Learning of Social Representations DeepWalk: Online Learning of Social Representations ACM SIG-KDD August 26, 2014, Rami Al-Rfou, Steven Skiena Stony Brook University Outline Introduction: Graphs as Features Language Modeling DeepWalk Evaluation:

More information

Transition-based Parsing with Neural Nets

Transition-based Parsing with Neural Nets CS11-747 Neural Networks for NLP Transition-based Parsing with Neural Nets Graham Neubig Site https://phontron.com/class/nn4nlp2017/ Two Types of Linguistic Structure Dependency: focus on relations between

More information

Learning to Match. Jun Xu, Zhengdong Lu, Tianqi Chen, Hang Li

Learning to Match. Jun Xu, Zhengdong Lu, Tianqi Chen, Hang Li Learning to Match Jun Xu, Zhengdong Lu, Tianqi Chen, Hang Li 1. Introduction The main tasks in many applications can be formalized as matching between heterogeneous objects, including search, recommendation,

More information

Learning Meanings for Sentences with Recursive Autoencoders

Learning Meanings for Sentences with Recursive Autoencoders Learning Meanings for Sentences with Recursive Autoencoders Tsung-Yi Lin and Chen-Yu Lee Department of Electrical and Computer Engineering University of California, San Diego {tsl008, chl260}@ucsd.edu

More information

FastText. Jon Koss, Abhishek Jindal

FastText. Jon Koss, Abhishek Jindal FastText Jon Koss, Abhishek Jindal FastText FastText is on par with state-of-the-art deep learning classifiers in terms of accuracy But it is way faster: FastText can train on more than one billion words

More information

Using Machine Learning for Classification of Cancer Cells

Using Machine Learning for Classification of Cancer Cells Using Machine Learning for Classification of Cancer Cells Camille Biscarrat University of California, Berkeley I Introduction Cell screening is a commonly used technique in the development of new drugs.

More information

Show, Discriminate, and Tell: A Discriminatory Image Captioning Model with Deep Neural Networks

Show, Discriminate, and Tell: A Discriminatory Image Captioning Model with Deep Neural Networks Show, Discriminate, and Tell: A Discriminatory Image Captioning Model with Deep Neural Networks Zelun Luo Department of Computer Science Stanford University zelunluo@stanford.edu Te-Lin Wu Department of

More information

SEMANTIC COMPUTING. Lecture 9: Deep Learning: Recurrent Neural Networks (RNNs) TU Dresden, 21 December 2018

SEMANTIC COMPUTING. Lecture 9: Deep Learning: Recurrent Neural Networks (RNNs) TU Dresden, 21 December 2018 SEMANTIC COMPUTING Lecture 9: Deep Learning: Recurrent Neural Networks (RNNs) Dagmar Gromann International Center For Computational Logic TU Dresden, 21 December 2018 Overview Handling Overfitting Recurrent

More information

CS6220: DATA MINING TECHNIQUES

CS6220: DATA MINING TECHNIQUES CS6220: DATA MINING TECHNIQUES Image Data: Classification via Neural Networks Instructor: Yizhou Sun yzsun@ccs.neu.edu November 19, 2015 Methods to Learn Classification Clustering Frequent Pattern Mining

More information

Recurrent Neural Networks. Nand Kishore, Audrey Huang, Rohan Batra

Recurrent Neural Networks. Nand Kishore, Audrey Huang, Rohan Batra Recurrent Neural Networks Nand Kishore, Audrey Huang, Rohan Batra Roadmap Issues Motivation 1 Application 1: Sequence Level Training 2 Basic Structure 3 4 Variations 5 Application 3: Image Classification

More information

Neural Networks for Machine Learning. Lecture 15a From Principal Components Analysis to Autoencoders

Neural Networks for Machine Learning. Lecture 15a From Principal Components Analysis to Autoencoders Neural Networks for Machine Learning Lecture 15a From Principal Components Analysis to Autoencoders Geoffrey Hinton Nitish Srivastava, Kevin Swersky Tijmen Tieleman Abdel-rahman Mohamed Principal Components

More information

Natural Language Processing

Natural Language Processing Natural Language Processing Machine Learning Potsdam, 26 April 2012 Saeedeh Momtazi Information Systems Group Introduction 2 Machine Learning Field of study that gives computers the ability to learn without

More information

INTRODUCTION TO DEEP LEARNING

INTRODUCTION TO DEEP LEARNING INTRODUCTION TO DEEP LEARNING CONTENTS Introduction to deep learning Contents 1. Examples 2. Machine learning 3. Neural networks 4. Deep learning 5. Convolutional neural networks 6. Conclusion 7. Additional

More information

16-785: Integrated Intelligence in Robotics: Vision, Language, and Planning. Spring 2018 Lecture 14. Image to Text

16-785: Integrated Intelligence in Robotics: Vision, Language, and Planning. Spring 2018 Lecture 14. Image to Text 16-785: Integrated Intelligence in Robotics: Vision, Language, and Planning Spring 2018 Lecture 14. Image to Text Input Output Classification tasks 4/1/18 CMU 16-785: Integrated Intelligence in Robotics

More information

Neural Networks for unsupervised learning From Principal Components Analysis to Autoencoders to semantic hashing

Neural Networks for unsupervised learning From Principal Components Analysis to Autoencoders to semantic hashing Neural Networks for unsupervised learning From Principal Components Analysis to Autoencoders to semantic hashing feature 3 PC 3 Beate Sick Many slides are taken form Hinton s great lecture on NN: https://www.coursera.org/course/neuralnets

More information

A Deep Relevance Matching Model for Ad-hoc Retrieval

A Deep Relevance Matching Model for Ad-hoc Retrieval A Deep Relevance Matching Model for Ad-hoc Retrieval Jiafeng Guo 1, Yixing Fan 1, Qingyao Ai 2, W. Bruce Croft 2 1 CAS Key Lab of Web Data Science and Technology, Institute of Computing Technology, Chinese

More information

Contents Machine Learning concepts 4 Learning Algorithm 4 Predictive Model (Model) 4 Model, Classification 4 Model, Regression 4 Representation

Contents Machine Learning concepts 4 Learning Algorithm 4 Predictive Model (Model) 4 Model, Classification 4 Model, Regression 4 Representation Contents Machine Learning concepts 4 Learning Algorithm 4 Predictive Model (Model) 4 Model, Classification 4 Model, Regression 4 Representation Learning 4 Supervised Learning 4 Unsupervised Learning 4

More information

Autoencoder. Representation learning (related to dictionary learning) Both the input and the output are x

Autoencoder. Representation learning (related to dictionary learning) Both the input and the output are x Deep Learning 4 Autoencoder, Attention (spatial transformer), Multi-modal learning, Neural Turing Machine, Memory Networks, Generative Adversarial Net Jian Li IIIS, Tsinghua Autoencoder Autoencoder Unsupervised

More information

Structural and Syntactic Pattern Recognition

Structural and Syntactic Pattern Recognition Structural and Syntactic Pattern Recognition Selim Aksoy Department of Computer Engineering Bilkent University saksoy@cs.bilkent.edu.tr CS 551, Fall 2017 CS 551, Fall 2017 c 2017, Selim Aksoy (Bilkent

More information

Object Detection Lecture Introduction to deep learning (CNN) Idar Dyrdal

Object Detection Lecture Introduction to deep learning (CNN) Idar Dyrdal Object Detection Lecture 10.3 - Introduction to deep learning (CNN) Idar Dyrdal Deep Learning Labels Computational models composed of multiple processing layers (non-linear transformations) Used to learn

More information

An Empirical Evaluation of Deep Architectures on Problems with Many Factors of Variation

An Empirical Evaluation of Deep Architectures on Problems with Many Factors of Variation An Empirical Evaluation of Deep Architectures on Problems with Many Factors of Variation Hugo Larochelle, Dumitru Erhan, Aaron Courville, James Bergstra, and Yoshua Bengio Université de Montréal 13/06/2007

More information

Computer vision: teaching computers to see

Computer vision: teaching computers to see Computer vision: teaching computers to see Mats Sjöberg Department of Computer Science Aalto University mats.sjoberg@aalto.fi Turku.AI meetup June 5, 2018 Computer vision Giving computers the ability to

More information

Pouya Kousha Fall 2018 CSE 5194 Prof. DK Panda

Pouya Kousha Fall 2018 CSE 5194 Prof. DK Panda Pouya Kousha Fall 2018 CSE 5194 Prof. DK Panda 1 Observe novel applicability of DL techniques in Big Data Analytics. Applications of DL techniques for common Big Data Analytics problems. Semantic indexing

More information

Every Picture Tells a Story: Generating Sentences from Images

Every Picture Tells a Story: Generating Sentences from Images Every Picture Tells a Story: Generating Sentences from Images Ali Farhadi, Mohsen Hejrati, Mohammad Amin Sadeghi, Peter Young, Cyrus Rashtchian, Julia Hockenmaier, David Forsyth University of Illinois

More information

CSC321: Neural Networks. Lecture 13: Learning without a teacher: Autoencoders and Principal Components Analysis. Geoffrey Hinton

CSC321: Neural Networks. Lecture 13: Learning without a teacher: Autoencoders and Principal Components Analysis. Geoffrey Hinton CSC321: Neural Networks Lecture 13: Learning without a teacher: Autoencoders and Principal Components Analysis Geoffrey Hinton Three problems with backpropagation Where does the supervision come from?

More information

arxiv:submit/ [cs.cv] 13 Jan 2018

arxiv:submit/ [cs.cv] 13 Jan 2018 Benchmark Visual Question Answer Models by using Focus Map Wenda Qiu Yueyang Xianzang Zhekai Zhang Shanghai Jiaotong University arxiv:submit/2130661 [cs.cv] 13 Jan 2018 Abstract Inferring and Executing

More information

Image-Sentence Multimodal Embedding with Instructive Objectives

Image-Sentence Multimodal Embedding with Instructive Objectives Image-Sentence Multimodal Embedding with Instructive Objectives Jianhao Wang Shunyu Yao IIIS, Tsinghua University {jh-wang15, yao-sy15}@mails.tsinghua.edu.cn Abstract To encode images and sentences into

More information

Image Question Answering using Convolutional Neural Network with Dynamic Parameter Prediction

Image Question Answering using Convolutional Neural Network with Dynamic Parameter Prediction Image Question Answering using Convolutional Neural Network with Dynamic Parameter Prediction by Noh, Hyeonwoo, Paul Hongsuck Seo, and Bohyung Han.[1] Presented : Badri Patro 1 1 Computer Vision Reading

More information

CS231N Section. Video Understanding 6/1/2018

CS231N Section. Video Understanding 6/1/2018 CS231N Section Video Understanding 6/1/2018 Outline Background / Motivation / History Video Datasets Models Pre-deep learning CNN + RNN 3D convolution Two-stream What we ve seen in class so far... Image

More information

Empirical Evaluation of RNN Architectures on Sentence Classification Task

Empirical Evaluation of RNN Architectures on Sentence Classification Task Empirical Evaluation of RNN Architectures on Sentence Classification Task Lei Shen, Junlin Zhang Chanjet Information Technology lorashen@126.com, zhangjlh@chanjet.com Abstract. Recurrent Neural Networks

More information

Machine Learning. The Breadth of ML Neural Networks & Deep Learning. Marc Toussaint. Duy Nguyen-Tuong. University of Stuttgart

Machine Learning. The Breadth of ML Neural Networks & Deep Learning. Marc Toussaint. Duy Nguyen-Tuong. University of Stuttgart Machine Learning The Breadth of ML Neural Networks & Deep Learning Marc Toussaint University of Stuttgart Duy Nguyen-Tuong Bosch Center for Artificial Intelligence Summer 2017 Neural Networks Consider

More information

Multi-Glance Attention Models For Image Classification

Multi-Glance Attention Models For Image Classification Multi-Glance Attention Models For Image Classification Chinmay Duvedi Stanford University Stanford, CA cduvedi@stanford.edu Pararth Shah Stanford University Stanford, CA pararth@stanford.edu Abstract We

More information

Unsupervised learning in Vision

Unsupervised learning in Vision Chapter 7 Unsupervised learning in Vision The fields of Computer Vision and Machine Learning complement each other in a very natural way: the aim of the former is to extract useful information from visual

More information

Transition-Based Dependency Parsing with Stack Long Short-Term Memory

Transition-Based Dependency Parsing with Stack Long Short-Term Memory Transition-Based Dependency Parsing with Stack Long Short-Term Memory Chris Dyer, Miguel Ballesteros, Wang Ling, Austin Matthews, Noah A. Smith Association for Computational Linguistics (ACL), 2015 Presented

More information

Indoor Object Recognition of 3D Kinect Dataset with RNNs

Indoor Object Recognition of 3D Kinect Dataset with RNNs Indoor Object Recognition of 3D Kinect Dataset with RNNs Thiraphat Charoensripongsa, Yue Chen, Brian Cheng 1. Introduction Recent work at Stanford in the area of scene understanding has involved using

More information

Robust PDF Table Locator

Robust PDF Table Locator Robust PDF Table Locator December 17, 2016 1 Introduction Data scientists rely on an abundance of tabular data stored in easy-to-machine-read formats like.csv files. Unfortunately, most government records

More information

Deep neural networks II

Deep neural networks II Deep neural networks II May 31 st, 2018 Yong Jae Lee UC Davis Many slides from Rob Fergus, Svetlana Lazebnik, Jia-Bin Huang, Derek Hoiem, Adriana Kovashka, Why (convolutional) neural networks? State of

More information

JOINT INTENT DETECTION AND SLOT FILLING USING CONVOLUTIONAL NEURAL NETWORKS. Puyang Xu, Ruhi Sarikaya. Microsoft Corporation

JOINT INTENT DETECTION AND SLOT FILLING USING CONVOLUTIONAL NEURAL NETWORKS. Puyang Xu, Ruhi Sarikaya. Microsoft Corporation JOINT INTENT DETECTION AND SLOT FILLING USING CONVOLUTIONAL NEURAL NETWORKS Puyang Xu, Ruhi Sarikaya Microsoft Corporation ABSTRACT We describe a joint model for intent detection and slot filling based

More information

Facial Expression Classification with Random Filters Feature Extraction

Facial Expression Classification with Random Filters Feature Extraction Facial Expression Classification with Random Filters Feature Extraction Mengye Ren Facial Monkey mren@cs.toronto.edu Zhi Hao Luo It s Me lzh@cs.toronto.edu I. ABSTRACT In our work, we attempted to tackle

More information

Recurrent Neural Nets II

Recurrent Neural Nets II Recurrent Neural Nets II Steven Spielberg Pon Kumar, Tingke (Kevin) Shen Machine Learning Reading Group, Fall 2016 9 November, 2016 Outline 1 Introduction 2 Problem Formulations with RNNs 3 LSTM for Optimization

More information

Image Captioning with Object Detection and Localization

Image Captioning with Object Detection and Localization Image Captioning with Object Detection and Localization Zhongliang Yang, Yu-Jin Zhang, Sadaqat ur Rehman, Yongfeng Huang, Department of Electronic Engineering, Tsinghua University, Beijing 100084, China

More information

Neural Networks. Single-layer neural network. CSE 446: Machine Learning Emily Fox University of Washington March 10, /10/2017

Neural Networks. Single-layer neural network. CSE 446: Machine Learning Emily Fox University of Washington March 10, /10/2017 3/0/207 Neural Networks Emily Fox University of Washington March 0, 207 Slides adapted from Ali Farhadi (via Carlos Guestrin and Luke Zettlemoyer) Single-layer neural network 3/0/207 Perceptron as a neural

More information

ImageCLEF 2011

ImageCLEF 2011 SZTAKI @ ImageCLEF 2011 Bálint Daróczy joint work with András Benczúr, Róbert Pethes Data Mining and Web Search Group Computer and Automation Research Institute Hungarian Academy of Sciences Training/test

More information

Deep Learning. Vladimir Golkov Technical University of Munich Computer Vision Group

Deep Learning. Vladimir Golkov Technical University of Munich Computer Vision Group Deep Learning Vladimir Golkov Technical University of Munich Computer Vision Group 1D Input, 1D Output target input 2 2D Input, 1D Output: Data Distribution Complexity Imagine many dimensions (data occupies

More information

Deep Learning. Volker Tresp Summer 2014

Deep Learning. Volker Tresp Summer 2014 Deep Learning Volker Tresp Summer 2014 1 Neural Network Winter and Revival While Machine Learning was flourishing, there was a Neural Network winter (late 1990 s until late 2000 s) Around 2010 there

More information

Large-scale Video Classification with Convolutional Neural Networks

Large-scale Video Classification with Convolutional Neural Networks Large-scale Video Classification with Convolutional Neural Networks Andrej Karpathy, George Toderici, Sanketh Shetty, Thomas Leung, Rahul Sukthankar, Li Fei-Fei Note: Slide content mostly from : Bay Area

More information

Applying Supervised Learning

Applying Supervised Learning Applying Supervised Learning When to Consider Supervised Learning A supervised learning algorithm takes a known set of input data (the training set) and known responses to the data (output), and trains

More information

Neural Network Neurons

Neural Network Neurons Neural Networks Neural Network Neurons 1 Receives n inputs (plus a bias term) Multiplies each input by its weight Applies activation function to the sum of results Outputs result Activation Functions Given

More information

Capsule Networks. Eric Mintun

Capsule Networks. Eric Mintun Capsule Networks Eric Mintun Motivation An improvement* to regular Convolutional Neural Networks. Two goals: Replace max-pooling operation with something more intuitive. Keep more info about an activated

More information

CSC 578 Neural Networks and Deep Learning

CSC 578 Neural Networks and Deep Learning CSC 578 Neural Networks and Deep Learning Fall 2018/19 7. Recurrent Neural Networks (Some figures adapted from NNDL book) 1 Recurrent Neural Networks 1. Recurrent Neural Networks (RNNs) 2. RNN Training

More information

Clustering algorithms and autoencoders for anomaly detection

Clustering algorithms and autoencoders for anomaly detection Clustering algorithms and autoencoders for anomaly detection Alessia Saggio Lunch Seminars and Journal Clubs Université catholique de Louvain, Belgium 3rd March 2017 a Outline Introduction Clustering algorithms

More information

A Taxonomy of Semi-Supervised Learning Algorithms

A Taxonomy of Semi-Supervised Learning Algorithms A Taxonomy of Semi-Supervised Learning Algorithms Olivier Chapelle Max Planck Institute for Biological Cybernetics December 2005 Outline 1 Introduction 2 Generative models 3 Low density separation 4 Graph

More information

Machine Learning in Biology

Machine Learning in Biology Università degli studi di Padova Machine Learning in Biology Luca Silvestrin (Dottorando, XXIII ciclo) Supervised learning Contents Class-conditional probability density Linear and quadratic discriminant

More information

Analysis: TextonBoost and Semantic Texton Forests. Daniel Munoz Februrary 9, 2009

Analysis: TextonBoost and Semantic Texton Forests. Daniel Munoz Februrary 9, 2009 Analysis: TextonBoost and Semantic Texton Forests Daniel Munoz 16-721 Februrary 9, 2009 Papers [shotton-eccv-06] J. Shotton, J. Winn, C. Rother, A. Criminisi, TextonBoost: Joint Appearance, Shape and Context

More information

DCU-UvA Multimodal MT System Report

DCU-UvA Multimodal MT System Report DCU-UvA Multimodal MT System Report Iacer Calixto ADAPT Centre School of Computing Dublin City University Dublin, Ireland iacer.calixto@adaptcentre.ie Desmond Elliott ILLC University of Amsterdam Science

More information

Deep Face Recognition. Nathan Sun

Deep Face Recognition. Nathan Sun Deep Face Recognition Nathan Sun Why Facial Recognition? Picture ID or video tracking Higher Security for Facial Recognition Software Immensely useful to police in tracking suspects Your face will be an

More information

Data Mining and Analytics

Data Mining and Analytics Data Mining and Analytics Aik Choon Tan, Ph.D. Associate Professor of Bioinformatics Division of Medical Oncology Department of Medicine aikchoon.tan@ucdenver.edu 9/22/2017 http://tanlab.ucdenver.edu/labhomepage/teaching/bsbt6111/

More information

A Hybrid Neural Model for Type Classification of Entity Mentions

A Hybrid Neural Model for Type Classification of Entity Mentions A Hybrid Neural Model for Type Classification of Entity Mentions Motivation Types group entities to categories Entity types are important for various NLP tasks Our task: predict an entity mention s type

More information

Real-time Object Detection CS 229 Course Project

Real-time Object Detection CS 229 Course Project Real-time Object Detection CS 229 Course Project Zibo Gong 1, Tianchang He 1, and Ziyi Yang 1 1 Department of Electrical Engineering, Stanford University December 17, 2016 Abstract Objection detection

More information

Multimodal Medical Image Retrieval based on Latent Topic Modeling

Multimodal Medical Image Retrieval based on Latent Topic Modeling Multimodal Medical Image Retrieval based on Latent Topic Modeling Mandikal Vikram 15it217.vikram@nitk.edu.in Suhas BS 15it110.suhas@nitk.edu.in Aditya Anantharaman 15it201.aditya.a@nitk.edu.in Sowmya Kamath

More information

Neural Symbolic Machines: Learning Semantic Parsers on Freebase with Weak Supervision

Neural Symbolic Machines: Learning Semantic Parsers on Freebase with Weak Supervision Neural Symbolic Machines: Learning Semantic Parsers on Freebase with Weak Supervision Anonymized for review Abstract Extending the success of deep neural networks to high level tasks like natural language

More information

Artificial Intelligence Introduction Handwriting Recognition Kadir Eren Unal ( ), Jakob Heyder ( )

Artificial Intelligence Introduction Handwriting Recognition Kadir Eren Unal ( ), Jakob Heyder ( ) Structure: 1. Introduction 2. Problem 3. Neural network approach a. Architecture b. Phases of CNN c. Results 4. HTM approach a. Architecture b. Setup c. Results 5. Conclusion 1.) Introduction Artificial

More information

Robotics Programming Laboratory

Robotics Programming Laboratory Chair of Software Engineering Robotics Programming Laboratory Bertrand Meyer Jiwon Shin Lecture 8: Robot Perception Perception http://pascallin.ecs.soton.ac.uk/challenges/voc/databases.html#caltech car

More information

Lecture 20: Neural Networks for NLP. Zubin Pahuja

Lecture 20: Neural Networks for NLP. Zubin Pahuja Lecture 20: Neural Networks for NLP Zubin Pahuja zpahuja2@illinois.edu courses.engr.illinois.edu/cs447 CS447: Natural Language Processing 1 Today s Lecture Feed-forward neural networks as classifiers simple

More information

Machine Learning With Python. Bin Chen Nov. 7, 2017 Research Computing Center

Machine Learning With Python. Bin Chen Nov. 7, 2017 Research Computing Center Machine Learning With Python Bin Chen Nov. 7, 2017 Research Computing Center Outline Introduction to Machine Learning (ML) Introduction to Neural Network (NN) Introduction to Deep Learning NN Introduction

More information

Bilevel Sparse Coding

Bilevel Sparse Coding Adobe Research 345 Park Ave, San Jose, CA Mar 15, 2013 Outline 1 2 The learning model The learning algorithm 3 4 Sparse Modeling Many types of sensory data, e.g., images and audio, are in high-dimensional

More information

CS224n: Natural Language Processing with Deep Learning 1 Lecture Notes: Part IV Dependency Parsing 2 Winter 2019

CS224n: Natural Language Processing with Deep Learning 1 Lecture Notes: Part IV Dependency Parsing 2 Winter 2019 CS224n: Natural Language Processing with Deep Learning 1 Lecture Notes: Part IV Dependency Parsing 2 Winter 2019 1 Course Instructors: Christopher Manning, Richard Socher 2 Authors: Lisa Wang, Juhi Naik,

More information

CPSC 340: Machine Learning and Data Mining. Principal Component Analysis Fall 2016

CPSC 340: Machine Learning and Data Mining. Principal Component Analysis Fall 2016 CPSC 340: Machine Learning and Data Mining Principal Component Analysis Fall 2016 A2/Midterm: Admin Grades/solutions will be posted after class. Assignment 4: Posted, due November 14. Extra office hours:

More information

11. Neural Network Regularization

11. Neural Network Regularization 11. Neural Network Regularization CS 519 Deep Learning, Winter 2016 Fuxin Li With materials from Andrej Karpathy, Zsolt Kira Preventing overfitting Approach 1: Get more data! Always best if possible! If

More information

Artificial Intelligence. Programming Styles

Artificial Intelligence. Programming Styles Artificial Intelligence Intro to Machine Learning Programming Styles Standard CS: Explicitly program computer to do something Early AI: Derive a problem description (state) and use general algorithms to

More information

Discriminative classifiers for image recognition

Discriminative classifiers for image recognition Discriminative classifiers for image recognition May 26 th, 2015 Yong Jae Lee UC Davis Outline Last time: window-based generic object detection basic pipeline face detection with boosting as case study

More information

Stacked Denoising Autoencoders for Face Pose Normalization

Stacked Denoising Autoencoders for Face Pose Normalization Stacked Denoising Autoencoders for Face Pose Normalization Yoonseop Kang 1, Kang-Tae Lee 2,JihyunEun 2, Sung Eun Park 2 and Seungjin Choi 1 1 Department of Computer Science and Engineering Pohang University

More information

Encoding RNNs, 48 End of sentence (EOS) token, 207 Exploding gradient, 131 Exponential function, 42 Exponential Linear Unit (ELU), 44

Encoding RNNs, 48 End of sentence (EOS) token, 207 Exploding gradient, 131 Exponential function, 42 Exponential Linear Unit (ELU), 44 A Activation potential, 40 Annotated corpus add padding, 162 check versions, 158 create checkpoints, 164, 166 create input, 160 create train and validation datasets, 163 dropout, 163 DRUG-AE.rel file,

More information

Music Genre Classification

Music Genre Classification Music Genre Classification Matthew Creme, Charles Burlin, Raphael Lenain Stanford University December 15, 2016 Abstract What exactly is it that makes us, humans, able to tell apart two songs of different

More information

Natural Language Processing with Deep Learning CS224N/Ling284. Christopher Manning Lecture 4: Backpropagation and computation graphs

Natural Language Processing with Deep Learning CS224N/Ling284. Christopher Manning Lecture 4: Backpropagation and computation graphs Natural Language Processing with Deep Learning CS4N/Ling84 Christopher Manning Lecture 4: Backpropagation and computation graphs Lecture Plan Lecture 4: Backpropagation and computation graphs 1. Matrix

More information

Neural Networks. CE-725: Statistical Pattern Recognition Sharif University of Technology Spring Soleymani

Neural Networks. CE-725: Statistical Pattern Recognition Sharif University of Technology Spring Soleymani Neural Networks CE-725: Statistical Pattern Recognition Sharif University of Technology Spring 2013 Soleymani Outline Biological and artificial neural networks Feed-forward neural networks Single layer

More information

arxiv: v1 [cs.cv] 6 Jul 2016

arxiv: v1 [cs.cv] 6 Jul 2016 arxiv:607.079v [cs.cv] 6 Jul 206 Deep CORAL: Correlation Alignment for Deep Domain Adaptation Baochen Sun and Kate Saenko University of Massachusetts Lowell, Boston University Abstract. Deep neural networks

More information

Apparel Classifier and Recommender using Deep Learning

Apparel Classifier and Recommender using Deep Learning Apparel Classifier and Recommender using Deep Learning Live Demo at: http://saurabhg.me/projects/tag-that-apparel Saurabh Gupta sag043@ucsd.edu Siddhartha Agarwal siagarwa@ucsd.edu Apoorve Dave a1dave@ucsd.edu

More information

Natural Language Processing with Deep Learning CS224N/Ling284

Natural Language Processing with Deep Learning CS224N/Ling284 Natural Language Processing with Deep Learning CS224N/Ling284 Lecture 8: Recurrent Neural Networks Christopher Manning and Richard Socher Organization Extra project office hour today after lecture Overview

More information

Natural Language Processing. SoSe Question Answering

Natural Language Processing. SoSe Question Answering Natural Language Processing SoSe 2017 Question Answering Dr. Mariana Neves July 5th, 2017 Motivation Find small segments of text which answer users questions (http://start.csail.mit.edu/) 2 3 Motivation

More information