arxiv: v2 [cs.cv] 14 Apr 2014

Size: px

Start display at page:

Download "arxiv: v2 [cs.cv] 14 Apr 2014"

Harriet Rich
6 years ago
Views:

1 Deep Convolutional Ranking for Multilabel Image Annotation arxiv: v2 [s.cv] 14 Apr 2014 Yunhao Gong UNC Chapel Hill Alexander Toshev Google Researh Yangqing Jia Google Researh Abstrat Thomas K. Leung Google Researh Sergey Ioffe Google Researh Multilabel image annotation is one of the most important hallenges in omputer vision with many real-world appliations. While existing work usually use onventional visual features for multilabel annotation, features based on Deep Neural Networks have shown potential to signifiantly boost performane. In this work, we propose to leverage the advantage of suh features and analyze key omponents that lead to better performanes. Speifially, we show that a signifiant performane gain ould be obtained by ombining onvolutional arhitetures with approximate top-k ranking objetives, as thye naturally fit the multilabel tagging problem. Our experiments on the NUS-WIDE dataset outperforms the onventional visual features by about 10%, obtaining the best reported performane in the literature. 1 Introdution Multilabel image annotation [25, 14] is an important and hallenging problem in omputer vision. Most existing work fous on single-label lassifiation problems [6, 21], where eah image is assumed to have only one lass label. However, this is not neessarily true for real world appliations, as an image may be assoiated with multiple semanti tags (Figure 1). As a pratial example, images from Flikr are usually aompanied by several tags to desribe the ontent of the image, suh as objets, ativities, and sene desriptions. Images on the Internet, in general, are usually assoiated with sentenes or desriptions, instead of a single lass label, whih may be deemed as a type of multitagging. Therefore, it is a pratial and important problem to aurately assign multiple labels to one image. Single-label image lassifiation has been extensively studied in the vision ommunity, the most reent advanes reported on the large-sale ImageNet database [6]. Most existing work fous on designing visual features for improving reognition auray. For example, sparse oding [33, 36], Fisher vetors [28], and VLAD [18] have been proposed to redue the quantization error of bag of words -type features. Spatial pyramid mathing [21] has been developed to enode spatial information for reognition. Very reently, deep onvolutional neural networks (CNN) have demonstrated promising results for single-label image lassifiation [20]. Suh algorithms have all foused on learning a better feature representation for one-vs-rest lassifiation problems, and it is not yet lear how to best train an arhiteture for multilabel annotation problems. In this work, we are interested in leveraging the highly expressive onvolutional network for the problem of multilabel image annotation. We employed a similar network struture to [20], whih ontains several onvolutional and dense onneted layers as the basi arhiteture. We studied 1

Tags: green, flower sun, flowers, zoo, day, sunny, sunshine Tags: london, traffi, raw Tags: art, girl, woman, wow, dane, jump, daning Figure 1: Sample images from the NUS-WIDE dataset, where eah

and ompared several other popular multilabel losses, suh as the ranking loss [19] that optimizes the area under ROC urve (AUC), and the ross-entropy loss used in Tagprop [14].

Using the largest publily available multilabel dataset NUS-WIDE [4], we observe a signifiant performane boost over onventional features, reporting the best retrieval performane on the benhmark

2 Tags: green, flower sun, flowers, zoo, day, sunny, sunshine Tags: london, traffi, raw Tags: art, girl, woman, wow, dane, jump, daning Figure 1: Sample images from the NUS-WIDE dataset, where eah image is annotated with several tags. and ompared several other popular multilabel losses, suh as the ranking loss [19] that optimizes the area under ROC urve (AUC), and the ross-entropy loss used in Tagprop [14]. Speifially, we propose to use the top-k ranking loss, inspired by [34], for embedding to train the network. Using the largest publily available multilabel dataset NUS-WIDE [4], we observe a signifiant performane boost over onventional features, reporting the best retrieval performane on the benhmark dataset in the literature. 1.1 Previous Work In this setion we first review related works on multilabel image annotation and then briefly disuss works on deep onvolutional networks. Modeling Internet images and their orresponding textural information (e.g., sentenes, tags) have been of great interest in the vision ommunity [2, 10, 12, 15, 30, 29, 34]. In this work, we fous on the image annotation problem and summarize several important lines of related researh. Early work in this area was mostly devoted to annotation models inspired by mahine translation tehniques [1, 8]. The work by Barnard et al. [1, 8] applied mahine translation methods to parse natural images and tried to establish a relationship between image regions and words. More reently, image annotation has been formulated as a lassifiation problem. Early works foused on generative model based tagging [1, 3, 26], entred upon the idea of learning a parametri model to perform preditions. However, beause image annotation is a highly nonlinear problem, a parametri model might not be suffiient to apture the omplex distribution of the data. Several reent works on image tagging have mostly foused on nonparametri nearest-neighbor methods, whih offer higher expressive power. The work by Makadia et al. [25], whih proposed a simple nearest-neighbor-based tag transfer approah, ahieved signifiant improvement over previous model-based methods. Reent improvements on the nonparametri approah inlude TagProp [14], whih learns a disriminative metri for nearest neighbors to improve tagging. Convolutional neural networks (CNNs) [20, 22, 23, 17, 5] are a speial type of neural network that utilizes speifi network strutures, suh as onvolutions and spatial pooling, and have exhibited good generalization power in image-related appliations. Combined with reent tehniques suh as Dropout and fast parallel training, CNN models have outperformed existing hanrafted features. Krizhevsky et al. [20] reported reord-breaking results on ILSVRC 2012 that ontains 1000 visualobjet ategories. However, this study was mostly onerned with single-label image lassifiation, and the images in the dataset only ontain one prominent objet lass. At a finer sale, several methods fous on improving speifi network designs. Notably, Zeiler et al. [37] investigated different pooling methods for training CNNs, and several different regularization methods, suh as Dropout [16], DropConnet [32], and Maxout [13] have been proposed to improve the robustness and representation power of the networks. In adition, Earlier studies [7] have shown that CNN features are suitable as a general feature for various tasks under the onventional lassifiation shemes, and our work fouses on how to diretly train a deep network from raw pixels, using multilabel ranking loss, to address the multilabel annotation problem. 2 Multilabel Deep Convolutional Ranking Net In our approah for multilabel image annotation, we adopted the arhiteture proposed in [20] as our basi framework and mainly foused on training the network with loss funtions tailored for multi-label predition tasks. 2

3 2.1 Network Arhiteture The basi arhiteture of the network we use is similar to the one used in [20]. We use five onvolutional layers and three densely onneted layers. Before feeding the images to the onvolutional layers, eah image is resized to Next, pathes are extrated from the whole image, at the enter and the four orners to provide an augmentation of the dataset. Convolution filter sizes are set to squares of size 11, 9, and 5 respetively for the different onvolutional layers; and max pooling layers are used in some of the onvolutional layers to introdue invariane. Eah densely onneted layer has output sizes of Dropout layers follow eah of the densely onneted layers with a dropout ratio of 0.6. For all the layers, we used retified linear units (RELU) as our nonlinear ativation funtion. The optimization of the whole network is ahieved by asynhronized stohasti gradient desent with a momentum term with weight 0.9, with mini-bath size of 32. The global learning rate for the whole network is set to at the beginning, and a stairase weight deay is applied after a few epohs. The same optimization parameters and proedure are applied to all the different methods. For our dataset with 150,000 training images, it usually takes one day to obtain a good model by training on a luster. Unlike previous work that usually used ImageNet to pre-train the network, we train the whole network diretly from the training images from the NUS-WIDE dataset for a fair omparison with onventional vision baselines. 2.2 Multilabel Ranking Losses We mainly foused on loss layer, whih speifies how the network training penalizes the deviation between the predited and true labels, and investigated several different multilabel loss funtions for training the network. The first loss funtion was inspired by Tagprop [14], for whih we minimized the multilabel softmax regression loss. The seond loss was a simple modifiation of a pairwiseranking loss [19], whih takes multiple labels into aount. The third loss funtion was a multilabel variant of the WARP loss [34], whih uses a sampling trik to optimize top-k annotation auray. For notations, assume that we have a set of images x and that we denote the onvolutional network by f( ) where the onvolutional layers and dense onneted layers filter the images. The output of f( ) is a soring funtion of the data point x, that produes a vetor of ativations. We assume there are n image training data and tags Softmax The softmax loss has been used for multilabel annotation in Tagprop [14], and is also used in singlelabel image lassifiation [20]; therefore, we adopted it in our ontext. The posterior probability of an image x i and lass j an be expressed as p ij = exp(f j (x i )) k=1 exp(f k(x i )), (1) where f j (x i ) means the ativation value for image x i and lass j. We then minimized the KL- Divergene between the preditions and the ground-truth probabilities. Assuming that eah image has multiple labels, and that we an form a label vetor y R 1 where y j = 1 means the presene of a label and y j = 0 means absene of a label for an image, we an obtain ground-truth probability by normalizing y as y/ y 1. If the ground truth probability for image i and lass j is defined as p ij, the ost funtion to be minimized is J = 1 m n i=1 j=1 p ij log(p ij ) = 1 m n + i=1 j=1 1 + log(p ij ) where + denotes the number of positive labels for eah image. For the ease of exposition and without loss of generality, we set + to be the same for all images Pairwise Ranking The seond loss funtion we onsidered was the pairwise-ranking loss [19], whih diretly models the annotation problem. In partiular, we wanted to rank the positive labels to always have higher 3

4 sores than negative labels, whih led to the following minimization problem: J = n + i=1 j=1 k=1 max(0, 1 f j (x i ) + f k (x i )), (2) where + is the positive labels and is the negative labels. During the bak-propagation, we omputed the sub-gradient of this loss funtion. One limitation of this loss is that it optimizes the area under the ROC urve (AUC) but does not diretly optimize the top-k annotation auray. Beause for image annotation problems we were mostly interested in top-k annotations, this pairwise ranking loss did not best fit our purpose Weighted Approximate Ranking (WARP) The third loss we onsidered was the weighted approximate ranking (WARP), whih was first desribed in [34]. It speifially optimizes the top-k auray for annotation by using a stohasti sampling approah. Suh an approah fits the stohasti optimization framework of the deep arhiteture very well. It minimizes J = n + i=1 j=1 k=1 L(r j ) max(0, 1 f j (x i ) + f k (x i )). (3) where L( ) is a weighting funtion for different ranks, and r j is the rank for the jth lass for image i. The weighting funtion L( ) used in our work is defined as: L(r) = r α j, with α 1 α (4) j=1 We defined the α i as equal to 1/j, whih is the same as [34]. The weights defined by L( ) ontrol the top-k of the optimization. In partiular, if a positive label is ranked top in the label list, then L( ) will assign a small weight to the loss and will not ost the loss too muh. However, if a positive label is not ranked top, L( ) will assign a muh larger weight to the loss, whih pushes the positive label to the top. The last question was how to estimate the rank r j. We followed the sampling method in [34]: for a positive label, we ontinued to sample negative labels until we found a violation; then we reorded the number of trials s we sampled for negative labels. The rank was estimated by the following formulation r j = 1, (5) s for lasses and s sampling trials. We omputed the sub-gradient for this layer during optimization. As a minor noite, the approximate objetive we optimize is a looser upper bound ompared to the original WARP loss proposed in [34]. To see this, notie that in the original paper, it is assumed that the probability of sampling a violator is p = r #Y 1 with a positive example (x, y) with rank r, where #Y is the number of labels. Thus, there are r labels with higher sores than y. This is true only if all these r labels are negative. However, in our ase, sine there may be other positive labels having higher sores than y due to the multi-label nature of the problem, we effetively have p r #Y 1. 3 Visual Feature based Image Annotation Baslines We used a set of 9 different visual features and ombined them to serve as our baseline features. Although suh a set of features might not have been the best possible ones we ould obtain, they already serve as a very strong visual representation, and the omputation of suh features is nontrivial. On top of these features, we ran two simple but powerful lassifiers (knn and SVM) for image annotation. We also experimented with Tagprop [14], but found it annot easily sale to a large training set beause of the O(n 2 ) time omplexity. After using a small training set to train the Tagprop model, we found the performane to be unsatisfatory and therefore do not ompare it here. 4

5 3.1 Visual Features GIST [27]: We resized eah image to and used three different sales [8, 8, 4] to filter eah RGB hannel, resulting in 960-dimensional (320 3) GIST feature vetors. SIFT: We used two different sampling methods and three different loal desriptors to extrat texture features, whih gave us a total of 6 different features. We used dense sampling and a Harris orner detetor as our path-sampling methods. For loal desriptors, we extrated SIFT [24], CSIFT [31], and RGBSIFT [31], and formed a odebook of size 1000 using kmeans lustering; then built a twolevel spatial pyramid [21] that resulted in a 5000-dimensional vetor for eah image. We will refer to these six features as D-SIFT, D-CSIFT, D-RGBSIFT, H-SIFT, H-CSIFT, and H-RGBSIFT. HOG: To represent texture information at a larger sale, we used 2 2 overlapping HOG as desribed in [35]. We quantized the HOG features to a odebook of size 1000 and used the same spatial pyramid sheme as above, whih resulted in 5000-dimensional feature vetors. Color: We used a joint RGB olor histogram of 8 bins per dimension, for a 512-dimensional feature. The same set of features were used in [11], and ahieved state-of-the-art performane for image retrieval and annotation. The ombination of this set of features has a total dimensionality of 36,472, whih makes learning very expensive. We followed [11] to perform simple dimensionality redutions to redue omputation. In partiular, we performed a kernel PCA (KPCA) separately on eah feature to redue the dimensionality to 500. Then we onatenated all of the feature vetors to form a 4500-dimensional global image feature vetor and performed different learning algorithms on it. 3.2 Visual feature + knn The simplest baseline that remains very powerful involves diretly applying a weighted knn on the visual feature vetors. knn is a very strong baseline for image annotation, as suggested by Makadia et al. [25], mainly beause multilabel image annotation is a highly nonlinear problem and handling the heavily tailed label distribution is usually very diffiult. By ontrast, knn is a highly nonlinear and adaptive algorithm that better handles rare tags. For eah test image, we found its k nearest neighbors in the training set and omputed the posterior probability p( i) as k 1 p( i) = k exp( x i x j 2 2 )y jk, (6) σ j=1 where y ik indexes the labels of training data, y ik = 1 when there is one label for this image, and y ik = 0 when there is no label for this image. σ is the bandwidth that needs to be tuned. After obtaining the predition probabilities for eah image, we sorted the sores and annotated eah testing image with the top-k tags. 3.3 Visual feature + SVM Another way to perform image annotation is to treat eah tag separately and to train different onevs-all lassifiers. We trained a linear SVM [9] for eah tag and used the output of the different SVMs to rank the tags. Beause we had already performed nonlinear mapping to the data during the KPCA stage, we found a linear SVM to be suffiient. Thus we assigned top-k tags to one image, based on the ranking of the output sores of the SVMs. 4 Experiments 4.1 Dataset We performed experiments on the largest publily available multilabel dataset, NUS-WIDE [4]. This dataset ontains 269,648 images downloaded from Flikr that have been manually annotated, with several tags (2-5 on average) per image. After ignoring the small subset of the images that are not annotated by any tag, we had a total of 209,347 images for training and testing. We used a subset of 150,000 images for training and used the rest of the images for testing. The tag ditionary for the images ontains 81 different tags. Some sample images and annotations are shown in Figure 1. 5

6 method / metri per-lass reall per-lass preision overall reall overall preision N+ Upper bound Visual Feature + knn Visual Feature + SVM CNN + Softmax CNN + Ranking CNN + WARP Table 1: Image annotation results on NUS-WIDE with k = 3 annotated tags per image. See text in setion 5.4 for the definition of Upper bound. method / metri per-lass reall per-lass preision overall reall overall preision N+ Upper bound Visual Feature + knn Visual Feature + SVM CNN + Softmax CNN + Ranking CNN + WARP Table 2: Image annotation results on NUS-WIDE with k = 5 annotated tags per image. See text in setion 5.4 for the definition of Upper bound. 4.2 Evaluation Protools We followed previous researh [25] in our use of the following protools to evaluate different methods. For eah image, we assigned k (e.g., k = 3, 5) highest-ranked tags to the image and ompared the assigned tags to the ground-truth tags. We omputed the reall and preision for eah tag separately, and then report the mean-per-lass reall and mean-per-lass preision: per-lass reall = 1 Ni, per-lass preision = 1 Ni (7) N g i=1 i N p i=1 i where is the number of tags, Ni is the number of orretly annotated image for tag i, N g i is the number of ground-truth tags for tag i, and N p i is the number of preditions for tag i. The above evaluations are biased toward infrequent tags, beause making them orret would have a very signifiant impat on the final auray. Therefore we also report the overall reall and overall preision: i=1 overall reall = N i i=1 N g i=1, overall preision = N i i i=1 N p. (8) i For the above two metris, the frequent lasses will be dominant and have a larger impat on final performane. Finally, we also report the perentage of realled tag words out of all tag words as N+. We believe that evaluating all of these metris makes the evaluation unbiased. 4.3 Baseline Parameters In our preliminary evaluation, we optimized the parameters for the visual-feature-based baseline systems. For visual-feature dimensionality redution, we followed the suggestions in Gong et al. [11] to redue the dimensionality of eah feature to 500 and then onatenated the PCA-redued vetors into a 4500-dimensional global image desriptor, whih worked as well as the original feature. For knn, we set the bandwidth σ to 1 and k to 50, having found that these settings work best. For SVM, we set the regularization parameter to C = 2, whih works best for this dataset. 4.4 Results We first report results with respet to the metris introdued above. In partiular, we vary the number k of predited keywords for eah image and mainly onsider k = 3 and k = 5. Before doing so, however, we must define an upper bound for our evaluation. In the dataset, eah image had different numbers of ground-truth tags, whih made it hard for us to preisely ompute an upper 6

7 Reall Softmax Ranking WARP Tags (dereasing frequeny) Figure 2: Analysis of per-lass reall of the 81 tags in NUS-WIDE dataset with k = 3. Preision Softmax Ranking WARP Tags (dereasing frequeny) Figure 3: Analysis of per-lass preision of the 81 tags in NUS-WIDE dataset with k = 3. bound for performane with different k. For eah image, when the number of ground-truth tags was larger than k, we randomly hose k ground-truth tags and assigned them to that image; when the number of ground-truth tags was smaller than k, we assigned all ground-truth tags to that image and randomly hose other tags for that image. We believe this baseline represents the best possible performane when the ground truth is known. The results for assigning 3 keywords per image are reported in Table 1. The results indiate that the deep network ahieves a substantial improvement over existing visual-feature-based annotation methods. The CNN+Softmax method outperforms the VisualFeature+SVM baseline by about 10%. Comparing the same CNN network with different loss funtions, results show that softmax already gives a very powerful baseline. Although using the pairwise ranking loss does not improve softmax, by using the weighted approximated-ranking loss (WARP) we were able to ahieve a substantial improvement over softmax. This is probably beause pairwise-ranking is not diretly optimizing the top-k auray, and beause WARP pushes lasses that are not ranked top heavier, whih boosts the performane of rare tags. From these results, we an see that all loss funtions ahieved omparable overall-reall and overall-preision, but that WARP loss ahieved signifiantly better per-lass reall and per-lass preision. Results for k = 5, whih are given in Table 2, show similar trends to k = 3. We also provide a more detailed analysis of per-lass reall and per-lass preision. The reall for eah tags appears in Figure 2, and the preision for eah tag in Figure 3. The results for different tags are sorted by the frequeny of eah tag, in desending order. From these results, we see that the auray for frequent tags greater than for infrequent tags. Different losses performed omparably to eah other for frequent lasses, and WARP worked better than other loss funtions for infrequent lasses. Finally, we show some image annotation examples in Figure 4. Even though some of the predited tags for these do not math the ground truth, they are still very meaningful. 5 Disussion and Future Work In this work, we proposed to use ranking to train deep onvolutional neural networks for multilabel image annotation problems. We investigated several different ranking-based loss funtions for training the CNN, and found that the weighted approximated-ranking loss works partiularly well for multilabel annotation problems. We performed experiments on the largest publily available multilabel image dataset NUS-WIDE, and demonstrated the effetiveness of using top-k ranking to train the network. In the future, we would like to use very large amount of noisy-labeled multilabel images from the Internet (e.g., from Flikr or image searhes) to train the network. 7

Image Ground truth Predi2ons Image Ground truth Predi2ons Boat Oean Vehile Lake Oean House Animal Flower plant Valley Plant Flower Road Animal Bird

annotation results obtained with WARP. Referenes [1] Kobus Barnard and David Forsyth. Learning the semantis of words and pitures. In ICCV, 2001.

Supervised learning of semanti lasses for image annotation and retrieval.

[4] Tat-Seng Chua, Jinhui Tang, Rihang Hong, Haojie Li, Zhiping Luo, and Yan-Tao. Zheng.

on Image and Video Retrieval (CIVR 09), Santorini, Greee., July 8-10, 2009.

Andrew Ng. Large sale distributed deep networks. In P. Bartlett, F.C.N. Pereira, C.J.C. Burges, L. Bottou, and K.Q.

Soher, Lijia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-sale hierarhial image database. CVPR, 2009.

Deaf: A deep onvolutional ativation feature for generi visual reognition. arxiv preprint arxiv:1310.1531, 2013.

8 Image Ground truth Predi2ons Image Ground truth Predi2ons Boat Oean Vehile Lake Oean House Animal Flower plant Valley Plant Flower Road Animal Bird Food Toy Beah Rok Sunset Rok Oean Snow Sunset Tree Tree Snow Animal Cow Horse Animal Mountain Rok Valley Mountain Building Figure 4: Qualitative image annotation results obtained with WARP. Referenes [1] Kobus Barnard and David Forsyth. Learning the semantis of words and pitures. In ICCV, [2] Tamara Berg and David Forsyth. Animals on the web. CVPR, [3] Gustavo Carneiro, Antoni B Chan, Pedro J Moreno, and Nuno Vasonelos. Supervised learning of semanti lasses for image annotation and retrieval. Pattern Analysis and Mahine Intelligene, IEEE Transations on, 29(3): , [4] Tat-Seng Chua, Jinhui Tang, Rihang Hong, Haojie Li, Zhiping Luo, and Yan-Tao. Zheng. Nus-wide: A real-world web image database from national university of singapore. In Pro. of ACM Conf. on Image and Video Retrieval (CIVR 09), Santorini, Greee., July 8-10, [5] Jeffrey Dean, Greg Corrado, Rajat Monga, Kai Chen, Matthieu Devin, Quo Le, Mark Mao, Mar Aurelio Ranzato, Andrew Senior, Paul Tuker, Ke Yang, and Andrew Ng. Large sale distributed deep networks. In P. Bartlett, F.C.N. Pereira, C.J.C. Burges, L. Bottou, and K.Q. Weinberger, editors, Advanes in Neural Information Proessing Systems 25, pages , [6] Jia Deng, W. Dong, R. Soher, Lijia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-sale hierarhial image database. CVPR, [7] Jeff Donahue, Yangqing Jia, Oriol Vinyals, Judy Hoffman, Ning Zhang, Eri Tzeng, and Trevor Darrell. Deaf: A deep onvolutional ativation feature for generi visual reognition. arxiv preprint arxiv: , [8] Pinar Duygulu, Kobus Barnard, Nando de Freitas, and David Forsyth. Objet reognition as mahine translation: Learning a lexion for a fixed image voabulary. In ECCV, [9] R.-E. Fan, K.-W. Chang, C.-J. Hsieh, X.-R. Wang, and C.-J. Lin. Liblinear: A library for large linear lassifiation. JMLR, [10] Rob Fergus, Antonio Torralba, and Yair Weiss. Semi-supervised learning in giganti image olletions. NIPS, [11] Yunhao Gong, Qifa Ke, Mihael Isard, and Svetlana Lazebnik. A multi-view embedding spae for internet images, tags, and their semantis. IJCV, [12] Yunhao Gong and Svetlana Lazebnik. Iterative quantization: An prorustean approah to learning binary odes. CVPR,

9 [13] Ian Goodfellow, David Warde-Farley, Mehdi Mirza, Aaron Courville, and Yoshua Bengio. Maxout networks. ICML, [14] M. Guillaumin, T. Mensink, J. Verbeek, and C. Shmid. Tagprop: Disriminative metri learning in nearest neighbor models for image auto-annotation. ICCV, [15] Matthieu Guillaumin, Jakob Verbeek, and Cordelia Shmid. Multimodal semi-supervised learning for image lassifiation. CVPR, [16] Geoffrey Hinton, Nitish Srivastava, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Improving neural networks by preventing o-adaptation of feature detetors. Arxiv, [17] What is the best multi-stage arhiteture for objet reognition? K. jarrett and k. kavukuoglu and m. a. ranzato and y. leun. CVPR, [18] Herv Jégou, M. Douze, Cordelia Shmid, and Patrik Perez. Aggregating loal desriptors into a ompat image representation. CVPR, [19] Thorsten Joahims. Optimizing searh engines using likthrough data. In Proeedings of the Eighth ACM SIGKDD International Conferene on Knowledge Disovery and Data Mining, KDD 02, pages , New York, NY, USA, ACM. [20] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. Imagenet lassifiation with deep onvolutional neural networks. NIPS, [21] Svetlana Lazebnik, Cordelia Shmid, and Jean Pone. Beyond bags of features: Spatial pyramid mathing for reognizing natural sene ategories. CVPR, [22] Y. LeCun, B. Boser, J.S. Denker, D. Henderson, R.E. Howard, W. Hubbard, and L.D. Jakel. Handwritten digit reognition with a bak-propagation network. NIPS, [23] H. Lee, R. Grosse, R. Ranganath, and A.Y. Ng. Convolutional deep belief networks for salable unsupervised learning of hierarhial representations. ICML, [24] David G. Lowe. Distintive image features from sale-invariant keypoints. IJCV, [25] Ameesh Makadia, Vladimir Pavlovi, and Sanjiv Kumar. A new baseline for image annotation. In ECCV, [26] F. Monay and D. Gatia-Perez. Plsa-based image autoannotation: Constraining the latent spae. In ACM Multimedia, [27] Aude Oliva and Antonio Torralba. Modeling the shape of the sene: a holisti representation of the spatial envelope. IJCV, [28] Florent Perronnin and Christopher R. Dane. Fisher kernels on visual voabularies for image ategorization. CVPR, [29] A. Quattoni, M. Collins, and T. Darrell. Learning visual representations using images with aptions. CVPR, [30] N. Rasiwasia, PJ Moreno, and N. Vasonelos. Bridging the gap: Query by semanti example. IEEE Transations on Multimedia, [31] Koen E. A. van de Sande, Theo Gevers, and Cees G. M. Snoek. Evaluating olor desriptors for objet and sene reognition. PAMI, [32] Li Wan, Matt Zeiler, Sixin Zhang, Yann Leun, and Rob Fergus. Regularization of neural networks using droponnet. ICML, [33] Jinjun Wang, Jianhao Yang, Kai Yu, Fengjun Lv, Thomas Huang, and Yihong Gong. Loalityonstrained linear oding for image lassifiation. CVPR, [34] Jason Weston, Samy Bengio, and Niolas Usunier. Wsabie: Saling up to large voabulary image annotation. In IJCAI, [35] Jianxiong Xiao, James Hays, Krista Ehinger, Aude Oliva, and Antonio Torralba. Sun database: Large-sale sene reognition from abbey to zoo. CVPR, [36] Jianhao Yang, Kai Yu, Yihong Gong, and Thomas Huang. Linear spatial pyramid mathing uisng sparse oding for image lassifiation. CVPR, [37] Matt Zeiler and Rob Fergus. Stohasti pooling for regularization of deep onvolutional neural networks. ICLR,

Learning Convention Propagation in BeerAdvocate Reviews from a etwork Perspective. Abstract

CS 9 Projet Final Report: Learning Convention Propagation in BeerAdvoate Reviews from a etwork Perspetive Abstrat We look at the way onventions propagate between reviews on the BeerAdvoate dataset, and