CEA LIST s participation to the Scalable Concept Image Annotation task of ImageCLEF 2015

Similar documents
CIS UDEL Working Notes on ImageCLEF 2015: Compound figure detection task

RUC-Tencent at ImageCLEF 2015: Concept Detection, Localization and Sentence Generation

AAUITEC at ImageCLEF 2015: Compound Figure Separation

Part Localization by Exploiting Deep Convolutional Networks

Overview of ImageCLEF Mauricio Villegas (on behalf of all organisers)

Real-time Object Detection CS 229 Course Project

A FRAMEWORK OF EXTRACTING MULTI-SCALE FEATURES USING MULTIPLE CONVOLUTIONAL NEURAL NETWORKS. Kuan-Chuan Peng and Tsuhan Chen

REGION AVERAGE POOLING FOR CONTEXT-AWARE OBJECT DETECTION

TRANSPARENT OBJECT DETECTION USING REGIONS WITH CONVOLUTIONAL NEURAL NETWORK

BUAA-iCC at ImageCLEF 2015 Scalable Concept Image Annotation Challenge

MIL at ImageCLEF 2014: Scalable System for Image Annotation

Supplementary Material: Unconstrained Salient Object Detection via Proposal Subset Optimization

Volumetric and Multi-View CNNs for Object Classification on 3D Data Supplementary Material

Using Textual and Visual Processing in Scalable Concept Image Annotation Challenge

Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks

Deep learning for object detection. Slides from Svetlana Lazebnik and many others

INAOE-UNAL at ImageCLEF 2015: Scalable Concept Image Annotation

Fine-tuning Pre-trained Large Scaled ImageNet model on smaller dataset for Detection task

An Exploration of Computer Vision Techniques for Bird Species Classification

Multi-Glance Attention Models For Image Classification

Textured Graph-model of the Lungs for Tuberculosis Type Classification and Drug Resistance Prediction: Participation in ImageCLEF 2017

Deep Neural Networks:

Channel Locality Block: A Variant of Squeeze-and-Excitation

Deep Tracking: Biologically Inspired Tracking with Deep Convolutional Networks

Supplementary material for Analyzing Filters Toward Efficient ConvNet

Lecture 7: Semantic Segmentation

Ryerson University CP8208. Soft Computing and Machine Intelligence. Naive Road-Detection using CNNS. Authors: Sarah Asiri - Domenic Curro

Spatial Localization and Detection. Lecture 8-1

Convolutional Neural Network Layer Reordering for Acceleration

Encoder-Decoder Networks for Semantic Segmentation. Sachin Mehta

arxiv: v1 [cs.cv] 31 Mar 2016

Finding Tiny Faces Supplementary Materials

Object Detection Based on Deep Learning

Additive Manufacturing Defect Detection using Neural Networks. James Ferguson May 16, 2016

Recognize Complex Events from Static Images by Fusing Deep Channels Supplementary Materials

Feature-Fused SSD: Fast Detection for Small Objects

Real Time Monitoring of CCTV Camera Images Using Object Detectors and Scene Classification for Retail and Surveillance Applications

Deep Learning in Visual Recognition. Thanks Da Zhang for the slides

Lecture 5: Object Detection

MULTI-SCALE OBJECT DETECTION WITH FEATURE FUSION AND REGION OBJECTNESS NETWORK. Wenjie Guan, YueXian Zou*, Xiaoqun Zhou

3D-CNN and SVM for Multi-Drug Resistance Detection

arxiv: v1 [cs.cv] 6 Jul 2016

arxiv: v1 [cs.cv] 29 Sep 2016

In Defense of Fully Connected Layers in Visual Representation Transfer

Deep Learning with Tensorflow AlexNet

Content-Based Image Recovery

CSE 559A: Computer Vision

YOLO9000: Better, Faster, Stronger

Brand-Aware Fashion Clothing Search using CNN Feature Encoding and Re-ranking

Project 3 Q&A. Jonathan Krause

International Journal of Computer Engineering and Applications, Volume XII, Special Issue, September 18,

Study of Residual Networks for Image Recognition

arxiv: v1 [cs.cv] 20 Dec 2016

CMU Lecture 18: Deep learning and Vision: Convolutional neural networks. Teacher: Gianni A. Di Caro

Direct Multi-Scale Dual-Stream Network for Pedestrian Detection Sang-Il Jung and Ki-Sang Hong Image Information Processing Lab.

Semantic Segmentation

Object Detection. CS698N Final Project Presentation AKSHAT AGARWAL SIDDHARTH TANWAR

MIL at ImageCLEF 2013: Scalable System for Image Annotation

Efficient Segmentation-Aided Text Detection For Intelligent Robots

Object detection with CNNs

DeepIndex for Accurate and Efficient Image Retrieval

Comparison of Fine-tuning and Extension Strategies for Deep Convolutional Neural Networks

Learning to Match. Jun Xu, Zhengdong Lu, Tianqi Chen, Hang Li

Constrained Local Enhancement of Semantic Features by Content-Based Sparsity

Computer Vision Lecture 16

Joint Object Detection and Viewpoint Estimation using CNN features

Extend the shallow part of Single Shot MultiBox Detector via Convolutional Neural Network

DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution and Fully Connected CRFs

Application of Deep Learning Neural Network for Classification of TB Lung CT Images Based on Patches

arxiv: v1 [cs.cv] 16 Nov 2015

JOINT DETECTION AND SEGMENTATION WITH DEEP HIERARCHICAL NETWORKS. Zhao Chen Machine Learning Intern, NVIDIA

Additive Manufacturing Defect Detection using Neural Networks

Handwritten Chinese Character Recognition by Joint Classification and Similarity Ranking

Automatic detection of books based on Faster R-CNN

Convolutional Neural Networks. Computer Vision Jia-Bin Huang, Virginia Tech

A Deep Learning Framework for Authorship Classification of Paintings

IPL at ImageCLEF 2017 Concept Detection Task

CAP 6412 Advanced Computer Vision

Visual features detection based on deep neural network in autonomous driving tasks

Efficient Convolutional Network Learning using Parametric Log based Dual-Tree Wavelet ScatterNet

3D model classification using convolutional neural network

Human Action Recognition Using CNN and BoW Methods Stanford University CS229 Machine Learning Spring 2016

Learning image representations equivariant to ego-motion (Supplementary material)

Real-time Hand Tracking under Occlusion from an Egocentric RGB-D Sensor Supplemental Document

CPPP/UFMS at ImageCLEF 2014: Robot Vision Task

Convolution Neural Network for Traditional Chinese Calligraphy Recognition

Structured Prediction using Convolutional Neural Networks

Microscopy Cell Counting with Fully Convolutional Regression Networks

Self-supervised Multi-level Face Model Learning for Monocular Reconstruction at over 250 Hz Supplemental Material

Deep Learning Based Real-time Object Recognition System with Image Web Crawler

COMP 551 Applied Machine Learning Lecture 16: Deep Learning

Deep Convolutional Neural Networks for Efficient Pose Estimation in Gesture Videos

Final Report: Smart Trash Net: Waste Localization and Classification

Automatic Graphic Logo Detection via Fast Region-based Convolutional Networks

arxiv: v1 [cs.cv] 5 Oct 2015

Kaggle Data Science Bowl 2017 Technical Report

Depth Estimation from a Single Image Using a Deep Neural Network Milestone Report

Cascade Region Regression for Robust Object Detection

Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks

ABC-CNN: Attention Based CNN for Visual Question Answering

Transcription:

CEA LIST s participation to the Scalable Concept Image Annotation task of ImageCLEF 2015 Etienne Gadeski, Hervé Le Borgne, and Adrian Popescu CEA, LIST, Laboratory of Vision and Content Engineering, France etienne.gadeski@cea.fr, herve.le-borgne@cea.fr, adrian.popescu@cea.fr Abstract. This paper describes our participation to the ImageCLEF 2015 scalable concept image annotation task. Our system is only based on visual features extracted with a deep Convolutional Neural Network (CNN). The network is trained with noisy web data corresponding to the concepts to detect in this task. We introduce a simple concept localization pipeline that provides the localization of the detected concepts, among the 251 concepts, within the images. Keywords: image annotation, classification, concept localization, convolutionnal neural networks. 1 Introduction The scalable concept image Annotation subtask of ImageCLEF 2015 [1] is described in detail in [2]. The system we propose only relies on visual features extracted with a deep Convolutional Neural Network (CNN). In addition to the specific training of this CNN for this task we propose a concept localization pipeline which uses the spatial information that CNNs offer. We are able to propose a single concept annotation and localization framework for the 251 concepts. Section 2 outlines our method in both training and testing steps. In Sect. 3 we present our experiments, report our results and discuss them. 2 Method In this section we detail the training and testing frameworks we used. Our method is based upon deep CNNs which have lately shown outstanding performances in diverse computer vision tasks such as object classification, localization, action recognition, etc [3, 4]. 2.1 Training Data. We have crawled a set of 251,000 images (1,000 images per concept) from the Bing Images search engine. For each concept we used its name and its

synonyms (if present) to query the search engine. This dataset is of course noisy but we believe it is not a big issue for training a deep CNN [5, 6]. We only used this additional data to train a 16-layer CNN [7]. We used 90% of the dataset for training and 10% for validation. Network Settings. The network is initialized with ImageNet weights. The initial learning rate is set to 0.001 and the batch size is set to 256. The last layer (the classifier) is trained from scratch, i.e. it is initialized with random weights sampled from a Gaussian distribution (σ = 0.01 and µ = 0) and its learning rate is 10 times larger than for other layers. During training, the dataset is enhanced with random transformations: RGB jittering, scale jittering, shearing, rotation, contrast adjustment, etc. It is known that data augmentation leads to better models [8] and reduces overfitting. Finally, the network takes a 224 224 RGB image as input and produces 251 outputs, i.e. the number of concepts. The models were trained on a single Nvidia Titan Black with our modified version of the Caffe framework [9]. 2.2 Testing Feature Extraction. We convert the fine-tuned CNN model to a fully convolutional network. To do so, we explicitly convert the first fully-connected layer to a 7 7 convolution layer. The last two fully-connected layers are then converted to 1 1 convolution layers. We are able to do such a conversion because fully connected layers are nothing more than 1 1 convolutional kernels and a full connection table. This conversion allows us to feed images of any size (H W pixels) to the network. In practice, any image with a side smaller than 256 pixels is isotropically rescaled so that its smallest side equals 256. In consequence, the output of the last layer (i.e. the classifier) is no longer a single vector with 251 dimensions but rather a spatial map S of M N 251 dimensions. The network has five 2 2 pooling layers, therefore it has a global stride of 32 pixels. We have: M = H 32 7 + 1 W and N = 32 7 + 1. (1) For instance, with a 512 384 image the network will produce a 10 6 251 feature map. Concept Localization. For each image, our network outputs a spatial feature map, S, of size M N 251. In every of the M N cells of this feature map, we obtain the probability for each concept to appear at this location. In our experiments, we chose to apply a max-pooling to each cell (over all 251 concepts) to only get the most probable concept within the location. Thus, we obtain a unique map giving the most probable concept at each cell location: { } R = i [1; M], j [1; N], r i,j : r i,j = arg max (s i,j (c)) c, (2)

13 97 97 97 97 97 97 21 21 21 21 21 97 104 104 97 97 97 21 21 21 21 21 104 104 104 97 97 21 21 21 21 21 21 104 104 104 97 21 21 21 21 21 21 21 104 104 97 97 97 21 21 21 21 21 21 97 97 97 97 97 21 21 21 21 21 21 11 Fig. 1. Example of one image of 600 544 pixels (on the left) with its corresponding 13 11 map (on the right). The concept 97 corresponds to flower, 21 to bee and 104 to garden. where s i,j is the 251-dimensional vector at position (i, j) in S. As illustrated in Fig. 1 this gives us an approximate but yet interesting localization of the different concepts within the image. To limit the number of regions where a concept is likely to appear, we further merge the neighboring cells which have that concept, by retaining the largest rectangular region containing all the contiguous cells with a given most probable concept. We map these regions coordinates to the original image coordinates. The final result can be seen in Fig. 2. 3 Experiments and Results We report the results obtained for the two runs submitted to the campaign and propose ways of improvement to our framework. 3.1 Runs Two runs were submitted for this task. We used two models, one per run. They only differ in the number of iterations accomplished in the training step. Results are reported in Table 1. Figure 3 shows the mean average precision (map) of our runs using varying percentage of overlap between our bounding boxes and the ground truth.

97 97 97 97 97 97 21 21 21 21 21 97 104 104 97 97 97 21 21 21 21 21 104 104 104 97 97 21 21 21 21 21 21 104 104 104 97 21 21 21 21 21 21 21 104 104 97 97 97 21 21 21 21 21 21 97 97 97 97 97 21 21 21 21 21 21 Fig. 2. On the left, we show the concepts regions proposed by our pipeline. The purple region corresponding to flower covers the whole area. The yellow region corresponding to bee is in the bottom right corner and there is also a green region corresponding to garden on the bottom left corner. We also display these mapped regions on the original image. Table 1. Results for the Image Concept detection and localization task. Run # iterations map 0% overlap map 50% overlap all img.bb.run1 26,000 0.420532 0.271942 all img.bb.run2 34,000 0.445121 0.294559 3.2 Discussion We note that, although there is only a small gap of 8,000 iterations of stochastic gradient descent between the two models, the second one performs significantly better (+2.5%) than the first one. Unfortunately, due to the lack of time, we could not have trained the model further and we believe it would provide a non negligible improvement. Regarding the localization, our method is quite efficient but it limits its ability to detect several similar objects forming a compact group into the image, such as a cow herd. However, we think there are several ways of improving our method. The first one would be to directly train the CNN in a fully convolutional way to improve the localization of concepts that can appear at different scales, therefore we could also improve the robustness of our method by extracting the spatial feature map at different scales.

Fig. 3. The mean average precision (map) of our two runs are reported with several percentage of overlap between our predicted bounding boxes and the ground truth. 4 Conclusion We proposed a simple and effective framework to annotate and localize concepts within images. The results obtained in this campaign are an encouraging step for us toward building a better framework for concept annotation and localization. References 1. Villegas, M., Müller, H., Gilbert, A., Piras, L., Wang, J., Mikolajczyk, K., de Herrera, A.G.S., Bromuri, S., Amin, M.A., Mohammed, M.K., Acar, B., Uskudarli, S., Marvasti, N.B., Aldana, J.F., del Mar Roldán García, M.: General Overview of ImageCLEF at the CLEF 2015 Labs. Lecture Notes in Computer Science. Springer International Publishing (2015) 2. Gilbert, A., Piras, L., Wang, J., Yan, F., Dellandrea, E., Gaizauskas, R., Villegas, M., Mikolajczyk, K.: Overview of the ImageCLEF 2015 Scalable Image Annotation, Localization and Sentence Generation task. In: CLEF2015 Working Notes. CEUR Workshop Proceedings, Toulouse, France, CEUR-WS.org (September 8-11 2015) 3. Sharif Razavian, A., Azizpour, H., Sullivan, J., Carlsson, S.: Cnn features off-theshelf: An astounding baseline for recognition. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) Workshops. (June 2014)

4. Girshick, R., Donahue, J., Darrell, T., Malik, J.: Rich feature hierarchies for accurate object detection and semantic segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). (2014) 5. Ginsca, A.L., Popescu, A., Le Borgne, H., Ballas, N., Vo, P., Kanellos, I.: Large-scale image mining with flickr groups. In: 21th International Conference on Multimedia Modelling (MMM 15). (2015) 6. Vo, P.D., Ginsca, A.L., Le Borgne, H., Popescu, A.: Effective training of convolutional networks using noisy web images. In: 13th International Workshop on Content Based Multimedia Indexing (CBMI). (2015) Prague, Czech Republic. 7. Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. CoRR abs/1409.1556 (2014) 8. Wu, R., Ya, S., Shan, Y., Dang, Q., Sun, G.: Deep image: Scaling up image recognition. CoRR abs/1501.02876 (2015) 9. Jia, Y., Shelhamer, E., Donahue, J., Karayev, S., Long, J., Girshick, R., Guadarrama, S., Darrell, T.: Caffe: Convolutional architecture for fast feature embedding. arxiv preprint arxiv:1408.5093 (2014)