Fine-tuning Pre-trained Large Scaled ImageNet model on smaller dataset for Detection task

Similar documents
Object Detection Based on Deep Learning

Unified, real-time object detection

Deep learning for object detection. Slides from Svetlana Lazebnik and many others

Object Detection. TA : Young-geun Kim. Biostatistics Lab., Seoul National University. March-June, 2018

Spatial Localization and Detection. Lecture 8-1

Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks

Rich feature hierarchies for accurate object detection and semantic segmentation

REGION AVERAGE POOLING FOR CONTEXT-AWARE OBJECT DETECTION

Lecture 5: Object Detection

Object detection with CNNs

Rich feature hierarchies for accurate object detection and semantic segmentation

Weakly Supervised Object Recognition with Convolutional Neural Networks

An Object Detection Algorithm based on Deformable Part Models with Bing Features Chunwei Li1, a and Youjun Bu1, b

Extend the shallow part of Single Shot MultiBox Detector via Convolutional Neural Network

TRANSPARENT OBJECT DETECTION USING REGIONS WITH CONVOLUTIONAL NEURAL NETWORK

Object Recognition II

arxiv: v1 [cs.cv] 4 Jun 2015

Deformable Part Models

G-CNN: an Iterative Grid Based Object Detector

CPSC340. State-of-the-art Neural Networks. Nando de Freitas November, 2012 University of British Columbia

Category-level localization

Optimizing Object Detection:

PASCAL VOC Classification: Local Features vs. Deep Features. Shuicheng YAN, NUS

CS 1674: Intro to Computer Vision. Object Recognition. Prof. Adriana Kovashka University of Pittsburgh April 3, 5, 2018

Optimizing Intersection-Over-Union in Deep Neural Networks for Image Segmentation

Video Semantic Indexing using Object Detection-Derived Features

Project 3 Q&A. Jonathan Krause

Supplementary Material: Pixelwise Instance Segmentation with a Dynamically Instantiated Network

Structured Prediction using Convolutional Neural Networks

Yiqi Yan. May 10, 2017

arxiv: v1 [cs.cv] 1 Apr 2017

YOLO9000: Better, Faster, Stronger

segdeepm: Exploiting Segmentation and Context in Deep Neural Networks for Object Detection

arxiv: v2 [cs.cv] 22 Sep 2014

Deep Tracking: Biologically Inspired Tracking with Deep Convolutional Networks

Feature-Fused SSD: Fast Detection for Small Objects

Detection and Localization with Multi-scale Models

Object detection using Region Proposals (RCNN) Ernest Cheung COMP Presentation

arxiv: v1 [cs.cv] 7 Jul 2014

Faster R-CNN Implementation using CUDA Architecture in GeForce GTX 10 Series

Optimizing Object Detection:

LEARNING OBJECT SEGMENTATION USING A MULTI NETWORK SEGMENT CLASSIFICATION APPROACH

SSD: Single Shot MultiBox Detector

Learning From Weakly Supervised Data by The Expectation Loss SVM (e-svm) algorithm

arxiv: v2 [cs.cv] 1 Oct 2014

Semantic Pooling for Image Categorization using Multiple Kernel Learning

Object Detection on Self-Driving Cars in China. Lingyun Li

MULTI-SCALE OBJECT DETECTION WITH FEATURE FUSION AND REGION OBJECTNESS NETWORK. Wenjie Guan, YueXian Zou*, Xiaoqun Zhou

Beyond Sliding Windows: Object Localization by Efficient Subwindow Search

arxiv: v1 [cs.cv] 3 Apr 2016

PT-NET: IMPROVE OBJECT AND FACE DETECTION VIA A PRE-TRAINED CNN MODEL

Content-Based Image Recovery

Hierarchical Image-Region Labeling via Structured Learning

Subspace Alignment Based Domain Adaptation for RCNN Detector

Segmentation as Selective Search for Object Recognition in ILSVRC2011

arxiv: v3 [cs.cv] 18 Nov 2014

Yield Estimation using faster R-CNN

Computer Vision Lecture 16

arxiv: v1 [cs.cv] 26 Jun 2017

Future directions in computer vision. Larry Davis Computer Vision Laboratory University of Maryland College Park MD USA

Towards Large-Scale Semantic Representations for Actionable Exploitation. Prof. Trevor Darrell UC Berkeley

Actions and Attributes from Wholes and Parts

MEDICAL IMAGE SEGMENTATION WITH DIGITS. Hyungon Ryu, Jack Han Solution Architect NVIDIA Corporation

Deep condolence to Professor Mark Everingham

arxiv: v1 [cs.cv] 5 Oct 2015

Constrained Convolutional Neural Networks for Weakly Supervised Segmentation. Deepak Pathak, Philipp Krähenbühl and Trevor Darrell

arxiv: v2 [cs.cv] 30 Mar 2016

Computer Vision Lecture 16

Computer Vision Lecture 16

Deep Neural Networks:

Final Report: Smart Trash Net: Waste Localization and Classification

CNN BASED REGION PROPOSALS FOR EFFICIENT OBJECT DETECTION. Jawadul H. Bappy and Amit K. Roy-Chowdhury

Rich feature hierarchies for accurate object detection and semant

Real-time Object Detection CS 229 Course Project

Exploiting Depth from Single Monocular Images for Object Detection and Semantic Segmentation

arxiv: v3 [cs.cv] 15 Sep 2018

Attributes. Computer Vision. James Hays. Many slides from Derek Hoiem

CEA LIST s participation to the Scalable Concept Image Annotation task of ImageCLEF 2015

Regionlet Object Detector with Hand-crafted and CNN Feature

Direct Multi-Scale Dual-Stream Network for Pedestrian Detection Sang-Il Jung and Ki-Sang Hong Image Information Processing Lab.

Learning Representations for Visual Object Class Recognition

arxiv: v3 [cs.cv] 13 Apr 2016 Abstract

Object Detection. CS698N Final Project Presentation AKSHAT AGARWAL SIDDHARTH TANWAR

Return of the Devil in the Details: Delving Deep into Convolutional Nets

CS6501: Deep Learning for Visual Recognition. Object Detection I: RCNN, Fast-RCNN, Faster-RCNN

arxiv: v1 [cs.cv] 23 Apr 2015

arxiv: v4 [cs.cv] 2 Apr 2015

arxiv: v2 [cs.cv] 26 Sep 2015

Single-Shot Refinement Neural Network for Object Detection -Supplementary Material-

RECOGNIZING objects and localizing them in images is

Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks

Microscopy Cell Counting with Fully Convolutional Regression Networks

Using RGB, Depth, and Thermal Data for Improved Hand Detection

HIERARCHICAL JOINT-GUIDED NETWORKS FOR SEMANTIC IMAGE SEGMENTATION

Deep Learning for Object detection & localization

Encoder-Decoder Networks for Semantic Segmentation. Sachin Mehta

Industrial Technology Research Institute, Hsinchu, Taiwan, R.O.C ǂ

Detection III: Analyzing and Debugging Detection Methods

Automatic Detection of Multiple Organs Using Convolutional Neural Networks

Multiple Instance Detection Network with Online Instance Classifier Refinement

Transcription:

Fine-tuning Pre-trained Large Scaled ImageNet model on smaller dataset for Detection task Kyunghee Kim Stanford University 353 Serra Mall Stanford, CA 94305 kyunghee.kim@stanford.edu Abstract We use a pre-trained model on ILSVRC dataset for classification to train a model for a smaller dataset for different task i.e., detection. The result shows promising performance of this approach and we furthur think about the difference between training the model for classification and detection and the method that can improve the detection where classification works fairly good meanwhile the detection fails. 1. Introduction and related work In RCNN[1] Girshick et al. shows that supervised pretrained model on large dataset(ilsvrc) to a smaller dataset(pascal VOC). They extract region proposals using selective search[2] to extract candidate regions for bounding boxes and distinguish between negative and positive bounding boxes using the overlap with the ground truth bounding box. Using pre-trained model on large dataset for classification gives very good result also on classification on smaller dataset for example for pre-trained VGG 16 layer model performance on PASCAL VOC 2007 dataset is 89.3%(mean AP)[4] while finetuning with Krizhevsky et al. s model[5] on PASCAL VOC 2007 dataset gives 54.2% (mean AP) for detection task [6]. We follow this approach and use VGG 16 layer model [3] for pre-trained large scale ImageNet model. 2. Approach We use Caffe[7] to fine-tune VGG 16 layer model on PASCAL VOC 2011[8] dataset. VGG 16 layer model architecture is like the following in Table 1 as described in [3]. The last Fully-Connected layer of this model has been changed to FC-21 for our fine-tuning on PASCAL VOC 2011 dataset since there are 20 object classes in PASCAL VOC dataset and one more class is for the background. The 20 object classes in VOC dataset is like the following;, bicycle, bird, boat, bottle, bus, car, cat, chair, cow, diningtable,, horse, motorbike, person,, sheep, sofa, train, tvmonitor We use 224 x224 as input image size instead of 227 x 227 in RCNN[1] and batch size 10 to fit the parameters in VGG model. Input(224 x 224 RGB image) conv3-64 conv3-64 Conv3-128 Conv3-128 FC-4096 FC-4096 FC-1000 Soft-max Table 1. Architecture of VGG 16 layer model 3. Experiment Fine-tuning We use Caffe[7] and did 100000 iterations which took about 10 hours for training. For training and validation data we used the code for RCNN[1] and extracted region proposals in PASCAL VOC 2011 training and validation datasets. We add one more layer at the last layer of VGG model which is the accuracy layer to see the performance during training and it was about 0.8~1 accuracy from 10000 iterations. In the experiment below we are using the model that was saved in 90000 iterations. However by the observation of the accuracy and loss during training the model might have converged already at 10000 iterations but we do not test the performance of this model here. Detection We use python wrapper for Caffe[7] for 225

detection. It was not easy to understand and figure out how to use Deployment Input [9] in Caffe so this step took a while to setup prototxt file as a caffe model definition. In the results below we show the detection results using this caffe python wrapper. Results Table 2 below shows the detection results. For example for the first example motor bike the left column show the region proposal that has the highest detection score for motorbike class. And the below 5 scores mean for this given detection box the top 5 prediction class for this box. We can see that for all example we can find the gound truth label of the object. But the thing is that the location of the bounding box is not always centered on the object. For example for the aeoplane example on the 3 rd and train on the 7 th, the bounding box is drawn on the nearby context cloud and grains near the ground of the railroad. From this we can conclude that for classification this model works good because the nearby context also helps the scene to be correctly classified. For example if there is cloud that it is likely the scene is in the sky so it is likely that we will find an or a bird. The original purpose of this project was to see these features from filters by visualizing them so that whether we can distinguish filters that learn context or the object so probably we can figure out which filters are helpful for detection task. But we have not made a progress on that point in this quarter. But that could be a good future work relating to this project. And on the right column it shows the top 3 detection bounding box after applying Felzenszwalb et al. s non-maxima suppression[10]. Underneath there is a yellow/red box indicating whether human would judge this detection result succeeded or not. Yellow means success and red means a failure case. For example for the 5 th example in example, its classification will be correct because we are getting the top 1 st score as a but its detection is a failure because even though the detected bounding box says it is a it does not overlap with a ground truth bounding box of a. Meanwhile interestingly the detection result is a nearby context which looks like a context for garden dining table and. A is likely to appear in that kind of environment. Therefore it looks like the filters that extract features for context is helpful for classification task but they are not helpful to get the exact location of the object for detection task so it will be helpful if we can investigate among the learned features which filters contribute more to the object rather than the object. The pre-trained model is trained for classification task so this performance gap between classification and detection seems to be reasonable. Therefore in 13 detection results in Table 2 all of them are classified correctly but 8 of them are detected correctly and 5 of them failed in detection. motorbike 5.209 person 4.671 potted plant 2.94 cow 2.37 motorbike 2.32 Person and bicycle For person: person 6.25 5.985 bird 5.22 4.027 motorbike 3.07 12.034 bicycle 3.0557 bottle 1.94 cow 0.88 0.614 For bicyle 12.06 bicycle 5.746 boat 2.35 bird 1.806 cat r : 2.329 b: 1.338 y : 1.26 bicycle: r : 5.74 b: 5.67 y : 5.49 r : 12.0346 b: 11.922 y : 11.375 226

5.494 2.16 train 2.03 chair 1.04 sofa 0.96 r : 2.16 b: 1.9547 y : 1.85 10.008 train 2.66 sheep 2.37 cow 1.411 1.159 bottle r : 2.66 b: 2.2965 y : 2.03 5.46 5.42 cow 4.429 sheep 3.205 train 1.75 r : 5.465 b: 3.299 y : 3.266 9.577 bottle 6.930 tvmonitor 3.09 bicycle2.87 bus 1.639 chair r : 6.930 b: 5.944 y : 5.62 car 6.784 chair 2.68 sofa 1.8027 diningtable 1.486 sheep 1.2448 r : 2.680 b: 2.312 y : 2.30 cat 8.36 car 7.24 6.80 tvmonitor 2.04 bicycle 0.92 train r : 7.2476 b: 5.874 y : 5.522 and bottle 227

background 0.83 b: 2.937 y : 2.00 For 4.833 4.24 sheep 1.85 boat 1.12 background 0.18 For bottle 11.47 cow 2.40 bottle 2.02 background 1.60 train 0.89 For potted plant r: 4.24 b: 3.996 y : 3.75 10.07 sheep 1.82 cow 1.53 train 1.29 1.27 bird r : 1.27 b: 1.24 y : 0.789 Table 2. Detection results on PASCAL VOC 2011 dataset 8.332 boat 7.48 bird 3.42 bicycle 2.16 0.38 And as a second part of the experiment we found that for example the 10th example seems interesting because the door looks like a big part of the picture even though the PASCAL VOC object class does not have the door label. So we did a classification with vgg 16 layer model but not fine-tuned with PASCAL VOC object using MatConvNet[11] matlab library. Figure 1 below shows a very interesting result that this image classified as a sliding door which has a lot of correlated to room door. r: 3.4275 b: 2.015 y : 1.74 person and bus For person For person: 5.105 4.90 cat 2.674 bottle 1.48 bird 0.64 For bus: 6.706 bus 3.17 2.43 cow 1.48 r : 0.62 b: 0.0387 y : -0.025 for bus Figure 1. The 10th figure in Table 2 classified as a sliding door in vgg 16 layer model And this classification is different from the classification r : 3.17 228

task in [4] since in [4] the classification is to figure out the ground truth label in PASCAL VOC class labels. In Figure 1 result here though is more general classification that can be confirmed with human observation that cannot be measured in computer since the ground truth for door is not annotated and therefore not provided. We thought this is interesting so we also listed the top 10 prediction result for this image and those are with score; sliding door(0.4436), wardrobe, closer, press(0.1836), medicine chest, medicine cabinet(0.1428), window shade(0.0585), shower curtain(0.0275), refrigerator, icebox(0.0180), safe(0.0079), washbasin, washbowl(0.0072), doormat(0.0069), radiator(0.0062). References [1] R. B. Girshick, J. Donahue, T. Darrell, J. Malik, Rich feature hierarchies for accurate object detection and semantic segmentation, in:cvpr, 2014, pp. 580 587. [2] J. Uijlings, K. van de Sande, T. Gevers, and A. Smeulders. Selective search for object recognition. IJCV, 2013. [3] Very Deep Convolutional Networks for Large-Scale Image Recognition, Karen Simonyan, Andrew Zisserman, ICLR 2015 [4] http://www.robots.ox.ac.uk/~vgg/research/very_deep/ [5] A. Krizhevsky, I. Sutskever, and G. Hinton. ImageNet clas- sification with deep convolutional neural networks. In NIPS, 2012. [6] https://github.com/rbgirshick/rcnn [7] http://caffe.berkeleyvision.org/ [8] Visual Object Classes Challenge 2011. http:// pascallin.ecs.soton.ac.uk/challenges/voc/voc2011/#d ata [9] http://caffe.berkeleyvision.org/tutorial/data.html. [10] P. Felzenszwalb, R. Girshick, D. McAllester, and D. Ra- manan. Object detection with discriminatively trained part based models. TPAMI, 2010. [11] MatConvNet: CNNs for MATLAB http://www.vlfeat.org/ matconvnet/ 229