Object Detection with YOLO on Artwork Dataset

Similar documents
Spatial Localization and Detection. Lecture 8-1

Object Detection Based on Deep Learning

Deformable Part Models

Lecture 5: Object Detection

Object detection with CNNs

Object Detection. CS698N Final Project Presentation AKSHAT AGARWAL SIDDHARTH TANWAR

YOLO9000: Better, Faster, Stronger

You Only Look Once: Unified, Real-Time Object Detection

Yiqi Yan. May 10, 2017

Object detection using Region Proposals (RCNN) Ernest Cheung COMP Presentation

YOLO: You Only Look Once Unified Real-Time Object Detection. Presenter: Liyang Zhong Quan Zou

Deep learning for object detection. Slides from Svetlana Lazebnik and many others

Rich feature hierarchies for accurate object detection and semantic segmentation

Unified, real-time object detection

Object Detection. TA : Young-geun Kim. Biostatistics Lab., Seoul National University. March-June, 2018

Rich feature hierarchies for accurate object detection and semantic segmentation

CS 1674: Intro to Computer Vision. Object Recognition. Prof. Adriana Kovashka University of Pittsburgh April 3, 5, 2018

Detection III: Analyzing and Debugging Detection Methods

Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks

REGION AVERAGE POOLING FOR CONTEXT-AWARE OBJECT DETECTION

G-CNN: an Iterative Grid Based Object Detector

CS6501: Deep Learning for Visual Recognition. Object Detection I: RCNN, Fast-RCNN, Faster-RCNN

Real-time Object Detection CS 229 Course Project

Deep Learning for Object detection & localization

Category-level localization

Direct Multi-Scale Dual-Stream Network for Pedestrian Detection Sang-Il Jung and Ki-Sang Hong Image Information Processing Lab.

Object Detection on Self-Driving Cars in China. Lingyun Li

Towards Real-Time Automatic Number Plate. Detection: Dots in the Search Space

A FRAMEWORK OF EXTRACTING MULTI-SCALE FEATURES USING MULTIPLE CONVOLUTIONAL NEURAL NETWORKS. Kuan-Chuan Peng and Tsuhan Chen

Object Recognition II

Fine-tuning Pre-trained Large Scaled ImageNet model on smaller dataset for Detection task

An Exploration of Computer Vision Techniques for Bird Species Classification

An Object Detection Algorithm based on Deformable Part Models with Bing Features Chunwei Li1, a and Youjun Bu1, b

Classification of objects from Video Data (Group 30)

Object Detection with Discriminatively Trained Part Based Models

A Comparison of CNN-based Face and Head Detectors for Real-Time Video Surveillance Applications

Deep Tracking: Biologically Inspired Tracking with Deep Convolutional Networks

TRANSPARENT OBJECT DETECTION USING REGIONS WITH CONVOLUTIONAL NEURAL NETWORK

Object Detection with Partial Occlusion Based on a Deformable Parts-Based Model

arxiv: v1 [cs.cv] 31 Mar 2016

Regionlet Object Detector with Hand-crafted and CNN Feature

Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks

Deep Learning in Visual Recognition. Thanks Da Zhang for the slides

Supplementary Material: Pixelwise Instance Segmentation with a Dynamically Instantiated Network

Supplementary Material: Unconstrained Salient Object Detection via Proposal Subset Optimization

Recap Image Classification with Bags of Local Features


Hand Detection For Grab-and-Go Groceries

Convolutional Neural Networks. Computer Vision Jia-Bin Huang, Virginia Tech

MULTI-SCALE OBJECT DETECTION WITH FEATURE FUSION AND REGION OBJECTNESS NETWORK. Wenjie Guan, YueXian Zou*, Xiaoqun Zhou

Find that! Visual Object Detection Primer

Rich feature hierarchies for accurate object detection and semant

Pedestrian Detection Using Structured SVM

Project 3 Q&A. Jonathan Krause

SSD: Single Shot MultiBox Detector. Author: Wei Liu et al. Presenter: Siyu Jiang

Finding Tiny Faces Supplementary Materials

Development in Object Detection. Junyuan Lin May 4th

Final Report: Smart Trash Net: Waste Localization and Classification

Extend the shallow part of Single Shot MultiBox Detector via Convolutional Neural Network

OBJECT DETECTION HYUNG IL KOO

CEA LIST s participation to the Scalable Concept Image Annotation task of ImageCLEF 2015

Deep Neural Networks:

Hierarchical Learning for Object Detection. Long Zhu, Yuanhao Chen, William Freeman, Alan Yuille, Antonio Torralba MIT and UCLA, 2010

Traffic Multiple Target Detection on YOLOv2

Learning to Localize Objects with Structured Output Regression

Computer Vision Lecture 16

Feature-Fused SSD: Fast Detection for Small Objects

Adaptive Feeding: Achieving Fast and Accurate Detections by Adaptively Combining Object Detectors

arxiv: v1 [cs.cv] 15 Oct 2018

Pedestrian Detection with Deep Convolutional Neural Network

3 Object Detection. BVM 2018 Tutorial: Advanced Deep Learning Methods. Paul F. Jaeger, Division of Medical Image Computing

A Novel Representation and Pipeline for Object Detection

arxiv: v1 [cs.cv] 30 Apr 2018

Selective Search for Object Recognition

Object Recognition with Deformable Models

arxiv: v1 [cs.cv] 26 Jun 2017

R-FCN++: Towards Accurate Region-Based Fully Convolutional Networks for Object Detection

Object Detection in Sports Videos

arxiv: v2 [cs.cv] 25 Aug 2014

Using k-poselets for detecting people and localizing their keypoints

Mask R-CNN. presented by Jiageng Zhang, Jingyao Zhan, Yunhan Ma

arxiv: v1 [cs.cv] 10 Jul 2014

Content-Based Image Recovery

Lecture 7: Semantic Segmentation

Introduction to Deep Learning for Facial Understanding Part III: Regional CNNs

Using the Deformable Part Model with Autoencoded Feature Descriptors for Object Detection

arxiv: v1 [cs.cv] 3 Apr 2016

Object Category Detection: Sliding Windows

CAP 6412 Advanced Computer Vision

Texture Complexity based Redundant Regions Ranking for Object Proposal

arxiv: v1 [cs.cv] 18 Jun 2017

Pedestrian Detection via Mixture of CNN Experts and thresholded Aggregated Channel Features

CIS680: Vision & Learning Assignment 2.b: RPN, Faster R-CNN and Mask R-CNN Due: Nov. 21, 2018 at 11:59 pm

All You Want To Know About CNNs. Yukun Zhu

An Efficient Divide-and-Conquer Cascade for Nonlinear Object Detection

SSD: Single Shot MultiBox Detector

Object Recognition and Detection

A Discriminatively Trained, Multiscale, Deformable Part Model

Object Category Detection: Sliding Windows

Object Detection by 3D Aspectlets and Occlusion Reasoning

Transcription:

Object Detection with YOLO on Artwork Dataset Yihui He Computer Science Department, Xi an Jiaotong University heyihui@stu.xjtu.edu.cn Abstract Person: 0.64 Horse: 0.28 I design a small object detection network, which is simplified from YOLO(You Only Look Once[15]) network. YOLO is a fast and elegant network that can extract meta features, predict bounding boxes and assign scores to bounding boxes. Compared with RCNN, it doesn t have complex pipline, which is easier for me to implement. Start from a ImageNet pretrained model, I train my YOLO on PASCAL VOC200 training dataset. And validate my YOLO on PASCAL VOC200 validation dataset. Finally, I evaluate my YOLO on an artwork dataset(picasso dataset). With the best parameters, I got 40% precision and 5% recall. 1. Resize image. 2. Run convolutional network.. Non-max suppression. Dog: 0.0 Figure 1: The YOLO Detection System. (1) resizes the input image to 448 448, (2) runs a single convolutional network on the image, and () thresholds the resulting detections by the model s confidence. segment objects based on a parts model. Each part is described by appearance and shape features. In [], Arbelaez et al. use a contour detector which has been shown to provide excellent results. Although segmentation methods perform well, they are often computationally expensive. Region olutional Network approaches attempt to use additional neural network to propose bounding boxes. Girshick et al. use convolutional network features to train a pipeline consisting of a classifier and regression step, called R-CNN [11]. R-CNN, fast R-CNN and faster R-CNN are faster and more accurate compared with previous methods. However, their training pipline is complex and still not efficient enough. Single olutional Network approach, namely YOLO, is a fast and accurate detection system. They frame object detection as a regression problem to spatially separated bounding boxes and associated class probabilities. A single neural network predicts bounding boxes and class probabilities directly from full images in one evaluation. Since the whole detection pipeline is a single network, it can be optimized end-to-end directly on detection performance. In this work, I implement and investigate YOLO. 1. Introduction & Related Work For object detection task in the past, Exhaustive search [6, 1, 1], segmentation based approaches [5, ], and Region olutional Network based approaches[11, 10, 16] have been widely used. Exhaustive search approachs attempts to search for all object proposals in an image. The major drawback is computationally expensive. And thus requires constraints to prune the search space and reduce the number of considered locations, at the cost of completeness. In [8], Felzenszwalb et al. performed an exhaustive search using HOG features and a linear SVM. This resulted in excellent object detection performance. To better guide the exhaustive search, in [14], Lampert et al. used an appearance model, combined with a branch and bound technique, to remove scale, aspect ratio, and grid constraints from the previous exhaustive search formulation. However, their method still requires evaluation on over 100,000 proposals [2]. Segmentation approaches attempts to construct object proposals from the underlying pixels. In [12], the authors 2. Feature Extraction & Classification Scheme in One Network: YOLO Yihui is a CS sophomore, with Xi an Jiaotong University. This work is his course project for 2016 Spring CS281B Advanced Computer Vision taught by Prof. Yuan-Fang Wang, when he was an international exchange student at University of California, Santa Barbara. My UCSB email is yihui@umail.ucsb.edu. Perm Number: X2899 In this section, I ll briefly introduce YOLO algoithm that I employed. For more details, you can read the original paper[15]. The pipline of standard YOLO is show in figure1. 1

. Technical Details In this section, I ll discuss technical details in my project..1. Features Bounding boxes + confidence S S grid on input The convolutional neural layers before output are used for generate features. Since the original YOLO described in previous section is a large network. I design a smaller network for this project, which is similar with YOLO. My architecture is shown below(: Full-connected, :convolution, :max-ing, Drop:Dropout) Final detections Class probability map Warp Drop Figure 2: The Model. It divides the image into an S S grid and for each grid cell predicts B bounding boxes, confidence for those boxes, and C class probabilities. These predictions are encoded as an S S (B 5 + C) tensor. YOLO unify the separate components of object detection into a single neural network. It uses features from the entire image to predict each bounding box. It also predicts all bounding boxes across all classes for an image simultaneously. First, YOLO divides the input image into an S S grid. If the center of an object falls into a grid cell, that grid cell is responsible for detecting that object. Each grid cell predicts B bounding boxes and confidence scores for those boxes. These confidence scores reflect how confident the model is that the box contains an object and also how accurate it thinks the box is that it predicts. Formally, confidence is Pr(Object) IOUtruth pred. Each bounding box consists of 5 predictions: x, y, w, h, and confidence. The (x, y) represent the center of the box.(w, h) represents size of a bounding box. Confidence prediction represents the IOU between the predicted box and any ground truth box. Each grid cell also predicts C conditional class probabilities, Pr(Classi Object). At test time, the conditional class probabilities times the box confidence gives class-specific confidence scores for each box, which encode both the probability of that class appearing in the box and how well the predicted box fits the object. truth truth Pr(Classi Object) Pr(Object) IOUpred = Pr(Classi ) IOUpred image > 448 x 448 x image 448 x448x image, 16 f i l t e r s 448 x448x16 image, 2 s i z e, 2 s t r i d e 224 x224x16 image, 2 f i l t e r s 224 x224x2 image, 2 s i z e, 2 s t r i d e 112 x112x2 image, 64 f i l t e r s 112 x112x64 image, 2 s i z e, 2 s t r i d e 56 x56x64 image, 128 f i l t e r s 56 x56x128 image, 2 s i z e, 2 s t r i d e 28 x28x128 image, 256 f i l t e r s 28 x28x256 image, 2 s i z e, 2 s t r i d e 14 x14x256 image, 512 f i l t e r s 14 x14x512 image, 2 s i z e, 2 s t r i d e xx512 image, f i l t e r s xx image, f i l t e r s xx image, f i l t e r s 5016 i n p u t s, 256 o u t p u t s 256 i n p u t s, 4096 o u t p u t s 4096 i n p u t s, 0. 5 p r o b a b i l i t y 4096 i n p u t s, 140 o u t p u t s There s no explicit features in YOLO,since features are extracted by neural network automatically. Actually, the last convolutional layer can be regard as feature map(xx)..2. Classification Scheme As I ve described in the previous section, YOLO is a unified neural network for multi-objects detection. Specifically, YOLO converts detection task into a regression problem. In this regression problem, the position of bounding boxes and confidence for each class are our regression object Y. The raw image is X. (1).. Dataset Selection & Partition I select two small dataset: PASCAL VOC200(400MB, 9000 images) and Picasso Artwork Dataset[9](218 paintings). Pretraining For objects detection tasks, researchers usually initalize their network with pretrained models on ImageNet. I initalize my neural network with caffe pretrained In practice, I use S =, B = 2. PASCAL VOC has 20 labelled classes so C = 20. Network output is a 0 tensor. The architecture of YOLO is similar with ordinary convolutional neural network(like AlexNet), only difference is object function. So I ll not discuss the architecture in details, which is shown in figure. 2

448 112 448 112 192 56 56 256 28 28 512 14 14 4096 0. Layer xx64-s-2 Max Layer. Layer xx192 Max Layer. Layers 1x1x128 xx256 1x1x256 xx512 Max Layer. Layers 1x1x256 xx512 1x1x512 xx Max Layer. Layers 1x1x512 xx xx xx-s-2. Layers } 4 } 2 xx xx Conn. Layer Conn. Layer Figure : The Architecture. The convolutional layers are pretrained on the ImageNet classification task at half the resolution (224 224 input image) and then double the resolution for detection. Note that since this network is too large, I design a simplified network which is described in section.1. model 1. So that for training part, I only need to perform fine-tuning on PASCAL VOC200. Training & Validation Since PASCAL VOC200 is not large(9000 images) and it is also a commonly used dataset for object detection tasks, I choose it for training and testing. I use 6000 images from PASCAL VOC200 as training dataset, and the rest 000 images from PASCAL VOC200 as validation dataset. Testing I choose Picasso Artwork Dataset as testing dataset for reasons. (i) Artwork dataset is interesting and new for me. (ii) There s many algoithms tested on it. (iii) Since artwork style is different from real world, it can tell whether a classifier have good generalization to other domains..4. Cross-Validation I ve mentioned in previous subsection that I have 6000 images for training and 000 images for testing. So I performed a -fold cross-validation. One also needed to mentioned is that, I also performed exponential moving average, which also helps improving performance and preventing overfitting. This is a special technique for neural network. It is formalized as follow: W t = α W t + (1 α) W t 1 Here, W is the weights of my model, αis the exponential decay rate. So that the final model is an ensemble of previous iterations. I ll show in experiments section that these two techniques both help improve performance. 1 https://github.com/bvlc/caffe/wiki/model-zoo 4. Experiments I implement my YOLO with TensorFlow[1] in python. Training, validation, and testing are performed on my laptop Thinkpad T440p. 4.1. Training & Cross-Validation First, I perform -fold cross-validation on PASCAL VOC200 as described in previous section. For each training time, I also perform exponential moving average weights. To show the effect of cross-validation, I compare standalone models with the combined model. It turns out that the combined model have % map improvement, which have 4% map. To show the effect of exponential moving average weights, I train a model without exponential moving average weights as control group. Then I compare it with the model above. It turns out that Without exponential moving average weights, map drops 1%. After training, I compared my model with other public algoithms on Pascal VOC200, which is shown in table1. It seems obvious that my YOLO implementation has lower map compared with the original paper. That makes sense because my implementation has fewer layers and fewer filters compared with the original paper. As for frames per second, since the benchmark results got by other algorithms are tested on GPU. My result is from my laptop CPU. So it is not comparable. However, you can still see that the FPS is higher that Fast R-CNN. We can safely claim that YOLO has a good trade-off between speed and performance compared with other state-of-the-art algorithms.

Real-Time Detectors 0Hz DPM[18] YOLO Fast R-CNN[10] Faster R-CNN[16] Mine(simple YOLO) Train 200 200+2012 200+2012 200+2012 200 map 26.1 6.4 0.0.2 4 FPS 0 45 0.5 0.(CPU) Table 1: Detection systems comparision PASCAL VOC. Comparing the performance and speed of detectors. YOLO have good trade-off between speed and performance. That s also a reason that I employ YOLO in this project. Figure 5: The two on the left are successful examples. The two on the right are failed examples. Figure 4: Some interesting results in real world Humans I include some interesting detection results in figure4. Some are successful, some failed. YOLO 4.2. Testing My YOLO After training on PASCAL VOC200, I start testing on Picasso Artwork Dataset. Since it is a small dataset with only 218 images, it only takes me minites to go through the full dataset, even on my CPU. Some successful and failed examples are shown in figure5. 4.2.1 DPM Poselets RCNN D&T Tuning Threshold: Precision Recall Trade-off Threshold of object bounding boxes confidence is the most important parameter for me to tune. I ve mentioned in second section that confidence threshold is defined as IoU(intersection over union) between predicted box and the ground truth. a confidence threshold of 50% means that we ll accept proposals that believes their bounding boxes have more that 50% overlap with real object. Increase confidence threshold will lead to less bounding box proposals for each image. Decrease confidence threshold will result in more bounding boxes. It s like the threshold for SIFT, which controls number of feature points. I ve tested 5 value for confidence threshold(100%, 5%, 50%, 25%, 0%) and compare my result with other public algoithms, which is Figure 6: Picasso Dataset precision-recall curves shown in figure6. Human perfomance on Picasso Artwork Dataset is the dot in right up corner. 4.2.2 Comparision with Other Models After considering the trade-off between precision and recall, I select confidence threshold of 50% as my final model, which has precision of 40% and recall of 5%. I compared my YOLO with other algoithms in table2. 4

AP Best F1 YOLO 5. 0.590 R-CNN 10.4 0.225 DPM.8 0.458 Poselets[4] 1.8 0.21 D&T[6] 1.9 0.051 Mine 4 0. Table 2: AP and F 1 score evaluation of different algorithms on Picasso Dataset From this table, you can easily see that, (i) current stateof-the-art is still far away from human perfromance, which also means that current algoithms do not have good generalization on non-real image. (ii) YOLO have a better generalization than other algoithms. (iii) The result I got is reasonable, since my model is a simplified YOLO. 5. Conclusion & Lessons Learned I learned a lot from through this project. Here I draw some conclusions and show some lessons I learned. I extensively read papers on objects detection tasks. Get the general idea of exhaustive search approaches, segmentation approaches and convolutional neural network approach. Important parameters like threshold need to be tuned very carefully. Precision and recall trade-off need to be decide. Cross-Validation is a good way for most models to further improve their accuracy. However, need some time to implement myself, since there s no cross-validation module in most machine learning packages. Objects detection is a computation expensive problem. It takes long time to train, validate and test, if I want to get good perfomance. Though many computer vision tasks like image classification have surpass human performance. However, there s still a lot of work to be done on objects detection task. Finally, thanks to this project, it helps me gain a deep and broad view on objects detection task! [2] B. Alexe, T. Deselaers, and V. Ferrari. What is an object? In CVPR, 2010. 1 [] P. Arbelaez, M. Maire, C. Fowlkes, and J. Malik. Contour detection and hierarchical image segmentation. PAMI, 2011. 1 [4] L. Bourdev and J. Malik. Poselets: Body part detectors trained using d human pose annotations. In Computer Vision, 2009 IEEE 12th International Conference on, pages 165 12. IEEE, 2009. 5 [5] J. Carreira and C. Sminchisescu. Constrained parametric min-cuts for automatic object segmentation. In CVPR. IEEE, 2010. 1 [6] N. Dalal and B. Triggs. Histograms of oriented gradients for human detection. In CVPR, 2005. 1, 5 [] I. Endres and D. Hoiem. Category independent object proposals. In ECCV. 2010. 1 [8] P. F. Felzenszwalb, R. B. Girshick, D. McAllester, and D. Ramanan. Object detection with discriminatively trained partbased models. PAMI, 2010. 1 [9] S. Ginosar, D. Haas, T. Brown, and J. Malik. Detecting people in cubist art. In Computer Vision-ECCV 2014 Workshops, pages 101 116. Springer, 2014. 2 [10] R. Girshick. Fast r-cnn. In Proceedings of the IEEE International Conference on Computer Vision, pages 1440 1448, 2015. 1, 4 [11] R. Girshick, J. Donahue, T. Darrell, and J. Malik. Rich feature hierarchies for accurate object detection and semantic segmentation. In CVPR, 2014. 1 [12] C. Gu, J. J. Lim, P. Arbeláez, and J. Malik. Recognition using regions. In CVPR, 2009. 1 [1] H. Harzallah, F. Jurie, and C. Schmid. Combining efficient object localization and image classification. In CVPR. IEEE, 2009. 1 [14] C. H. Lampert, M. B. Blaschko, and T. Hofmann. Efficient subwindow search: A branch and bound framework for object localization. PAMI, 2009. 1 [15] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi. You only look once: Unified, real-time object detection. arxiv preprint arxiv:1506.02640, 2015. 1 [16] S. Ren, K. He, R. Girshick, and J. Sun. Faster r-cnn: Towards real-time object detection with region proposal networks. In Advances in Neural Information Processing Systems, pages 91 99, 2015. 1, 4 [1] P. Viola and M. J. Jones. Robust real-time face detection. IJCV, 2004. 1 [18] J. Yan, Z. Lei, L. Wen, and S. Li. The fastest deformable part model for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 249 2504, 2014. 4 References [1] M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S. Corrado, A. Davis, J. Dean, M. Devin, et al. Tensorflow: Large-scale machine learning on heterogeneous systems, 2015. Software available from tensorflow. org. 5