SSD: Single Shot MultiBox Detector. Author: Wei Liu et al. Presenter: Siyu Jiang

Similar documents
Object detection with CNNs

3 Object Detection. BVM 2018 Tutorial: Advanced Deep Learning Methods. Paul F. Jaeger, Division of Medical Image Computing

Modern Convolutional Object Detectors

Deep learning for object detection. Slides from Svetlana Lazebnik and many others

YOLO 9000 TAEWAN KIM

R-FCN: Object Detection with Really - Friggin Convolutional Networks

R-FCN: OBJECT DETECTION VIA REGION-BASED FULLY CONVOLUTIONAL NETWORKS

Lecture 5: Object Detection

[Supplementary Material] Improving Occlusion and Hard Negative Handling for Single-Stage Pedestrian Detectors

Object Detection on Self-Driving Cars in China. Lingyun Li

Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks

Yiqi Yan. May 10, 2017

YOLO9000: Better, Faster, Stronger

Efficient Segmentation-Aided Text Detection For Intelligent Robots

CS6501: Deep Learning for Visual Recognition. Object Detection I: RCNN, Fast-RCNN, Faster-RCNN

Encoder-Decoder Networks for Semantic Segmentation. Sachin Mehta

Object Detection Based on Deep Learning

Extend the shallow part of Single Shot MultiBox Detector via Convolutional Neural Network

Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks

DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution and Fully Connected CRFs

Photo OCR ( )

Direct Multi-Scale Dual-Stream Network for Pedestrian Detection Sang-Il Jung and Ki-Sang Hong Image Information Processing Lab.

Fully Convolutional Networks for Semantic Segmentation

Deep learning for dense per-pixel prediction. Chunhua Shen The University of Adelaide, Australia

Rich feature hierarchies for accurate object detection and semantic segmentation

MULTI-SCALE OBJECT DETECTION WITH FEATURE FUSION AND REGION OBJECTNESS NETWORK. Wenjie Guan, YueXian Zou*, Xiaoqun Zhou

Spatial Localization and Detection. Lecture 8-1

Mask R-CNN. presented by Jiageng Zhang, Jingyao Zhan, Yunhan Ma

Feature-Fused SSD: Fast Detection for Small Objects

Regionlet Object Detector with Hand-crafted and CNN Feature

SSD: Single Shot MultiBox Detector

CS 1674: Intro to Computer Vision. Object Recognition. Prof. Adriana Kovashka University of Pittsburgh April 3, 5, 2018

Towards Weakly- and Semi- Supervised Object Localization and Semantic Segmentation

Rich feature hierarchies for accurate object detection and semant

RON: Reverse Connection with Objectness Prior Networks for Object Detection

Pixel Offset Regression (POR) for Single-shot Instance Segmentation

arxiv: v1 [cs.cv] 26 Jun 2017

Object Detection. CS698N Final Project Presentation AKSHAT AGARWAL SIDDHARTH TANWAR

arxiv: v1 [cs.cv] 15 Oct 2018

Lecture 7: Semantic Segmentation

Unsupervised Deep Learning. James Hays slides from Carl Doersch and Richard Zhang

arxiv: v1 [cs.cv] 26 May 2017

Finding Tiny Faces Supplementary Materials

G-CNN: an Iterative Grid Based Object Detector

Predicting Depth, Surface Normals and Semantic Labels with a Common Multi-Scale Convolutional Architecture David Eigen, Rob Fergus

arxiv: v1 [cs.cv] 31 Mar 2016

Semantic Segmentation

Team G-RMI: Google Research & Machine Intelligence

arxiv: v1 [cs.cv] 9 Aug 2017

Supplementary Material: Pixelwise Instance Segmentation with a Dynamically Instantiated Network

Amodal and Panoptic Segmentation. Stephanie Liu, Andrew Zhou

JOINT DETECTION AND SEGMENTATION WITH DEEP HIERARCHICAL NETWORKS. Zhao Chen Machine Learning Intern, NVIDIA

YOLO: You Only Look Once Unified Real-Time Object Detection. Presenter: Liyang Zhong Quan Zou

arxiv: v2 [cs.cv] 8 Apr 2018

AttentionNet for Accurate Localization and Detection of Objects. (To appear in ICCV 2015)

Flow-Based Video Recognition

DeepIM: Deep Iterative Matching for 6D Pose Estimation - Supplementary Material

arxiv: v2 [cs.cv] 23 Jan 2019

arxiv: v1 [cs.cv] 22 Mar 2018

Xiaowei Hu* Lei Zhu* Chi-Wing Fu Jing Qin Pheng-Ann Heng

An Anchor-Free Region Proposal Network for Faster R-CNN based Text Detection Approaches

Mask R-CNN. By Kaiming He, Georgia Gkioxari, Piotr Dollar and Ross Girshick Presented By Aditya Sanghi

PT-NET: IMPROVE OBJECT AND FACE DETECTION VIA A PRE-TRAINED CNN MODEL

arxiv: v2 [cs.cv] 30 Mar 2016

CIS680: Vision & Learning Assignment 2.b: RPN, Faster R-CNN and Mask R-CNN Due: Nov. 21, 2018 at 11:59 pm

Advanced Video Analysis & Imaging

TextBoxes++: A Single-Shot Oriented Scene Text Detector

Traffic Multiple Target Detection on YOLOv2

Cascade Region Regression for Robust Object Detection

Deep Learning. Visualizing and Understanding Convolutional Networks. Christopher Funk. Pennsylvania State University.

Yield Estimation using faster R-CNN

Improving Small Object Detection

arxiv: v4 [cs.cv] 30 Nov 2016

Deep Tracking: Biologically Inspired Tracking with Deep Convolutional Networks

Deep Learning For Video Classification. Presented by Natalie Carlebach & Gil Sharon

Deep Learning for Object detection & localization

Learning Globally Optimized Object Detector via Policy Gradient

arxiv: v2 [cs.cv] 24 Mar 2018

Volume 6, Issue 12, December 2018 International Journal of Advance Research in Computer Science and Management Studies

Pedestrian Detection based on Deep Fusion Network using Feature Correlation

Hand Detection For Grab-and-Go Groceries

Martian lava field, NASA, Wikipedia

An algorithm for highway vehicle detection based on convolutional neural network

arxiv: v3 [cs.cv] 12 Mar 2018

REGION AVERAGE POOLING FOR CONTEXT-AWARE OBJECT DETECTION

Automatic detection of books based on Faster R-CNN

Pelee: A Real-Time Object Detection System on Mobile Devices

Object Proposal Generation with Fully Convolutional Networks

Learning to Segment Object Candidates

Instance-aware Semantic Segmentation via Multi-task Network Cascades

Hide-and-Seek: Forcing a network to be Meticulous for Weakly-supervised Object and Action Localization

Structured Prediction using Convolutional Neural Networks

EFFECTIVE OBJECT DETECTION FROM TRAFFIC CAMERA VIDEOS. Honghui Shi, Zhichao Liu*, Yuchen Fan, Xinchao Wang, Thomas Huang

DEEP NEURAL NETWORKS FOR OBJECT DETECTION

Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset. By Joa õ Carreira and Andrew Zisserman Presenter: Zhisheng Huang 03/02/2018

Object Detection. Part1. Presenter: Dae-Yong

Object detection using Region Proposals (RCNN) Ernest Cheung COMP Presentation

Adaptive Feeding: Achieving Fast and Accurate Detections by Adaptively Combining Object Detectors

CENG 783. Special topics in. Deep Learning. AlchemyAPI. Week 11. Sinan Kalkan

Places Challenge 2017

Transcription:

SSD: Single Shot MultiBox Detector Author: Wei Liu et al. Presenter: Siyu Jiang

Outline 1. Motivations 2. Contributions 3. Methodology 4. Experiments 5. Conclusions 6. Extensions

Motivation

Motivation Faster R-CNN: Box Classification and Regression are being done 2 times.

Motivation 1. Two stage object detection is time-consuming. Faster R-CNN is faster but not fast enough.

Motivation YOLO: Fast but not accurate enough.

Motivation 1. Two stage object detection is time-consuming. Faster R-CNN is faster but not fast enough. 2. YOLO: Fast but not accurate enough. 3. Object detection needs a good tradeoff between accuracy and speed.

Contribution SSD: Single Shot MultiBox Detection (ECCV 2016)

Contribution 1. Achieves competitive map (72.1) as faster R-CNN (73.2). 2. Much faster (58 fps) than faster R-CNN (7fps) and YOLO (45fps), making accurate real-time detection possible. 3. Makes predictions on multiple feature maps with different resolutions to handle objects of different sizes.

Network Architecture VGG network - extract features Detectors on multiple feature maps

Default (Anchor) Boxes Similar to faster R-CNN, SSD also uses anchor boxes. At each feature map position, anchor boxes have different aspect ratios. S k a r S k a r a r : Aspect ratio a r {1, 2, 3, 1 2, 1 3 } Feature Map S k a r S k a r S k S k Each position has anchor boxes of different aspect ratio

Default (Anchor) Boxes Anchor boxes of different feature maps have unique scales. So different feature map are responsible for objects of different sizes. S k denotes the scale of of the k-th feature map. m is the number of feature maps for prediction. S min = 0.2, S max = 0.9.

Convolutional Filters for Prediction On eachfeature map, two types of convolutional filters will be applied: C filters for category prediction, where C is the number of object categories. 4 filters for bounding box regression, 4 for the coordinates x, y, w, h. Together there will be c + 4 k filters for each feature map, k is the number of anchor box types. The output for a m n feature map will be a map of m n c + 4 k, indicating the category and coordinates for each bounding box. m n d feature map 3 3 c + 4 k convolutional filter m n c + 4 k predicted maps

Training objective The loss used in SSD is a combination of confidence loss and localization loss. Confidence loss is the softmax loss over multiple classes confidences. Localization loss is the same smooth L1 loss as faster R-CNN.

Matching Strategy During training, we need to determine which anchor boxes are corresponding to ground truth boxes. 1. Match each ground truth box to the anchor box with the best jaccard overlap. 2. Match default box to any ground truth with jaccard overlap higher than 0.5. Question : Why not using predicted box (anchor box after regression) for matching? Question : Negative anchor boxes are much more than positive anchor boxes, how to deal with this imbalance?

Hard Negative Mining Number of negative anchor boxes >> number of positive anchor boxes Calculate the confidence loss of eachnegative anchor boxes. Select anchor boxes that have highest loss as negative training samples. Keep the ratio between negatives and positives 3:1. This leads to faster convergence and stable training. Question: Isthis approach sufficient for training negative samples? No. A recent paper show that negative samples need more elegant loss calculation. (ICCV 2017 best student paper award: Focal Loss for Dense Object Detection, Linet al.)

Experiments Base network: VGG16, pretrained on ILSVRC dataset. Add layers on top of VGG Conv5 layer. Use dilated convolution in Conv6 layer. Dataset: Pascal VOC and MS COCO.

Results using multiple layers for prediction Predicting on multiple layers is better. Removing boundary boxes hurts the accuracy for high level feature maps.

Results scale & data augmentation Using more scales of anchor boxes is better. Data Augmentation is crucial to SSD. Increases map by 8.8%.

Data Augmentation Photo-metric distortions. Random crop: Use the entire original input image. Crop a patch so that the minimum jaccard overlap with the objects is 0.1, 0.3, 0.5, 0.7 or 0.9. Randomly crop a patch. Question: Will data augmentation also benefit this much to Faster R- CNN? crop & resize No, because faster R-CNN uses a feature pooling step which is relatively robust to object translation.

Results small objects SSD does not perform well on small objects. No feature resampling step in SSD. Relatively low-level feature maps are responsible for detecting small objects. These feature maps do not have sufficient high-level semantic information.

More data augmentation for small objects Keep small objects small. Random expand 4x width 4x height Fill with grey pixels Random sample crop & resize

Results after data augmentation for small objects An increase of 2% - 3% in map.

Results - comparison with other methods It is claimed in the paper that SSD outperforms Faster R-CNN and YOLO in both speed and accuracy. SSD300 is the first real-time method to achieve above 70% map.

Results - comparison with other methods Outperforms Faster R-CNN in every category. VOC 2007, VOC 2012, MS COCO. Notice that the input image size of Faster R-CNN is at least 600 x 600.

Results - comparison with other methods Outperforms Faster R-CNN in every category. VOC 2007, VOC 2012, MS COCO. Notice that the input image size of Faster R-CNN is at least 600 x 600.

Are the results really this good? Speed/accuracy trade-offs for modern convolutional object detectors (CVPR 2017) A detailed comparison of SSD, Faster R-CNN, and R-FCN. SSD is not as good as faster R-CNN or R- FCN in most cases. Better in large objects, worse in small ones.

Results - comparison with other methods SSD leads in detection speed; it is good at speed / accuracy tradeoff.

Conclusion Introduces a single-stage detector for object detection. Uses convolutional predictor on multiple feature maps, each responsible for a unique scale of objects. Achieves competitive accuracy and faster speed on various datasets.

Extensions? Use skip connections and deconvolution to integrate low-level location information and high-level semantic information. Feature Pyramid Networks for Object Detection (CVPR 2017) DSSD : Deconvolutional Single Shot Detector (2017)

References https://www.youtube.com/watch?v=p8e-g-mhx4k Huang, Jonathan, et al. "Speed/accuracy trade-offs for modern convolutional object detectors." IEEE CVPR. 2017. Lin, Tsung-Yi, et al. "Feature pyramid networks for object detection." CVPR. Vol. 1. No. 2. 2017. Fu, Cheng-Yang, et al. "DSSD: Deconvolutional single shot detector." arxiv preprint arxiv:1701.06659 (2017).