YOLO9000: Better, Faster, Stronger

Similar documents
Lecture 5: Object Detection

YOLO 9000 TAEWAN KIM

Object detection with CNNs

Object Detection Based on Deep Learning

Object Detection. CS698N Final Project Presentation AKSHAT AGARWAL SIDDHARTH TANWAR

Spatial Localization and Detection. Lecture 8-1

Unified, real-time object detection

Extend the shallow part of Single Shot MultiBox Detector via Convolutional Neural Network

Traffic Multiple Target Detection on YOLOv2

Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks

Yiqi Yan. May 10, 2017

REGION AVERAGE POOLING FOR CONTEXT-AWARE OBJECT DETECTION

MULTI-SCALE OBJECT DETECTION WITH FEATURE FUSION AND REGION OBJECTNESS NETWORK. Wenjie Guan, YueXian Zou*, Xiaoqun Zhou

Deep learning for object detection. Slides from Svetlana Lazebnik and many others

Rich feature hierarchies for accurate object detection and semantic segmentation

Real-time Object Detection CS 229 Course Project

YOLO9000: Better, Faster, Stronger

Feature-Fused SSD: Fast Detection for Small Objects

SSD: Single Shot MultiBox Detector. Author: Wei Liu et al. Presenter: Siyu Jiang

YOLO: You Only Look Once Unified Real-Time Object Detection. Presenter: Liyang Zhong Quan Zou

Object Detection. TA : Young-geun Kim. Biostatistics Lab., Seoul National University. March-June, 2018

Pedestrian Detection based on Deep Fusion Network using Feature Correlation

Weakly Supervised Object Recognition with Convolutional Neural Networks

Hand Detection For Grab-and-Go Groceries

Object Detection with YOLO on Artwork Dataset

Direct Multi-Scale Dual-Stream Network for Pedestrian Detection Sang-Il Jung and Ki-Sang Hong Image Information Processing Lab.

[Supplementary Material] Improving Occlusion and Hard Negative Handling for Single-Stage Pedestrian Detectors

CS6501: Deep Learning for Visual Recognition. Object Detection I: RCNN, Fast-RCNN, Faster-RCNN

arxiv: v2 [cs.cv] 8 Apr 2018

Abandoned Luggage Detection

R-FCN++: Towards Accurate Region-Based Fully Convolutional Networks for Object Detection

3 Object Detection. BVM 2018 Tutorial: Advanced Deep Learning Methods. Paul F. Jaeger, Division of Medical Image Computing

Volume 6, Issue 12, December 2018 International Journal of Advance Research in Computer Science and Management Studies

Rich feature hierarchies for accurate object detection and semantic segmentation

arxiv: v2 [cs.cv] 13 Jun 2017

Object Detection on Self-Driving Cars in China. Lingyun Li

Fine-tuning Pre-trained Large Scaled ImageNet model on smaller dataset for Detection task

Object Detection in Sports Videos

arxiv: v1 [cs.cv] 15 Oct 2018

Object detection using Region Proposals (RCNN) Ernest Cheung COMP Presentation

Optimizing Object Detection:

Efficient Segmentation-Aided Text Detection For Intelligent Robots

Loss Rank Mining: A General Hard Example Mining Method for Real-time Detectors

CS 1674: Intro to Computer Vision. Object Recognition. Prof. Adriana Kovashka University of Pittsburgh April 3, 5, 2018

G-CNN: an Iterative Grid Based Object Detector

arxiv: v1 [cs.cv] 26 May 2017

arxiv: v1 [cs.cv] 22 Mar 2018

Lecture 7: Semantic Segmentation

Constrained Convolutional Neural Networks for Weakly Supervised Segmentation. Deepak Pathak, Philipp Krähenbühl and Trevor Darrell

An algorithm for highway vehicle detection based on convolutional neural network

Parallel Feature Pyramid Network for Object Detection

Mask R-CNN. Kaiming He, Georgia, Gkioxari, Piotr Dollar, Ross Girshick Presenters: Xiaokang Wang, Mengyao Shi Feb. 13, 2018

arxiv: v1 [cs.ro] 28 Dec 2018

arxiv: v1 [cs.cv] 9 Aug 2017

Channel Locality Block: A Variant of Squeeze-and-Excitation

Introduction to Deep Learning for Facial Understanding Part III: Regional CNNs

Pelee: A Real-Time Object Detection System on Mobile Devices

MCMOT: Multi-Class Multi-Object Tracking using Changing Point Detection

A FRAMEWORK OF EXTRACTING MULTI-SCALE FEATURES USING MULTIPLE CONVOLUTIONAL NEURAL NETWORKS. Kuan-Chuan Peng and Tsuhan Chen

Deep Learning for Object detection & localization

Final Report: Smart Trash Net: Waste Localization and Classification

International Journal of Computer Engineering and Applications, Volume XII, Special Issue, September 18,

Visual features detection based on deep neural network in autonomous driving tasks

Inception and Residual Networks. Hantao Zhang. Deep Learning with Python.

SSD: Single Shot MultiBox Detector

arxiv: v1 [cs.cv] 18 Jun 2017

arxiv: v3 [cs.cv] 12 Mar 2018

arxiv: v1 [cs.cv] 30 Apr 2018

Finding Tiny Faces Supplementary Materials

You Only Look Once: Unified, Real-Time Object Detection

DEEP NEURAL NETWORKS FOR OBJECT DETECTION

CENG 783. Special topics in. Deep Learning. AlchemyAPI. Week 11. Sinan Kalkan

CS230: Lecture 3 Various Deep Learning Topics

arxiv: v1 [cs.cv] 29 Nov 2018

arxiv: v1 [cs.cv] 4 Jun 2015

All You Want To Know About CNNs. Yukun Zhu

Deep Learning in Visual Recognition. Thanks Da Zhang for the slides

ScratchDet: Exploring to Train Single-Shot Object Detectors from Scratch

Deep learning for dense per-pixel prediction. Chunhua Shen The University of Adelaide, Australia

OBJECT DETECTION HYUNG IL KOO

arxiv: v1 [cs.cv] 5 Dec 2017

A Comparison of CNN-based Face and Head Detectors for Real-Time Video Surveillance Applications

arxiv: v3 [cs.cv] 17 May 2018

arxiv: v2 [cs.cv] 22 Nov 2017

Single-Shot Refinement Neural Network for Object Detection -Supplementary Material-

Deep object classification in low resolution LWIR imagery via transfer learning

arxiv: v5 [cs.cv] 11 Dec 2018

R-FCN-3000 at 30fps: Decoupling Detection and Classification

Supplementary Material: Pixelwise Instance Segmentation with a Dynamically Instantiated Network

Fully Convolutional Networks for Semantic Segmentation

EFFECTIVE OBJECT DETECTION FROM TRAFFIC CAMERA VIDEOS. Honghui Shi, Zhichao Liu*, Yuchen Fan, Xinchao Wang, Thomas Huang

Encoder-Decoder Networks for Semantic Segmentation. Sachin Mehta

Towards Weakly- and Semi- Supervised Object Localization and Semantic Segmentation

R-FCN: OBJECT DETECTION VIA REGION-BASED FULLY CONVOLUTIONAL NETWORKS

Scene Text Recognition for Augmented Reality. Sagar G V Adviser: Prof. Bharadwaj Amrutur Indian Institute Of Science

Content-Based Image Recovery

arxiv: v3 [cs.cv] 29 Mar 2017

Gradient of the lower bound

arxiv: v2 [cs.cv] 19 Apr 2018

Deformable Part Models

Transcription:

YOLO9000: Better, Faster, Stronger Date: January 24, 2018 Prepared by Haris Khan (University of Toronto) Haris Khan CSC2548: Machine Learning in Computer Vision 1

Overview 1. Motivation for one-shot object detection and weakly-supervised learning 2. YOLO 3. YOLOv2 / YOLO9000 4. Future Work Haris Khan CSC2548: Machine Learning in Computer Vision 2

One-Shot Detection Eliminates regional proposal steps used in R-CNN [3], Fast R-CNN [4] and Faster R-CNN [5] Motivation: Develop object detection methods that predict bounding boxes and class probabilities at the same time Want to achieve real-time detection speeds Maintain / exceed accuracy benchmarks set by previous region proposal methods 3

Improving Detection Datasets VOC 2007 / 2012: 20 classes i.e. person, cat, dog, car, chair, bottle MS COCO: 80 classes i.e. book, apple, teddy bear, scissors ImageNet1000: 1000 classes i.e. German shepherd, golden retriever, European fire salamander Motivation: Increase the number and detail of classes that can be learned during training using existing detection and classification datasets 4

You Only Look Once (YOLO) [1] 1. Assume each grid cell has B objects. 2. Bounding box feature vector = [x, y, w, h, c] 3. Merge predictions into S S (5B + C) output tensor Image Credit: [1] 2. C object classes PASCAL VOC: S = 7, B = 2, C = 20 Output tensor size = 7 7 30 5

YOLO - Architecture Inspired by GoogLeNet 24 convolutional layers + 2 FC layers Grid creation, bounding box & class predictions Image Credit: [1] 6

YOLO - Training Loss Image Credit: [1] Only back-propagate loss if object is present 7

YOLO - Test Results Primary evaluation done on VOC 2007 & 2012 test sets VOC 2007 Test Results VOC 2012 Test Results * Table Credits: [1] *Speed measured on Titan X GPU 8

YOLO - Limitations Produces more localization errors than Fast R-CNN Struggles to detect small, repeated objects (i.e. flocks of birds) Bounding box priors not used during training Image Credit: [1] 9

YOLO9000 - Paper Overview YOLOv2 [2]: Modified version of original YOLO that increases detection speed and accuracy YOLO9000 [2]: Training method that increases the number of classes a detection network can learn by using weakly-supervised training on the union of detection (i.e. VOC, COCO) and classification (i.e. ImageNet) datasets 10

YOLOv2 - Modifications Bounding Boxes Architecture Training Anchor Boxes Modification Dimension clusters + new bounding box parameterization New Darknet-19 replaces GoogLeNet Convolutional prediction layer Batch normalization High resolution fine-tuning of weights Multi-scale images Passthrough for fine-grained features Effect 7% recall increase 4.8% map increase 33% computation decrease, 0.4% map increase 0.3% map increase 2% map increase 4% map increase 1.1% map increase 1% map increase 11

YOLOv2 - Bounding Boxes Anchor boxes allow multiple objects of various aspect ratio to be detected in a single grid cell Anchor boxes sizes determined by k-means clustering of VOC 2007 training set k = 5 provides best trade-off between average IOU / model complexity Average IOU = 61.0% Feature vector parameterization directly predicts bounding box centre point, width and height Image Credit: [2] 12

YOLOv2 - DarkNet-19 19 convolutional layers and 5 maxpooling layers DarkNet-19 for Image Classification Reduced number of FLOPs VGG-16 -> 30.67 billion YOLO -> 8.52 billion YOLOv2 -> 5.58 billion Table Credit: [2] 13

YOLOv2 - Example Video link: https://youtu.be/cgxsv1rijhi?t=290 14

YOLO9000 - Concept + Image Credits: [2] 15

Slide Credit: Joseph Redmon [3] 16

YOLO9000 - WordTree Image Credit: [2] 17

Slide Credit: Joseph Redmon [3] 18

Image Credit: Joseph Redmon [3] 19

Image Credit: Joseph Redmon [3] 20

YOLOv2 - Detection Training Datasets: VOC 2007+2012, COCO trainval35k Data Augmentation: Random crops, colour shifting Hyperparameters: # of epochs = 160 Learning rate = 0.001 Weight decay = 0.0005 Momentum = 0.9 Training Enhancements: Batch normalization High resolution fine-tuning Multi-scale images Three 3x3 & 1x1 convolutional layers replace last convolutional layer of DarkNet-19 base model Passthrough connection between 3x3x512 and second-to-last convolutional layers, adding finegrained features to prediction layer 21

YOLO9000 - Detection Training Datasets: 9418 classes ImageNet (top 9000 classes) COCO detection dataset ImageNet detection challenge Bounding Boxes: Minimum IOU threshold = 0.3 # of dimension clusters =3 Backpropagating Loss: For detection images, backpropagate as in YOLOv2 For unsupervised classification images, only backpropagate classification loss, while finding best matching bounding box from WordTree 22

YOLOv2 - Test Results VOC 2007 Test Results Image Credit: Joseph Redmon [3] 23

VOC 2012 Test Results COCO Test-Dev 2015 Results Table Credits: [2] 24

YOLO9000 - Test Results Evaluated on ImageNet detection task Best and Worst Classes on ImageNet 200 classes total 44 detection labelled classes shared between ImageNet and COCO 156 unsupervised classes Overall detection accuracy = 19.7% map 16.0% map achieved on unsupervised classes Table Credit: [2] 25

YOLO9000 - Paper Evaluation Strengths: Speed performance of YOLOv2 far exceeds competitors (i.e. SSD) Anchor box priors via clustering allow detector to learn ideal aspect ratios from training data WordTree method increases the number of learnable classes using existing datasets Weaknesses: Detection performance of YOLOv2 on COCO is well below state-of-the-art Description of how loss function uses unsupervised training examples is vague Results from YOLO9000 tests are inconclusive Does not compare method with alternative weakly-supervised techniques 26

Future Work Improve the accuracy of one-shot detectors in dense object scenes RetinaNet [7] Investigate the transferability of weakly-supervised training to other domains, such as image segmentation or dense captioning 27

Questions? 28

References [1] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, You only look once: Unified, real-time object detection, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 779 788. [2] J. Redmon and A. Farhadi, YOLO9000: better, faster, stronger, arxiv preprint. ArXiv161208242, 2016. [3] J. Redmon, YOLO9000 Better, Faster, Stronger, presented at the CVPR, 2017. [4] R. Girshick, J. Donahue, T. Darrell, and J. Malik, Rich feature hierarchies for accurate object detection and semantic segmentation, in Proceedings of the IEEE conference on computer vision and pattern recognition, 2014, pp. 580 587. [5] R. Girshick, Fast r-cnn, in Proceedings of the IEEE international conference on computer vision, 2015, pp. 1440 1448 [6] S. Ren, K. He, R. Girshick, and J. Sun, Faster R-CNN: Towards real-time object detection with region proposal networks, in Advances in neural information processing systems, 2015, pp. 91 99 [7] T.-Y. Lin, P. Goyal, R. Girshick, K. He, and P. Dollár, Focal loss for dense object detection, arxiv preprint. ArXiv170802002, 2017. 29