Object Detection. CS698N Final Project Presentation AKSHAT AGARWAL SIDDHARTH TANWAR

Similar documents
Unified, real-time object detection

Spatial Localization and Detection. Lecture 8-1

YOLO9000: Better, Faster, Stronger

Yiqi Yan. May 10, 2017

Object Detection Based on Deep Learning

Lecture 5: Object Detection

Object Detection on Self-Driving Cars in China. Lingyun Li

Object detection using Region Proposals (RCNN) Ernest Cheung COMP Presentation

Object Detection. TA : Young-geun Kim. Biostatistics Lab., Seoul National University. March-June, 2018

Object detection with CNNs

Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks

Extend the shallow part of Single Shot MultiBox Detector via Convolutional Neural Network

CS6501: Deep Learning for Visual Recognition. Object Detection I: RCNN, Fast-RCNN, Faster-RCNN

Object Detection with YOLO on Artwork Dataset

Self Driving. DNN * * Reinforcement * Unsupervised *

EFFECTIVE OBJECT DETECTION FROM TRAFFIC CAMERA VIDEOS. Honghui Shi, Zhichao Liu*, Yuchen Fan, Xinchao Wang, Thomas Huang

YOLO: You Only Look Once Unified Real-Time Object Detection. Presenter: Liyang Zhong Quan Zou

MULTI-SCALE OBJECT DETECTION WITH FEATURE FUSION AND REGION OBJECTNESS NETWORK. Wenjie Guan, YueXian Zou*, Xiaoqun Zhou

Deep learning for object detection. Slides from Svetlana Lazebnik and many others

Traffic Multiple Target Detection on YOLOv2

Rich feature hierarchies for accurate object detection and semantic segmentation

Deep Learning for Object detection & localization

Efficient Segmentation-Aided Text Detection For Intelligent Robots

Volume 6, Issue 12, December 2018 International Journal of Advance Research in Computer Science and Management Studies

Object Detection and Its Implementation on Android Devices

A Comparison of CNN-based Face and Head Detectors for Real-Time Video Surveillance Applications

CS 1674: Intro to Computer Vision. Object Recognition. Prof. Adriana Kovashka University of Pittsburgh April 3, 5, 2018

Rich feature hierarchies for accurate object detection and semantic segmentation

Regionlet Object Detector with Hand-crafted and CNN Feature

Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks

3 Object Detection. BVM 2018 Tutorial: Advanced Deep Learning Methods. Paul F. Jaeger, Division of Medical Image Computing

Deformable Part Models

REGION AVERAGE POOLING FOR CONTEXT-AWARE OBJECT DETECTION

Mask R-CNN. Kaiming He, Georgia, Gkioxari, Piotr Dollar, Ross Girshick Presenters: Xiaokang Wang, Mengyao Shi Feb. 13, 2018

OBJECT DETECTION HYUNG IL KOO

Encoder-Decoder Networks for Semantic Segmentation. Sachin Mehta

Creating Affordable and Reliable Autonomous Vehicle Systems

SSD: Single Shot MultiBox Detector. Author: Wei Liu et al. Presenter: Siyu Jiang

Hand Detection For Grab-and-Go Groceries

Final Report: Smart Trash Net: Waste Localization and Classification

Deep Learning in Visual Recognition. Thanks Da Zhang for the slides

Deep Residual Learning

R-FCN: OBJECT DETECTION VIA REGION-BASED FULLY CONVOLUTIONAL NETWORKS

Joint Object Detection and Viewpoint Estimation using CNN features

Introduction to Deep Learning for Facial Understanding Part III: Regional CNNs

arxiv: v1 [cs.cv] 15 Oct 2018

YOLO 9000 TAEWAN KIM

Optimizing Object Detection:

arxiv: v1 [cs.cv] 20 Dec 2016

You Only Look Once: Unified, Real-Time Object Detection

Deep Supervision with Shape Concepts for Occlusion-Aware 3D Object Parsing

Direct Multi-Scale Dual-Stream Network for Pedestrian Detection Sang-Il Jung and Ki-Sang Hong Image Information Processing Lab.

Project 3 Q&A. Jonathan Krause

Automatic detection of books based on Faster R-CNN

Detecting cars in aerial photographs with a hierarchy of deconvolution nets

Real-time Object Detection CS 229 Course Project

Mask R-CNN. presented by Jiageng Zhang, Jingyao Zhan, Yunhan Ma

Computer Vision Lecture 16

POINT CLOUD DEEP LEARNING

Deep Tracking: Biologically Inspired Tracking with Deep Convolutional Networks

Rich feature hierarchies for accurate object detection and semant

Visual features detection based on deep neural network in autonomous driving tasks

Weakly Supervised Object Recognition with Convolutional Neural Networks

Fine-tuning Pre-trained Large Scaled ImageNet model on smaller dataset for Detection task

Category-level localization

Vehicle Classification on Low-resolution and Occluded images: A low-cost labeled dataset for augmentation

Multi-View 3D Object Detection Network for Autonomous Driving

Object Recognition II

MCMOT: Multi-Class Multi-Object Tracking using Changing Point Detection

Deep Supervision with Shape Concepts for Occlusion-Aware 3D Object Parsing Supplementary Material

Pedestrian Detection based on Deep Fusion Network using Feature Correlation

Computer Vision Lecture 16

Fully Convolutional Network for Depth Estimation and Semantic Segmentation

AttentionNet for Accurate Localization and Detection of Objects. (To appear in ICCV 2015)

All You Want To Know About CNNs. Yukun Zhu

arxiv: v1 [cs.cv] 16 Mar 2018

Using Faster-RCNN to Improve Shape Detection in LIDAR

Object Detection in Sports Videos

Abandoned Luggage Detection

DEEP NEURAL NETWORKS CHANGING THE AUTONOMOUS VEHICLE LANDSCAPE. Dennis Lui August 2017

arxiv: v1 [cs.cv] 26 May 2017

R-FCN: Object Detection with Really - Friggin Convolutional Networks

SqueezeDet: Unified, Small, Low Power Fully Convolutional Neural Networks for Real-Time Object Detection for Autonomous Driving

A Study of Vehicle Detector Generalization on U.S. Highway

RON: Reverse Connection with Objectness Prior Networks for Object Detection

A Lightweight YOLOv2:

R-FCN++: Towards Accurate Region-Based Fully Convolutional Networks for Object Detection

Finding Tiny Faces Supplementary Materials

Computer Vision Lecture 16

arxiv: v1 [cs.cv] 30 Apr 2018

DeepIM: Deep Iterative Matching for 6D Pose Estimation - Supplementary Material

Semantic Segmentation

arxiv: v1 [cs.cv] 23 Apr 2018

DEEP NEURAL NETWORKS FOR OBJECT DETECTION

[Supplementary Material] Improving Occlusion and Hard Negative Handling for Single-Stage Pedestrian Detectors

Ryerson University CP8208. Soft Computing and Machine Intelligence. Naive Road-Detection using CNNS. Authors: Sarah Asiri - Domenic Curro

CIS680: Vision & Learning Assignment 2.b: RPN, Faster R-CNN and Mask R-CNN Due: Nov. 21, 2018 at 11:59 pm

Learning Semantic Video Captioning using Data Generated with Grand Theft Auto

Content-Based Image Recovery

JOINT DETECTION AND SEGMENTATION WITH DEEP HIERARCHICAL NETWORKS. Zhao Chen Machine Learning Intern, NVIDIA

Transcription:

Object Detection CS698N Final Project Presentation AKSHAT AGARWAL SIDDHARTH TANWAR

Problem Description Arguably the most important part of perception Long term goals for object recognition: Generalization from a few examples Robust illumination, deformation, scale invariance The 100/100 tracking challenge, discussed by Dieter Fox in ICRA 16 looking to identify and track 100% of the objects and activities in a scene, with 100% accuracy Learn semantic info about objects, like how to interact with them, observe and learn their behavior itself

Dataset - PASCAL VOC PASCAL Visual Object Classes (VOC) challenge Contains a number of visual object classes in realistic scenes Contains 20 classes, ~10k images with ~24k annotated objects Annotation contains both object class and bounding box Released in 2007 Evaluation: Average Precision over the entire range of recall, with a good score having both high precision and high recall Everingham, Mark, et al. "The pascal visual object classes (voc) challenge."international journal of computer vision 88.2 (2010): 303-338.

Dataset - KITTI Geared towards autonomous driving 15k images, 80k labeled objects Provides ground truth data with LIDAR Dense images of an urban city with up to 15 cars and 30 pedestrians visible in one image 3 classes: Cars, Pedestrians and Cyclists Geiger, Andreas, Philip Lenz, and Raquel Urtasun. "Are we ready for autonomous driving? the kitti vision benchmark suite." Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on. IEEE, 2012.

Related Work DPM: Uses HOG features at multiple scales, filters and then classifies (33.4) Selective Search: Bottom up segmentation by merging similar regions (35) R-CNN: Extracts features from region proposals, then classifies (59.2) Fast R-CNN: Introduced an RoI pooling layer which creates fixed size feature maps for each RoI and then regresses (70.7 (w YOLO)) Faster R-CNN: Introduced a Region Proposal Network (RPN) which generates proposals using CNN features shared with the image (75.9) SSD: Uses default bounding boxes over different aspect ratios and scales for conv. feature maps at multiple resolutions, no proposals (75.8) PVANET: Redesign feature extraction part, fewer channels with more layers. Uses concatenated ReLU, inception modules and combines multi-scale intermediate outputs (82.5) R-FCN: Fully convolutional network, learns position sensitive score maps, gets region proposals from an RPN and applies RoI pooling Current Leader - R-FCN + ResNet ensemble trained on VOC+COCO (map 88.4)

Problems Most of these methods have : Large, complex detection pipeline Independently precisely tuned stages Slow forward pass Selective Search takes 2 secs/image to generate proposals! R-CNN takes 40 secs/image at test time! Faster R-CNN runs at 5FPS R-FCN takes 170 ms/image (~6FPS) Dai, Jifeng, et al. "R-FCN: Object Detection via Region-based Fully Convolutional Networks." arxiv preprint arxiv:1605.06409 (2016). Ren, Shaoqing, et al. "Faster R-CNN: Towards real-time object detection with region proposal networks." Advances in neural information processing systems. 2015. Girshick, Ross, et al. "Rich feature hierarchies for accurate object detection and semantic segmentation." Proceedings of the IEEE conference on computer vision and pattern recognition. 2014. Uijlings, Jasper RR, et al. "Selective search for object recognition."international journal of computer vision 104.2 (2013): 154-171. Redmon, Joseph, et al. "You only look once: Unified, real-time object detection." arxiv preprint arxiv:1506.02640 (2015).

YOLO - Method Overview YOLO replaces the pipeline with a single convolutional neural network which : Trains features in-line and optimizes them for detection Predicts bounding boxes Performs non-maximal suppression Predicts class probabilities for each bounding box Only a single network evaluation is required to predict which objects are present and where they are.

Unified Detection S x S grid cells B bounding boxes per cell The prediction is encoded as a S x S x (B * 5 + C) tensor For evaluating on PASCAL VOC, S = 7, B = 2 and C =20 is used. Image Source : Redmon, Joseph, et al. "You only look once: Unified, real-time object detection." arxiv preprint arxiv:1506.02640 (2015).

Unified Detection Each grid cell also predicts C conditional probabilities : Pr(classi object) Conditioned on the cell containing an object Pr(Classi Object) * Pr(Object) * truthioupred = Pr(Classi) * truthioupred Image Source : Redmon, Joseph, et al. "You only look once: Unified, real-time object detection." arxiv preprint arxiv:1506.02640 (2015).

Our Network - Scaled version of YOLO 8 convolutional layers alternating with maxpool layers, followed by 1 FCL Has been pre-trained on the ImageNet classification task The fully connected layer outputs a S*S*(B*5+C) vector, where S is the number of grid cells, B is the number of bounding boxes per cell and C is the number of classes (20, for the VOC dataset) We used batch sizes of 32, 64 and 128 to optimize usage of GPUs in the space available Dropout of 0.5 to prevent co-adaptation Network was trained for roughly 250 epochs Random scaling and translations of 20% of original image size introduced for data augmentation Leaky ReLU used as activation function for conv layers

Results on VOC Comparison with reported numbers Since VOC evaluation is not open, we could not use the VOC2012 validation set during training and instead tested on it. We have ~10.7k training images, and ~5.8k testing images Author has used VOC2012 validation set for training, his results are with ~16.5k training images and ~11k testing images Network pre-trained on ImageNet(mAP) Network trained by us on VOC(mAP) Reported by author(map) 0.27 44.65 52.7 Source: http://host.robots.ox.ac.uk:8080/pascal/voc/voc2007/dbstats.html

Results Parameter Study Analysis was done varying two parameters : Number of grid cells Number of Bounding Boxes per cell This was done to improve YOLO s performance in cluttered scenes a) Original Image 4 bicyclists b) Detections with s=7 2 people and 2 bicycles c) Detections with s=9 5 people and 5 bicycles

Results Parameter Study - Quantitative Analysis Class S = 5 (map) S = 7 (map) S= 9 (map) Aeroplane 58.69 61.70 67.10 Bus 64.17 61.74 67.11 Cat 61.57 59.10 70.09 Train 59.4 58.44 63.44 Person 47.88 48.84 55.19 Overall 38.86 36.92 44.65 Results obtained from varying grid size S, with B kept fixed at 2 S = 9 heavily outperforms S = 7 in all object classes S = 5 outperforms S = 7 in big object classes - Because finer grid not required?

Results Parameter Study - Quantitative Analysis Class B = 2(mAP) B = 4 (map) B = 6 (map) Aeroplane 67.10 60.92 56.53 Bus 67.11 65.09 55.01 Cat 70.09 59.6 50.07 Train 63.44 61.14 52.35 Person 55.19 50.81 45.77 Overall 44.65 38.10 27.38 Results obtained from varying number of bounding boxes per grid cell B, with S kept fixed at 9 Surprisingly, B = 2 outperforms B = 4 and 6 by a significant margin. We postulate that this is because greater B means that there is a greater chance of the wrong box being selected from that grid cell.

Results Precision-Recall curves

Results Some examples on VOC dataset

Results Artwork Corpus & Images in the wild

Results Examples on KITTI Object Detection task

Results Videos The smaller YOLO model runs at around 50-70 fps on NVIDIA GeForce GTX 760 GPU Playing 2 videos, one shows detection using the network trained on VOC and the other using the network trained on KITTI Vision Benchmark suite for object detection.

Our work Did an extensive parameter study on YOLO, studying the changes observed on increasing or decreasing the grid resolution, and increasing the number of bounding boxes per grid cell Obtained object detection results on the object detection benchmark KITTI by using a YOLO network pre-trained on ImageNet. Broadly speaking, both of us did most of the work together. However, Akshat worked on integrating KITTI data into VOC and Darknet compatible format Siddharth worked on the parameter study for YOLO

Thank You