Object detection using Region Proposals (RCNN) Ernest Cheung COMP Presentation

Similar documents
Spatial Localization and Detection. Lecture 8-1

Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks

Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks

Object Detection Based on Deep Learning

Rich feature hierarchies for accurate object detection and semantic segmentation

Deep Learning for Object detection & localization

CS6501: Deep Learning for Visual Recognition. Object Detection I: RCNN, Fast-RCNN, Faster-RCNN

Object detection with CNNs

Yiqi Yan. May 10, 2017

Object Detection on Self-Driving Cars in China. Lingyun Li

Rich feature hierarchies for accurate object detection and semantic segmentation

Lecture 5: Object Detection

Deep learning for object detection. Slides from Svetlana Lazebnik and many others

Rich feature hierarchies for accurate object detection and semant

Object Detection. CS698N Final Project Presentation AKSHAT AGARWAL SIDDHARTH TANWAR

Classification of objects from Video Data (Group 30)

Object Detection. TA : Young-geun Kim. Biostatistics Lab., Seoul National University. March-June, 2018

Automatic detection of books based on Faster R-CNN

Project 3 Q&A. Jonathan Krause

OBJECT DETECTION HYUNG IL KOO

3 Object Detection. BVM 2018 Tutorial: Advanced Deep Learning Methods. Paul F. Jaeger, Division of Medical Image Computing

Visual features detection based on deep neural network in autonomous driving tasks

Introduction to Deep Learning for Facial Understanding Part III: Regional CNNs

Regionlet Object Detector with Hand-crafted and CNN Feature

CAP 6412 Advanced Computer Vision

Real-time Object Detection CS 229 Course Project

Cascade Region Regression for Robust Object Detection

Final Report: Smart Trash Net: Waste Localization and Classification


YOLO9000: Better, Faster, Stronger

Computer Vision Lecture 16

TRANSPARENT OBJECT DETECTION USING REGIONS WITH CONVOLUTIONAL NEURAL NETWORK

Computer Vision Lecture 16

Object Recognition II

Fine-tuning Pre-trained Large Scaled ImageNet model on smaller dataset for Detection task

CS 1674: Intro to Computer Vision. Object Recognition. Prof. Adriana Kovashka University of Pittsburgh April 3, 5, 2018

Object Detection Design challenges

Deformable Part Models

Deep Tracking: Biologically Inspired Tracking with Deep Convolutional Networks

Mask R-CNN. presented by Jiageng Zhang, Jingyao Zhan, Yunhan Ma

Segmentation as Selective Search for Object Recognition in ILSVRC2011

Classification and Detection in Images. D.A. Forsyth

Selective Search for Object Recognition

Instance-aware Semantic Segmentation via Multi-task Network Cascades

Object Detection with Discriminatively Trained Part Based Models

Lecture 7: Semantic Segmentation

AttentionNet for Accurate Localization and Detection of Objects. (To appear in ICCV 2015)

Articulated Pose Estimation with Flexible Mixtures-of-Parts

Recognition of Animal Skin Texture Attributes in the Wild. Amey Dharwadker (aap2174) Kai Zhang (kz2213)

Martian lava field, NASA, Wikipedia

Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset. By Joa õ Carreira and Andrew Zisserman Presenter: Zhisheng Huang 03/02/2018

Object Detection in Sports Videos

Supplementary Material: Unconstrained Salient Object Detection via Proposal Subset Optimization

PT-NET: IMPROVE OBJECT AND FACE DETECTION VIA A PRE-TRAINED CNN MODEL

Mask R-CNN. By Kaiming He, Georgia Gkioxari, Piotr Dollar and Ross Girshick Presented By Aditya Sanghi

Convolutional Neural Networks. Computer Vision Jia-Bin Huang, Virginia Tech

Large-scale Video Classification with Convolutional Neural Networks

WE are witnessing a rapid, revolutionary change in our

Fully Convolutional Networks for Semantic Segmentation

DeepIM: Deep Iterative Matching for 6D Pose Estimation - Supplementary Material

Deep Learning in Visual Recognition. Thanks Da Zhang for the slides

Amodal and Panoptic Segmentation. Stephanie Liu, Andrew Zhou

Deep Supervision with Shape Concepts for Occlusion-Aware 3D Object Parsing

Yield Estimation using faster R-CNN

CIS680: Vision & Learning Assignment 2.b: RPN, Faster R-CNN and Mask R-CNN Due: Nov. 21, 2018 at 11:59 pm

Joint Object Detection and Viewpoint Estimation using CNN features

Constrained Convolutional Neural Networks for Weakly Supervised Segmentation. Deepak Pathak, Philipp Krähenbühl and Trevor Darrell

arxiv: v1 [cs.cv] 4 Jun 2015

Towards Large-Scale Semantic Representations for Actionable Exploitation. Prof. Trevor Darrell UC Berkeley

Optimizing Object Detection:

MULTI-SCALE OBJECT DETECTION WITH FEATURE FUSION AND REGION OBJECTNESS NETWORK. Wenjie Guan, YueXian Zou*, Xiaoqun Zhou

Direct Multi-Scale Dual-Stream Network for Pedestrian Detection Sang-Il Jung and Ki-Sang Hong Image Information Processing Lab.

Object Detection with YOLO on Artwork Dataset

LEARNING TO SEGMENT MOVING OBJECTS IN VIDEOS FRAGKIADAKI ET AL. 2015

Unified, real-time object detection

Mask R-CNN. Kaiming He, Georgia, Gkioxari, Piotr Dollar, Ross Girshick Presenters: Xiaokang Wang, Mengyao Shi Feb. 13, 2018

Extend the shallow part of Single Shot MultiBox Detector via Convolutional Neural Network

Know your data - many types of networks

Flow-Based Video Recognition

CS 664 Segmentation. Daniel Huttenlocher

DEEP NEURAL NETWORKS FOR OBJECT DETECTION

Learning and Inferring Depth from Monocular Images. Jiyan Pan April 1, 2009

Ranking Figure-Ground Hypotheses for Object Segmentation

Category-level localization

SSD: Single Shot MultiBox Detector. Author: Wei Liu et al. Presenter: Siyu Jiang

Action recognition in videos

Vehicle Classification on Low-resolution and Occluded images: A low-cost labeled dataset for augmentation

Supporting Information

Photo-realistic Renderings for Machines Seong-heum Kim

G-CNN: an Iterative Grid Based Object Detector

CS 1674: Intro to Computer Vision. Neural Networks. Prof. Adriana Kovashka University of Pittsburgh November 16, 2016

Convolutional Neural Networks: Applications and a short timeline. 7th Deep Learning Meetup Kornel Kis Vienna,

Deep Learning Requirements for Autonomous Vehicles

Deep Neural Networks:

DeepProposals: Hunting Objects and Actions by Cascading Deep Convolutional Layers

Weakly Supervised Object Recognition with Convolutional Neural Networks

Adaptive Learning of an Accurate Skin-Color Model

Deep Supervision with Shape Concepts for Occlusion-Aware 3D Object Parsing Supplementary Material

Pedestrian Detection Using Structured SVM

Disguised Face Identification (DFI) with Facial KeyPoints using Spatial Fusion Convolutional Network. Nathan Sun CIS601

Transcription:

Object detection using Region Proposals (RCNN) Ernest Cheung COMP790-125 Presentation 1

2 Problem to solve Object detection Input: Image Output: Bounding box of the object

3 Object detection using CNN Faster R-CNN 78.8 VOC 2012

4 Transforming the problem to Classification Krizhevsky et al. [A25] shown substantially higher image classification accuracy on ImageNet Large Scale Visual Recognition Challenge (ConvNet) [A9, A10] Trained using 1.2 million labeled images, together with a few twists on LeCun s CNN.

5 Transforming the problem to Classification Classification: Image => Class Label Detection: Image => Bounding box Image source: [B]

6 Transforming the problem to Classification Krizhevsky et al. s work, ConvNet Image 1000 class labels Image source: [A25]

7 Transforming the problem to Classification Localizing object with a deep network using Region proposals Image source: [A]

8 Outline Region Proposals RCNN training RCNN fine tuning and Results RCNN Variants LCrowdV: Generating large amount of data and to train Faster R-CNN

Region Proposals 9

RCNN 10

11 Region Proposal Varity of work using different approach to generate region proposals: objectness [A1], selective search [C], category-independent object proposals [A14], constrained parametric min-cuts (CPMC) [A5], multi-scale combinatorial grouping [A3], and CNN [A6]

12 Selective Search for producing Region Proposals Algorithm Design challenges: Capture All Scales Diversification Fast to Compute Image source: [C]

13 Selective Search for producing Region Proposals Algorithm Design challenges: Capture All Scales Diversification Fast to Compute Image source: [C]

14 Selective Search for producing Region Proposals Algorithm Design challenges: Capture All Scales Diversification a) Objects are of different scale b) Texture are same c) Color are same d) Wheels are different in color and texture Fast to Compute Image source: [C]

15 Selective Search for producing Region Proposals Outline of Algorithm: 1. Initialization by [C13] 2. Greedily group regions together by selecting the pair with highest similarity 3. Until the whole image become a single region 4. Generates a hierarchy of bounding boxes

16 Selective Search for producing Region Proposals Outline of Algorithm: 1. Initialization by [C13] 2. Greedily group regions together by selecting the pair with highest similarity 3. Until the whole image become a single region 4. Generates a hierarchy of bounding boxes

17 Selective Search for producing Region Proposals Outline of Algorithm: 1. Initialization by [C13] 2. Greedily group regions together by selecting the pair with highest similarity 3. Until the whole image become a single region 4. Generates a hierarchy of bounding boxes

18 Similarity Heuristic Defined by a combination of Color similarity histogram intersection Texture HOG-like feature, Gaussian derivative in 8 directions Size - Encourage small regions to merge early Shape Two regions are fitting together

19 Similarity Heuristic Defined by a combination of Color similarity histogram intersection 25 bins for each color Texture HOG-like feature, Gaussian derivative in 8 directions Size - Encourage small regions to merge early Shape Two regions are fitting together Image source: [C]

20 Similarity Heuristic Defined by a combination of Color similarity histogram intersection Texture HOG-like feature, Gaussian derivative in 8 directions 10-bin histogram for each direction Histogram intersection Size - Encourage small regions to merge early Shape Two regions are fitting together Image source: [C]

21 Similarity Heuristic Defined by a combination of Color similarity histogram intersection Texture HOG-like feature, Gaussian derivative in 8 directions Size - Encourage small regions to merge early Shape Two regions are fitting together Image source: [C]

22 Similarity Heuristic Defined by a combination of Color similarity histogram intersection Texture HOG-like feature, Gaussian derivative in 8 directions Size - Encourage small regions to merge early Shape Two regions are fitting together Image source: [C]

23 Selective Search for producing Region Proposals Outline of Algorithm: 1. Initialization by [C13] 2. Greedily group regions together by selecting the pair with highest similarity 3. Until the whole image become a single region 4. Generates a hierarchy of bounding boxes

24 Selective Search for producing Region Proposals Outline of Algorithm: 1. Initialization by [C13] 2. Greedily group regions together by selecting the pair with highest similarity 3. Until the whole image become a single region 4. Generates a hierarchy of bounding boxes

Selective Search for producing Region Proposals 25

RCNN training 26

RCNN 27

28 Feature Extraction [A25] takes a 227 x 227 pixel image [A] uses the simplest approach to convert the region proposals to CNN input: warping regardless of size or aspect ratio

29 Training The R-CNN is based on Krizhevsky et al. [A25] [A25] produces a 4096 feature vector

RCNN 30

31 Classify regions With around 2000 region proposals obtained in step 2, 2000 CNN features are computed in step 3 In step 4, one linear SVM per class is used to test the features Non-maximal suppression: Scored regions are rejected if IoU overlap with a higher score region > a learned threshold Image source: http://www.pyimagesearch.com/2016/11/07/intersection-over-union-iou-for-object-detection/

RCNN Finetuning & Results 32

33 Supervised pre-training The CNN is trained on ILSVRC2012 classification using image-level only annotations Authors claimed that performance nearly matches the one of the original model in [A25]

34 Fine-tuning Continue stochastic gradient decent (SGD) of CNN parameters using only warped region proposals Replace last layer 1000-way classification layer with a randomly initialized (N+1)-way classification layer N is the number of object classes

35 Training data Positive samples Negative samples How about this?

36 Training data If Intersection-over-union (IoU) < threshold, then it is a negative sample. Authors performed a grid search over {0, 0.1, 0.5} and find out that if IoU = 0.3 is best in map.

37 Bounding box regression To improve localization performance, authors propose a bounding box regression to learn the relationship between the pool 5 features Set up N class-specific bounding-box regressors

38 Bounding box regression Given a set of Detected region proposal bounding box P Ground truth bounding box G and The authors establish Where w are the learnable parameters

39 Bounding box regression The problem is then formulated as a regularized least square problem, where the objective is: where

40 Bounding box regression Two subtle issues observed Regularization is important λ = 1000 Selection of (P,G) is important IoU overlap > 0.6 only Discard proposal IoU overlap <= 0.6 for regression

41 Results of RCNN

RCNN Variants 42

43 Why RCNN is slow? RCNN is slow because every Region Proposal is passed into the CNN and compute the features No sharing computation is done among Region Proposals of the same image

44 SPPNet [D] Observation: feature maps has also information of spatial position

45 Spatial Pyramid representation Image source: http://slazebni.cs.illinois.edu/slides/ima_poster.pdf

SPPNet 46

47 SPPNet Much faster than RCNN because each image is passed into CNN once only Can have multiscale variant to improve (maintain) accuracy

48 Problem of SPPNet Layers below the spatial pyramid layer cannot be updated, thus affect accuracy Weights CANNOT be updated

49 Fast RCNN [E] Fast RCNN solves this problem by proposing a single network trained in one stage

50 Faster R-CNN [F] Adding Region Proposal Network (RPN) Full connected layer Take image/feature map Output object proposals Use Fast R-CNN after obtained proposals Features shared between Fast R-CNN and RPN

LCrowdV : Generating Labeled Videos for Simulation-based Crowd Behavior Learning 51

52 Traditional training with human annotator

53 Traditional training with human annotator Obtaining Crowd Videos Annotations 1 hour video * 30 FPS = 108000 frames average 100 person per frame => 2M annotations 500 annotations / man-hour => 4000 man-hours

54 Training with LCrowdV LCrowdV Annotations 108 5min-videos released = 1 M images frame ~ 10M annotations

Strength of LCrowdV 55

56 Traditional Vs LCrowdV Ref: Image of left is from UCF-CC50

57 LCrowdV Framework Density Pedestrian Count Personality characteristic Background Noise Agent model Lighting Camera Angle Procedural Simulation Goal Selection Plan Computation Preferred Goal Velocity Plan Adaption Velocity Motion Synthesis Trajectory Procedural Rendering Results Videos Head location Bounding Boxes Attributes

Parameters of LCrowdV 58

59

60

61

62

63

64

65

66

67 Impact of fixing one parameter on the results Precision-recall graph

68 Results on Pedestrian Detection Precision-recall graph Trained with data from same scene + LCrowdV Trained with data from same scene Original Model

69 Results on Pedestrian Detection Precision-recall graph Varying the number of samples from the same scene as the test data, we observe consistent improvement of AP by complementing the training data with LCrowdV data.

Results on Pedestrian Detection 70

Results on Pedestrian Detection 71

72 Further improvement on LCrowdV More 3D Models of characters Walking cycle animations Background Scenes Perform comprehensive analysis on how to improve the accuracy using LCrowdV Develop novel ways to combine real work data with synthetic data

73 Benefits of LCrowdV Precise annotations generated automatically Avoiding Annotators error and intensive Labor effort Large variety in data Can be used for different application of Crowd Understanding Huge number of data Provide samples that Real-life data cannot cover Complementary to Machine Learning in Crowd Shown in Pedestrian Detection experiment

74 Reference A. Rich feature hierarchies for accurate object detection and semantic segmentation, Ross Girshick et al. https://arxiv.org/abs/1311.2524v5 B. You Only Look Once: Unified, Real-Time Object Detection, Joseph Redmon et al., CVPR2016 http://www.cs.virginia.edu/~vicente/recognition/pres entations/yolo.pdf

75 Reference C. Selective Search for Object Recognition, J.R.R. Uijlings et al., IJCV 2013 http://www.huppelen.nl/publications/selectivesearchdra ft.pdf D. Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition, Kaiming He et al., ILSVRC2014 https://arxiv.org/pdf/1406.4729v4.pdf

76 Reference E. Fast R-CNN, Ross Girshick https://arxiv.org/pdf/1504.08083v2.pdf F. Faster R-CNN, Shaoqing Ren et al. https://arxiv.org/abs/1506.01497 G. LCrowdV: Generating Labeled Videos for Simulation-based Crowd Behavior Learning, Ernest Cheung et al., http://gamma.cs.unc.edu/lcrowdv/