Joint Object Detection and Viewpoint Estimation using CNN features

Size: px

Start display at page:

Download "Joint Object Detection and Viewpoint Estimation using CNN features"

Cecil Dickerson
5 years ago
Views:

1 Joint Object Detection and Viewpoint Estimation using CNN features Carlos Guindel, David Martín and José M. Armingol Intelligent Systems Laboratory Universidad Carlos III de Madrid Wien 28 June 2017

2 Outline 2 Introduction Object detection Viewpoint estimation Results Conclusion

3 Outline 3 Introduction Object detection Viewpoint estimation Results Conclusion

Situational awareness for vehicles 4 Advanced Driver Assistance Systems (ADAS) and autonomous vehicles rely on a trustable on-board obstacle detection module.

4 Situational awareness for vehicles 4 Advanced Driver Assistance Systems (ADAS) and autonomous vehicles rely on a trustable on-board obstacle detection module. A precise classification of the obstacles enables accurate predictions of future traffic situations, including those involving VRU. Another significant cue that can be used to anticipate future events is the orientation of the objects moving on the ground plane.

5 On-board object detection 5 Laser rangefinder Computer Vision Hand-crafted features Convolutional Neural Networks Recognition Detection

6 Proposal overview 6 RPN Regions RGB Image Convolutional Layers Features Fast R-CNN Faster R-CNN 1 Classification Refined Bounding Box 1 S. Ren, K. He, R. Girshick, and J. Sun, Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks, IEEE Trans. Pattern Anal. Mach. Intell., vol. 39, no. 6, pp , Viewpoint

7 Outline 7 Introduction Object detection Viewpoint estimation Results Conclusion

Object detection 8 Traffic environments Diversity of agents Unstructured environment Faster R-CNN End-to-end feature learning Highly efficient No prior constraints about the location of objects in

8 Object detection 8 Traffic environments Diversity of agents Unstructured environment Faster R-CNN End-to-end feature learning Highly efficient No prior constraints about the location of objects in the image Meant for more than 21 classes S. Ren, K. He, R. Girshick, and J. Sun, Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks, IEEE Trans. Pattern Anal. Mach. Intell., vol. 39, no. 6, pp , 2016.

9 Faster R-CNN framework 9 Parameters are learned through a multi-task loss Conv. features in these regions are pooled for classification A RPN generates proposals wrt. a fixed set of anchors Convolutional features computed only once per image S. Ren, K. He, R. Girshick, and J. Sun, Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks, IEEE Trans. Pattern Anal. Mach. Intell., vol. 39, no. 6, pp , 2016.

10 Fine-tuning for traffic environments 10 Optimized anchors Management of class imbalance Information gain multinomial logistic loss for the class inference N NK L = L = 1 N 1,klog( pƹ n,k ) N H ln,l n pƹ n,ln ) n=1 n=1 k=1 Infogain matrix H 0, H K,K

11 Outline 11 Introduction Object detection Viewpoint estimation Results Conclusion

12 Viewpoint estimation 12 1, 2 RPN Regions RGB Image Convolutional Layers Features Fast R-CNN Features can be learned for multiple tasks 4 3 Refined Classification Bounding Box 5 Viewpoint

13 Discrete viewpoint inference 13 N b angle bins Θ i Θ Nb N b = 8 Training: θ i0 Θ i Inference output: r Δ N b 1 r Final estimation: Θ i መθ Elements of r

14 Joint detection and viewpoint estimation 14 CNN outputs RPN Fast R-CNN For each anchor: Objectness For each proposal: Class Predicted bounding box Bounding box refinement Per class Number of angle bins Number of classes Viewpoint N b K N b K elements

15 Joint detection and viewpoint estimation 15 Fast R-CNN B. Box regression Class Viewpoint Softmax Softmax Softmax FC layer FC layer FC layer Fully connected (FC) layers Fixed size feat. vector Only N b K 4096 new weights Conv. design is unchanged Proposal Feature map

16 Loss function and training 16 Approximate joint training strategy Unweighted muli-task loss with five components Logistic loss for RPN objectness Smooth-L1 loss for RPN b.box regression Infogain loss for class L inf p i v i N = H vi,k log(p i,k ) n=1 frequent H vi,k Logistic loss for viewpoint estimation Only the N b elements of the ground-truth class Smooth-L1 loss for b.box regression

17 Outline 17 Introduction Object detection Viewpoint estimation Results Conclusion

18 Experiments 18 On the KITTI Vision Benchmark Suite Object detection dataset Parameters: Scale: 500 px. in height 50k iter. l r = k iter. l r = k iter. l r = VGG16 architecture, initialized with ImageNet weights. Source: KITTI: A. Geiger, P. Lenz, and R. Urtasun, Are we ready for Autonomous Driving? The KITTI Vision Benchmark Suite, in Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2012, pp

19 Experiments 19 N b = 8 (resolution: π/4 rad) Infogain matrix values: H k,k = 2 f min f k 1 8 Number of ocurrences of the less frequent class Evaluation criteria Number of instances of class k Average precision Average orientation similarity (performance of detection + orientation) Minimum overlaps established by the KITTI benchmark

Results (KITTI submission: FRCNN+Or) 20 Detection (AP as %) Easy Moderate Hard Car 89.60 78.59 68.69 Pedestrian 72.21 56.99 53.72 Cyclist 68.81 55.80 50.52 map 76.87 63.79 57,64 SubCNN 84.52 77.14 64.

20 Results (KITTI submission: FRCNN+Or) 20 Detection (AP as %) Easy Moderate Hard Car Pedestrian Cyclist map ,64 SubCNN Slightly different (generally better) than the ones in the paper: Used the whole KITTI training set Trained only with Car, Pedestrian and Cyclist Non-fixed weights (and bias) at the first convolutional layers Det + Or (AOS as %) Easy Moderate Hard Car Pedestrian Cyclist maos ,65 SubCNN SubCNN: Y. Xiang, W. Choi, Y. Lin and S. Savarese, Subcategory-aware Convolutional Neural Networks for Object Proposals and Detection, in IEEE Winter Conference on Applications of Computer Vision (WACV), 2017, pp

21 AOS (%) Comparison with other methods TOP 10 AOS ranking MODERATE difficulty Only published methods Car Pedestrians Cyclists Global rankings (including unpublished methods): Car: 11 st Pedestrian: 9 th Cyclist: 10 th

22 RUNTIME PER FRAME (S) Comparison with other methods 22 5 Reported runtimes for TOP 10 AOS Only published 4,5 4 3,5 3 2,5 2 1,5 1 0, ms* *Average running time using an implementation based on py-faster-rcnn (Python & Caffe) and a NVIDIA Titan Xp kindly donated by NVIDIA Corporation

23 Sample test images 23

24 Outline 24 Introduction Object detection Viewpoint estimation Results Conclusion

Conclusion 25 Single-image object detection + orientation estimation in traffic scenes. The same convolutional features can be successfully used for both tasks.

25 Conclusion 25 Single-image object detection + orientation estimation in traffic scenes. The same convolutional features can be successfully used for both tasks. Results comparable with non-real-time, sophisticated approaches. Orientation is a step towards a real scene understanding. Future work Fine-grained orientation inference using the cross-entropy logistic loss n L = 1 n n=1 [p n log pƹ n + 1 p n log(1 pƹ n )] Improvements: Network architecture Methods to overcome the fixed-size receptive field Continuously-evolving code at:

26 Thank you for your attention! Carlos Guindel Intelligent Systems Laboratory Universidad Carlos III de Madrid Wien 28 June

27 AVERAGE PRECISION (%) Precision change when introducing viewpoint ,25 77,72 MODERATE DIFFICULTY LEVEL 69,6 67,12 53,08 44,82 Car Pedestrian Cyclist Detection Detection+Viewpoint Detection (ΔAP as %) Easy Moderate Hard Car +0,72 +0,47 +0,26 Pedestrian -1,85-2,47-3,00 Cyclist -12,16-8,25-8,11

Multi-View 3D Object Detection Network for Autonomous Driving

Multi-View 3D Object Detection Network for Autonomous Driving Xiaozhi Chen, Huimin Ma, Ji Wan, Bo Li, Tian Xia CVPR 2017 (Spotlight) Presented By: Jason Ku Overview Motivation Dataset Network Architecture