Finding Tiny Faces Supplementary Materials

Similar documents
Real-time Object Detection CS 229 Course Project

Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks

DeepIM: Deep Iterative Matching for 6D Pose Estimation - Supplementary Material

Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks

Improved Face Detection and Alignment using Cascade Deep Convolutional Network

Efficient Segmentation-Aided Text Detection For Intelligent Robots

Deep Tracking: Biologically Inspired Tracking with Deep Convolutional Networks

arxiv: v1 [cs.cv] 31 Mar 2016

SSD: Single Shot MultiBox Detector. Author: Wei Liu et al. Presenter: Siyu Jiang

YOLO9000: Better, Faster, Stronger

Channel Locality Block: A Variant of Squeeze-and-Excitation

Supplementary Material: Pixelwise Instance Segmentation with a Dynamically Instantiated Network

Deformable Part Models

MULTI-SCALE OBJECT DETECTION WITH FEATURE FUSION AND REGION OBJECTNESS NETWORK. Wenjie Guan, YueXian Zou*, Xiaoqun Zhou

Detection III: Analyzing and Debugging Detection Methods

Finding Tiny Faces. Peiyun Hu, Deva Ramanan Robotics Institute Carnegie Mellon University Abstract. 1.

Real-Time Rotation-Invariant Face Detection with Progressive Calibration Networks

International Journal of Computer Engineering and Applications, Volume XII, Special Issue, September 18,

Content-Based Image Recovery

Classifying a specific image region using convolutional nets with an ROI mask as input

Object Detection. CS698N Final Project Presentation AKSHAT AGARWAL SIDDHARTH TANWAR

arxiv: v3 [cs.cv] 18 Oct 2017

Pedestrian Detection and Tracking in Images and Videos

Skin Lesion Attribute Detection for ISIC Using Mask-RCNN

Pose estimation using a variety of techniques

arxiv: v1 [cs.cv] 15 Oct 2018

CIS680: Vision & Learning Assignment 2.b: RPN, Faster R-CNN and Mask R-CNN Due: Nov. 21, 2018 at 11:59 pm

Spatial Localization and Detection. Lecture 8-1

Mask R-CNN. presented by Jiageng Zhang, Jingyao Zhan, Yunhan Ma

Mask R-CNN. By Kaiming He, Georgia Gkioxari, Piotr Dollar and Ross Girshick Presented By Aditya Sanghi

CEA LIST s participation to the Scalable Concept Image Annotation task of ImageCLEF 2015

arxiv: v1 [cs.cv] 16 Nov 2018

DeepBIBX: Deep Learning for Image Based Bibliographic Data Extraction

Automated Diagnosis of Vertebral Fractures using 2D and 3D Convolutional Networks

R-FCN: Object Detection with Really - Friggin Convolutional Networks

Mask R-CNN. Kaiming He, Georgia, Gkioxari, Piotr Dollar, Ross Girshick Presenters: Xiaokang Wang, Mengyao Shi Feb. 13, 2018

Real-Time* Multiple Object Tracking (MOT) for Autonomous Navigation

arxiv: v1 [cs.cv] 29 Sep 2016

Automatic detection of books based on Faster R-CNN

Single-Shot Refinement Neural Network for Object Detection -Supplementary Material-

R-FCN++: Towards Accurate Region-Based Fully Convolutional Networks for Object Detection

Object Detection on Self-Driving Cars in China. Lingyun Li

Extend the shallow part of Single Shot MultiBox Detector via Convolutional Neural Network

Encoder-Decoder Networks for Semantic Segmentation. Sachin Mehta

Convolution Neural Network for Traditional Chinese Calligraphy Recognition

CAP 6412 Advanced Computer Vision

Ryerson University CP8208. Soft Computing and Machine Intelligence. Naive Road-Detection using CNNS. Authors: Sarah Asiri - Domenic Curro

Structured Prediction using Convolutional Neural Networks

Traffic Multiple Target Detection on YOLOv2

Facial Expression Classification with Random Filters Feature Extraction

Automatic Lymphocyte Detection in H&E Images with Deep Neural Networks

3 Object Detection. BVM 2018 Tutorial: Advanced Deep Learning Methods. Paul F. Jaeger, Division of Medical Image Computing

Supplementary: Cross-modal Deep Variational Hand Pose Estimation

Tri-modal Human Body Segmentation

Deep Learning in Visual Recognition. Thanks Da Zhang for the slides

Multi-Glance Attention Models For Image Classification

Real-time Hand Tracking under Occlusion from an Egocentric RGB-D Sensor Supplemental Document

arxiv: v1 [cs.cv] 16 Nov 2015

Object Detection with Partial Occlusion Based on a Deformable Parts-Based Model

arxiv: v2 [cs.cv] 26 Mar 2018

Team G-RMI: Google Research & Machine Intelligence

arxiv: v2 [cs.cv] 22 Nov 2017

Lecture 5: Object Detection

arxiv: v1 [cs.cv] 2 Sep 2018

Fully Convolutional Network for Depth Estimation and Semantic Segmentation

arxiv: v1 [cs.cv] 26 Jul 2018

Convolutional Neural Networks: Applications and a short timeline. 7th Deep Learning Meetup Kornel Kis Vienna,

De-mark GAN: Removing Dense Watermark With Generative Adversarial Network

Object Detection with Discriminatively Trained Part Based Models

Deep Learning. Visualizing and Understanding Convolutional Networks. Christopher Funk. Pennsylvania State University.

Convolution Neural Networks for Chinese Handwriting Recognition

arxiv: v1 [cs.cv] 19 Feb 2019

Visual object classification by sparse convolutional neural networks

Deep Learning With Noise

Optimizing Intersection-Over-Union in Deep Neural Networks for Image Segmentation

Gradient of the lower bound

Object Detection Based on Deep Learning

arxiv: v1 [cs.cv] 25 Feb 2018

Direct Multi-Scale Dual-Stream Network for Pedestrian Detection Sang-Il Jung and Ki-Sang Hong Image Information Processing Lab.

Image Captioning with Object Detection and Localization

Object detection using Region Proposals (RCNN) Ernest Cheung COMP Presentation

CS231A Course Project Final Report Sign Language Recognition with Unsupervised Feature Learning

SurfNet: Generating 3D shape surfaces using deep residual networks-supplementary Material

Data Mining: Classifier Evaluation. CSCI-B490 Seminar in Computer Science (Data Mining)

Rich feature hierarchies for accurate object detection and semant

JOINT DETECTION AND SEGMENTATION WITH DEEP HIERARCHICAL NETWORKS. Zhao Chen Machine Learning Intern, NVIDIA

Object detection with CNNs

An Analysis of Scale Invariance in Object Detection SNIP

Lecture 7: Semantic Segmentation

arxiv:submit/ [cs.cv] 13 Jan 2018

Joint Object Detection and Viewpoint Estimation using CNN features

Detecting cars in aerial photographs with a hierarchy of deconvolution nets

Project 3 Q&A. Jonathan Krause

THE SPEED-LIMIT SIGN DETECTION AND RECOGNITION SYSTEM

A Study of Vehicle Detector Generalization on U.S. Highway

Recognize Complex Events from Static Images by Fusing Deep Channels Supplementary Materials

Lightweight Two-Stream Convolutional Face Detection

Object Detection and Its Implementation on Android Devices

Supplementary Material: Unconstrained Salient Object Detection via Proposal Subset Optimization

Three-Dimensional Object Detection and Layout Prediction using Clouds of Oriented Gradients

Transcription:

Finding Tiny Faces Supplementary Materials Peiyun Hu, Deva Ramanan Robotics Institute Carnegie Mellon University {peiyunh,deva}@cs.cmu.edu 1. Error analysis Quantitative analysis We plot the distribution of error modes among false positives in Fig. 1 and the impact of object characteristics on detection performance in Fig. 2 and Fig. 3. Qualitative analysis We show top 2 scoring false positives in Fig. 4. 2. Experimental details Multi-scale features Inspired by the way [3] trains FCN-8s at-once, we scale the learning rate of predictor built on top of each layer by a fixed constant. Specifically, we use a scaling factor of 1 for res4,.1 for res3, and.1 for res2. One more difference between our model and [3] is that: instead of predicting at original resolution, our model predicts at the resolution of res3 feature (downsampled by 8X comparing to input resolution). Input sampling We first randomly re-scale the input image by.5x, 1X, or 2X. Then we randomly crop a 5x5 image region out of the re-scaled input. We pad with average RGB value (prior to average subtraction) when cropping outside image boundary. Border cases Similar to [2], we ignore gradients coming from heatmap locations whose detection windows cross the image boundary. The only difference is, we treat padded average pixels (as described in Input sampling) as outside image boundary as well. Online hard mining and balanced sampling We apply hard mining on both positive and negative examples. Our implementation is simpler yet still effective comparing to [4]. We set a small threshold (.3) on classification loss to filter out easy locations. Then we sample at most 128 locations for both positive and negative (respectively) from remaining ones whose losses are above the threshold. We compare training with and without hard mining on validation performance in Table 1. Loss function Our loss function is formulated in the same way as [2]. Note that we also use Huber loss as the loss function for bounding box regression. Bounding box regression Our bounding box regression is formulated as [2] and trained jointly with classification using stochastic gradient descent. We compare between testing with and without regression in terms of performance on WIDER FACE validation set. 1

1 9 8 Loc BG percentage of each type 7 6 5 4 3 2 1 25 5 1 2 4 8 16 32 total false positives Figure 1: Distribution of error modes of false positives. Background confusion seems the dominating error mode among top-scoring detection, however, we found 15 out of 2 top-scoring false positives, as shown in Fig. 4, are in fact due to missed annotation. Method Easy Medium Hard w/ hard mining.919.98.822 w/o hard mining.917.94.825 Table 1: Comparison between training with and without hard mining. We show performance on WIDER FACE validation set. Both models are trained with balanced sampling and use ResNet-11 architecture. Results suggest hard mining has no noticeable affect the final performance. Method Easy Medium Hard w/ regression.919.98.823 w/o regression.911.9.798 Table 2: Comparison between testing with and without regression. We show performance on WIDER FACE validation set. Both models use ResNet-11 architecture. Results suggest that regression helps slightly more on detecting small s (2.4%). Bounding ellipse regression Our bounding ellipse regression is formulated as Eq. (1). t x c = (x c x c )/w (1) t y c = (yc y c )/h (2) t r a = log(ra/(h/2)) (3) t r b = log(rb /(w/2)) (4) t θ = cot(θ ) (5) (6) 2

1.8.896.798.6.4.2.459.522.53.317.237.259.635.457.511.456.459.463.44 area asp occ blr expr illu pose Figure 2: Summary of sensitivity plot. We plot the maximum and minimum of AP N shown in Figure 3. Our detector is mostly affected by object scale (from.44 to.896) and blur (from.259 to.798). where x c, y c, r a, r b, θ represent center x-,y-coordinate, ground truth half axes, and rotation angle of the ground truth ellipse. x c, y c, h, w represent the center x-,y-coordinate, height, and width of our predicted bounding box. We learn the bounding ellipse linear regression offline, with the same feature used for training bounding box regression. Other hyper-parameters We use a fixed learning rate of 1 4, a weight decay of.5, and a momentum of.9. We use a batch size of 2 images, and randomly crop one 5x5 region from the re-scaled version of each image. In general, we train models for 5 epochs and then select the best-performing epoch on validation set. References [1] D. Hoiem, Y. Chodpathumwan, and Q. Dai. Diagnosing error in object detectors. In European conference on computer vision, pages 34 353. Springer, 212. 4 [2] S. Ren, K. He, R. Girshick, and J. Sun. Faster r-cnn: Towards real-time object detection with region proposal networks. In Advances in neural information processing systems, pages 91 99, 215. 1 [3] E. Shelhamer, J. Long, and T. Darrell. Fully convolutional networks for semantic segmentation. 1 [4] A. Shrivastava, A. Gupta, and R. Girshick. Training region-based object detectors with online hard example mining. arxiv preprint arxiv:164.354, 216. 1 [5] S. Yang, P. Luo, C.-C. Loy, and X. Tang. Wider : A detection benchmark. In Proceedings of the IEEE International Conference on Computer Vision, June 216. 4 3

1.8 BBox Area.76.9 1.8 BBox Height.75.89.6.6.4.44.4.44.2.2.6.1.4 XS S M L XL Aspect Ratio.8.1.5 XS S M L XL Blur.8.5.4.3.2.1.24.48.52.48.34.6.4.2.57.26.8.6.4 XT T M W XW Expression.64.46.6.5.4.3 Clear Normal Heavy Illumination.51.46.2.2.1.6.5.4.3.2 Typical.53 Exaggerate Occlusion.32.33.6.5.4.3.2 Normal Extreme Pose.46.46.1.1 None Partial Heavy Typical Atypical Figure 3: Sensitivity and impact of object characteristics. We show normalized AP[1] for each characteristics. Please refer to [1] for definition of BBox Area, BBox Height, and Aspect Ratio and also refer to [5] for the definition of per- attributes Blur, Expression, Illumination, Occlusion, and Pose. Our detector performs under average in the case of extremely small scale, extremely skewed aspect ratio, heavy blur, and heavy occlusion. Surprisingly, exaggerated expression and extreme illumination correlate with better performance. Pose variation does not have noticeable affect. 4

Figure 4: Top 2 scoring false positives on validation set. Error type is labeled at the left bottom of each image. (bg) represents background confusion and (loc) represents inaccurate localization. ov represents overlap with ground truth bounding boxes, 1-r represents the percentage of detections whose confidence is below the current one s. Our detector seems to find s that were not annotated (when prediction is on the while ov equals to zero). 5