arxiv: v1 [cs.cv] 20 Dec 2016

Similar documents
Synscapes A photorealistic syntehtic dataset for street scene parsing Jonas Unger Department of Science and Technology Linköpings Universitet.

Deep Learning with Tensorflow AlexNet

Supplementary material for Analyzing Filters Toward Efficient ConvNet

An Exploration of Computer Vision Techniques for Bird Species Classification

Spatial Localization and Detection. Lecture 8-1

3D Densely Convolutional Networks for Volumetric Segmentation. Toan Duc Bui, Jitae Shin, and Taesup Moon

arxiv: v1 [cs.cv] 15 Oct 2018

Deep Tracking: Biologically Inspired Tracking with Deep Convolutional Networks

Joint Object Detection and Viewpoint Estimation using CNN features

Object Detection. CS698N Final Project Presentation AKSHAT AGARWAL SIDDHARTH TANWAR

Direct Multi-Scale Dual-Stream Network for Pedestrian Detection Sang-Il Jung and Ki-Sang Hong Image Information Processing Lab.

Classification of objects from Video Data (Group 30)

HENet: A Highly Efficient Convolutional Neural. Networks Optimized for Accuracy, Speed and Storage

CIS680: Vision & Learning Assignment 2.b: RPN, Faster R-CNN and Mask R-CNN Due: Nov. 21, 2018 at 11:59 pm

Proceedings of the International MultiConference of Engineers and Computer Scientists 2018 Vol I IMECS 2018, March 14-16, 2018, Hong Kong

... ISA 2 : Intelligent Speed Adaptation from Appearance. Carlos Herranz-Perdiguero 1 and Roberto J. López-Sastre 1

Channel Locality Block: A Variant of Squeeze-and-Excitation

Structured Prediction using Convolutional Neural Networks

Real-time Object Detection CS 229 Course Project

3D model classification using convolutional neural network

Deep Learning for Computer Vision II

A Deep Learning Approach to Vehicle Speed Estimation

Content-Based Image Recovery

arxiv: v1 [cs.cv] 6 Mar 2018

Thinking Outside the Box: Spatial Anticipation of Semantic Categories

Inception and Residual Networks. Hantao Zhang. Deep Learning with Python.

Unified, real-time object detection

Non-flat Road Detection Based on A Local Descriptor

Elastic Neural Networks for Classification

Convolutional Neural Networks for Facial Expression Recognition

Pedestrian Detection based on Deep Fusion Network using Feature Correlation

arxiv: v2 [cs.cv] 28 May 2017

Ryerson University CP8208. Soft Computing and Machine Intelligence. Naive Road-Detection using CNNS. Authors: Sarah Asiri - Domenic Curro

Industrial Technology Research Institute, Hsinchu, Taiwan, R.O.C ǂ

Detecting and Recognizing Text in Natural Images using Convolutional Networks

arxiv: v1 [cs.cv] 6 Jul 2016

REGION AVERAGE POOLING FOR CONTEXT-AWARE OBJECT DETECTION

Dynamic Routing Between Capsules

Encoder-Decoder Networks for Semantic Segmentation. Sachin Mehta

arxiv: v2 [cs.cv] 23 May 2016

Convolutional Neural Networks. Computer Vision Jia-Bin Huang, Virginia Tech

Real-time convolutional networks for sonar image classification in low-power embedded systems

Learning Deep Features for Visual Recognition

Learning Deep Representations for Visual Recognition

arxiv: v1 [cs.cv] 29 Sep 2016

Ground Plane Detection with a Local Descriptor

AdaDepth: Unsupervised Content Congruent Adaptation for Depth Estimation

Supplementary Material for Sparsity Invariant CNNs

Deep Residual Learning

Study of Residual Networks for Image Recognition

Cross-domain Deep Encoding for 3D Voxels and 2D Images

Convolution Neural Network for Traditional Chinese Calligraphy Recognition

Deep Face Recognition. Nathan Sun

Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks

Volumetric and Multi-View CNNs for Object Classification on 3D Data Supplementary Material

Object detection using Region Proposals (RCNN) Ernest Cheung COMP Presentation

Fused DNN: A deep neural network fusion approach to fast and robust pedestrian detection

BranchyNet: Fast Inference via Early Exiting from Deep Neural Networks

Computer Vision Lecture 16

A Novel Weight-Shared Multi-Stage Network Architecture of CNNs for Scale Invariance

Computer Vision Lecture 16

One Network to Solve Them All Solving Linear Inverse Problems using Deep Projection Models

Perceptron: This is convolution!

Towards Real-Time Automatic Number Plate. Detection: Dots in the Search Space

CEA LIST s participation to the Scalable Concept Image Annotation task of ImageCLEF 2015

Deep learning for object detection. Slides from Svetlana Lazebnik and many others

Lecture 7: Semantic Segmentation

Inception Network Overview. David White CS793

arxiv: v1 [cs.cv] 7 Jun 2016

Mask R-CNN. presented by Jiageng Zhang, Jingyao Zhan, Yunhan Ma

arxiv: v1 [cs.cv] 1 Nov 2018

CNNS FROM THE BASICS TO RECENT ADVANCES. Dmytro Mishkin Center for Machine Perception Czech Technical University in Prague

Speeding up Semantic Segmentation for Autonomous Driving

KamiNet A Convolutional Neural Network for Tiny ImageNet Challenge

Optimizing Intersection-Over-Union in Deep Neural Networks for Image Segmentation

CNN Basics. Chongruo Wu

Building Damage Assessment Using Deep Learning and Ground-Level Image Data

arxiv: v1 [cs.lg] 9 Jan 2019

Como funciona o Deep Learning

Pedestrian Detection via Mixture of CNN Experts and thresholded Aggregated Channel Features

Learning visual odometry with a convolutional network

ECCV Presented by: Boris Ivanovic and Yolanda Wang CS 331B - November 16, 2016

Deep Learning and Its Applications

All You Want To Know About CNNs. Yukun Zhu

Fully Convolutional Network for Depth Estimation and Semantic Segmentation

arxiv: v1 [cs.cv] 15 May 2018

Object Detection Based on Deep Learning

Multi-Glance Attention Models For Image Classification

COMP9444 Neural Networks and Deep Learning 7. Image Processing. COMP9444 c Alan Blair, 2017

Edge Detection Using Convolutional Neural Network

PT-NET: IMPROVE OBJECT AND FACE DETECTION VIA A PRE-TRAINED CNN MODEL

Deep learning for dense per-pixel prediction. Chunhua Shen The University of Adelaide, Australia

TRANSPARENT OBJECT DETECTION USING REGIONS WITH CONVOLUTIONAL NEURAL NETWORK

Deeply Cascaded Networks

arxiv: v1 [cs.cv] 17 Apr 2019

JOINT DETECTION AND SEGMENTATION WITH DEEP HIERARCHICAL NETWORKS. Zhao Chen Machine Learning Intern, NVIDIA

A FRAMEWORK OF EXTRACTING MULTI-SCALE FEATURES USING MULTIPLE CONVOLUTIONAL NEURAL NETWORKS. Kuan-Chuan Peng and Tsuhan Chen

Object Localization, Segmentation, Classification, and Pose Estimation in 3D Images using Deep Learning

Fuzzy Set Theory in Computer Vision: Example 3, Part II

Fuzzy Set Theory in Computer Vision: Example 3

Transcription:

End-to-End Pedestrian Collision Warning System based on a Convolutional Neural Network with Semantic Segmentation arxiv:1612.06558v1 [cs.cv] 20 Dec 2016 Heechul Jung heechul@dgist.ac.kr Min-Kook Choi mkchoi@dgist.ac.kr Woo Young Jung wyjung@dgist.ac.kr Abstract Kwon Soon soonyk@dgist.ac.kr Traditional pedestrian collision warning systems sometimes raise alarms even when there is no danger (e.g., when all pedestrians are walking on the sidewalk). These false alarms can make it difficult for drivers to concentrate on their driving. In this paper, we propose a novel framework for an end-to-end pedestrian collision warning system based on a convolutional neural network. Semantic segmentation information is used to train the convolutional neural network and two loss functions, such as cross entropy and Euclidean losses, are minimized. Finally, we demonstrate the effectiveness of our method in reducing false alarms and increasing warning accuracy compared to a traditional histogram of oriented gradients (HoG)-based system. 1 Introduction Advanced driver assistance systems (ADAS) are important for the prevention of vehicle accidents. ADAS contain several components, such as forward collision warning (FCW), lane departure warning (LDW), and pedestrian collision warning (PCW) systems. The PCW system is especially helpful for preventing serious damage and casualties, and as a result, many researchers and developers have tried to improve this system [7]. PCW systems usually rely on pedestrian detection and have an unfortunate tendency to produce false alarms in safe situations. For example, as shown in the image on the right side of Fig. 1, if pedestrians are walking on the sidewalk, then an alarm is not necessary. However, traditional pedestrian detection-based PCW systems produce alerts in this situation. It is not trivial to develop a system only using handcrafted features because it requires several complex steps, including pedestrian detection and scene recognition. In this study, using an end-to-end framework based on a convolutional neural network (CNN), we build a system that solves this problem. Our contributions in this paper are summarized as follows: We propose a novel framework for a PCW system composed of an end-to-end CNN-based learning algorithm. We show performance improvements of the proposed PCW system, which are achieved using semantic information from images.

Figure 1: Limitations of traditional PCW systems. The image on the left shows a dangerous situation, in which drivers must be warned. The image on the right shows a safe situation for which a warning is not required. There are two main advantages to our CNN-based PCW system. First, our system is effective in reducing false alarms compared to traditional pedestrian-detection-based PCW systems. Second, unlike traditional methods, our system can give warning alarms in response to cyclists as well as to pedestrians. 2 End-to-End PCW System Fig. 2 shows the proposed architecture of the CNN for our PCW system. We use five convolutional layers (CONV1 CONV5) and four fully connected layers (FC1 FC4). Unlike traditional approaches, which have distinct pedestrian detection and warning decision stages, as shown in Fig. 3, our PCW system does not include a pedestrian detection stage. In other words, our system predicts whether the situation is dangerous directly from the M N raw input image. This makes our system more accurate than traditional systems, since the varying appearance of pedestrians causes the pedestrian detection stage to be imperfect. Our network is a combination of two networks, one responsible for prediction and the other for semantic segmentation, as shown in Fig. 2. The prediction network determines whether the input image shows a warning situation, which is a binary classification problem. The semantic segmentation network segments the input image and extracts useful semantic information to feed into prediction network. These two networks are simultaneously trained by minimizing the loss function as follows: L t = L c + λl e, (1) where L t, L c, and L e are the total, cross-entropy, and Euclidean loss functions, respectively. λ is a tuning parameter, which we set to 10 3 in order to adjust the scale between the two loss values. The cross-entropy loss L c is defined as follows: L c = 1 B B C i=1 j=1 t ij log(o (p) ij ), (2) where B is the total number of data samples in a batch. t ij is the j-th value of the ground truth label for the i-th data sample of the current training batch, and o (p) ij is the j-th softmax output value of the prediction network for the i-th data sample. C is the total number of classes; in this case, C = 2. This loss function helps the network to predict the situation correctly. The Euclidean loss function L e for the semantic segmentation network is defined as follows: L e = 1 2B B i=1 o (s) i s i 2 2, (3) where o (s) i IR M N is an output vector of the last FC layer (FC4) of the semantic segmentation network for the i-th data sample of a batch. s i IR M N is a vectorized form of the ground truth segmentation image for the i-th data sample of a batch. Using this loss function, the semantic segmentation network learns how to segment input image semantically. These two networks share their two low level layers, CONV1 and CONV2, as shown in Fig. 2. This design is efficient, since these lower layers produce common features such as edges or blobs, allowing 2

Figure 2: Proposed CNN architecture for the PCW system. The network in the box with dotted lines is a prediction network and the network in the gray-filled box represents a semantic segmentation network. us to reduce the total number of learnable parameters. The output of the two networks is integrated at the FC2 layer. The high level features of FC1 and FC3 are concatenated, and the features are used as an input to the FC2 layer. Unlike the features extracted by FC1, the features extracted by FC3 represent semantic features of the input image. The semantic features can detect and classify objects implicitly, so we expect that the semantic features will be helpful for inferring dangerous situations. We do not use the output extracted by the FC4 layer because the dimensionality of FC4 features is too large: 2048 for FC3 compared to 131,072 for FC4. This huge number of dimensions requires many weight connections at the FC2 layer that can cause over-fitting. 3 Balancing Training Data It is difficult to train deep neural networks if the training data are imbalanced; that is, if the amount of data varies significantly between classes [4]. This results in imbalanced loss values between classes and failure in training the CNN. In our case, the number of images with no warning case is five times greater than the number of images showing a warning case. To resolve this problem, we copy the warning case training images in order to generate the same number of images as the non-warning case. 4 Experimental Results 4.1 CNN Parameter Details The detailed parameter settings for our network, which is similar to AlexNet [2], are as follows: The input image size is 512 256, with three channels representing RGB values. For the prediction network, we use CONV(11, 96, 4) - ReLU - MaxPool(3, 2) - CONV(5, 256, 1) - ReLU - MaxPool(3, 2) - CONV(3, 384, 1) - ReLU - CONV(3, 256, 1) - ReLU - MaxPool(3, 2) - FC(256) - ReLU - FC(256) - ReLU - Softmax(2), where each value in brackets means CONV(kernel size, the number of channels, stride), MaxPool(kernel size, stride), and FC(the number of output node). For the semantic Figure 3: Structure of a traditional HoG-based PCW system. The baseline method has two stages: pedestrian detection and warning decision. 3

segmentation network, we use CONV(11, 96, 4) - ReLU - MaxPool(3, 2) - CONV(5, 256, 1) - ReLU - MaxPool(3, 2) - FC(2048) - FC(131072). The size of mini-batch is of 128, and the value of learning rate is of 0.001. Further, we set the value of weight decay to 0.0001, and the total iteration number is of 2000. For the weight initialization of whole weight layers, we used the method as described in [5]. 4.2 Dataset We use a cityscape dataset taken from urban environments to evaluate our method [3]. The dataset includes various objects, such as vehicles, pedestrians, and cyclists. Furthermore, each image is densely or sparsely annotated for semantic segmentation tasks. In our experiments, we use densely annotated data consisting of 2975 training images and 1525 test images. Unfortunately, the dataset does not provide the ground truth data for PCW, so we have annotated each image manually (0: no alarm, 1: warning). 4.3 Comparison to an HoG-based PCW System We built a baseline algorithm, shown in Fig. 3, that includes an HoG-based pedestrian detection method, for comparison with our method [1]. The rule for making a determination about the danger of a situation was: If pedestrians exist in a dangerous region, determined manually by setting a region of interest from (128, 0) to (383, 255) in the input image, then a warning alarm should be provided to the driver. (The size of the input image is 512 256.). This assumption is natural since sidewalk regions can be significantly reduced in these images. 4.4 Results Fig. 4 shows the receiver operating characteristic (ROC) curves for each method, showing the rates of both true and false positives. The blue line represents the HoG-based algorithm and the red and green lines denote our proposed method with and without a semantic segmentation network, respectively. The true positive rate of the HoG-based algorithm is better than that of the other methods when the false positive rate is under 0.05; however, our proposed methods are superior to the HoG-based algorithm in other cases. In particular, our proposed method with a semantic segmentation network shows the best performance among the three approaches. Table 1 shows the accuracy (true positive rate) of each method at a 15% false positive rate. Our proposed method without the semantic segmentation network shows about a 19% improvement compared to the HoG-based algorithm. Furthermore, with the semantic segmentation network, performance was significantly improved, by 26%. Figure 4: ROC curves for each method. The graph on the left shows the original ROC curve; the graph on the right shows a magnified version of the ROC curve at false positive rates in the range of [0, 0.35]. 4

Table 1: Warning accuracies at a 15% false positive rate. Baseline Proposed Proposed (HoG-based) (without semantic (with semantic segmentation) segmentation) Accuracy 45% 63.94% 71% Fig. 5 shows qualitative results extracted from our proposed PCW system with a semantic segmentation network. Surprisingly, our system was aware of both pedestrians and cyclists. Additionally, our system did not raise an alarm when there was no risk of collision with pedestrians or cyclists. This is a significant benefit of the use of our PCW system. Figure 5: Qualitative results. The first row shows two result images for warning cases produced by the proposed method. The second row shows safe cases predicted by the proposed method. 5 Conclusion We have proposed an end-to-end PCW system based on a CNN. The usefulness of traditional systems, based on pedestrian detection, is limited by their false alarm rate. Our system, however, is effective in reducing false alarms and improves system accuracy. Our model combines two networks that individually perform prediction and semantic segmentation tasks, and it was helpful in improving our proposed PCW system. One of the main contributions of this paper is a demonstration of the feasibility of an end-to-end PCW system. Our system could potentially be improved by replacing the deep neural network architecture to a more recent architecture, for example, a deep residual network [6]. Additionally, we believe that our proposed framework can be applied to other ADAS, such as LDW and FCW systems. 6 Acknowledgement This work was supported by the R&D Program of the Ministry of Science of Korea (16-FA- 07). References [1] Dalal, N. and Bill T. Histograms of oriented gradients for human detection CVPR 05. Vol. 1. IEEE, 2005. [2] Krizhevsky, A., Sutskever, I., and Hinton, G. E. Imagenet classification with deep convolutional neural networks NIPS (pp. 1097-1105), 2012. 5

[3] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., and Schiele, B. The cityscapes dataset for semantic urban scene understanding arxiv preprint arxiv:1604.01685, 2016. [4] Masko, D., and Hensman, P. The impact of imbalanced training data for convolutional neural networks 2015. [5] He, K., Zhang, X., Ren, S., and Sun, J. Delving deep into rectifiers: Surpassing human-level performance on imagenet classification ICCV (pp. 1026-1034), 2015. [6] He, K., Zhang, X., Ren, S., and Sun, J. Deep residual learning for image recognition arxiv preprint arxiv:1512.03385, 2015. [7] Geronimo, D., Lopez, A. M., Sappa, A. D., and Graf, T. Survey of pedestrian detection for advanced driver assistance systems IEEE transactions on pattern analysis and machine intelligence, 32(7), 1239-1258, 2010. 6