arxiv: v1 [cs.cv] 17 Apr 2019

Similar documents
PointNet++: Deep Hierarchical Feature Learning on Point Sets in a Metric Space

Deep Tracking: Biologically Inspired Tracking with Deep Convolutional Networks

PointNet: Deep Learning on Point Sets for 3D Classification and Segmentation. Charles R. Qi* Hao Su* Kaichun Mo Leonidas J. Guibas

Classification of objects from Video Data (Group 30)

arxiv: v1 [cs.cv] 20 Dec 2016

Multi-View 3D Object Detection Network for Autonomous Driving

Object Detection. CS698N Final Project Presentation AKSHAT AGARWAL SIDDHARTH TANWAR

arxiv: v1 [cs.cv] 31 Mar 2019

2 OVERVIEW OF RELATED WORK

Deep Learning with Tensorflow AlexNet

Visual object classification by sparse convolutional neural networks

ImageNet Classification with Deep Convolutional Neural Networks

Object detection using Region Proposals (RCNN) Ernest Cheung COMP Presentation

Automotive 3D Object Detection Without Target Domain Annotations

A Deep Learning Approach to Vehicle Speed Estimation

Supplementary A. Overview. C. Time and Space Complexity. B. Shape Retrieval. D. Permutation Invariant SOM. B.1. Dataset

Laserscanner Based Cooperative Pre-Data-Fusion

CAP 6412 Advanced Computer Vision

POINT CLOUD DEEP LEARNING

Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks

A Neural Network for Real-Time Signal Processing

Robot localization method based on visual features and their geometric relationship

3D Convolutional Neural Networks for Landing Zone Detection from LiDAR

Instance-aware Semantic Segmentation via Multi-task Network Cascades

PSU Student Research Symposium 2017 Bayesian Optimization for Refining Object Proposals, with an Application to Pedestrian Detection Anthony D.

Volumetric and Multi-View CNNs for Object Classification on 3D Data Supplementary Material

Deep Supervision with Shape Concepts for Occlusion-Aware 3D Object Parsing

Joint Object Detection and Viewpoint Estimation using CNN features

Object Detection on Self-Driving Cars in China. Lingyun Li

Deep Learning for Computer Vision II

Training Convolutional Neural Networks for Translational Invariance on SAR ATR

Kaggle Data Science Bowl 2017 Technical Report

Fusing Radar and Scene Labeling Data for Multi-Object Vehicle Tracking

arxiv: v1 [cs.cv] 30 Jan 2018

Project 3 Q&A. Jonathan Krause

(Deep) Learning for Robot Perception and Navigation. Wolfram Burgard

Spatio temporal Segmentation using Laserscanner and Video Sequences

Face Detection and Tracking Control with Omni Car

Learning visual odometry with a convolutional network

Postprint.

CS 223B Computer Vision Problem Set 3

Regionlet Object Detector with Hand-crafted and CNN Feature

An Exploration of Computer Vision Techniques for Bird Species Classification

Spatial Localization and Detection. Lecture 8-1

Neural Network and Deep Learning. Donglin Zeng, Department of Biostatistics, University of North Carolina

Robot Autonomy Final Report : Team Husky

3D model classification using convolutional neural network

Pedestrian Detection Using Correlated Lidar and Image Data EECS442 Final Project Fall 2016

Deep Supervision with Shape Concepts for Occlusion-Aware 3D Object Parsing Supplementary Material

SSD: Single Shot MultiBox Detector. Author: Wei Liu et al. Presenter: Siyu Jiang

CLASSIFICATION WITH RADIAL BASIS AND PROBABILISTIC NEURAL NETWORKS

3 Object Detection. BVM 2018 Tutorial: Advanced Deep Learning Methods. Paul F. Jaeger, Division of Medical Image Computing

COSC160: Detection and Classification. Jeremy Bolton, PhD Assistant Teaching Professor

HOG-based Pedestriant Detector Training

CEA LIST s participation to the Scalable Concept Image Annotation task of ImageCLEF 2015

Human Motion Detection and Tracking for Video Surveillance

CS231A Course Project Final Report Sign Language Recognition with Unsupervised Feature Learning

A Generic Approach Towards Image Manipulation Parameter Estimation Using Convolutional Neural Networks

Towards Fully-automated Driving. tue-mps.org. Challenges and Potential Solutions. Dr. Gijs Dubbelman Mobile Perception Systems EE-SPS/VCA

A Novel Approach to Image Segmentation for Traffic Sign Recognition Jon Jay Hack and Sidd Jagadish

arxiv: v1 [cs.cv] 2 Sep 2018

Visual features detection based on deep neural network in autonomous driving tasks

CIS680: Vision & Learning Assignment 2.b: RPN, Faster R-CNN and Mask R-CNN Due: Nov. 21, 2018 at 11:59 pm

From 3D descriptors to monocular 6D pose: what have we learned?

Object Localization, Segmentation, Classification, and Pose Estimation in 3D Images using Deep Learning

arxiv: v2 [cs.cv] 3 Sep 2018

Synscapes A photorealistic syntehtic dataset for street scene parsing Jonas Unger Department of Science and Technology Linköpings Universitet.

CS 231A Computer Vision (Fall 2012) Problem Set 3

Chapter 3 Image Registration. Chapter 3 Image Registration

R-FCN: OBJECT DETECTION VIA REGION-BASED FULLY CONVOLUTIONAL NETWORKS

Using Machine Learning for Classification of Cancer Cells

Selective Search for Object Recognition

Perceptron: This is convolution!

Mask R-CNN. presented by Jiageng Zhang, Jingyao Zhan, Yunhan Ma

Large-Scale Traffic Sign Recognition based on Local Features and Color Segmentation

Automatic detection of books based on Faster R-CNN

Exploring Curve Fitting for Fingers in Egocentric Images

DeepIM: Deep Iterative Matching for 6D Pose Estimation - Supplementary Material

REGION AVERAGE POOLING FOR CONTEXT-AWARE OBJECT DETECTION

Investigation of Machine Learning Algorithm Compared to Fuzzy Logic in Wild Fire Smoke Detection Applications

Cost-alleviative Learning for Deep Convolutional Neural Network-based Facial Part Labeling

Including the Size of Regions in Image Segmentation by Region Based Graph

Recognition of Animal Skin Texture Attributes in the Wild. Amey Dharwadker (aap2174) Kai Zhang (kz2213)

Mobile Human Detection Systems based on Sliding Windows Approach-A Review

Deep Learning For Video Classification. Presented by Natalie Carlebach & Gil Sharon

Using Faster-RCNN to Improve Shape Detection in LIDAR

Channel Locality Block: A Variant of Squeeze-and-Excitation

Dynamic Routing Between Capsules

Temporal HeartNet: Towards Human-Level Automatic Analysis of Fetal Cardiac Screening Video

Character Recognition

Canny Edge Based Self-localization of a RoboCup Middle-sized League Robot

Part Localization by Exploiting Deep Convolutional Networks

Deep Learning. Visualizing and Understanding Convolutional Networks. Christopher Funk. Pennsylvania State University.

Creating Affordable and Reliable Autonomous Vehicle Systems

Convolutional Neural Networks. Computer Vision Jia-Bin Huang, Virginia Tech

W4. Perception & Situation Awareness & Decision making

MULTI-SCALE OBJECT DETECTION WITH FEATURE FUSION AND REGION OBJECTNESS NETWORK. Wenjie Guan, YueXian Zou*, Xiaoqun Zhou

A System for Real-time Detection and Tracking of Vehicles from a Single Car-mounted Camera

Applying Synthetic Images to Learning Grasping Orientation from Single Monocular Images

Non-rigid body Object Tracking using Fuzzy Neural System based on Multiple ROIs and Adaptive Motion Frame Method

Transcription:

2D Car Detection in Radar Data with PointNets Andreas Danzer, Thomas Griebel, Martin Bach, and Klaus Dietmayer arxiv:1904.08414v1 [cs.cv] 17 Apr 2019 Abstract For many automated driving functions, a highly accurate perception of the vehicle environment is a crucial prerequisite. Modern high-resolution radar sensors generate multiple radar targets per object, which makes these sensors particularly suitable for the 2D object detection task. This work presents an approach to detect object hypotheses solely depending on sparse radar data using PointNets. In literature, only methods are presented so far which perform either object classification or bounding box estimation for objects. In contrast, this method facilitates a classification together with a bounding box estimation of objects using a single radar sensor. To this end, PointNets are adjusted for radar data performing 2D object classification with segmentation, and 2D bounding box regression in order to estimate an amodal bounding box. The algorithm is evaluated using an automatically created dataset which consist of various realistic driving maneuvers. The results show the great potential of object detection in highresolution radar data using PointNets. I. INTRODUCTION For autonomous driving, the perception of vehicle surrounding is an important task. In particular, tracking of multiple objects using noisy data from sensors, such as radar, lidar and camera, is a crucial task. A standard approach is to preprocess received sensor data in order to generate object detections, which are used as input for a tracking algorithm. For camera and lidar sensors, various object detectors have already been developed to obtain object hypotheses in form of classified rectangular 2D or 3D bounding boxes. To the best of the authors knowledge, no object detection method for radar has been presented in literature so far which performs a classification as well as a bounding box estimation. Although modern high-resolution radar sensors generate multiple detections per object, the received radar data is extremely sparse compared to lidar point clouds or camera images. For this reason, it is a challenging task to recognize different objects solely using radar data. This contribution presents a method to detect objects in high-resolution radar data using a machine learning approach. As shown in Figure 1, radar data is presented as a point cloud, called radar target list, consisting of two spatial coordinates, ego-motion compensated Doppler velocity and radar cross section (RCS). Since radar targets are represented as point cloud, it is desirable to use simply point clouds as input for a neural network. Existing approaches which process radar data use certain representation transformations in order to use neural networks. However, PointNets [1], [2] make it possible to directly process point clouds and are A. Danzer, T. Griebel, M. Bach, and K. Dietmayer are with the Institute of Measurement, Control and Microtechnology, Ulm University, 89081 Ulm, Germany firstname.lastname@uni-ulm.de Fig. 1: 2D object detection in radar data. Radar point cloud with reflections belonging to a car (red) or clutter (blue). The length of arrows displays the Doppler velocity, the size of points represents the radar cross section value. The red box is a predicted amodal 2D bounding box. suitable for this data format. Frustum PointNets [3] extend the concept of PointNets for the detection of objects by combining a 2D object detector with a 3D instance segmentation and 3D bounding box estimation. The proposed object detection method for radar data is based on the approach of Frustum PointNets to perform 2D object detection, i.e., object classification and segmentation of radar target lists together with a 2D bounding box regression. The 3D lidar point cloud used in [1] [3] is dense and even captured finegrained structures of objects. Although radar data is sparse compared to lidar data, radar data contains strong features in form of Doppler and RCS information. For example, wheels of a vehicle generate notable Doppler velocities, and license plates result in targets with high RCS values. Another advantage of radar data is that it often contains reflections of an object part which is not directly visible, e.g., wheel houses at the opposite side of a vehicle. All these features can be very beneficial to classify and to segment radar targets, and above all to perform bounding box estimations of objects. This work is structured as follows. Section II displays related work in the field of object classification and bounding box estimation using radar data. Furthermore, deep learning approaches on point clouds using PointNets are introduced. In Section III the problem of this contribution is stated. Section IV presents the proposed method to detect 2D object hypotheses in radar data. In addition to that, the automatically generated radar dataset is described, and the training process is explained. Section V shows results which are evaluated on This work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible.

real-world radar data. Finally, a short conclusion is given. II. RELATED WORK Object Classification in Radar Data: For the object classification task in radar data, Heuel and Rohling [4] [6] present approaches with self-defined extracted features to recognize and classify pedestrians and vehicles. Wöhler et al. [7] extract stochastic features and use them as input for a random forest classifier and a long short-term memory network. In order to classify static objects, Lombacher et al. [8] accumulate raw radar data over time and transform them into a grid map representation [9]. Then, windows around potential objects are cut out and used as input for a deep neural network. Furthermore, Lombacher et al. [10] infer a semantic representation for stationary objects using radar data. To this end, a convolutional neural network is fed with an occupancy radar grid map. Bounding Box Estimation in Radar Data: Instead of classifying objects, another task is to estimate a bounding box, i.e., position, orientation and dimension of an object. Roos et al. [11] present an approach to estimate the orientation as well as the dimension of a vehicle using highresolution radar. For this purpose, single measurements of two radars are collected and enhanced versions of orientated bounding box algorithms and the L-fit algorithm are applied. Schlichenmaier et al. [12] show another approach to estimate bounding boxes using high-resolution radar data. For this reason, position and dimension of vehicles are estimated using a variant of the k-nearest-neighbors method. Furthermore, Schlichenmaier et al. [13] present an algorithm to estimate bounding boxes representing vehicles using template matching. Especially in challenging scenarios, the templating matching algorithm outperforms the orientated bounding box methods. A disadvantage of the proposed method is that clutter points not belonging to the vehicle are however taken into account for the bounding box estimation. PointNets: The input of most neural networks has to follow a regular structure, e.g., image grids or grid map representation. This requires that data such as point clouds or radar targets have to be transformed in a regular format before feeding them into a neural network. The PointNet architecture overcomes this constraint and support point clouds as input. Qi et al. [1] present a 3D classification and a semantic segmentation of 3D lidar point clouds using PointNet. Since the PointNet architecture does not capture local structures induced by the metric space, the ability to capture details is limited. For this reason, Qi et al. [2] also propose a hierarchical neural network, called PointNet++, which applies PointNet recursively on small regions of the input point set. The PointNet++ architecture facilitates to leverage neighborhoods at multiple scales resulting in the ability to learn deep point set features, i.e., to capture finegrained patterns, and robustness. Schumann et al. [14] use the same PointNet++ architecture for semantic segmentation on radar point clouds. For this purpose, the architecture is modified to handle point clouds with two spatial and two further feature dimensions. The radar data is accumulated over a time period of 500 ms to get a denser point cloud with more reflections per object. Subsequently, each radar target is classified among six different classes. Thus, only a semantic segmentation is performed but no semantic instance segmentation or bounding box estimation. Utilizing image and lidar data, Qi et al. [3] present the Frustum PointNets, to detect 3D objects. First, a 3D frustum point cloud including an object is extracted using a 2D bounding box from an image based object detector. Second, a 3D instance segmentation in the frustum is performed using a segmentation PointNet. Third, an amodal 3D bounding box is estimated using a regression PointNet. Hence, Frustum PointNets are the first method performing object detection including bounding box estimation using PointNets on unstructured data. III. PROBLEM STATEMENT Given radar point clouds, the goal of the presented method is to detect objects, i.e., to classify and localize objects in 2D space (Figure 1). The radar point cloud is represented as a set of four-dimensional points P = {p i i = 1,..., n}, where n N denotes the number of radar targets. Furthermore, each point p i = (x, y, ṽ r, σ) contains (x, y)-coordinates, ego-motion compensated Doppler velocity ṽ r, and radar cross section σ. The radar data are generated from one measurement cycle of a single radar and are not accumulated over time. For the object classification task, this contribution distinguishes between two classes, car and clutter. Note that this method can easily be extended to multiple classes. Additionally, for radar targets that are assigned to the class car, an amodal 2D bounding is predicted, i.e., even if only parts are captured by the radar sensor, the entire object is estimated. The 2D bounding box is described by its center (x c, y c ), its heading angle θ in the xy-plane and its size containing length l and width w. IV. 2D OBJECT DETECTION WITH POINTNETS An overview of the proposed 2D object detection system using radar data is shown in Figure 2. This sections introduces the three major modules of the system: patch proposal, classification and segmentation, and amodal 2D bounding box estimation. A. Patch Proposal The patch proposal divides the radar point cloud into region of interests. For this purpose, a patch with a specific length and width is determined around each radar target. The lenght and width of the patch must be selected in such a way that it comprises the entire object of interest, here a car. The patch proposal generates multiple patches containing the same object. As a result, the final 2D object detector provides multiple hypotheses for a single object. This behavior is desirable because the object tracking system in the further processing chain for environmental perception deals with multiple hypotheses per object. Note that the tracking system is not part of this work. As described in [3], the patches

Radar Targets Generation of 2D Patches Patch Proposal nx4 Classification and Segmentation PointNet Classification and 2D Segmentation Masking mx4 Transformer PointNet Amodal 2D Bounding Box Estimation PointNet Amodal 2D Bounding Box Estimation Box Parameters Fig. 2: 2D object detection in radar data with PointNets. First, a patch proposal determines multiple region of interests, called patches, using the entire radar target list. Second, a classification and segmentation network classifies these patches. Subsequently, each of the n radar targets are classified to get an instance segmentation. Finally, a regression network estimates an amodal 2D bounding box for objects using the m segmented car radar targets. are normalized to a center view which ensures rotationinvariance of the algorithm. Finally, all radar targets within a patch are forwarded to the classification and segmentation network. B. Classification and Segmentation The classification and object segmentation module consists of a classification and a segmentation network to get classified object with segmented radar targets. To this end, the entire patches are classified using the classification network to distinguish between car and clutter patches. For car patches, the segmentation network predicts a probability score for each radar target which indicates the probability of radar targets belonging to a car. In the masking step, radar targets which are classified as car targets are extracted. As presented in [3], coordinates of the segmented radar targets are normalized to ensure translational invariance of the algorithm. Note that the classification and segmentation module can easily be extended to multiple classes. For this purpose, the patch is classified as a certain class and, consequently the predicted classification information is used for the segmentation step. C. Amodal 2D Bounding Box Estimation After the segmentation of object points, this module estimates an associated amodal 2D bounding box. First, a lightweight regression PointNet, called Transformer PointNet (T-Net), estimates the center of the amodal bounding box and transforms radar targets into a local coordinate system relative to the predicted center. This step and the T-Net architecture are described in detail in [3]. The transformation using T-Net is reasonable, because regarding the viewing angle, the centroid of segmented points can differ to the true center of the amodal bounding box. For the 2D bounding box estimation, a box regression PointNet, which is conceptually the same as proposed in [3], is used. The regression network predicts the parameters of a 2D bounding box by its center (x c, y c ), its heading angle θ and its size (l, w). For the box center estimation, the residual based 2D localization of [3] is performed. Heading angle and size of bounding box is predicted using a combination of a classification and a segmentation approach TABLE I: Radar dataset. Dataset distribution among classes. class number of patches number of targets car 289 597 (34.14%) 1 967 761 (10.55%) clutter 558 616 (65.86%) 16 676 716 (89.45%) as explained in [3]. More precisely, for size estimation, predefined size templates are incorporated for the classification task. Furthermore, residual values regarding those categories are predicted. In case of multiple classes, the box estimation network also uses the classification information for the bounding box regression. Therefore, the size templates have to be extended by additional classes, e.g. pedestrians or cyclists. D. Network Architectures For the object detecion task in radar data, two network architectures were considered, the v1 model based on the concepts of PointNet [1], and the v2 model based on the concepts of PointNet++ [2]. Figure 3 shows both network architectures. For the classification and segmentation network the architecture is conceptually similar to PointNet [1] and [2]. The network for amodal 2D bounding box estimation is the same as proposed in [3]. Since this work uses radar data as input, the input of both networks is extended. For the classification and segmentation network, the input is a four-dimensional radar target list of a patch containing n points with 2D spatial data, ego-motion compensated Doppler velocity and RCS information. For the bounding box regression network, the input is spatial information of a segmented radar target list consisting of m points. E. Dataset For training and testing of the proposed 2D object detection method in radar data, a dataset with real-world radar data was created on a test track. Two test vehicles of Ulm University [15] were used, an ego vehicle and a target vehicle. The ego vehicle is equipped with two Continental ARS 408-21 Premium radar sensors that are installed on the front corners of the vehicle. Note in this work, only the radar sensor mounted on the front left corner is used. Both vehicles are equipped with an Automotive Dynamic Motion Analyzer (ADMA) sensor unit from GeneSys and a highly accurate Differential Global Positioning System (DGPS). The target

v1 Classification Network v2 Classification Network n x 4 mlp(64,64) n x 64 mlp(64,128,1024) n x 1024 max pool 1024 FCs(512,256,k) 48 x 192 32 x 256 (MSG) (MSG) n x 4 np=32, r=[2,4], np=48, r=[1,3], n neigh=[16,32], n neigh=[8,32], mlp=[[64,64,128], mlp=[[32,32,64], [64,64,128]] [64,64,128]] 1024 r=inf, mlp=[128,256,1024] FCs(256,128,k) v1 Segmentation Network n x (1088+k) mlp(512,256,128,128,2) n x 2 n x 1 object prob. 1024 k FP 128 x 128 FP 32 x 128 FP n x 128 mlp=(128,128) mlp=(128,128) mlp=(128,128) v2 Segmentation Network mlp=(128,2) n x 2 n x 1 object prob. v1 Amodal 2D Box Estimation Network v2 Amodal 2D Box Estimation Network m x 2 mlp(128,128,256,512) m x 512 max pool 512 k FCs (512,256, 2+3n s+2n h) box parameters m x 2 32 x 128 np=32, r=0.5, mlp=[64,64,128] 16 x 256 np=16, r=1.0, mlp=[128,128,256] 512 k r=inf, mlp=[256,256,512] FCs(512,256, 2+3n s+2n h) box parameters Fig. 3: Network architectures for 2D object detection in radar data with PointNets. On the left, the v1 model is based on PointNet, on the right, the v2 model is based on PointNet++. Input points are a list of radar targets. To get s in the classification network, v1 model consists of a multi-layer perceptron (mlp) and a max pooling layers, v2 model contains set abstraction () layers with multi-scale grouping (MSG) and single-scale grouping. After fully connected (FC) layers, classification scores for the whole point cloud are obtained. In the segmentation network, local and s are combined with from the classification, v1 model applies mlp, v2 model uses feature propagation (FG) to obtain object probabilities. In the box estimation network, v1 model contains mlp and max pooling layers, v2 consists of layers with SSG to get s. Features are combined with classification scores and are forwarded to FC layers to determine box parameters. vehicle is a Mercedes E-Class station wagon S212. With the ADMA data and the vehicle specification, ground truth data is generated. Ground truth data of the vehicle is represented as a bounding box comprising position, orientation, and dimension. At the time of recording the dataset, there were different weather conditions. The dataset consists of eleven different driving maneuvers, e.g., driving circles and figure eights, driving in front or towards the ego vehicle, as well as passing maneuvers. The idea is to cover as many maneuvers under training conditions, which can be transferred to real-world traffic scenarios, as possible. The labeling process generates automatically annotated radar targets by using the ground truth bounding box of the target vehicle as reference. Each radar target inside the reference bounding box is marked as car target, and all the others as clutter targets. Since radar measurements are typically noisy, targets close to the target vehicle can belong to the car class. Therefore, the ground truth bounding box is extended by 0.35 m in length and width. The big advantage of the automatic label procedure is that a large amount of data can be annotated with minimal effort. After patch proposal, i.e., generating patches around each radar target, patches are annotated as car if the reference bounding box is inside the associated patch, and all the others as clutter. Since a certain number of radar targets are necessary to assign a class as well as to estimate a bounding box, the dataset is created under certain conditions. Hence, each car patch has to comprise at least 2 radar targets belonging to the car class, and each clutter patch has to comprise at least 16 radar targets belonging to the clutter class. Table I shows the resulting distribution of the annotated patches and radar targets. For training, validation, and testing purpose, the dataset is divided into three parts: training, validation and testing data. The training data is used to train the model. The evaluation of the model during the training process uses the validation data. The validation data consists of the same driving maneuvers as the training data. For example, if a maneuver was driven five times, then three of them are added to the training data and the rest to the validation data. This ensures that all trained driving maneuvers are covered within the validation data. The testing data is used to show the generalization ability of models after the training process. The testing data consists of different driving maneuvers which differ from the trained maneuvers. As a result, the testing data is completely disjoint from the training and the validation data. In total, 54, 20% of the patches are used for training, 19, 51% for validation and 26, 29% for testing. F. Training As proposed in [3], training is performed with a multi-task loss to optimize classification and segmentation PointNet, T-Net and amodal 2D bounding box estimation PointNet simultaneously. Since this work uses a patch classification before segmentation, the multi-task loss is extended by the

classification part as proposed in [1]. As a result, the multitask loss is defined as L multi task =w cls L cls + w seg L seg + w box (L c1 reg + L c2 reg + L h cls + L h reg + L s cls + L s reg + w corner L corner ). (1) In Equation (1), L cls denotes the loss for patch classification and L seg for segmentation of radar targets in the patch. Both losses can be weighted by the parameters w cls and w seg respectively. Since the distribution of car and clutter patches is imbalanced (Table I), during training process the weights for w cls and w seg are chosen to be higher for car patches. All other losses are used for bounding box regression. Here, L c1 reg is for residual based center regression of T-net and L c2 reg for center regression of amodal box estimation network. Furthermore, L h cls and L h reg are losses for heading angle estimation, while L s cls and L s reg are for size estimation of the bounding box. The corner loss L corner, weighted by w corner, is a novel regularization loss proposed in [3] for jointly optimizing center, size and heading angle of the object of interest. This loss is constructed to ensure a good 2D bounding box estimation under the intersection over union metric. The parameter w box weights the bounding box estimation. If a patch is classified as clutter patch, w box is set to zero with the result that no bounding box estimation is performed. Moreover, softmax with cross-entropy loss is used for the classification and the segmentation task, and smooth-l 1 (huber) loss is used for loss calculations of the regression task. To guarantee a fixed number of input radar targets, sampling is performed. For the classification and segmentation, radar targets are sampled up to 48. Since the number of radar targets generated by car class is sparse, the sample process uses all car targets and only samples clutter targets. For the amodal 2D bounding box estimation, 32 points are randomly sampled from the segmented radar targets. Data augmentation is a helpful concept to avoid model overfitting [16]. Hence, for all radar targets of a patch, data augmentation is applied during training. First, spatial information in a radar patch are shifted randomly and uniformly. Second, for car targets ego-motion compensated Doppler velocity is perturbed using random Gaussian noise with zero mean and standard deviation of 0.2. Third, for perturbing RCS value of car targets, random noise of a Gaussian distribution with zero mean and standard deviation of one is used. The training of both models uses Adam optimizer and performs batch normalization for all layers excluding the last classification and regression layers. The initial learning of both models is chosen to be 0.0001. For v1 or rather v2 model, each 45.000 and 50.000 iterations learning rate is decreased by half. For both models, initial decay rate is 0.5, which is increased by 0.5 up to 0.99. The batch size for both models is chosen to be 32. For v1 or rather v2 model, training is performed for 31 and 51 epochs on a single NVIDIA GeForce GTX 1070 GPU. Both models are trained on the full dataset containing patches with at least 2 car targets and 16 clutter targets per patch. Furthermore, the weights for multi-task loss during training are heuristically chosen to be w cls = 2, w seg = 2 for car patches and w cls = 1, w seg = 1 for clutter patches, and w box = 1, w corner = 10 for the bounding box estimation. V. EXPERIMENTS This section shows the evaluation of the proposed 2D object detector for radar data. For this purpose, results for the selected v1 model which is based on PointNet and the v2 model which is based on PointNet++ are compared. Also, the impact of the number of radar targets belonging to the car class is analyzed. Finally, the results and the restrictions are discussed. A. Evaluation Since the proposed 2D object detection method contains classification, segmentation, and bounding box estimation, the performance of these three modules will be evaluated. For the classification and segmentation, precision, recall, and F 1 score is used. For the bounding box regression, performance is measured by Intersection over Union (IoU). Precision denotes the accuracy of all positive predictions and is given by T P P = T P + F P, (2) where T P is the number of true positives and F P the number of false positives. Recall depicts the ratio of all positive examples which are correctly classified as positives and is defined as R = T P T P + F N, (3) where F N is the number of false negatives. F 1 is the harmonic mean between precision and recall and is given by F 1 = 2 P R P + R. (4) The IoU compares a predicted bounding box b pred with the ground truth bounding box b gt and is defined as IoU = b gt b pred b gt b pred, where measures the area of the underlying set. If the ground truth bounding box and the predicted bounding box are almost the same, IoU score tends to be close to one. If the two bounding boxes do not overlap at all, IoU score will be zero. Testing was performed on the full dataset as well as on subsets. Therefore, combinations with different number of car and clutter targets per patch are chosen, namely 4 and 20, 6 and 24, 8 and 30, 10 and 32. The length and the width of the patches (IV-A) are chosen to be 10 m, which guarantees that a patch comprises an entire vehicle at any time. Results of the 2D object detector in radar data are

TABLE II: Results for 2D object detection in radar data. Both models are evaluated on the test set. Precision, recall, and F 1 score are evaluated for classification and segmentation. IoU evaluates the 2D bounding box estimation by using average IoU and ratio of IoUs with a threshold of 0.7. Furthermore, different thresholds for the number of car and clutter targets are used. Model Number of Car / Classification Segmentation 2D Bounding Box Clutter Targets P R F 1 P R F 1 IoU IoU ( 0.7) v1 10/ 32 98.04% 99.33% 98.69% 93.61% 92.17% 92.89% 0.64 52.26% v2 10/ 32 98.82% 98.82% 98.82% 96.50% 93.34% 94.90% 0.65 55.96% v1 8/ 30 97.91% 98.75% 98.33% 93.65% 92.28% 92.96% 0.63 50.17% v2 8/ 30 98.69% 98.07% 98.38% 96.69% 93.61% 95.12% 0.65 52.97% v1 6/ 24 96.95% 98.49% 97.71% 93.57% 93.01% 93.29% 0.62 47.91% v2 6/ 24 98.14% 97.93% 97.99% 96.68% 94.08% 95.36% 0.64 50.27% v1 4/ 20 97.13% 97.81% 97.47% 93.38% 93.67% 93.53% 0.59 42.75% v2 4/ 20 98.36% 97.40% 97.88% 96.60% 94.27% 95.42% 0.61 45.17% v1 2/ 16 96.66% 95.27% 95.96% 92.22% 93.47% 93.84% 0.52 33.52% v2 2/ 16 98.23% 95.30% 96.74% 96.05% 93.74% 94.88% 0.54 35.98% shown in Table II. The more targets belong to a specific class, the better the results are for classification, segmentation, and bounding box estimation. Especially, for the prediction of the amodal 2D bounding box, the number of car targets has a great impact. In total, v2 model outperforms v1 model in almost all test cases. In particular, the results for bounding box estimation are always better. Considering the inference time for the proposed object detector, v1 model takes 2.7 ms and v2 model takes 6.0 ms for prediction (classification and segmentation together with amodal 2D bounding box estimation) per patch. B. Discussion Generally, the results of the 2D object detector in radar data are promising. However, it is important to note that the presented dataset is limited. First, the dataset contains only one car object in one radar measurement. Second, a further restriction is that the resulting car object has always the same ground truth size, such that the result of size estimation must not be considered in more detail. Nevertheless, the proposed method is able to deal with objects with different sizes. The same argumentation holds for the detection of multiple classes. The proposed object detector is able to recognize multiple classes, but the current dataset does not include objects with different classes. For future research, the dataset will be extended to multiple objects per radar measurement and multiple classes. Another important point to note is that in further work, the object detector for radar data will be a preprocessing module for a multi-object tracking system which fuses data of multiple radar sensors. This is the main reason why no accumulation of multiple radar measurements is used in order to get point clouds that are more dense. The fusion of radar measurements with different time stamps, as well as measurements from multiple radar sensors, can be processed with a sophisticated object tracking system. Additionally, the object tracker probabilistically models multiple hypotheses per object. So it is desirable that the object detector generates multiple hypotheses which is guaranteed by the patch proposal module. Since one radar sensor is only able to provide sparse data per measurement cycle, the proposed object detector will naturally generate misdetections or clutter measurements. However, a multi-object tracker such as the Labeled Multi-Bernoulli Filter [17] is able to handle this. VI. CONCLUSION This contribution has proposed an approach to detect 2D objects hypotheses in sparse radar data. The object detector performs an object classification and a 2D segmentation, together with an amodal 2D bounding box estimation using variants of PointNets. Input of the PointNets is a point cloud with radar targets generated from one measurement cycle of one single radar sensor. Although, in this work only one object class was considered, the results are promising and the proposed approach should be further investigated. In future work, the dataset will be extended by several classes to investigate the performance of the object detector for multiple classes. Additionally, objects in the dataset belonging to the same class will have different size to examine the impact of object dimension. ACKNOWLEDGMENT This work was performed as part of the second author s research within a thesis to obtain the Master of Science. This research is accomplished within the project UNICARagil (FKZ 16EMO0290). We acknowledge the financial support for the project by the Federal Ministry of Education and Research of Germany (BMBF). REFERENCES [1] C. R. Qi, H. Su, K. Mo, and L. J. Guibas, PointNet: Deep Learning on Point Sets for 3D Classification and Segmentation, Proc. Computer Vision and Pattern Recognition (CVPR), IEEE, 2017. [2] C. R. Qi, L. Yi, H. Su, and L. J. Guibas, PointNet++: Deep Hierarchical Feature Learning on Point Sets in a Metric Space, in Advances in Neural Information Processing Systems 30, 2017, pp. 5099 5108. [3] C. R. Qi, W. Liu, C. Wu, H. Su, and L. J. Guibas, Frustum pointnets for 3d object detection from rgb-d data, Proc. Computer Vision and Pattern Recognition (CVPR), IEEE, 2018. [4] S. Heuel and H. Rohling, Two-Stage Pedestrian Classification in Automotive Radar Systems, in 2011 12th International Radar Symposium (IRS), Sep. 2011, pp. 477 484. [5], Pedestrian Classification in Automotive Radar Systems, in 2012 13th International Radar Symposium, May 2012, pp. 39 44. [6], Pedestrian Recognition in Automotive Radar Sensors, in 2013 14th International Radar Symposium (IRS), vol. 2, June 2013, pp. 732 739.

[7] C. Wöhler, O. Schumann, M. Hahn, and J. Dickmann, Comparison of Random Forest and Long Short-Term Memory Network Performances in Classification Tasks Using Radar, in 2017 Sensor Data Fusion: Trends, Solutions, Applications (SDF), Oct 2017, pp. 1 6. [8] J. Lombacher, M. Hahn, J. Dickmann, and C. Wöhler, Potential of Radar for Static Object Classification using Deep Learning Methods, in 2016 IEEE MTT-S International Conference on Microwaves for Intelligent Mobility (ICMIM), May 2016, pp. 1 4. [9] K. Werber, M. Rapp, J. Klappstein, M. Hahn, J. Dickmann, K. Dietmayer, and C. Waldschmidt, Automotive Radar Gridmap Representations, in 2015 IEEE MTT-S International Conference on Microwaves for Intelligent Mobility (ICMIM), April 2015, pp. 1 4. [10] J. Lombacher, K. Laudt, M. Hahn, J. Dickmann, and C. Wöhler, Semantic Radar Grids, in 2017 IEEE Intelligent Vehicles Symposium (IV), June 2017, pp. 1170 1175. [11] F. Roos, D. Kellner, J. Dickmann, and C. Waldschmidt, Reliable Orientation Estimation of Vehicles in High-Resolution Radar Images, IEEE Transactions on Microwave Theory and Techniques, vol. 64, no. 9, pp. 2986 2993, Sep. 2016. [12] J. Schlichenmaier, F. Roos, M. Kunert, and C. Waldschmidt, Adaptive Clustering for Contour Estimation of Vehicles for High-Resolution Radar, in 2016 IEEE MTT-S International Conference on Microwaves for Intelligent Mobility (ICMIM), May 2016, pp. 1 4. [13] J. Schlichenmaier, N. Selvaraj, M. Stolz, and C. Waldschmidt, Template Matching for Radar-Based Orientation and Position Estimation in Automotive Scenarios, in 2017 IEEE MTT-S International Conference on Microwaves for Intelligent Mobility (ICMIM), March 2017, pp. 95 98. [14] O. Schumann, M. Hahn, J. Dickmann, and C. Wöhler, Semantic segmentation on radar point clouds, in 2018 21st International Conference on Information Fusion (FUSION), July 2018, pp. 2179 2186. [15] F. Kunz, D. Nuss, J. Wiest, H. Deusch, S. Reuter, F. Gritschneder, M. Stübler, M. Bach, P. Hatzelmann, C. Wild, and K. Dietmayer, Autonomous Driving at Ulm University: A Modular, Robust, and Sensor-Independent Fusion Approach, in Intelligent Vehicles Symposium, 2015, pp. 666 673. [16] A. Krizhevsky, I. Sutskever, and G. E. Hinton, ImageNet Classification with Deep Convolutional Neural Networks, in Proceedings of the 25th International Conference on Neural Information Processing Systems - Volume 1, ser. NIPS 12. U: Curran Associates Inc., 2012, pp. 1097 1105. [Online]. Available: http://dl.acm.org/citation.cfm?id=2999134.2999257 [17] S. Reuter, B.-T. Vo, B.-N. Vo, and K. Dietmayer, The Labeled Multi- Bernoulli Filter, IEEE Transactions on Signal Processing, vol. 62, no. 12, pp. 3246 3260, 2014.