Focal Loss in 3D Object Detection

Size: px

Start display at page:

Download "Focal Loss in 3D Object Detection"

Nancy Walker
5 years ago
Views:

1 Focal Loss n 3D Object Detecton eng Yun1 Le Ta2 Yuan Wang2 Chengju Lu3 Mng Lu2 Fg. 1. Upper two rows show projected 3D object detecton results from the detector traned wth bnary cross entropy.

Abstract 3D object detecton s stll an open problem n autonomous drvng scenes.

detecton. In ths paper, we am to solve ths fore-background mbalance n 3D object detecton.

1 1 Focal Loss n 3D Object Detecton eng Yun1 Le Ta2 Yuan Wang2 Chengju Lu3 Mng Lu2 Fg. 1. Upper two rows show projected 3D object detecton results from the detector traned wth bnary cross entropy. Lower two rows present related results from the detector traned wth the focal loss. urple and blue boundng boxes are the ground-truth and the estmated results respectvely. Abstract 3D object detecton s stll an open problem n autonomous drvng scenes. When recognzng and localzng key objects from sparse 3D nputs, autonomous vehcles suffer from a larger contnuous searchng space and hgher fore-background mbalance compared to mage-based object detecton. In ths paper, we am to solve ths fore-background mbalance n 3D object detecton. Inspred by the recent use of focal loss n magebased object detecton, we extend ths hard-mnng mprovement of bnary cross entropy to pont-cloud-based object detecton and conduct experments to show ts performance based on two dfferent 3D detectors: 3D-FCN and VoxelNet. The evaluaton results show up to 11.2A gans through the focal loss n a wde range of hyperparameters for 3D object detecton. Index Terms Deep Learnng n Robotcs and Automaton; Object Detecton, Segmentaton and Categorzaton; Recognton. Ths work was supported by the Natonal Natural Scence Foundaton of Chna (Grant No. U ), and was partally supported by Shenzhen Scence Technology and Innovaton Commsson (SZSTI) JCYJ , the Research Grant Councl of Hong Kong SAR Government, Chna, under roject No , No and No awarded to rof. Mng Lu. (Correspondng author: eng Yun.) 1 eng Yun s wth the Department of Computer Scence and Engneerng, The Hong Kong Unversty of Scence and Technology, Hong Kong (e-mal: pyun@ust.hk) 2 Le Ta, Yuan Wang and Mng Lu are wth the Department of Electronc and Computer Engneerng, The Hong Kong Unversty of Scence and Technology, Hong Kong (e-mal: {lta, ywangeq, eelum}@ust.hk) 3 Chengju Lu s wth College of Electrcal and Informaton Engneerng, Tongj Unversty, Chna (e-mal: luchengju@tongj.edu.cn) I. I NTRODUCTION BJECT detecton n 3D s stll challengng n robotcs percepton, the appled scenes of whch wdely nclude urban and suburban roads, hghways, brdges and ndoor settngs. Robots recognze and localze key objects from data n the 3D form and predct ther locatons, szes and orentatons, whch provdes both semantc and spatal nformaton for hgh-level decson makng. The pont cloud s one of the most commonly used 3D data forms, and can be gathered by range cameras, lke LDAR and RGB-D cameras. Snce the coordnate nformaton of pont clouds s not nfluenced by appearance changes, pont clouds are also robust n extreme weather and varous seasons. In addton, t s naturally scalenvarant. The scale of an object s nvarant anywhere n a pont cloud, whle t always changes n an mage due to foreshortenng effects. Moreover, the ncreasng percepton dstance and decreasng prce of 3D LDARs make them a promsng drecton for autonomous drvng researchers [1]. Current mage-based detectors beneft from translaton nvarance from convoluton operatons and can perform wth human-comparable accuracy. However, the successful magebased archtectures cannot be drectly appled n 3D space. ont-cloud-based object detecton consumes pont clouds whch are sparse pont lsts nstead of dense arrays. If drawng O

2 2 on the success of mage-based detectors and conductng dense convoluton operaton to acqure translaton nvarance, preprocessng must be mplemented to convert the sparse pont clouds nto dense arrays. Otherwse, specal layers should be carefully desgned to extract meanngful features from the sparse nputs. Addtonally, the fore-background mbalance s much more serous than n 2D scenaros, snce the new z- axs further enlarges the searchng space and the extent of mbalance s dfferent for each dfferent z value. Ln et al.[2] proposed focal loss to tackle the forebackground mbalance n mage-based object detecton, so that one-stage detectors could acheve state-of-the-art accuracy as two-stage detectors. As a hard-mnng mprovement of bnary cross entropy, t helps the network focus on hard classfed objects, n case they are overwhelmed by a large number of easly classfed objects. Smlar to mage-based detecton methods, pont-cloudbased detecton methods can also be classfed nto twostage [3], [4], [5] and one-stage detectors [6], [7]. In ths paper, nspred by [2], we am to solve the fore-background mbalance for 3D object detecton through the focal loss. We clam the followng contrbutons: We extend focal loss to 3D object detecton to solve the huge fore-background mbalance n one-stage detectors, and conduct experments on two dfferent one-stage 3D object detectors, 3D-FCN [6] and VoxelNet [7]. The experment results demonstrate up to 11.2A gans from the focal loss n a wde range of hyperparameters. To further understand focal loss n 3D object detecton, we analyze ts effect towards foreground and background estmatons, and valdate that t plays a role smlar to mage-based detecton. We also fnd that the specal archtecture of VoxelNet can naturally handle the hard negatves well. We plot the fnal posteror probablty dstrbutons of the two detectors and demonstrate that the focal loss wth the ncreasng hyperparameter γ decreases the estmaton posteror probabltes. II. RELATED WORK A. Two-Stage 3D Object Detecton When extendng two-stage mage detectors to the 3D space, researchers encounter the followng problems: (1) the nput s sparse and at low resoluton; (2) the orgnal mage-based methods are not guaranteed to have enough nformaton to generate regon proposals. Ku et al. [4] proposed AVOD whch fused RGB mages and pont clouds. It frst proposes algned 3D boundng boxes wth a multmodal fuson regon proposal network. Then, the proposed boundng boxes are classfed and regressed wth fully connected layers. Both the appearance and the 3D nformaton are well-utlzed to mprove the accuracy and robustness of the proposed model n extreme scenes. Ther hand-crafted features can be further mproved to learn representatons drectly from raw LDAR nputs to allevate nformaton loss. Q et al. [3] proposed F-ontNet and leveraged both 2D object detectors and 3D deep learnng for object localzaton. TABLE I IMAGE-BASED AND OINT-CLOUD-BASED OBJECT DETECTION Image-Based Object Detecton ont-cloud-based Object Detecton Method - 3D-FCN [6] VoxelNet [7] Dmenson 2D 3D 3D Input Dense Grd Dense Grd Sparse ont Lst Network Dense Conv Dense Conv Heterogeneous pelne One/Two-Stage One-Stage One-Stage They extracted the 3D boundng frustum of an object wth a 2D object detector. Then 3D nstance segmentaton and 3D boundng box regresson were appled wth two varants of ontnet [8]. F-ontNet acheves state-of-the-art accuracy on the KITTI 3D object detecton challenge [9], and also performs at real-tme speed for 3D object detecton. Ther mage detector needs to be carefully desgned wth a hgh recall rate, snce the accuracy upper bound s determned by the frst stage. B. One-Stage 3D Object Detecton L [6] extended a 2D fully convolutonal network to 3D. The voxelzed pont clouds are processed by an encoderdecoder network. The 3D fully convolutonal network (3D- FCN) fnally proposes a probablty and a regresson map for the whole detecton regon. It thoroughly conssts of 3D dense convolutons wth hgh computaton and memory costs, so that the network depth s lmted and hard to extract hgh-level features. Unlke 3D-FCN and AVOD, both of whch adopt hand-crafted features to represent the pont clouds, Zhou et al. [7] desgned an end-to-end network to mplement pont-cloudbased 3D object detecton wth learnng representatons called VoxelNet. Compared to 3D-FCN [6], the computaton cost s mtgated by the Voxel Feature Encodng Layers (VFELayers) and 2D convoluton. In ths paper, we adopt 3D-FCN [6] and VoxelNet [7] as two dfferent types of one-stage 3D detectors. As shown n Table I, 3D-FCN consumes dense grds and conssts of only 3D dense convoluton layers, where the 2D FCN archtecture [10] s extended to 3D for dense feature extracton. In contrast, VoxelNet consumes sparse pont lsts and s a heterogeneous network, whch frstly extracts sparse features wth ts novel VFELayers and then conducts 3D and 2D convoluton sequentally. C. Imbalance between Foreground and Background Image-based object detectors can be classfed nto twostage and one-stage detectors. For two-stage detectors, lke R-CNN [11], the frst stage generates a sparse set of canddate object locatons and the second stage classfes each canddate locaton as one of the foreground classes or as the background usng a convolutonal neural network. The twostage detectors [12], [13] acheve state-of-the-art accuracy on the COCO benchmark. On the other hand, one-stage detectors, lke YOLO [14] and SSD [15], am to smplfy the ppelne. They mprove the tranng speed of deep models and also demonstrate promsng results n terms of accuracy.

3 3 Ln et al. [2] explored both one-stage and two-stage detectors n mage-based object detecton, and clamed that the hurdle that obstructs the one-stage detectors from better accuracy s the extreme fore-background class mbalance encountered durng tranng of dense detectors. They reshaped the standard cross entropy loss and proposed the focal loss such that the losses assgned to well-classfed examples were downweghted. Ths can be seen as a hard-mnng mprovement of bnary cross entropy to help networks focus on hard classfed objects n case they are overwhelmed by a large number of easly classfed objects. We extend focal loss to 3D object detecton to tackle the fore-background mbalance problem. Dfferent from magebased detecton, pont-cloud-based object detecton s a more challengng percepton problem n 3D space wth sparse sensor data and suffers from more serous fore-background mbalance. To thoroughly evaluate the performance of the focal loss n ths harder task, we conduct experments based on two dfferent types of one-stage 3D detectors: 3D-FCN and VoxelNet. We analyze the focal loss effect on these two 3D detectors followng a smlar method to that n [2], and further dscuss the decreasng posteror probablty effect of the focal loss. III. FOCAL LOSS In ths secton, we frst declare notatons and revst the focal loss [2], and then further analyze the fore-background mbalance n 3D object detecton. A. relmnares We defne y {±1} as the ground-truth class, and p as the estmated probablty for the class wth label y = 1. For notatonal convenence, we defne the posteror probablty p t as { p f y = 1 p t = (1) 1 p f y = 1, where p s calculated wth p = sgmod(x). The bnary cross entropy (BCE) loss and ts devaton can be formulated as ε BCE (p t ) = log(p t ) (2) dε BCE (p t ) = y(p t 1). (3) dx As clamed n [2], when the network s traned wth BCE loss, ts gradent wll be domnated by vast easy classfed negatve samples f a huge fore-background mbalance exsts. Focal loss can be consdered as a dynamcally scaled cross entropy loss, whch s defned as ε FL (p t ) = (1 p t ) γ log(p t ) (4) dε FL (p t ) = y(1 p t ) γ (γ p t log(p t ) + p t 1). (5) dx The contrbuton from the well classfed samples (p t 0.5) to the loss s down-weghted. The hyperparameter γ of the focal loss can be used to tune the weght of dfferent samples. As γ ncreases, fewer easly classfed samples contrbute to the tranng loss. Obvously, when γ reaches 0, the focal loss degrades to become same as the BCE loss. In the followng sectons, all the cases wth γ = 0 represent BCE loss cases. Researchers have prevously ether ntroduced hyperparameters to balance the losses calculated from postve and negatve anchors, or normalzed postve and negatve losses by the frequency of correspondng anchors. However, one essental problem that these two prevous methods cannot handle s the gradent salence of hard negatve samples. The gradents of hard negatve anchors (p t < 0.5) are overwhelmed by a large number of easy negatve anchors (p t 0.5). Due to the dynamc scalng wth the posteror probablty p t, a weghted focal loss can be used to handle both the fore-background mbalance and the gradent salence of hard negatve samples wth the followng form, ε FL (p t ) = λ (1 p t ) γ log(p t ), (6) where λ s nduced to weght dfferent classes. In the followng sectons, we adopt hyperparameters α and β to weght postve and negatve focal loss respectvely. B. Fore-background Imbalance n 3D Object Detecton The methods for 3D object detecton can be classfed as one-stage [6], [7] and two-stage [3], [4], [5] detectors. The two-stage detectors frst adopt an algorthm wth a hgh recall rate to propose regons that possbly contan objects and adopt a convoluton network to classfy classes and regress boundng boxes. The one-stage detectors are end-to-end networks that learn representatons and mplement classfcaton and regresson n all anchors. In one-stage methods, anchors are proposed at each locaton, and thus a huge fore-background mbalance exsts. For nstance, there are 50k boundng boxes proposed n each frame for 3D-FCN and 70k for VoxelNet, but less than 30 anchors among them contan postve objects (e.g. car, pedestran, cyclst). Compared to mage detectors, the extra estmaton n z-axs further ncreases the fore-background mbalance. Addtonally, postve samples always locate on the poston wth small z values n some specfc scenes. For nstance, cars and pedestrans are always on the road n autonomous drvng scenes. In such stuatons, the dstrbuton of fore-background mbalance s dfferent along the z-axs: the extent of mbalance ncreases wth hgher z values. The one-stage methods for 3D detectors are dfferent from the 2D detectors because of ther larger searchng space, sparse nput and dfferent types of network archtecture. Therefore, we select two dfferent networks, 3D-FCN and VoxelNet, to conduct experments to evaluate the performance of focal loss n 3D object detecton. The features of these two 3D detectors are dscussed n the followng two sectons, and the expermental detals and results are shown n Secton VI. IV. 3D-FCN FEATURES In ths secton, we dscuss the dense convoluton network archtecture of 3D-FCN and ntroduce our enhanced loss functon for 3D-FCN. The detals of 3D-FCN can be found n [6]. lease refer to AENDIX for our mplementaton of 3D-FCN.

4 4 BodyNet [40,800,800,1] [5,100,100,96] HeadNet -Map: [5,100,100,1] R-Map: [5,100,100,24] Fg. 2. The dense convoluton network archtecture of 3D-FCN [6]. The whole network conssts of only 3D convoluton layers. All ntermedate tensors n the hdden space are dense 3D grds (whch are represented by a tensor wth dmensons as [heght, wdth, length, feature]). A. Dense Convoluton Network Archtecture 3D-FCN [6] draws on experence from mage-based recognton tasks, and extends the 2D convoluton layer to 3D space to acqure translaton nvarance. The nput pont cloud s frstly voxelzed nto a 3D dense grd. In each voxel of the 3D dense grd, the values {0,1} are used to present whether there s any pont observed. The network archtecture of 3D-FCN s shown n Fgure 2. The voxelzed pont cloud s convolved by four blocks sequentally. The output features are then processed by two blocks ndvdually to generate a probablty map and a regresson map (-Map and R-Map). Dfferent from mage-based object detecton, the probablty map and regresson map are all n 3D dense grds, so that the searchng space s exponentally ncreased. B. Enhanced Loss Functon The orgnal loss functon for 3D-FCN [6] s shown n the left of Equaton 7 to 11, where ε and ε R represent the classfcaton loss and regresson loss, as well as ε cls and ε reg are the loss functons used for classfcaton and regresson respectvely. In regresson loss ε R, u and u are the regresson output and ground truth for postve anchors. In classfcaton loss ε, p pos and p neg represent the posteror probablty of postve and negatve estmaton. ε = ε + ε R ε = ε + ε R (7) ε = η(ε pos + εneg ) ε = η(ε pos + εneg ) (8) N pos ε reg (u,u ) (9) ε R = ε reg (u,u ) ε R = 1 ε pos ε neg = = ε cls (p pos,1) ε pos = α 1 ε cls (p neg,0) ε neg = β 1 N pos N neg ε cls (p pos,1) (10) ε cls (p neg,0) (11) In the orgnal form, a large mbalance exsts between ε pos and ε neg, whch represent classfcaton loss of postve and negatve samples respectvely. Therefore, we adopt the loss functon used n VoxelNet [7], whch normalzes sub-loss wth correspondng frequency as well as balances ε pos and ε neg wth two more hyperparameters α and β. The adopted loss functon s shown n the rght of Equaton 7 to 11. In Secton VI, we use the loss functon n the rght part of Equaton 7 to 11 to demonstrate the focal loss mprovement compared wth BCE Loss, where ε reg denotes the square loss and ε cls denotes the focal loss. We also show the enhanced loss functon form mprovement compared wth the orgnal loss functon [6] n the AENDIX, where ε reg denotes the square loss and ε cls denotes the BCE loss. V. VOXELNET FEATURES In ths secton, we dscuss the heterogeneous network archtecture of VoxelNet, and ts brd s-eye-vew estmaton. The detals of VoxelNet can be found n [7]. lease refer to AENDIX for our mplementaton of VoxelNet. A. Heterogeneous Network Archtecture The heterogeneous archtecture overvew of VoxelNet s shown n Fgure 3. It conssts of three man parts: FeatureNet, MddleLayer and RN. FeatureNet extracts features drectly from sparse pont lsts. It adopts Voxel Feature Encodng Layers (VFELayers) [7] to extract both pont-wse and voxel-wse features drectly from ponts, where fully connected layers are used to extract pontwse features and a symmetrc functon s used to aggregate local features from all ponts wthn a local voxel. Compared to sub-optmally dervng hand-crafted features from voxels, VFELayers can learn representatons mnmzng the loss functon. The derved voxel-wse representatons from VFELayers are sparse, whch saves memory and tme n the computaton. In contrast, f a pont cloud of KITTI dataset s parttoned nto a [10, 400, 352] dense grd for vehcle detecton, only around 5300 voxels (about 0.3%) are non-empty. However, the sparse representaton s currently unfrendly to convolutonal operatons. In order to mplement convoluton, VoxelNet compromses on effcency and converts the sparse representaton to a dense representaton at the end of FeatureNet. Each sparse voxel-wse representaton s coped to ts specfc entry n the dense grd. MddleLayer consumes the 3D dense grd and converts t to a 2D brd s-eye-vew form, so that further processng can be done n 2D space. The role of MddleLayer s to learn features from all voxels n the same brd s-eye-vew locaton. Therefore, the 3D convolutonal kernel s of sze [d,1,1], f we denote the dense grd n the order of z,x,y. The 3D kernel of sze [d, 1, 1] helps aggregate voxel-wse features wthn a progressvely expandng receptve feld along the z-axs and keeps the shape n the x,y dmenson. RN predcts the probablty and regresson map from the 2D brd s-eye-vew feature map. Snce the ncreased nvarance and large receptve felds of top-level nodes wll yeld smooth responses and cause naccurate localzaton, t does not utlze max-poolng but adopts skp-layers [10] to combne hgh-level semantc features and low-level spatal features. B. Estmaton n Brd s-eye-vew Form The fnal probablty and regresson estmaton maps are all n brd s-eye vew form, whch s smlar to the fnal estmaton of mage-based detecton methods. Ths saves both memory and tme of the calculaton compared to 3D maps, but only one object per locaton can be estmated n the brd s-eye vew.

5 5 FeatureNet MddleLayer RN K x T x m sparse pont lst VFELayer VFELayer FC max-pool K x n sparse feature Sparse2Dense [10,400,352,128] [400,352,128] [200,176,128] [100,96,128] [50,48,256] dense feature x 3 De x 5 x 5 De De [200,176,256] -Map: [200,176,2] R-Map: [200,176,14] Fg. 3. VoxelNet heterogeneous archtecture [7]. It conssts of three man parts: FeatureNet (pont-wse and voxel-wse feature transformaton), MddleLayer (3D dense convoluton) and RN (2D dense convoluton). The probablty and regresson maps are n brd s-eye-vew form. Ths s acceptable n autonomous drvng scenes but wll meet problems n ndoor scenes, where objects can be stacked up (e.g., a mug on a stack of books). MddleLayer saves calculaton for further processng by aggregatng the 3D dense grd nto a 2D brd s-eye-vew feature map. Otherwse, thoroughly 3D dense convoluton n such a deep network (22 convoluton layers) would brng exponentally more parameters and calculaton. We note that MddleLayer s stll a bottleneck of the whole network as shown n Table VII because of ts 3D dense convoluton operatons. The effcent sparse convolutonal mplementaton s stll an open problem and deserves effort to solve. C. Loss Functon We adopt the loss functon form from the orgnal VoxelNet [7], whch s the same as the rght half part from Equaton 7 to 11. In Secton VI, we use SmoothL1Norm [16] for ε reg as the orgnal paper [7] and use the focal loss for ε cls. VI. EXERIMENTS In ths secton, we ntend to answer two questons: 1) Can focal loss help mprove accuracy n 3D object detecton task? 2) Does focal loss have an equal effect n 3D object detecton to ts effect n mage-based detecton? To answer the former queston, we conduct experments to compare the performance of 3D-FCN and VoxelNet traned wth BCE loss and focal loss on the challengng KITTI benchmark [9]. To answer the second queston, we analyze the cumulatve dstrbuton curve of 3D-FCN and VoxelNet followng a smlar method to that n [13]. The code and weghts for our experments are avalable at A. BCE Loss vs. Focal Loss The KITTI 3D object detecton dataset [9] contans 3D annotatons for cars, pedestrans and cyclsts n urban drvng scenaros. The sensor setup manly conssts of a wde-angle camera and a Velodyne LDAR (HDL-64E), both of whch are well-calbrated. The tranng dataset contans 7481 frames, ncludng both raw sensor data and annotatons. The KITTI 3D detecton dataset contans some bad annotatons whch are empty boundng boxes contanng few ponts. In order to avod overfttng those bad annotatons, we remove all boundng boxes contanng few ponts (fewer than 10). Followng [5], we splt the dataset nto tranng and valdaton sets, each contanng around half of the entre set. For smplcty, we conduct experments only on the car class to show the focal loss mprovement. We do such mplement because both 3D-FCN and VoxelNet are traned class-specfcally and extendng them to other classes s only tunng technques. Also, the focal loss n the form of Equaton 6 s agnostc to the class of objects. We set α = 1, β = 5, η = 10 n 3D-FCN and α = 1, β = 10, η = 0.5 n VoxelNet so that ε pos and ε neg as well as ε and ε R wll be of the same orders of magntude. As clamed n [2], when tranng a network from scratch wth the focal loss, t s unstable n the begnnng. Therefore, we frst tran the network (both 3D-FCN and VoxelNet) for 30 epochs wth the BCE loss and the learnng rate lr, and then for another 30 epochs wth the focal loss and a dscounted learnng rate 0.1lr. The mnmum overlap thresholds are 0.7, 0.5, 0.5 for 2D evaluaton on mage/ground plane and 3D evaluaton. The network detals of both 3D-FCN and VoxelNet are shown n Table VI and Table VII n AENDIX. Non-maxmum suppresson wth the threshold 0.8 s used at the end of 3D-FCN and VoxelNet for estmaton refnement. In order to control a sngle varable γ, we frstly make comparsons among last models, whch are traned wth the same amount of steps. Addtonally, we also make comparsons among best models to make the concluson more concrete. The best models are selected accordng to the mean value among easy, moderate and hard 3D detecton As (3D detecton ma). We compare the results of the last models n Table II and Table III, where the rows wth γ = 0 and γ > 0 represent the results from the BCE loss and the focal loss respectvely. Bolded numbers are the results n whch focal loss cases outperforms the BCE loss case. In general, VoxelNet outperforms 3D- FCN n accuracy, snce the nput of VoxelNet has the orgnal pont clouds, but 3D-FCN suffers from nformaton loss when voxelzng the pont clouds nto bnary representatons. Addtonally, VoxelNet benefts from ts deeper network structure, whch s able to extract more useful hgh-level features. In 3D-FCN, the focal loss helps mprove accuracy n all metrcs n a wde range of hyperparameters (0 < γ 2.0), provdng gans from 0.3A to 11.2A. In VoxelNet, the cases wth γ = 0.1,0.5,1 show gans from the focal loss n all metrcs, rangng from 0.6A to 9.1A. Both gans and losses happen when γ s 0.2 or 2. However, gans (up to 9.1A) are generally much greater than losses (at most 2.7A). The tranng processes nclude some randomness due to sample shufflng and the sophstcated gradent descent tranng scheme. We further evaluate all ntermedate weghts and select the best models

6 6 (a) (b) (c) (d) Fg. 4. Cumulatve dstrbutons of 3D-FCN and VoxelNet for dfferent values of γ. In 3D-FCN (a, b), as γ ncreases, loss of both foreground and background samples concentrate on the harder parttons. The effect on the background s stronger. In VoxelNet (c, d), the effect of the focal loss ncreases as γ ncreases, but the effect on the foreground s stronger than on the background. Note that the VoxelNet background cumulatve dstrbuton (d) s n the range of [0.998, 1]. γ TABLE II EVALUATION RESULTS ON KITTI VALIDATION DATASET FOR LAST MODELS OF 3D-FCN Brd s Eye Vew A (%) 3D Detecton A (%) Easy Mod Hard Easy Mod Hard to make the comparson n Table IV. It shows that focal loss helps mprove accuracy n all metrcs wth a proper γ. The performance losses of γ = 0.2 n Table III mght be caused by tranng randomness and model degradaton wth redundant tranng. From Table II, Table III and Table IV, t shows that the focal loss n 3D object detecton provdes better or comparable results than BCE loss. Therefore, the focal loss works n 3D object detecton and help mprove accuracy n a wde range of γ (normally γ 2). B. Analyss of Focal Loss n 3D Detectors We analyze the emprcal cumulatve dstrbutons of the loss from the converged 3D-FCN and VoxelNet models as n [2]. We apply the two converged models traned wth the focal loss γ TABLE III EVALUATION RESULTS ON KITTI VALIDATION DATASET FOR LAST MODELS OF VOXELNET Brd s Eye Vew A (%) 3D Detecton A (%) Easy Mod Hard Easy Mod Hard TABLE IV EVALUATION RESULT ON KITTI VALIDATION DATASET FOR BEST MODELS Detector γ lr Step Brd s Eye Vew A(%) 3D Detecton A(%) Easy Mod Hard Easy Mod Hard 3D-FCN 0 1e-2 126k D-FCN 2 1e-2 137k VoxelNet 0 1e-4 134k VoxelNet 0.2 1e-4 215k Note that all cases n Table IV are the evaluaton results of the best models selected among all ntermedate weghts. Thus the accuracy mprovement s from the focal loss nstead of longer tranng steps. (row 2 and row 4 n Table IV) on the valdaton dataset and sample the predcted probablty for 10 7 negatve wndows and

7 (a) (b) Fg. 5. osteror probablty hstogram of 3D-FCN (a) and VoxelNet (b). As γ ncreases, the peak decreases and moves towards lower values n both 3D-FCN and VoxelNet. 105 postve wndows.

7 7 (a) (b) Fg. 5. osteror probablty hstogram of 3D-FCN (a) and VoxelNet (b). As γ ncreases, the peak decreases and moves towards lower values n both 3D-FCN and VoxelNet. 105 postve wndows. Then, we calculate the focal loss wth these probablty data. The calculated focal loss s normalzed such that t sums to one and s sorted from low to hgh. We plot the cumulatve dstrbutons for 3D-FCN and VoxelNet for dfferent γ n Fgure 4. In 3D-FCN, approxmately 15% of the hardest postve samples account for roughly half of the postve loss. As γ ncreases, more of the loss gets concentrated n the top 15% of examples. However, compared to the effect of the focal loss on negatve samples, ts effect on the postve samples s mnor. For γ = 0, the postve and negatve CDFs are qute smlar. As γ ncreases, more weght becomes concentrated on the hard negatve examples. Wth γ = 2 (the best result for 3D-FCN), the vast majorty of the loss comes from a small fracton of samples. As clamed n [2], the focal loss can effectvely dscount the effect of easy negatves, so that the network focuses on learnng the hard negatve examples. In VoxelNet, the condton s dfferent. From c and d n Fgure 4, we can see that the effect of the focal loss ncreases n both the postve and negatve samples as γ ncreases. However, the cumulatve dstrbuton functons for the negatve samples are qute smlar among dfferent values of γ, even though we adjust the x-axs to [0.998, 1]. Ths shows that VoxelNet traned wth the BCE loss s already able to handle negatve hard samples. Compared wth the results on the negatve samples, the effects of focal loss on the postve samples are stronger. Therefore, the accuracy gans of the focal loss n VoxelNet are manly from the postve hard samples. From the analyss of cumulatve dstrbutons, we beleve that the focal loss n 3D object detecton helps networks allevate hard sample gradent salence n the tranng process. C. Focal Loss Decreases the osteror robabltes When undertakng the experments, we found networks traned wth the focal loss should be set wth a lower threshold for non-maxmum suppresson. Ths nspres us to explore the nfluence of the focal loss on the output posteror probabltes. We take the models n Table II and Table III, and evaluate them on the valdaton set. We record all the evaluaton results and plot the probablty hstogram for postve boundng boxes. The results are shown n Fgure 5. As γ ncreases, the peak decreases and moves towards the lower values. Ths demonstrates that networks traned wth the focal loss output postve estmaton wth lower posteror probabltes. A probable explanaton s that objects wth hgh posteror probabltes are easly classfed, and the loss they contrbute s down-weghted n the tranng process due to the focal loss. In other words, they wll be relatvely gnored n the tranng process f they are estmated wth hgh posteror probabltes, so that ther posteror probabltes cannot be further mproved. However, they can also be accurately classfed f we decrease the non-maxmum suppresson threshold n the fnal output step. VII. C ONCLUSION In ths paper, we extended the focal loss of mage detectors to 3D object detecton to solve the fore-background mbalance. We conducted experments on two dfferent types of 3D object detectors to demonstrate the performance of the focal loss n pont-cloud-based object detecton. The expermental results show that the focal loss helps mprove accuracy n 3D object detecton, and t protects the network from fore-background mbalance and allevates hard sample gradent salence both for postve and negatve anchors n the tranng process. The posteror probablty hstograms show that the networks traned wth the focal loss outputs postve estmaton wth lower posteror probabltes. R EFERENCES [1] Z. Wang, Y. Lu, Q. Lao, H. Ye, M. Lu, and L. Wang, Characterzaton of a rs-ldar for 3d percepton, n IEEE Internatonal Conference on CYBER Technology n Automaton, Control, and Intellgent Systems (CYBER), July [2] T. Ln,. Goyal, R. Grshck, K. He, and. Dollar, Focal loss for dense object detecton, IEEE Transactons on attern Analyss and Machne Intellgence, pp. 1 1, [3] C. R. Q, W. Lu, C. Wu, H. Su, and L. J. Gubas, Frustum pontnets for 3d object detecton from rgb-d data, n 2018 IEEE/CVF Conference on Computer Vson and attern Recognton, June 2018, pp [4] J. Ku, M. Mozfan, J. Lee, A. Harakeh, and S. L. Waslander, Jont 3d proposal generaton and object detecton from vew aggregaton, n 2018 IEEE/RSJ Internatonal Conference on Intellgent Robots and Systems (IROS), Oct 2018, pp. 1 8.

8 8 [5] X. Chen, H. Ma, J. Wan, B. L, and T. Xa, Mult-vew 3d object detecton network for autonomous drvng, n 2017 IEEE Conference on Computer Vson and attern Recognton (CVR), July 2017, pp [6] B. L, 3d fully convolutonal network for vehcle detecton n pont cloud, n 2017 IEEE/RSJ Internatonal Conference on Intellgent Robots and Systems (IROS), Sep. 2017, pp [7] Y. Zhou and O. Tuzel, Voxelnet: End-to-end learnng for pont cloud based 3d object detecton, n 2018 IEEE/CVF Conference on Computer Vson and attern Recognton, June 2018, pp [8] R. Q. Charles, H. Su, M. Kachun, and L. J. Gubas, ontnet: Deep learnng on pont sets for 3d classfcaton and segmentaton, n 2017 IEEE Conference on Computer Vson and attern Recognton (CVR), July 2017, pp [9] A. Geger,. Lenz, and R. Urtasun, Are we ready for autonomous drvng? the ktt vson benchmark sute, n 2012 IEEE Conference on Computer Vson and attern Recognton, June 2012, pp [10] J. Long, E. Shelhamer, and T. Darrell, Fully convolutonal networks for semantc segmentaton, n 2015 IEEE Conference on Computer Vson and attern Recognton (CVR), June 2015, pp [11] R. Grshck, J. Donahue, T. Darrell, and J. Malk, Rch feature herarches for accurate object detecton and semantc segmentaton, n 2014 IEEE Conference on Computer Vson and attern Recognton, June 2014, pp [12] K. He, G. Gkoxar,. Dollr, and R. Grshck, Mask r-cnn, n 2017 IEEE Internatonal Conference on Computer Vson (ICCV), Oct 2017, pp [13] T. Ln,. Dollr, R. Grshck, K. He, B. Harharan, and S. Belonge, Feature pyramd networks for object detecton, n 2017 IEEE Conference on Computer Vson and attern Recognton (CVR), July 2017, pp [14] J. Redmon, S. Dvvala, R. Grshck, and A. Farhad, You only look once: Unfed, real-tme object detecton, n 2016 IEEE Conference on Computer Vson and attern Recognton (CVR), June 2016, pp [15] W. Lu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.-Y. Fu, and A. C. Berg, Ssd: Sngle shot multbox detector, n Computer Vson ECCV 2016, B. Lebe, J. Matas, N. Sebe, and M. Wellng, Eds. Cham: Sprnger Internatonal ublshng, 2016, pp [16] S. Ren, K. He, R. Grshck, and J. Sun, Faster r-cnn: Towards real-tme object detecton wth regon proposal networks, IEEE Transactons on attern Analyss and Machne Intellgence, vol. 39, no. 6, pp , June AENDIX A. Improvement of Enhanced Loss Functon for 3D-FCN We demonstrate the mprovement of adoptng the loss functon from VoxelNet [7] (normalzaton, new hyperparameters, BCE Loss) provdes over ts orgnal loss functon [6] for 3D- FCN. We set α = 1, β = 5, η = 10 n the enhanced 3D-FCN so that ε pos and ε neg as well as ε and ε R can be of the same orders of magntude. We set η = 0.1 n the orgnal 3D-FCN so that ε and ε R can be of the same orders of magntude. γ s set as 0 for usng BCE loss. We tran these two cases from scratch wth 30 epochs. The threshold for non-maxmum suppresson s set as The reason why η s 100 larger n the enhanced 3D-FCN s that we dd normalzaton n the enhanced loss 3D-FCN and N neg s much greater than N pos. We compare the last models n Table V whch shows the mprovement of the enhanced loss functon. B. Our 3D-FCN Implementaton Detals The network detals of 3D-FCN are shown n Table VI. Each block n the BodyNet ncludes a 3D convoluton layer, a ReLU layer and a batch normalzaton layer sequentally. In the HeadNet, each block represents an ndvdual 3D convoluton layer. In the tranng phase, we create the ground truth for -Map by settng the object-voxel whch contans an object center as 1. For the regresson map, we create the ground truth by settng the object-voxels wth 24-length resdual vectors, each of whch s the coordnates for the eght ponts of the boundng box wth a fxed order. The result of the 3D-FCN baselne mplemented by us s shown n the frst row of Table IV. C. Our VoxelNet Implementaton Detals The network detals of VoxelNet are shown n Table VII. The FC block n VoxelNet conssts of a fully connected layer, a batch normalzaton layer and a ReLU layer sequentally. Each block n the MddleLayer ncludes a 3D convoluton layer, a ReLU layer and a batch normalzaton layer. The block n the RN conssts of a 2D convoluton layer, a ReLU layer and a batch normalzaton layer. The model of -Map and R-Map s an ndvdual 2D convoluton layer. We adopt the orgnal parameterzaton method and resdual vector for regresson of VoxelNet[7]. The result of our VoxelNet baselne s shown n the thrd row of Table IV. TABLE V THE IMROVEMENT OF THE ENHANCED LOSS FUNCTION FOR 3D-FCN Detector Brd s Eye Vew A(%) 3D Detecton A(%) Easy Mod Hard Easy Mod Hard Orgnal Enhanced TABLE VI OUR IMLEMENTATION DETAILS OF 3D-FCN Block Name Layer Name Kernel Sze Strdes Flter GFLOs Body conv3d 1 [5,5,5] [2,2,2] conv3d 2 [5,5,5] [2,2,2] conv3d 3 [3,3,3] [2,2,2] conv3d 4 [3,3,3] [1,1,1] Head-Map conv3d obj [3,3,3] [1,1,1] Head-RMap conv3d cor [3,3,3] [1,1,1] TABLE VII OUR IMLEMENTATION DETAILS OF VOXELNET Block Name Layer Name FeatureNet MddleLayer RN Kernel Sze / Output Unt Strdes Flter GFLOs vfe 32 N/A N/A <0.1 vfe 128 N/A N/A <0.1 fc 128 N/A N/A <0.1 conv3d [3,3,3] [2,1,1] conv3d [3,3,3] [1,1,1] conv3d [3,3,3] [2,1,1] reshape N/A N/A N/A / conv2d [3,3] [2,2] conv2d 3 [3,3] [1,1] deconv [3,3] [1,1] conv2d [3,3] [2,2] conv2d 5 [3,3] [1,1] deconv [2,2] [2,2] conv2d [3,3] [2,2] conv2d 5 [3,3] [1,1] deconv [4,4] [4,4] rob-map conv2d [1,1] [1,1] Reg-Map conv2d [1,1] [1,1]

Learning the Kernel Parameters in Kernel Minimum Distance Classifier

Learning the Kernel Parameters in Kernel Minimum Distance Classifier Learnng the Kernel Parameters n Kernel Mnmum Dstance Classfer Daoqang Zhang 1,, Songcan Chen and Zh-Hua Zhou 1* 1 Natonal Laboratory for Novel Software Technology Nanjng Unversty, Nanjng 193, Chna Department