Deep Spatial-Temporal Joint Feature Representation for Video Object Detection

Size: px

Start display at page:

Download "Deep Spatial-Temporal Joint Feature Representation for Video Object Detection"

Osborn Stone
6 years ago
Views:

sensors Artcle Deep Spatal-Temporal Jont Feature Representaton for Vdeo Object Detecton Baojun Zhao 1,2, Boya Zhao 1,2 ID, Lnbo Tang 1,2, *, Yuq Han 1,2 and Wenzheng Wang 1,2 1 School of Informaton

W.) 2 Bejng Key Laboratory of Embedded Real-tme Informaton Processng Technology, Bejng Insttute of Technology, Bejng 100081, Chna * Correspondence: tanglnbo@bt.edu.cn; Tel.

success n felds of smart survellance, self-drvng cars, and facal recognton.

1 sensors Artcle Deep Spatal-Temporal Jont Feature Representaton for Vdeo Object Detecton Baojun Zhao 1,2, Boya Zhao 1,2 ID, Lnbo Tang 1,2, *, Yuq Han 1,2 and Wenzheng Wang 1,2 1 School of Informaton and Electroncs, Bejng Insttute of Technology, Bejng , Chna; zbj@bt.edu.cn (B.Z.); zhaoboya@bt.edu.cn (B.Z.); yuq_han@bt.edu.cn (Y.H.); wwz@bt.edu.cn (W.W.) 2 Bejng Key Laboratory of Embedded Real-tme Informaton Processng Technology, Bejng Insttute of Technology, Bejng , Chna * Correspondence: tanglnbo@bt.edu.cn; Tel.: Receved: 2 February 2018; Accepted: 27 February 2018; Publshed: 4 March 2018 Abstract: Wth development of deep neural networks, many object detecton frameworks have shown great success n felds of smart survellance, self-drvng cars, and facal recognton. However, data sources are usually vdeos, and object detecton frameworks are mostly establshed on stll mages and only use spatal nformaton, whch means that feature consstency cannot be ensured because tranng procedure loses temporal nformaton. To address se problems, we propose a sngle, fully-convolutonal neural network-based object detecton framework that nvolves temporal nformaton by usng Samese networks. In tranng procedure, frst, predcton network combnes multscale feature map to handle objects of varous szes. Second, we ntroduce a correlaton loss by usng Samese network, whch provdes neghborng frame features. Ths correlaton loss represents object co-occurrences across tme to ad consstent feature generaton. Snce correlaton loss should use nformaton of track ID and detecton label, our vdeo object detecton network has been evaluated on large-scale ImageNet VID dataset where t acheves a 69.5% mean average precson (map). Keywords: deep neural network; vdeo object detecton; temporal nformaton; Samese network; multscale feature representaton 1. Introducton Object detecton n mages has receved much attenton n recent years wth tremendous progress mostly due to emergence of deep neural networks, especally deep convolutonal neural networks [1 4], and r regon-based descendants [5 11]. These methods acheve excellent results on stll mage datasets, such as Pattern Analyss, Statstcal Modellng and Computatonal Learnng Vsual Object Classes (PASCAL VOC) and Mcrosoft Common Object n Context (COCO). Wth ths success, computer vson tasks have been extended from stll mage doman to vdeo doman because, n realty, data sources of practcal applcatons, such as smart survellance, self-drvng, and face recognton, are mostly vdeos. Thus, addtonal challenges are [12]: (1) moton blur: due to rapd camera or object movement; (2) qualty: due to qualty of nternet vdeo clps beng lower than that of stll mages, even f resolutons are same; (3) partal occluson: due to poston change n camera or object; and (4) pose: due to unconventonal object-to-camera poses that are frequently shown n vdeo clps. To overcome ths gap, most vdeo object detecton methods [13 16] use exhaustve post-processng n addton to stll mage detectors. For example, T-CNN [13] uses two-stage Faster RCNN [8] detecton framework for ndvdual vdeo frames. Then, context suppresson and trackng are appled for detecton results. Snce y do not actually nvolve temporal nformaton, those vdeo object detecton methods do not have favourable results on vdeo sources. Sensors 2018, 18, 774; do: /s

2 Sensors 2018, 18, of 21 In ths paper, to ntroduce tme-seres nformaton to network, we propose a novel tranng framework based on Samese network [17,18], whch s fully-convolutonal wth respect to adjacent frames. By ntroducng Samese network, we can effectvely mantan feature consstency by computng feature smlarty across adjacent frames, whch mantans feature consstency n detecton process. In tranng stream, nput s a par of neghborng frames. Then, a backbone network and a set of fully-convolutonal layers are cascaded to generate multscale feature representaton for detecton. In ths way, we can handle multscale objects. Afterwards, we desgn a novel loss functon based on Samese network to mantan feature consstency. Correlaton loss s a constrant between neghborng frames that s calculated by computng smlarty of object features wth same label and track ID. In detecton stream, by sharng wth multscale feature representaton, detecton results are optmzed by Soft-NMS [19]. Our network s evaluated on larger-scale ImageNet VID valdaton dataset. Our proposed network has a 1.6% mprovement compared to baselne Sngle Shot Multbox Detector. Moreover, for demonstrate effectveness of proposed framework, we also conduct an experment on YouTube Object dataset and proposed framework has a 4.4% bonus compared to baselne. 2. Related Works In ths secton, we revew stll mage object detecton, vdeo object detecton, vdeo acton detecton, and 3D shape classfcaton Stll Image Object Detecton There are two classes of detecton methods. One method s based on regon proposal classfcaton, and or method s based on sldng wndows. Before larger mprovement of computatonal sources, best of se two approaches are Deformable Part model (DPM) [20] and Selectve Search (SS) [21]. After dramatc mprovement that was shown by R-CNNs [8], regon proposal object detecton methods became prevalent. For one-stage detecton framework based on sldng wndows, You Look Only Once (YOLO) [22] and Sngle Shot Multbox Detector (SSD) [10] became renowned. These two detecton methods have comparable results. Moreover, re are added benefts to usng one-stage archtecture. The processng speed of one-stage method s faster than regon proposal method. For example, SSD300 obtaned a speed of 46 fps, whereas Faster RCNN [6] was 7 fps Vdeo Object Detecton Recently, ImageNet [23] ntroduced a new soluton for object detecton from vdeo clps, whch s well-known ImageNet VID. Ths competton brngs object detecton task nto vdeo doman. In ths competton, nearly all detecton methods that nvolve tme-seres nformaton have been post-processed after detecton on ndvdual frames. The T-CNN [13] ncorporates optcal flow nformaton to fx neghborng frame results. The MCMOT [16] approaches post-processng as a mult-object trackng problem by applyng a seres of man-made rules (e.g., threshold of confdence and changng pont detecton). The Seq-NMS [14] regards post-processng as a confdence re-scorng problem. The boxes of vdeo sequence are re-scored to average confdence. Unfortunately, se methods ncorporate temporal nformaton by post-processng and detecton n a vdeo s a mult-stage ppelne. The temporal nformaton s not truly nvolved n algorthm Vdeo Acton Detecton Dfferent from vdeo object detecton, vdeo acton detecton s to detect every occurrence of a gven acton wthn a long vdeo, and to localze each detecton both n space and tme. How to use temporal nformaton effectvely s common problem n vdeo object detecton and vdeo acton detecton. Fndng Acton Tubes [24] extracts frame-level acton proposals usng selectve search and lnk m usng Vterb algorthm. Mult-regon RCNNs [25] apples two stream

3 Sensors 2018, 18, of 21 R-CNNs for acton detecton, where a spatal Regon Proposal Network (RPN) and a moton RPN are used to generate frame-level acton proposals. Tube CNN [26] proposes an end-to-end Tube Convolutonal Neural Network for acton detecton, whch explots a 3D convolutonal network to extract spatal-temporal features D Shape Classfcaton 3D data has three-dmensonal nformaton, whch conssts of 2D nformaton and depth nformaton. A vdeo clp s smlar to 3D data, whch also has three-dmensonal data (2D nformaton and tme-seres nformaton). The processng of 3D data s manly focused on shape classfcaton. The common problem of vdeo object detecton and 3D shape classfcaton s how to use thrd-dmensonal nformaton (depth or tme-seres nformaton). LWU [27] leverages stochastc gradent Markov Chan Monte Carlo (SG-MCMC) and Bayesan nterpretatons to learn weght uncertanty and mprove robustness n deep neural networks. GM3D [28] extracts geodesc moments as shape features and uses a two-layer stacked sparse auto encoder to dgest se features to predct shape category. 3. Method Fgure 1 llustrates entre ppelne of proposed method. The ppelne s ntroduced n two parts: tranng stream and testng stream. Frst, for tranng procedure to have a better tranng startng pont, backbone of proposed fully-convolutonal neural networks s VGG-16, whch s pre-traned by usng ImageNet CLS-LOC dataset. Second, by feedng forward a set of fully-convolutonal layers cascaded after backbone network, multscale feature representaton s generated to handle multscale objects. Thrd, a set of dfferent anchor shapes are generated on multscale feature maps to adapt objects wth dfferent scales aspect ratos. Fourth, consderng predctons for each anchor and ntersecton of unon (IOU) of anchor and ground truth, detecton loss s formed. Fnally, above procedures are replcated twce to establsh a Samese network. By measurng smlarty wth neghborng frame feature, correlaton loss, whch conssts of center-value loss and anchor coordnate loss, s computed. For testng procedure, after sharng backbone multscale feature representaton, anchor generaton, and predcton, testng results are computed after Soft-NMS optmzaton Network Archtecture The followng secton detals proposed network, whch contans backbone network, multscale feature representaton, anchor generaton, anchor predcton, and tranng sample selecton. The network archtecture explans data flow n feed-forward procedure Backbone Network In ths paper, backbone network s VGG-16 [3], whch s pre-traned on ImageNet CLS-LOC dataset [23]. As shown n Fgure 2, orgnal VGG-16 s a deep CNN that ncludes 13 convolutonal layers and three fully-connected layers. The convolutonal layers generate deep features, whch are n fed nto fully-connected layers. Smlar to SSD, we remove fully-connected layers and only use convolutonal layers to generate feature maps. The followngs detal entre backbone network: (1) Input: mages wth RGB channels. (2) Convolutonal layers: The convolutonal layers manly contan fve groups conv1, conv2, conv3, conv4, and conv5. Conv1 ncludes two convolutonal layers wth kernels. Conv2 ncludes two convolutonal layers wth 3 3 kernels. Smlar to conv1 and conv2, conv3, conv4, and conv5 nclude three convolutonal layers wth 256, 512, and kernels, respectvely.

Sensors 2018, 18, x FOR PEER REVIEW 4 of 21 (2) Convolutonal layers: The convolutonal layers manly contan fve groups conv1, conv2, conv3, conv4, and conv5.

4 Sensors 2018, 18, x FOR PEER REVIEW 4 of 21 (2) Convolutonal layers: The convolutonal layers manly contan fve groups conv1, conv2, conv3, conv4, and conv5. Conv1 ncludes two convolutonal layers wth kernels. Sensors 2018, 18, of 21 Conv2 ncludes two convolutonal layers wth 3 3 kernels. Smlar to conv1 and conv2, conv3, conv4, and conv5 nclude three convolutonal layers wth 256, 512, and kernels, respectvely. (3) The actvaton functon s rectfed lnear unts [29], and kernel sze of max poolng layer (3) The actvaton functon s rectfed lnear unts [29], and kernel sze of max poolng s 2 2. layer s 2 2. Channel reducng layer 3 x 3 x 512 x 1024 dlaton 1 x 1 x 1024 x 1024 s1-p0 1 x 1 x 1024 x 256 Conv4_3 38 x 38 x 512 Conv6 Conv7 Conv8 Conv9 Conv10 Correlatuon loss Neghborngframe feature 1 x 1 x 256 x 128 s1-p0 3 x 3 x 128 x 256 s1-p0 1 x 1 x 256 x 128 s1-p0 3 x 3 x 128 x 256 s1-p0 Detecton loss VGG16 Frame t+1 Detecton loss s1-p0 3 x 3 x 256 x 512 s2-p1 1 x 1 x 512 x 128 s1-p0 3 x 3 x 128 x 256 s2-p1 VGG16 Frame t Multscale feature Conv4_3 38 x 38 x 512 Mult feature representaton Frame n Detecton results VGG16 Conv4_3 38 x 38 x 512 Fgure 1. The archtecture of proposed method. The orange part s tranng procedure by Fgure 1. neghborng The archtecture of green proposed method. Theoforange frames. The part s testng procedure frame n.part s tranng procedure by neghborng frames. The REVIEW green part s testng procedure of frame n. Sensors 2018, 18, x FOR PEER 5 of archtecture The archtecture of VGG16 that showng convolutonal convolutonal and layers. Along wthwth FgureFgure 2. The of VGG16 that showng andpoolng poolng layers. Along feed-forward procedure, sze of feature map s decreased. s and p refer to strde and feed-forward procedure, sze of feature map s decreased. s and p refer to strde and paddng paddng sze, respectvely. sze, respectvely Multscale Feature Representaton After backbone network, multscale feature representaton network s cascaded, whch s generated by feed-forward convolutonal networks [4,30]. Accordng to SSD [10], conv4_3 has a dfferent feature scale compared to or layers n backbone network. Therefore, multscale feature representaton s started from conv4_3 after applyng a L2 normalzaton to scale feature norm at each locaton n feature map to 20 and to learn scale durng back propagaton.

5 Sensors 2018, 18, of Multscale Feature Representaton After backbone network, multscale feature representaton network s cascaded, whch s generated by feed-forward convolutonal networks [4,30]. Accordng to SSD [10], conv4_3 has a dfferent feature scale compared to or layers n backbone network. Therefore, multscale feature representaton s started from conv4_3 after applyng a L2 normalzaton to scale feature norm at each locaton n feature map to 20 and to learn scale durng back propagaton. Consderng that sze of nput mage s ntalzed to , conv4_3 has a feature map sze, and conv4_3 s frst feature scale map. The multscale feature representaton s generated after conv4_3 by applyng feed-forward convoluton. Tradtonally, low-resoluton feature map can be generated from hgh-resoluton feature map by a certan convolutonal layer, but computatonal costs are hgh. To reduce computatonal costs, each scale block contans 2 convolutonal layers. The frst convolutonal kernel sze s 1 1 M N. Here, M s upper layer channel number, and N s current channel number. By usng a 1 1 M N kernel, mddle feature map channel s decreased from M to N. The second convoluton kernel sze s 3 3 N K, where K s current scale-feature channel number. Table 1 shows detals of multscale feature representaton. There are sx scale feature maps wth feature map szes of 38 38, 19 19, 10 10, 5 5, 3 3, and 1 1, where s conv4_3 n VGG16. In optons lst, s s convolutonal kernel strde, p s paddng sze, and dlaton s dlated convoluton. In dlaton, strde s 1, paddng sze s 6, and convolutonal kernel sze s 6 wth 3 3 parameters. Table 1. Detals of multscale feature representaton. Stage Conv Kernel Sze Feature Map Sze Usage Optons Conv4_3 ( ) Detecton for scale 1 s-1, p-1 Conv6_1 ( ) Enlarge receptve feld dlaton Conv6_2 ( ) Detecton for scale 2 s-1, p-0 Conv7_1 ( ) Reduce channels s-1, p-0 Conv7_2 ( ) Detecton for scale 3 s-2, p-1 Conv8_1 ( ) Reduce channels s-1, p-0 Conv8_2 ( ) Detecton for scale 4 s-2, p-1 Conv9_1 ( ) Reduce channels s-1, p-0 Conv9_2 ( ) Detecton for scale 5 s-1, p-0 Conv10_1 ( ) Reduce channels s-1, p-0 Conv10_2 ( ) Detecton for scale 6 s-1, p Anchor Generaton The anchor n proposed network plays two roles. The frst role s n data selecton procedure. The IOU of ground truth and an anchor decdes wher ths anchor s a postve sample [31]. Tradtonally, f IOU s more than 50%, anchor s a postve sample. The second role s n tranng and testng procedure. In tranng procedure, classfcaton and locaton loss are computed by anchors. In testng procedure, boundng box results are computed by anchor locaton layers and, after non-maxmum suppresson of boundng box results, detecton results are computed. Fgure 3 shows flow of anchor generaton. Tradtonal object detecton methods suggest processng an mage at dfferent scales and n combnes results. Snce we have multscale feature representaton, we can utlze se feature maps wth dfferent scales n a sngle network to have a smlar effect wth tradtonal mage processng method. Smlar to Faster RCNN and SSD, anchors are generated on feature maps by dense samplng. Frst, to handle dfferent szes of objects, dfferent scales of objects are detected on dfferent feature maps. Therefore, n dfferent feature maps, we allocate dfferent scales. In feature map wth

6 Tradtonal object detecton methods suggest processng an mage at dfferent scales and n combnes results. Snce we have multscale feature representaton, we can utlze se feature maps wth dfferent scales n a sngle network to have a smlar effect wth tradtonal mage processng method. Smlar to Faster RCNN and SSD, anchors are generated on feature maps by Sensorsdense 2018, 18, samplng. 774 Frst, to handle dfferent szes of objects, dfferent scales of objects are detected on 6 of 21 dfferent feature maps. Therefore, n dfferent feature maps, we allocate dfferent scales. In feature map wth low resoluton, anchor scale s large, and n feature map wth hgh low resoluton, anchor scale s large, small. and Second, n for feature a certan map scale, wth hgh anchor resoluton, should have anchor dfferent scale s small. aspect Second, ratos. forthe a certan desgn scale, of se aspect anchor ratos should are decded have dfferent by object s aspect aspect ratos. ratos. TheSuppose desgnthat of se aspect we ratos have are m feature decded maps byand object s detecton aspect s appled ratos. to Suppose se m feature that wemaps, have manchor featurescale maps of and detecton feature s appled map s computed to se mas feature follows: maps, anchor scale of feature map s computed as follows: s n = s mn + s s -s max mn s = s + max mn ( n-1), n mn nî [1, m] (1) m-1 (n 1), n [1, m] (1) In formulaton, s mnmum scale of objects to be detected, and s In maxmum formulaton, scale of objects s mn sto be mnmum detected. n represents scale of nth objects square tofeature be detected, map. and s max s s scale maxmum of scale certan offeature objects map. Accordng to be detected. to nscales, represents anchor nth area square s obtaned as h map. s n s = scale. of certanfor feature aspects map. of Accordng anchors, to dfferent scales, aspect anchor ratos areaset s accordng obtaned as to anchor object s area = aspect s 2 n. For ratos. Tradtonally, aspects of anchors, aspect dfferent ratos contan aspect ratos = [1, are2, set 3, accordng, ]. Furrmore, to object s for aspect scale ratos. Tradtonally, aspect ratos contan a r = [1, 2, 3, 1 2, 1 3 ]. Furrmore, for scale consstency, a scale of s n = consstency, a scale of = and aspect rato of 1 are also consdered. The wdth and heght s n s n+1 are and computed aspect follows: rato of 1 are also consdered. The wdth and heght are computed as follows: { ìï anchor_w ï anchor _ w = s a n n ar anchor_h r í (2) ï (2) ïî _ h = s / a n n ra r Fnally, re are sx anchors for each pxel n feature maps. Fnally, re are sx anchors for each pxel n feature maps. In proposed model, we allocate dfferent scales from 0.1 to 0.95, n whch 0.95 ndcates In proposed model, we allocate dfferent scales from 0.1 to 0.95, n whch 0.95 ndcates that that object occupes entre mage and 0.1 ndcate down-samplng rate on Conv4_3. object occupes entre mage and 0.1 ndcate down-samplng rate on Conv4_3. Moreover, Moreover, we set startng scale of Conv6_2 s 0.2. The aspect ratos contan we set startng scale of Conv6_2 s 0.2. The aspect ratos contan a r = [1, 2, 3, 1 2, = 1 [1, 2, 3, 3 ] and aspect, ] rato and aspect rato of 1 for scale of of 1 for scale of. s n s n+1. Table 2 shows anchor detals wth heght, weght and number of each feature map n Table Secton shows In table, anchor for detals clarty, wth channel heght, number weght of and feature number map s of hdden. each feature To nclude map n Secton more objects In wth table, dfferent for clarty, aspect ratos channel and scales, number re ofare sx feature anchor map shapes s hdden. for each To pxel nclude on more objects feature wth dfferent map. aspect ratos and scales, re are sx anchor shapes for each pxel on feature map. Feature map Heght Wdth Anchor 1(scale-0.2, aspect_rato-3): 0.346, Anchor 2(scale-0.2, aspect_rato-2): 0.283, Anchor 3(scale-0.2, aspect_rato-1): 0.2, 0.2 Anchor 4(scale-0.2, aspect_rato-0.5): 0.141, Anchor 5(scale-0.2, aspect_rato-0.33):0.115, Anchor 6(scale-0.27, aspect_rato-1): 0.27, 0.27 Fgure 3. Anchor generaton flow. For each pxel on feature map, sx anchor shapes are generated that share same center wth scales of 0.2 and In addton, center of each anchor s pxel center. In fgure, center s (0.5, 0.5). Table 2. Anchor detals n multscale feature representaton. Feature Feature Map Sze Anchor Heght Anchor Wdth Number Conv4_ Conv6_ Conv7_ Conv8_2 5 5 Conv9_2 3 3 Conv10_ , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , ,

7 Conv8_2 5 5 Conv9_2 3 3 Sensors 2018, 18, , , , , Conv10_ of , , , , Anchor Predcton , , , , , , , , , , , , , , , , , , , , After assocatng a set of anchors wth each feature map, anchor predcton s next key procedure durng algorthm. At each anchor, we predct offsets relatve to anchor shapes and per-class scores that ndcate presence of a class nstance n ths anchor. Suppose that class number that we want to predct s n. For each anchor, output sze s (n + 1) + 4. (n + 1) s class number and background, and 4 s boundng box offset for ths anchor. Fgure 4 shows predcton procedure of an anchor. Referrng to multscale feature representaton n Secton 3.1.2, Table 3 shows detaled predcton kernels. The confdence output of each of each anchor anchor s (n s + ( 1) 61) dmensons 6 dmensons wth (n wth + 1) ( classes 1) classes and sx shapes. and sx The shapes. boundng The boundng box outputbox of each output anchor of each s (n + anchor 1) 4s dmensons ( 1) 4 wth dmensons (n + 1) classes wth ( and (bx, 1) classes by, bw, and bh). (bx, by, bw, bh) Confdence kernel Convoluton Confdence predcton Feature map Example anchor Convoluton Boundng box predcton Boundng box kernel Fgure Fgure Confdence Confdence and and boundng boundng box box predcton predcton from from an an anchor. anchor. The The red red pont pont s s center center of of example example anchor. anchor. There Thereare aretwo twokernels kernelsof of for for confdence confdence predcton predcton and and boundng box box predcton for for ths ths anchor. anchor. The The red red kernel kernel s s confdence confdence kernel kernel and and blue kernel blue s kernel boundng s boundng box predcton box predcton kernel. kernel. Table 3. Detals of predcton kernel. Feature Feature Map Sze Confdence Kernel Locaton Kernel Conv4_ (n + 1) (n + 1) 4 Conv6_ (n + 1) (n + 1) 4 Conv7_ (n + 1) (n + 1) 4 Conv8_ (n + 1) (n + 1) 4 Conv9_ (n + 1) (n + 1) 4 Conv10_ (n + 1) (n + 1) Tranng Sample Selecton In tranng procedure, tranng samples are generated from anchors. A matchng strategy between anchors and ground truths s appled. The strategy begns by matchng each anchor to ground truth. Dfferent from Multbox, postve samples are anchors wth a Jaccard overlap hgher than a threshold. The ors are negatve samples. Ths strategy smplfes tranng sample selecton problems and provdes network wth more postve samples. Moreover, to mprove performance, network also ncorporates hard example mnng and data augmentaton.

8 Sensors 2018, 18, of Hard Example Mnng Snce negatve samples are selected from background, number of negatve samples s much larger than number of postve samples. Ths dfference causes a sgnfcant mbalance between postve and negatve tranng samples. To overcome ths dffculty, followng OHEM [32], we sort samples by r predcton scores. By selectng top scores, we restrct rato between negatve and postve tranng samples to 3:1. In ths way, we can avod tranng manly on negatve samples. 2. Data Augmentaton To mprove generalzaton of network, we also apply data augmentaton. Smlar to SSD, each tranng mage s augmented by one of followng optons: Use entre orgnal nput mage; Sample a patch so that mnmum Jaccard overlap wth objects s 0.1, 0.3, 0.5, 0.7, or 0.9; Randomly sample a patch. After aforementoned samplng step, each augmented mage s reszed to a fxed sze and s horzontally flpped wth probablty of Loss Functon In proposed network, loss functon manly contans two parts: detecton loss and correlaton loss. The detecton loss relates to object classfcaton and object boundng box regresson. The correlaton loss s neghborng object feature correlaton, whch could nvolve tme-seres nformaton of network Detecton Loss Detecton loss relates to an object n a sngle frame, and t nvolves frame nformaton concernng object. Smlar to Faster RCNN [6], detecton loss conssts of a combned classfcaton loss L cls and a boundng box regresson loss L bbox. The overall detecton loss s sum of L cls and L bbox. Suppose that f s a feature n a certan anchor wth a Jaccard overlap that s hgher than 0.5. Then: L det ( f, l, g, c) = 1 n (al bbox( f, l, g) + L cls ( f, c)), (3) where n s number of anchors whose Jaccard overlap s hgher than 0.5; f s feature of certan anchor; c s classfcaton score of anchor; and l and g are boundng box offsets and ground truth, respectvely. Smlar to SSD, L cls s formed by mult class softmax: where: L cls ( f, c) = ( n postve fj label log( c label ) + c label = exp(clabel ) label c label n negtve log( c label )), (4), (5) In formulaton, c label s anchor classfcaton predcton score of certan label n th anchor box. For example, c person 5 s predcton score of person n 5th anchor box. Let fj label = [1, 0] be ndcator for th anchor box to jth ground truth of label. Moreover, background label s 0. Therefore, f negtve, n c label = c 0.

9 Sensors 2018, 18, of 21 L bbox s based on Huber loss [33] between ground truth and boundng box. The Huber loss s less senstve to outlers n data than squared error loss. L bbox s formulated as follows: L loc = n postve k (bx,by,bw,bh) f label j Huber(pre m g m j ), (6) n whch: Huber(x) = { 0.5x 2 ( x α) α( x 0.5α) ( x > α) (7) Smlar to R-CNN, α s a hyper-parameter and set to 1. Suppose that boundng box has four parameters, whch are (bx, by, bw, bh). (bx, by) s center of boundng box; and (bw, bh) s wdth and heght, respectvely. Pre and g refer to predcton box and ground truth box, respectvely. and j are as same as L cls. Moreover, when formulatng L bbox, we adopt parameterzatons of anchor boundng box(a) and ground truth box(g) followng R-CNN [8]: g bx j g by j g bw j g bh j = gbx j a bx a bw = gby j a by a bh = log( gbw j a bw = log( gbh j Ths parameterzaton enhances effect of center (bx, by) and weakens effect of wdth (bw) and heght (bh) Correlaton Loss Detecton loss s manly concerned wth ntra-frame nformaton. However, n vdeo sources, detecton loss does not consder nter-frame nformaton. Correlaton loss s a strategy that nvolves nter-frame nformaton by measurng feature dfference between frames. As s known, tradtonal detecton frameworks based on deep neural networks are dscrmnatve algorthms. The man concern s how to plot a lne or surface n feature space. If feature s on left sde of lne, feature s generated by postve sample. Accordng to well-constructed feature space of a convolutonal neural network, lnear classfer (Softmax) could easly dfferentate negatve features from postve features. Due to dscrmnatve algorthm characterstcs, feature n feature space s scattered. In vdeo object detecton, object s contnuously movng. Therefore, when tranng dataset s completed, ntra-frame-based detecton framework may have a favorable result. The proposed correlaton loss s nspred by trackng task. In trackng task [16,34 37], feature consstency s key pont to judge trackng result, especally n correlaton-based trackng algorthm such as correlaton flter [34]. In deep feature [12] and flow-guded feature [38], accordng to object s movement, feature s coped or aggregated from key frames of or frames. In ths way, feature stablty has been guaranteed. Dfferent from se approaches, we formulate a correlaton loss to supervse feature consstency. As shown n Fgure 1, correlaton loss s computed by Samese network, whch s a two-path network that uses replcaton. snce we construct multscale feature map, correlaton a bh ) ) (8)

Sensors 2018, 18, 774 10 of 21 loss s generated from t.

L coordnate ( f t, map t+1, a t+1 )), (9) where n s number of ground truths; f t and f t+1 are anchor center features of neghborng frame t and t + 1, respectvely; and a t+1 s postve anchor box n

In correlaton loss procedure, anchor selecton s maxmum Jaccard overlap between ground truth box and anchors, whch means that number of postve anchors s same as ground truth.

The followng s formulaton of (L center_value ): L center_value ( f t, f t+1 n ) = f t,label_t,track_t f t+1,label_t+1,track_t+1 j, postve f (label_t = label_t + 1) and (track_t = track_t + 1) (10)

label_t and track_t are ground truth label and track ID, respectvely, that relate to ths n anchor.

10 Sensors 2018, 18, of 21 loss s generated from t. The correlaton loss (L corr ) s combnaton of center value loss (L center_value ) and anchor coordnate loss (L coordnate ): L corr ( f t, f t+1, map t+1, a t+1 ) = 1 n (L center_value( f t, f t+1 ) + L coordnate ( f t, map t+1, a t+1 )), (9) where n s number of ground truths; f t and f t+1 are anchor center features of neghborng frame t and t + 1, respectvely; and a t+1 s postve anchor box n frame t + 1. Each postve anchor box has 4 parameters (bx, by, bw, bh). map t+1 s entre scale feature map n frame t + 1. Moreover, postve anchor s dfferent from postve sample selecton. In correlaton loss procedure, anchor selecton s maxmum Jaccard overlap between ground truth box and anchors, whch means that number of postve anchors s same as ground truth. Frst, correlaton loss s a value that compares correlaton dstance between same labelled object features n neghborng frame. The followng s formulaton of (L center_value ): L center_value ( f t, f t+1 n ) = f t,label_t,track_t f t+1,label_t+1,track_t+1 j, postve f (label_t = label_t + 1) and (track_t = track_t + 1) (10) Sensors 2018, 18, x FOR PEER REVIEW 11 of 21 Here, s correlaton operaton. The correlaton operaton measures consstency of feature. label_t and track_t are ground truth label and track ID, respectvely, that relate to ths n anchor. By searchng t tall + 1 t+ 1 postve anchorst, label n _ t, trackneghborng _ t t+ 1 t+ frame, 1, label _ t+ 1, we track _ can t+ 1 obtan one-to-one L ( f, map, a ) = ( (, ), )/ _ correspondng postve anchor wth ts label and track ID n frame t and frame t + 1. f t,label_t,track_t, coordnate å Ds T f map a map sze Îpostve (11) s center f ( label value _ t of = label postve _ t + 1) and anchor. ( track _ t = track _ t + 1) Second, anchor coordnate loss measures trackng result loss on neghborng frame feature. where: Inspred by correlaton flter, we take postve anchor n frame t as frst-frame trackng ground truth. We obtan response map n t, label _ t, track _ t t+ 1 t, label frame _ t, track t _ + t 1. Then, t+ 1 T( f, map ) = max( f Ämap )[ bx, coordnate by] of hghest (12) response s trackng result n frame t + 1. Fgure 5 shows L coordnate computaton flow. Feature map for frame t Correlaton heat map Kernel correlaton-s1-p1 Feature map for frame t+1 Max pont coordnate Fgure 5. Coordnate loss computaton flow. The red and green ponts are postve anchors of Fgure 5. Coordnate loss computaton flow. The red and green ponts are postve anchors of frame frame t and frame t + 1, respectvely. A score heat map s computed wth correlaton operaton t and frame t + 1, respectvely. A score heat map s computed wth correlaton operaton by usng a by usng a 3 3 kernel n frame t wth red pont center and feature map of frame t + 1, 3 3 kernel n frame t wth red pont center and feature map of frame t + 1, smlar to trackng on smlar frame to t + trackng 1. After on obtanng frame t + max 1. After pont obtanng coordnate, L max pont coordnate, can be coordnate can be computed by max pont coordnate computed by and max green pont pont coordnate coordnate. and green pont coordnate. The T (Track) formulaton obtans center (bx, by) of max response locaton. In, Ds computes center Eucldean dstance of trackng result and postve anchor n frame t + 1. Moreover, n case coordnate loss s hgh, we normalze t by sze of feature map. By applyng correlaton loss, object feature n neghborng frame can be more consstent.

11 Sensors 2018, 18, of 21 Then, anchor coordnate loss s formed as: L coordnate ( f t, map t+1, a t+1 ) = n postve Ds(T( f t,label_t,track_t, map t+1 ), a t+1,label_t+1,track_t+1 )/map_sze, f (label_t = label_t + 1) and (track_t = track_t + 1) (11) where: T( f t,label_t,track_t, map t+1 ) = max( t,label_t,track_t f map t+1 )[bx, by] (12) The T (Track) formulaton obtans center (bx, by) of max response locaton. In L coordnate, Ds computes center Eucldean dstance of trackng result and postve anchor n frame t + 1. Moreover, n case coordnate loss s hgh, we normalze t by sze of feature map. By applyng correlaton loss, object feature n neghborng frame can be more consstent. 4. Experment and Results We report results of ImageNet VID valdaton dataset. The tranng set contans ImageNet VID tranng dataset. The performance of our method s compared to R-CNN [8], Fast R-CNN [5], orgnal SSD [10], T-CNN [15], and TPN + LSTM [39], and wnner of competton of ImageNet VID 2015 [13]. The detaled evaluaton metrcs are descrbed n Secton 4.1. All methods n experments were programmed based on Pytorch deep learnng framework. The computatonal resources nclude a TITAN X GPU, 128 GB of memory, and an Intel Xeon E CPU (2.30 GHz). The operatng system used s Ubuntu Moreover, n order to show effectveness of proposed method, we also evaluate our model on YouTube Object (YTO) dataset and Unsupervsed [40], YOLO [22], Context [41], a_lstm [42], T-CNN [15], Base [10] models are selected for comparson ImageNet Dataset We evaluate our method by usng 2015 ImageNet object detecton from a vdeo (VID) [23] dataset Sensors 2018, that 18, contans x FOR PEER 30REVIEW classes n 3862 tranng and 555 valdaton vdeos. The 30 object categores 12 of 21 n ImageNet VID are a subset of 200 categores n ImageNet DET dataset. The objects have ground ground truth truth annotatons annotatons for for r r boundng boundng box box and and a trackng a trackng ID n ID a n vdeo. a vdeo. Snce Snce ground ground truth for truth for test set test s not set publcly s not publcly avalable, avalable, we measure we measure performance performance as mean as average mean precson average (map) precson over (map) 30 classes over 30 on classes valdaton on valdaton set by followng set by followng protocols protocols n [12,13,15,38,39], n [12,13,15,38,39], whch s standard whch s standard practce. Fgure practce. 6 shows Fgure 6 shows number of number ground truths of ground n each truths class, n n each whch class, we n can whch see that we can tranng see that ground tranng truth ground n each class truth s n unbalanced. each class s Then, unbalanced. we subsample Then, we subsample VID tranng set VID by usng tranng only set 30 by frames usng only from 30 each frames vdeo. from each vdeo. Fgure The number of ground truths n n each each class. class. Fgure 7 shows object area statstcal nformaton n dataset, n whch we can see more than half of objects have areas lower than The postve samples are anchors wth Jaccard overlap of ground truths of more than 0.5. If Jaccard overlaps between all of anchors and ground truths are lower than 0.5, re would be no

Sensors 2018, 18, 774 12 of 21 Fgure 6. The number of ground truths n each class.

12 Sensors 2018, 18, of 21 Fgure 6. The number of ground truths n each class. Fgure 7 shows object area statstcal nformaton n dataset, n whch we can see more than half of objects have areas lower than The postve samples are anchors wth Jaccard overlap of ground truths of more than 0.5. If Jaccard overlaps between all of anchors and ground truths are lower than 0.5, re would be no postve samples. Fgure 88 shows statstcal nformaton of small of small objects. objects. In ths In ths dataset, dataset, f f scale scale of of smallest smallest anchor anchor s 0.1, s 0.1, framework framework wll mss wll about mss about 10% of 10% of ground ground truths. truths. Fgure 7. The object area characterstcs of Imagenet VID dataset. Fgure 7. The object area characterstcs of Imagenet VID dataset. Sensors 2018, 18, x FOR PEER REVIEW 13 of 21 Fgure Fgure The The small small object object area area characterstcs characterstcs of of Imagenet Imagenet VID VID dataset. dataset Model Tranng 4.2. Model Tranng Baselne-sngle shot multbox: The baselne s SSD300, whch means nput s reszed to Baselne-sngle shot multbox: The baselne s SSD300, whch means nput s reszed to The multscale feature map szes are 38 38, 19 19, 10 10, 5 5, 3 3, and The multscale feature map szes are 38 38, 19 19, 10 10, 5 5, 3 There are sx anchor shapes n multscale feature map, whch contans [1, 2, 3, 3, and 1 1. There are sx anchor shapes n multscale feature map, whch contans [1, 2, 3, 1 wth an 2, 1 3 ] wth an aspect rato of 1, for a scale of s n s n+1. The backbone s VGG 16 net wth fully-connected layers removed. Followng SSD300, we we change change pool5 pool5 from from s2 to 3 to 33 3 s1 and use and anuse a trous an algorthm algorthm to fll to holes. fll holes. The baselne uses conv4_3, conv6_2, conv7_2, conv8_2, conv9_2, and and conv10_2 conv10_2 to predct to predct both both locatons locatons and and confdences. confdences. The The baselne baselne sets sets anchor anchor wth wth a scale a scale of 0.1of on0.1 conv4_3, conv4_3, and and or or layers layers are ntalzed are ntalzed by by Xaver Xaver method method [43]. For [43]. allfor predcton all predcton layers, layers, we ncluded we ncluded sx anchor sx anchor shapes as shapes descrbed as descrbed n Secton n Secton The baselne The baselne uses uses 10 3 learnng 10 rate learnng for 200,000 rate for teratons. 200,000 teratons. The tranng was contnued for 200,000 teratons wth 10 and 10. Moreover, optmzaton trck has a momentum of 0.9 and a weght decay of Proposed network: Our network s establshed on SSD300, whose nput mage sze s also reszed to The tranng dfferences between proposed network and baselne are as follows:

13 Sensors 2018, 18, of 21 The tranng was contnued for 200,000 teratons wth 10 4 and Moreover, optmzaton trck has a momentum of 0.9 and a weght decay of Proposed network: Our network s establshed on SSD300, whose nput mage sze s also reszed to The tranng dfferences between proposed network and baselne are as follows: 1. Input mage The nput mage s a par of neghborng frames (frame t and frame t + 1). The neghborng frames must have at least one object par wth same label and track ID. If one of neghborng frames s empty, ths frame par s removed. 2. Correlaton loss computaton Correlaton loss (L corr ) has two subsets: (L center_value ) and (L coordnate ). (L center_value ) s computed by center values of postve anchors. In L coordnate, when computng response map, kernel sze s 3 3, and ts center s postve anchor center. Furrmore, frame t + 1 s padded by one pxel. 3. Learnng rate Snce loss n proposed network s more than baselne, we frst warm up network wth a learnng rate of 10 4 for 10,000 teratons. Then, proposed network uses 10 3 learnng rate for 200,000 teratons. The tranng was contnued for 200,000 teratons wth 10 4 and Tranng tme The tranng tme of baselne s almost three days. Due to correlaton loss computaton and detecton loss that contans two mages, tranng tme of our method s longer than baselne. Followng above tranng process, entre tranng tme s approxmately seven days. In addton to above procedure, proposed network tranng s mostly same as baselne. The proposed network also uses sx layers to predct, and each layer s anchor shape s sx. The optmzaton trck that we use s a momentum of 0.9 and a weght decay Testng and Results We test baselne and proposed network on valdaton dataset wth subsamplng. Moreover, we also adopt Soft-NMS [19] to accurately fx canddate boundng boxes. Moreover, Seq-NMS [14], whch s a post-processng method, s appled after detecton process. Table 4 shows map results on valdaton dataset. From table, baselne SSD300 acheves a map of 67.9%. Our proposed method acheves 69.5%, whch s a 1.6% mprovement compared to baselne. From Table 4, we compare our proposed network wth R-CNN [8], Fast R-CNN [5], T-CNN [15], TPN + LSTM [39], baselne and wnner n competton of ImageNet VID 2015 [13] (mult-model). The R-CNN and Fast R-CNN are baselnes desgned for stll mages. The last lst s wnner n competton of ImageNet VID 2015 and result s fuson of DeepID net, CRAFT and post-processng procedures. For sngle model, proposed network acheves best score, whch s a 1.1% mprovement over TPN + LSTM. Moreover, because our proposed network s based on SSD300, and because testng process s smlar, test tme per mage s smlar. Based on our equpment (Ttan X), testng tme per mage s approxmately 32 fps because re are more anchors than orgnal SSD. Durng network feed-forward, Fgure 9 shows some mddle features, contanng conv4_3, conv6_2, conv7_2, conv8_2, conv9_2, and conv10_2.

14 Sensors 2018, 18, of 21 Table 4. Average precson of each class and map. The bold values n table are best results among sngle models for a certan class. Class [8] [5] [15] [39] Baselne Arplane Antelope Bear Bcycle Brd Bus Car Cattle Dog Dc_cat Elephant Fox Gant_panda Hamster Horse Lon Lzard Monkey Motorcycle Rabbt Red_panda Sheep Snake Squrrel Tger Tran Turtle Watercraft Sensors 2018, 18, Whale x FOR PEER REVIEW Zebra map 63.0 Fgure shows 68.4 some Durng network 45.3 feed-forward, Our [13] of contanng 73.8 mddle features, conv4_3, conv6_2, conv7_2, conv8_2, conv9_2, and conv10_2. Image Conv4_3 Conv6_2 Conv7_2 Conv8_2 Conv9_2 Conv10_2 Fgure 9. Multscale feature representaton of network. The frst lne s multscale feature Fgure 9. Multscale feature representaton of network. The frst lne s multscale feature representaton of proposed method and second lne s feature of baselne. representaton of proposed method and second lne s feature of baselne. To measure effect on features, wher nvolvng correlaton loss or not, we extract To measure effect on features, wher nvolvng correlaton or not, we computng extract features features of anchor wth maxmum Jaccard overlap of groundloss truth. Then, of anchor wth maxmum Jaccard of metrc groundstruth. Then, computng of smlarty of neghbourng frame, overlap smlarty Eucldean dstance. Forsmlarty each class, neghbourng frame, smlarty metrc s Eucldean dstance. For each class, we choose 10 we choose 10 pars of neghbourng frames and smlarty metrc s average of those smlartes. In order to avod effect of channel sze, smlarty ndex s normalzed by ℎ : n a a Smlarty = å ( f frame - f frame )2 / channels / n _t _ t +1 (13) a Here, are features of certan anchor on neghbourng frame. _ and The channel s number of feature map. Fgure 10 shows feature s smlarty of proposed network and baselne. We can see our

15 Fgure 9. Multscale feature representaton of network. The frst lne s multscale feature representaton of proposed method and second lne s feature of baselne. To measure effect on features, wher nvolvng correlaton loss or not, we extract features of18, Sensors 2018, 774anchor wth maxmum Jaccard overlap of ground truth. Then, computng 15 of 21 smlarty of neghbourng frame, smlarty metrc s Eucldean dstance. For each class, we choose 10 pars of neghbourng frames and smlarty metrc s average of those pars of neghbourng frames smlarty s sze, average of thosendex smlartes. In orderby to smlartes. In order to avodand effect of metrc channel smlarty s normalzed avod effect of channel sze, smlarty ndex s normalzed by channels: ℎ : n nq a aa Smlarty )2 )/2channels /n Smlarty == ffframe /channels/n å ( (f faframe_t frame _ t _ t +1+1 f rame_t a (13) (13) a Smlarty Here, and f a are features of certan anchor on neghbourng frame. Here, f farame_t_ and f ramet +1 are features of certan anchor on neghbourng frame. The channel s number of feature map. Fgure 10 shows feature s smlarty of proposed network and baselne. baselne. We We can see our proposed method mantans a better feature feature smlarty smlarty than than baselne. baselne. Fgure 10. The feature smlarty smlarty of proposed method. The The horzontal horzontal axs axs s s number of multscale feature map and vertcal axs s smlarty ndex. Fgure shows detecton detecton results results of of valdaton valdaton dataset. dataset. Fgure shows Model Model Analyss Analyss The number of anchor shapes and mult-scale feature maps are key hyper-parameters n proposed framework. To evaluate effects of dfferent number of anchor shapes and feature maps, addtonal comparson experments are conducted on ImageNet VID dataset Number of Anchor Shapes In proposed framework, number of anchor shapes for each pxel n mult-scale feature map s one of key hyper-parameter and number of anchor shapes could affect detecton performance and speed. To analyze how number of anchors affect detecton performance and speed, we conduct a comparson experment between dfferent numbers of anchors. Table 5 shows detals about effect on detecton performance and speed. The settngs of ths experment are as followng: Anchor-6: For each pxel n mult-scale feature map, re are sx anchor shapes n multscale feature map, whch contans [1, 2, 3, 12, 13 ] and an aspect rato of 1 for a scale of sn sn+1. In our framework, total number of anchors s 11,640. Anchor-4: For each pxel n mult-scale feature map, re are four anchor shapes n multscale feature map, whch contans [1, 2, 12 ] and an aspect rato of 1 for a scale of sn sn+1. In our framework, total number of anchors s Anchor-4 and 6: In ths settng, we follow orgnal SSD settng. In Conv_4_3, Conv9_2, and Conv10_2, re are four anchor shapes n multscale feature map, whch contans [1, 2, 21 ]

16 Sensors 2018, 18, of 21 and an aspect rato of 1 for a scale of sn sn+1. Then, n Conv_6_2, Conv7_2, and Conv8_2, re are sx anchor shapes n multscale feature map, whch contans [1, 2, 3, 12, 13 ] and an aspect rato of 1 for a scale of sn sn+review 1. In our framework, total number of anchors s of 21 Sensors 2018, 18, x FOR PEER (a) (b) (c) Fgure 11. Results of valdaton dataset. (a) Blue boundng boxes are ground truths. (b) Fgure 11. Yellow Results of valdaton (a) Blue boundng results. ground truths. (b) Yellow boundng boxes are dataset. baselne results. (c) Red boundngboxes boxes are are our boundng boxes are baselne results. (c) Red boundng boxes are our results.

17 Sensors 2018, 18, of 21 Table 5. Detecton performance on dfferent number of anchor shapes. The bold values n table are fastest detecton speed and best performance. Settngs Anchor-6 Anchor-4 Anchor-4 and 6 Anchor number 11, Detecton speed 32 fps 51 fps 46 fps Mean AP From table, we can see that more anchors refer to better mean average precson. Addtonally, more anchors need more computatonal resources and detecton speed s lower. If we remove aspect rato of 3 and 1 3, performance drops 1.6% and detecton speed has an 18 fps bonus Number of Mult-Scale Feature Maps In our framework, number of mult-scale feature maps s also a key hyper-parameter. To nvestgate effect of mult-scale feature representaton, we do anor comparson experment between dfferent numbers of mult-scale feature maps. Table 6 shows detals about effect on detecton performance and speed. Moreover, anchor shapes n each pxel s sx, whch conssts of [1, 2, 3, 1 2, 1 3 ] and an aspect rato of 1 for a scale of s n s n+1. The followngs are expermental settngs: 1. Feature-6: In ths settng, we use sx feature maps, whch are Conv4_3, Conv6_2, Conv7_2, Conv8_2, Conv9_2, and Conv10_2. In our framework, total number of anchors s 11, Feature-5: In ths settng, we use fve feature maps, whch are Conv4_3, Conv6_2, Conv7_2, Conv8_2, and Conv9_2. In our framework, total number of anchors s 11, Feature-4: In ths settng, we use four feature maps, whch are Conv4_3, Conv6_2, Conv7_2, and Conv8_2. In our framework, total number of anchors s 11, Feature-3: In ths settng, we use three feature maps, whch are Conv4_3, Conv6_2, and Conv7_2. In our framework, total number of anchors s 11,430. Table 6. Detecton performance on dfferent number of feature maps. The bold values n table are fastest detecton speed and best performance. Settngs Feature-6 Feature-5 Feature-4 Feature-3 Anchor number 11,640 11,634 11,580 11,430 Detecton speed 32 fps 32 fps 32 fps 33 fps Mean AP From table, we can see that more feature maps can sgnfcantly enhance mean average precson. Snce anchors are mostly generated on feature maps wth lower resoluton, speed between settngs does not make a great dfference. Compared wth Feature-3, Feature-6 has a 2.7% bonus on mean average precson Evaluaton of YouTube Object (YTO) Dataset In order to show effectveness of proposed network, we evaluate our model on a vdeo object detecton task wth YTO dataset [44] YouTube Object Dataset The YTO dataset contans 10 object classes, whch nclude an arplane, brd, boat, car, cat, cow, dog, horse, motor-bke, and tran. Moreover, se 10 object classes are a subset of ImageNet VID dataset and se objects are also movng objects. Dfferent from VID dataset whch contans full annotatons on all vdeo frames, YTO tranng dataset s weakly annotated,.e., each vdeo s only ensured to contan one object of correspondng class, and only a few frames, whereas objects n

18 Sensors 2018, 18, of 21 YTO test dataset are all annotated. In total, YTO dataset contans 155 vdeos. However, t only contans 6087 annotated frames; among m 4306 are for tranng and 1781 are for testng. The weak annotaton makes t nfeasble to tran proposed network on YTO dataset. Followng [13,15], snce YTO classes are a subset of VID dataset classes, we only use ths dataset for evaluaton and can drectly apply traned models on YTO dataset for evaluaton. The evaluaton metrc s same as ImageNet VID dataset Evaluaton Results We evaluate model traned on ImageNet to YTO test dataset and several state-of--art methods are selected for comparson. Table 7 shows detaled AP lsts computed by Unsupervsed [40], YOLO [22], Context [41], a_lstm [42], T-CNN [15], Base [10] and our own YTO test dataset. Table 7. Detecton performance on YTO dataset. The bold values n table are best results among sngle models for a certan class. Class [40] [22] [41] [42] [15] Base Our Arplane Brd Boat Car Cat Cow Dog Horse Moterbke Tran Mean AP From table, we can see that our proposed framework outperforms by a large margn. Compared wth baselne, our proposed method has around a 4.4% mprovement and, compared wth T-CNN, our proposed method has a 4.0% bonus. 5. Conclusons In ths paper, we propose a fast and accurate object detecton network for vdeo sources. The proposed network s a sngle-shot object detecton network. Unlke tradtonal sngle-frame-based object detecton network, our proposed network nvolves frame-to-frame nformaton by usng object par relatons among neghborng frames. Followng Samese network, we formulate a correlaton loss to restran deep features. In ths way, we ncorporate correlaton nto dscrmnatve algorthm. The backbone network s VGG16 model that reduced fully-connected layers. Based on backbone network, we establsh a multscale feature representaton to predct detectons on multple layers. Dfferent anchor scales are appled to dfferent feature maps for dfferent object scales. Hard example mnng and data augmentaton are also used to balance tranng samples and to test generalzatons. Our proposed model has been tested on a large object detecton dataset, namely, ImageNet VID dataset. Ths dataset s largest vdeo object detecton dataset. Compared to baselne, our proposed network has a 1.6% bonus, and test tme does not ncrease. In order to show effectveness of proposed framework, we evaluate model on YTO dataset and proposed framework has a 4.4% bonus compared to baselne. Although our proposed network has a better performance than baselne, t stll has some lmtatons. The frst lmtaton s use of tme-seres nformaton. The correlaton loss s rgd to feature map. The second lmtaton s that we do not fully use trackng nformaton. The trackng

19 Sensors 2018, 18, of 21 process s only on feature map and does not output exact trackng result. If network could output exact trackng result, testng tme could be faster. Acknowledgments: The work descrbed n ths paper s supported by 111 Project of Chna under grant B14010 and Chang Jang Scholars Programme (grant no. T ). Author Contrbutons: Baojun Zhao, Boya Zhao and Lnbo Tang created and desgned framework. Boya Zhao and Wenzheng Wang performed algorthm on Pytorch. Boya Zhao and Wenzheng Wang debugged code. Boya Zhao and Yuq Han wrote paper. Conflcts of Interest: The authors declare no conflcts of nterest. References 1. LeCun, Y.; Boser, B.; Denker, J.S.; Henderson, D.; Howard, R.E.; Hubbard, W.; Jackel, L.D. Backpropagaton appled to handwrtten zp code recognton. Neural Comput. 1989, 1, [CrossRef] 2. Krzhevsky, A.; Sutskever, I.; Hnton, G.E. Imagenet classfcaton wth deep convolutonal neural networks. Adv. Neural Inf. Process. Syst. 2012, 2012, [CrossRef] 3. Smonyan, K.; Zsserman, A. Very deep convolutonal networks for large-scale mage recognton. arxv 2014, arxv: Szegedy, C.; Lu, W.; Ja, Y.; Sermanet, P.; Reed, S.; Anguelov, D.; Erhan, D.; Vanhoucke, V.; Rabnovch, A. Gong deeper wth convolutons. In Proceedngs of IEEE Conference on Computer Vson and Pattern Recognton, Boston, MA, USA, 7 12 June 2015; pp Grshck, R. Fast r-cnn. In Proceedngs of IEEE Internatonal Conference on Computer Vson, Los Alamtos, CA, USA, 7 13 December 2015; pp Ren, S.; He, K.; Grshck, R.; Sun, J. Faster r-cnn: Towards real-tme object detecton wth regon proposal networks. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, [CrossRef] [PubMed] 7. Da, J.; L, Y.; He, K.; Sun, J. R-fcn: Object detecton va regon-based fully convolutonal networks. Adv. Neural Inf. Process. Syst. 2016, 2016, Grshck, R.; Donahue, J.; Darrell, T.; Malk, J. Rch feature herarches for accurate object detecton and semantc segmentaton. In Proceedngs of IEEE Conference on Computer Vson and Pattern Recognton, Columbus, OH, USA, June 2014; pp Zhong, J.; Le, T.; Yao, G. Robust Vehcle Detecton n Aeral Images Based on Cascaded Convolutonal Neural Networks. Sensors 2017, 17, [CrossRef] [PubMed] 10. Lu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.-Y.; Berg, A.C. Ssd: Sngle shot multbox detector. In European Conference on Computer Vson; Sprnger: Cham, Swtzerland, 2016; pp Oh, S.I.; Kang, H.B. Object Detecton and Classfcaton by Decson-Level Fuson for Intellgent Vehcle Systems. Sensors 2017, 17, 207. [CrossRef] [PubMed] 12. Zhu, X.; Xong, Y.; Da, J.; Yuan, L.; We, Y. Deep feature flow for vdeo recognton. arxv 2016, arxv: Kang, K.; L, H.; Yan, J.; Zeng, X.; Yang, B.; Xao, T.; Zhang, C.; Wang, Z.; Wang, R.; Wang, X. T-CNN: Tubelets wth convolutonal neural networks for object detecton from vdeos. IEEE Trans. Crcuts Systems Vdeo Technol [CrossRef] 14. Han, W.; Khorram, P.; Pane, T.L.; Ramachandran, P.; Babaezadeh, M.; Sh, H.; L, J.; Yan, S.; Huang, T.S. Seq-nms for vdeo object detecton. arxv 2016, arxv: Kang, K.; Ouyang, W.; L, H.; Wang, X. Object detecton from vdeo tubelets wth convolutonal neural networks. In Proceedngs of IEEE Conference on Computer Vson and Pattern Recognton, Seattle, WA, USA, June 2016; pp Lee, B.; Erdenee, E.; Jn, S.; Nam, M.Y.; Jung, Y.G.; Rhee, P.K. Mult-class mult-object trackng usng changng pont detecton. In European Conference on Computer Vson; Sprnger: Cham, Swtzerland, 2016; pp Chopra, S.; Hadsell, R.; LeCun, Y. Learnng a smlarty metrc dscrmnatvely, wth applcaton to face verfcaton. In Proceedngs of IEEE CVPR Computer Socety Conference on Computer Vson and Pattern Recognton, San Dego, CA, USA, June 2005; pp Wenberger, K.Q.; Saul, L.K. Dstance metrc learnng for large margn nearest neghbor classfcaton. J. Mach. Learn. Res. 2009, 10,

20 Sensors 2018, 18, of Bodla, N.; Sngh, B.; Chellappa, R.; Davs, L.S. Soft-NMS Improvng Object Detecton wth One Lne of Code. In Proceedngs of 2017 IEEE Internatonal Conference on Computer Vson (ICCV), Vence, Italy, October 2017; pp Felzenszwalb, P.; McAllester, D.; Ramanan, D. A dscrmnatvely traned, multscale, deformable part model. In Proceedngs of IEEE CVPR Conference on Computer Vson and Pattern Recognton, Anchorage, AK, USA, June 2008; pp Ujlngs, J.R.; Van De Sande, K.E.; Gevers, T.; Smeulders, A.W. Selectve search for object recognton. Int. J. Comput. Vs. 2013, 104, [CrossRef] 22. Redmon, J.; Dvvala, S.; Grshck, R.; Farhad, A. You only look once: Unfed, real-tme object detecton. In Proceedngs of IEEE Conference on Computer Vson and Pattern Recognton, Seattle, WA, USA, June 2016; pp Russakovsky, O.; Deng, J.; Su, H.; Krause, J.; Saesh, S.; Ma, S.; Huang, Z.; Karpathy, A.; Khosla, A.; Bernsten, M. Imagenet large scale vsual recognton challenge. Int. J. Comput. Vs. 2015, 115, [CrossRef] 24. Gkoxar, G.; Malk, J. Fndng acton tubes. In Proceedngs of 2015 IEEE Conference on Computer Vson and Pattern Recognton (CVPR), Boston, MA, USA, 7 12 June 2015; pp Peng, X.; Schmd, C. Mult-regon two-stream R-CNN for acton detecton. In European Conference on Computer Vson; Sprnger: Cham, Swtzerland, 2016; pp Hou, R.; Chen, C.; Shah, M. Tube convolutonal neural network (T-CNN) for acton detecton n vdeos. In Proceedngs of IEEE Internatonal Conference on Computer Vson, Vence, Italy, October L, C.; Stevens, A.; Chen, C.; Pu, Y.; Gan, Z.; Carn, L. Learnng Weght Uncertanty wth Stochastc Gradent MCMC for Shape Classfcaton. In Proceedngs of Computer Vson and Pattern Recognton, Seattle, WA, USA, June 2016; pp Lucano, L.; Hamza, A.B. Deep learnng wth geodesc moments for 3D shape classfcaton. Pattern Recognt. Lett [CrossRef] 29. Nar, V.; Hnton, G.E. Rectfed lnear unts mprove restrcted boltzmann machnes. In Proceedngs of 27th Internatonal Conference on Machne Learnng (ICML-10), Hafa, Israel, June 2010; pp Erhan, D.; Szegedy, C.; Toshev, A.; Anguelov, D. Scalable object detecton usng deep neural networks. In Proceedngs of IEEE Conference on Computer Vson and Pattern Recognton, Columbus, OH, USA, June 2014; pp Hosang, J.; Benenson, R.; Schele, B. How good are detecton proposals, really? arxv 2014, arxv: Shrvastava, A.; Gupta, A.; Grshck, R. Tranng regon-based object detectors wth onlne hard example mnng. In Proceedngs of IEEE Conference on Computer Vson and Pattern Recognton, Seattle, WA, USA, June 2016; pp Huber, P.J. Robust Estmaton of a Locaton Parameter. Ann. Math. Stat. 1964, 35, [CrossRef] 34. Henrques, J.F.; Casero, R.; Martns, P.; Batsta, J. Hgh-speed trackng wth kernelzed correlaton flters. IEEE Trans. Pattern Anal. Mach. Intell. 2015, 37, [CrossRef] [PubMed] 35. Baoxan, W.; Lnbo, T.; Jngln, Y.; Baojun, Z.; Shugen, W. Vsual Trackng Based on Extreme Learnng Machne and Sparse Representaton. Sensors 2015, 15, Bertnetto, L.; Valmadre, J.; Henrques, J.F.; Vedald, A.; Torr, P.H. Fully-convolutonal samese networks for object trackng. In European Conference on Computer Vson; Sprnger: Cham, Swtzerland, 2016; pp Zhao, Z.; Han, Y.; Xu, T.; L, X.; Song, H.; Luo, J. A Relable and Real-Tme Trackng Method wth Color Dstrbuton. Sensors 2017, 17, [CrossRef] [PubMed] 38. Zhu, X.; Wang, Y.; Da, J.; Yuan, L.; We, Y. Flow-Guded Feature Aggregaton for Vdeo Object Detecton. arxv 2017, arxv: Kang, K.; L, H.; Xao, T.; Ouyang, W.; Yan, J.; Lu, X.; Wang, X. Object detecton n vdeos wth tubelet proposal networks. In Proceedngs of IEEE Conference on Computer Vson and Pattern Recognton, Hawa, HI, USA, July 2017; p Kwak, S.; Cho, M.; Laptev, I.; Ponce, J. Unsupervsed Object Dscovery and Trackng n Vdeo Collectons. In Proceedngs of IEEE Internatonal Conference on Computer Vson, Washngton, DC, USA, 7 13 December 2015; pp

Sensors 2018, 18, 774 21 of 21 41. Trpath, S.; Lpton, Z.; Belonge, S.; Nguyen, T. Context Matters: Refnng Object Detecton n Vdeo wth Recurrent Neural Networks.

21 Sensors 2018, 18, of Trpath, S.; Lpton, Z.; Belonge, S.; Nguyen, T. Context Matters: Refnng Object Detecton n Vdeo wth Recurrent Neural Networks. In Proceedngs of Brtsh Machne Vson Conference, York, UK, September 2016; pp Lu, Y.; Lu, C.; Tang, C.K. Onlne Vdeo Object Detecton Usng Assocaton LSTM. In Proceedngs of IEEE Internatonal Conference on Computer Vson, Vence, Italy, October 2017; pp Glorot, X.; Bengo, Y. Understandng dffculty of tranng deep feedforward neural networks. In Proceedngs of Thrteenth Internatonal Conference on Artfcal Intellgence and Statstcs, Sanya, Chna, October 2010; pp Ferrar, V.; Schmd, C.; Cvera, J.; Lestner, C.; Prest, A. Learnng object class detectors from weakly annotated vdeo. In Proceedngs of IEEE Conference on Computer Vson and Pattern Recognton, Provdence, RI, USA, June 2012; pp by authors. Lcensee MDPI, Basel, Swtzerland. Ths artcle s an open access artcle dstrbuted under terms and condtons of Creatve Commons Attrbuton (CC BY) lcense (

Outline. Discriminative classifiers for image recognition. Where in the World? A nearest neighbor recognition example 4/14/2011. CS 376 Lecture 22 1

Outline. Discriminative classifiers for image recognition. Where in the World? A nearest neighbor recognition example 4/14/2011. CS 376 Lecture 22 1 4/14/011 Outlne Dscrmnatve classfers for mage recognton Wednesday, Aprl 13 Krsten Grauman UT-Austn Last tme: wndow-based generc obect detecton basc ppelne face detecton wth boostng as case study Today: