TrackNet: Simultaneous Detection and Tracking of Multiple Objects

Size: px

Start display at page:

Download "TrackNet: Simultaneous Detection and Tracking of Multiple Objects"

Isabel Fisher
5 years ago
Views:

1 TrackNe: Simulaneous Deecion and Tracking of Muliple Objecs Chenge Li New York Universiy Gregory Dobler New York Universiy Yilin Song New York Universiy Xin Feng Chongqing Universiy of Technology Yao Wang New York Universiy Absrac Objec deecion and objec racking are usually reaed as wo separae processes. Objec deecion in sill images relies on spaial appearance feaures, whereas objec racking in videos relies on boh spaial appearance and emporal moion feaures. Significan progress has been made for objec deecion in 2D images using deep learning neworks such as region CNN and subsequen varians. The usual pipeline for objec racking requires ha he objec be successfully deeced in he firs frame or in every frame, and racking is done by associaing deecion resuls. Performing objec deecion and objec racking hrough a single nework remains a challenging open quesion. We propose a novel nework srucure ha can direcly deec a 3D ube enclosing a moving objec in a video by exending he region-cnn framework for objec deecion in an image. The proposed rackne works over shor video segmens and oupus a bounding ube for each deeced moving objec, which includes shifed bounding boxes covering he deeced objec in successive frames. A Tube Proposal Nework (TPN) inside he rackne is proposed o predic he objeciveness of each candidae ube and locaion parameers specifying he bounding ube wih a high objeciveness score. We have rained and esed rackne on UA-DETRAC, he larges raffic video daase available for muli-objec deecion and racking, and obained very promising resuls. 1. Inroducion Objec deecion and objec racking have been wo long sanding challenges for he compuer vision communiy, and much progress has been made on boh frons. For objec deecion, complex hand-crafed feaures plus shallow classifiers such as HOG+SVM [3] and muliple resoluion image pyramids plus muliple filers such as DPM [6] were boh popular deecion pipelines. In 2013, AlexNe [14] showed significan performance improvemens over hose radiional pipelines in he ImageNe compeiion [4] for a 1000-caegory image classificaion ask. Wih his success, Convoluional Neural Neworks (CNN) regained populariy in research he communiy for a variey of vision asks from image classificaion, objec deecion, and localizaion o dense image capioning, ec. Objec racking, especially Muli-Objec Tracking (MOT), has many real-world applicaions including auonomous vehicles, virual/augmened realiy, robo navigaion, ec. Mos exising MOT sysems can be classified ino wo groups [16]: Deecion Based Tracking (DBT) and Deecion Free Tracking (DFT). DBT requires objec being deeced for every frame followed by a racker ha associaes or links he deecion regions based eiher on objec feaures or probabilisic movemen characerisics. DFT on he oher hand, requires manual iniializaion of objecs in he iniial frame followed by associaion of hose wih auomaically deeced objecs in subsequen frames (i.e., DFT rackers canno handle objecs ha were no iniialized). Boh approaches rely on pre-compued objec locaions. A general MOT sysem enails an objec deecion sep o find arge locaions in muliple (and someimes every) video frames. We argue ha objec deecion and racking should no be reaed as wo independen asks, bu raher, effecive objec deecion in video should employ boh spaial appearance and emporal moion feaures. In his work, we propose a unified nework srucure for simulaneous objec deecion and racking. Our nework reas a group of consecuive picures (GoP) as a 3D volume, and deecs moving objecs in he GoP as ubes wihin i. As shown in figure 1, our sysem uses he popular 3D convoluional nework [23] and he VGG[21] ne wih pure 2D convoluions as he underlying neworks for spaial and emporal feaure exracion from raw inpu videos. A spaial ransformer module [10] is used o deal wih varying ob- 1

2D convoluion Spaial Transformer θ Tθ(G) Tube Proposal Nework FC layers Lcls Spaial-emporal convoluion U V Tube Pooling Pos-TPN Lreg Figure 1. The rackne srucure.

2 2D convoluion Spaial Transformer θ Tθ(G) Tube Proposal Nework FC layers Lcls Spaial-emporal convoluion U V Tube Pooling Pos-TPN Lreg Figure 1. The rackne srucure. A wo branched backbone produces boh appearance-only and spaial-emporal feaures from a video GoP. A spaial ransformer is insered o inroduce global feaure warping. A ube proposal nework is inroduced o generae flexible bounding ube proposals. Proposal ubes are used o guide more specific pooling, and he ube-pooled feaures are used o furher predic more specific class labels and refine ube locaions. The final oupu will be bounding ubes for moving objecs inside his video GoP. A shor rackle is also generaed by connecing he cenroids of bounding boxes a each frame. jec orienaions. A ube proposal nework (TPN) generaes bounding ube proposals for deeced candidaes. A pos- TPN sage hen furher refines he ube classificaion scores and locaion parameers for ube proposals wih higher objeciveness scores. We have explored muliple srucures inside TPN for generaing accurae ube proposals. We evaluae our mehod on he UA-DETRAC[26] daase boh for deecion and racking. Moivaions for Tube Proposals. One obvious advanage of ube proposals over box proposals is he convenience of geing all objecs spaial-emporal locaions in one sho. A bounding ube is readily available afer one forward pass insead of N imes of forward passes plus pos-processing associaion. The oher advanage of ube proposals is is role o provide a much sronger regularizaion during raining. Suppose ha here are B moving objecs, which appeared in each of he N frames. If we firs deec B box proposals in each frame, we would have o examine B N possible ubes, while here are only B ubes ha are correc. Therefore during raining, he ground ruh bounding ubes acually occur very rarely in his high-dimensional space. Each one of he ground ruh ubes carries highly srucured informaion implicily. This sparsiy and implici srucure informaion will serve as very srong regularizaion: only ubes wih cerain spaial and moion feaures over he enire GoP are good candidaes. Furhermore, he sep from proposing boxes o proposing ubes is in fac a very naural exension. When a person sees wo nearby frames (no necessarily adjacen), he will naurally fill in he gap beween one objec s wo appearances. Therefore, when dealing wih objec deecion or racking in videos, proposing ubes is a more naural way han proposing boxes. Exising ube-based algorihms such as [12, 13, 11] have some limiaions in heir ube generaion processes. There, he proposed ubes are achieved from frame-by-frame deecion resuls or by using some racking algorihms o explicily creae a ube, which make he ube proposal sage very ime-consuming and compuaionally expensive. Kang e al. in [11] proposed o generae ube proposals from saic anchors more efficienly, however he feaures are pooled from he same box locaion across muliple frames. Hence only saionary sraigh anchor ubes are considered in heir case. The major innovaions of his work include: i) We exend he region-cnn archiecure o direcly deec bounding ubes covering moving objecs in a video, o simulaneously accomplish deecion and racking; ii) We propose o combine he feaures generaed by a 3D convoluion nework and a 2D convoluion nework o capure boh moion informaion and appearance informaion; iii) We propose o inser a spaial ransformer module in he feaure domain o make he inpu feaures o he ube proposal nework and refinemen nework invarian o camera view angles; iv) We furher propose o iniialize candidae anchor ubes based on he opical flow moion vecors near he candidae locaion. 2. Relaed Work Objec Deecion in Images. Successful objec deecion has progressed rapidly in recen years [8, 19, 17, 18]. Girshick e al. firs inroduced R-CNN [8] o idenify and label objecs in 2D images. From objec region proposals ha are generaed by an independen algorihm (e.g., selecive search [24]), R-CNN runs a forward pass once for each proposed region o deermine wheher his region conains an objec. The same auhors furher improved R-CNN o fas R-CNN[7] by sharing convoluion feaure maps among all objec region proposals, and hence only one forward pass for all region proposals is needed. Faser R-CNN furher improved upon fas R-CNN by inroducing a region proposal nework (RPN) ha direcly regresses fixed anchors o objec region proposals from feaure maps exraced using a 2D convoluional nework. Our work is inspired by he srucure of faser R-CNN [19], bu we replace RPN by a ube proposal nework (TPN) acing on a 3D convoluional nework, and we uilize a 3D (as opposed o 2D) ROI pooling mechanism. The regression loss objecive funcion is also exended o consider he difference beween ground ruh and deeced ube locaions in all frames, which is op-

3 imized end-o-end joinly wih ube classificaion loss. Objec Deecion in Videos. Mos sysems proposed so far for idenifying (and racking) moving objecs in videos rely on 2D objec deecion in each frame, which is compuaionally expensive and does no joinly consider objec(s ) moion informaion. Some sysems (e.g., [12, 13]) have used explici moion informaion (e.g. opical flow) as a linking feaure o associae deecion regions or o smooh deecion scores as a pos-processing sep. However hose moion informaion is derived separaely ouside of he nework and is no inegraed organically wih he nework raining. Recenly, [11] proposed a ubele generaion module (also refered o as Tubele Proposed Nework), however heir ubeles are generaed from 2D proposal boxes generaed by a separae algorihm (e.g., selecive search [24]). The objec deecion is based on pooled feaures from GoogLene [22] feaure maps. Despie he powerful abiliy of GoogLene, his mehod pools from muliple 2D feaure maps which are hen concaenaed as opposed o our spaialemporal feaures which are generaed direcly from video segmens. Tran e al. in [23] proposed C3D, a 3D convoluional CNN srucure, o exrac spaio-emporal feaures o classify videos ino differen caegories. C3D has shown good performance compared wih radiional compuer vision algorihms such as dense rajecories [25]. Shou e al. in [20] exended he C3D nework o locae ineresing acion ime slos in an unrimmed long video by argeing video segmens. We argue ha feaures exraced by his kind of 3D convoluional neworks are well suied for objec deecion in videos. Even hough C3D was no specifically rained for accurae objec localizaion, locaions ha are highly acivaed in he feaures maps ypically correspond o moving objecs. Therefore we use he C3D nework srucure for feaure exracion in he rackne, and we refine he nework weighs hrough end-o-end raining. 3. TrackNe Model Archiecure 3.1. Two Sream Feaure Exracion and Transformaion In order o uilize boh spaial and emporal feaures, our nework is based on VGG ne rained from ImageNe and C3D rained from UCF101 for video classificaions. Figure 1 provides an overview of he rackne srucure. We divide a video ino group of picures(gop) of fixed lengh (8 frames in our implemenaion) and feed he raw video frames in each GoP ino a wo-headed backbone srucure, where he firs head is a VGG-like subnework wih convoluions and spaial max-poolings and he second head is a C3D-like subnework wih 3D convoluions and spaialemporal max-poolings. These wo kind of feaures complimen each oher in ha one focuses more on appearance ransformed fea V (pi,1, pi,t) inerpolae refine cls reg Tube Classificaion Module Tube Offse Regression Module Figure 2. Tube Proposal Nework (TPN) in deail: based on he shared convoluion feaure maps, a TPN consiss of wo pars: he classificaion module and he ube offse regression module. We show a hea map of he prediced objeciveness scores from he classificaion module: regions where objecs are moving have higher (warmer) values. whereas he oher focuses more on moions. The resuling spaial-emporal feaure maps and he 2D feaure maps hen separaely go hrough a squashing convoluional layer wih 1 1 kernel size and 128 oupu channels. These wo squashing layers will reduce he numbers of feaure map channels. Squashed feaure maps are concaenaed aferwards. Spaial Transformer. In some videos, he nework will observe objecs fronal appearance, whereas in ohers, he side appearances of objecs will be observed. Inspired from [10], we uilized a learnable module, he spaial ransformer, o map he concaenaed feaures from differen viewing angles ino a unified manifold. We used affine ransformaion in our case, however one can use more complicaed ransformaions o sui heir case as indicaed in [10]. Our ransformer has a very simple srucure wih only one convoluional layer and one fully conneced layer as shown in figure 1. Six affine ransformaion parameers θ will be oupu from he fully conneced layer and hen used o ransform he original concaenaed feaure maps U. Insead of sampling from he original feaure maps using a regular mesh grid G, he sampling grid will be ransformed using θ o T θ (G), which is applied o original feaure maps U o produce he warped oupu feaure maps V. All feaure maps are ransformed channel-wise in he same way. The ransformed feaures V are hen fed ino he ube proposal nework (TPN) shown in figure 2 o generae ube proposals and a pos-tpn sage o furher refine hose proposals. We will describe he deails of differen subcomponens in he following secions.

Figure 3. Example of he M = 9 objeciveness scores shown on he righ as he hea maps. Each score hea map corresponds o a specific se of anchor size and aspec raio. e.g. size=40, aspec raio=0.81.

4 Figure 3. Example of he M = 9 objeciveness scores shown on he righ as he hea maps. Each score hea map corresponds o a specific se of anchor size and aspec raio. e.g. size=40, aspec raio=0.81. I can be seen ha score maps of smaller-sized anchor ubes are more sensiive o smaller objecs whereas score maps of larger-sized anchor ubes are more sensiive o larger objecs Tube Proposal Nework (TPN) TPN will produce many iniial ube proposals. Similar wih faser RCNN s region proposal nework (RPN), he TPN generaes muliple candidae anchor ubes a each pixel locaion. For each anchor ube, he classificaion module predics he objeciveness score of he ube (i.e. probabiliy of having objecs inside he ube) and he offse regression module generaes he posiion offses from he anchor ubes o he ground ruh bounding ubes. Boh he classificaion module and offse regression module inside TPN are sharing he same ransformed feaure maps V of size 256 W H Saionary and Tiled Anchor Tubes Faser R-CNN s RPN uses M predefined anchors wih differen base sizes and aspec raios. Analogously, we sar from a fixed se of anchor ubes. As discussed in he inroducion secion, he ube space is a very high-dimensional space. I is no possible and no necessary o consider every possible ube. Consider a shor video segmen wih T frames (e.g. 8 frames in a F P S = 30 video), an objec s rajecory is usually very smooh and nearly linear in such shor ime period. These quasi-linear ubeles may differ in size, direcion or speed, however hey all live on a low dimensional manifold wih limied degrees of freedom. Because of he quasi-lineariy of objecs shor rajecories, we use sraigh anchor ubes as he iniial candidaes. A naive way o consruc a 3D ube is o have he same bounding box posiions in all frames, which we call saionary ubes T s, because hey correspond o non-moving objecs. These simple ubes, however, are no always desirable, especially when dealing wih videos wih varying viewing poins and camera posiions. Consider he ypical scenes from he UA-DETRAC[26] daase (see sample picures in figure 4), raffic can flow owards an arbirary direcion, and a saionary ube would have very low overlapping wih he rue ube, resuling in a very big offse, which is hard o regress. In order o have beer iniial ube candidaes, we uilize moion vecors (MV) derived from opical flow fields and consruc iled ubes by modifying he iniial saionary ubes using he average moion direcions. Firsly he dominan moion direcion (posiive or negaive) will be decided based on he voes of all he moion vecors a his pixel locaion in his video GoP. If he dominan direcion is posiive, hen he firs half of he bigges moion vecors will be averaged as he mean MV, while if he dominan direcion is negaive, hen he firs half of he smalles moion vecors will be averaged as he resul. For frame index = [1, N], box posiions of he iled ube T can be derived from he sraigh ubes T s as: T [, :, :, k] = T s [, :, :, k] + mv ( 1) where mv is eiher mv x or mv y, he mean moion vecors in x and y direcions a each pixel locaion. [:, :] sands for parallel compuaion for all pixel locaions in he feaure map a he same ime. (T [, w, h, 0], T [, w, h, 1]) is he upper lef corner posiion whereas (T [, w, h, 2], T [, w, h, 3]) is he lower righ corner posiion of he bounding box a (w, h) in frame. During raining or esing, boh saionary and iled ubes will be used as he iniial ube candidaes. Unlike faser R-CNN, we do no pre-define he size and aspec raio for he anchors. Raher we idenify he ypical sizes and aspec raios of he ground ruh bounding boxes in he raining daa using he K-means clusering (inspired by YOLO [17]), and use he cenroids for all clusers as he sizes and aspec raios for he base saionary anchors. We find ha he nework has difficuly regressing o large offse values if saring from poorly chosen iniial anchors and herefore an appropriae iniializaion for anchor sizes, aspec raios and moving direcions is raher imporan Tube Classificaion Module Based on he shared feaure maps V, M objeciveness scores are prediced in he classificaion module for M anchor ubes a each feaure map locaion by a convoluional layer wih kernel size 3 3 (acing on K = 256 feaure maps). Essenially he objeciveness score for each anchor is deermined from a 3 3 K feaure ensor. This module can be viewed as a fully convoluional nework. During raining, ube overlapping, i.e. 3D inersecion over union (3D-IoU) beween he anchor ubes and ground ruh ubes are compued. Anchor ubes wih high 3D-IoU will be seleced as posiive proposals and assigned label +1, whereas anchor ubes wih low 3D-IoU scores (parially overlapped) will be assigned label 1 and he remaining anchor ubes (including pure background) will be ignored. The classificaion module is rained wih he cross-enropy classificaion loss L clst P N wih respec o heir ground ruh label. In figure 2, he ube classificaion module is shown. The hea map is he objeciveness score oupu, where bigger (warmer) values indicae higher probabiliies of conaining

5 objecs in ha locaion. Here we only show one hea map corresponding o one se of anchor ube size. In our implemenaion, we have M = 9 se of base ubes wih differen sizes or aspec raios as in figure Tube Offse Regression Module Anchor ubes will be ranked based on heir objeciveness scores. For ubes wih higher scores, he offses beween he corner posiions of he ubes and ground ruh posiions are compued as he regression arges. The regression module will be rained o generae hese regression arges from he inpu 3 3 K feaures. Given M candidae anchor ubes a each feaure map locaion, he offses will be prediced so as o bend he sraigh anchor ubes ino a shape closer o he ground ruh ubes. Following R-CNN, we use he cener posiion and widh and heigh o parameerize he posiion of a recangular bounding box in each frame. The offses of hese parameers beween he bounding boxes of all frames in an anchor ube (ST) and hose in he ground ruh ube (GT) are our 3D ube regression arge for his anchor ube. We adop he parameerizaion of he 4 coordinaes in [8], bu similar as [17], we normalize he spaial coordinae by he acual widh and heigh of he video frame, so ha he normalized coordinae and hence he 4 parameers are all in he range of [0, 1], which helps he convergence speed. The 3D Tube regression arges for posiive anchor ube i a frame is defined as: ar i, = X g Y g = (GTcener x) (STcener x) (ST w) = (GTcener y) (STcener y) (ST h ) W g H g = log (GTw) (ST w) = log (GT h) (ST h ) By learning o regress o hese arges, he sysem can derive he refined locaions for all anchor ubes ha have high overlap wih ground ruh bounding ubes. We have explored wo ways o wire he ube offse regression module: (1) direcly predicing offses of all frames and (2) uilizing linear inerpolaion. Opion 1: Direcly predic ube parameers. In his srucure, we direcly esimae he offses of every frame. Given a video GoP of lengh T, he regression nework direcly predics 4 T parameers for every ube. As our sraigh ube candidaes are spreading over all pixel locaions, he regression nework is implemened using a convoluion layer wih 4 T M oupu maps. Opion 2: Linear inerpolaion of bounding box offses from offses a wo frames. Despie he fac ha an objec inside a video can have arbirarily complex moions, mos objecs moions are very smooh in real-world videos. Given a shor enough ime period, we can approximae he rajecory of each corner of he bounding ube wih a sraigh line. This is paricularly rue for raffic videos conaining moving vehicles. Moivaed by his observaion, insead of deermining he offses of he corner posiions in all frames, he regression nework only esimaes he offses in he beginning and ending frames, and linearly inerpolae he offses in oher frames. During raining, he regression loss considers he difference beween he rue offses (arges) and he esimaed offses for all frames, which are inerpolaed from he offses in he beginning and ending frames. The advanage of his approach is ha only 8 parameers are esimaed for a given ube, as opposed o 4 T parameers. Compared o direcly esimaing he offse a every frame, his approach also implicily applies a smoohness consrain along he corner rajecories and prevens he nework o generae erraic rajecories. We implemen he inerpolaion using a convoluion layer wih spaial 1 1 kernel. For example, if we have a video segmen wih lengh T = 8 frames, X 1, Y 1, W 1, H 1, X T, Y T, W T, H T are he prediced cener offses and widh and heigh offses a he firs frame ( = 1) and he las frame ( = 8). The offse a ime frame can be easily implemened using a convoluional layer wih kernel marix: K = [ ] 1, 6/7, 5/7, 4/7, 3/7, 2/7, 1/7, 0 0, 1/7, 2/7, 3/7, 4/7, 5/7, 6/7, 1 If we view he firs frame predicion resul (wih 4 channels for 4 parameers) and he las frame predicion resul as 2 separae inpu feaure maps, hen hese 2 feaure maps convolving wih his kernel will produce 8 feaure maps, corresponding o prediced offses for all 8 frames. Noe ha we could implemen higher order inerpolaion by using more han 2 inpu feaure maps and seing he kernel marix accordingly. We could also rain he kernel marix as par of he regression nework o learn he appropriae inerpolaion kernel pos-tpn: Classificaion and Refinemen As shown in Figure 1, he ube proposal nework generaes many ube proposals, whose posiions are deermined by he original candidae ubes and he prediced offses. Proposal ubes wih high objeciveness scores will go hrough a second sage of classificaion and regression. In his sage, ube proposals will be furher classified ino differen classes (such as car, bus, van ec. for UA-DETRAC daase). The posiion offses for he ube will also be refined. Insead of using he feaures pooled from he 3 3 neighborhood on he feaure map as in TPN, feaures specific for he proposal ube regions are pooled using he ube pooling. Tube Pooling. ROI pooling was inroduced in [9], which

Figure 4. Examples of he prediced bounding ubes.

TrackNe is rained o generae bounding ubes o cover boh small and large vehicle sizes wih differen aspec raios.

enables differen proposal regions o be described by he same dimensional feaure vecors. In our case, a proposal ube consiss of bounding boxes in differen frames ha are differen in sizes and locaions.

Insead of pooling from he same feaure map muliple imes as in [11], he union of all bounding boxes in a proposal ube is found and feaures covering he union region are exraced from he ransformed feaure

There are wo fully conneced (fc) layers and anoher wo fc layers for predicing classificaion scores and offses separaely.

6 Figure 4. Examples of he prediced bounding ubes. The firs and hird rows show he bounding box in he middle frames, and he second and fourh rows show he whole bounding ube on he same middle frames wih he cenroids conneced as he rackles. TrackNe is rained o generae bounding ubes o cover boh small and large vehicle sizes wih differen aspec raios. I also generaes more sparse bounding ubes for fas-moving vehicles and denser bounding ubes for slower vehicles or vehicles ha are are furher away. enables differen proposal regions o be described by he same dimensional feaure vecors. In our case, a proposal ube consiss of bounding boxes in differen frames ha are differen in sizes and locaions. Pooling based on one paricular bounding box inside he ube would be deficien. Insead of pooling from he same feaure map muliple imes as in [11], he union of all bounding boxes in a proposal ube is found and feaures covering he union region are exraced from he ransformed feaure maps V. Afer he ube pooling, his feaure vecor is hen fed ino a pos-tpn subnework, which furher assesses is class and refines he ube posiion informaion. There are wo fully conneced (fc) layers and anoher wo fc layers for predicing classificaion scores and offses separaely. In our implemenaion, 256 proposal ubes (half posiive, half negaive) are considered and feaures are pooled from feaure map V using he ube union ROI, leading o a oal of feaures. For he offse regression, similarly wih he TPN regression module, eiher linear inerpolaion or direcly predicing offses a all frames can be chosen Muli-ask Loss o Train he TrackNe Boh classificaion loss and regression loss are used o penalize he proposed ubes. For he TPN, he prediced ob- jeciveness score for each anchor ube will have he crossenropy classificaion loss LclsT P N wih respec o heir ground ruh label. Posiive anchor ubes will have he regression loss LregT P N wih respec o he offse arges. During pos-tpn sage, rue class labels (e.g. background, car, bus, van) are used for he cross-enropy loss Lcls. Boh regression losses LregT P N and Lreg use he smooh l1 loss defined in [7]. The above losses are combined o form he oal loss for a proposal ube: L(si, pi ) = λ1 Lcls (li, si ) + λ2 T X Lreg (ari,, pi, )+ =1 +λ3 LclsT P N (li, si ) + λ4 T X LregT P N (ari,, pi, ) =1 +λ5 Lsmooh (1) where li is he ground ruh label for anchor ube i, si is he prediced objeciveness score or he specific class score for anchor ube i. ari, is he ground ruh arge, a four-parameer vecor represening he offses beween he ground ruh locaion and he locaion of posiive anchor ube i a ime, i.e. ari, = [ Xg, Yg, Wg, Hg ]i. And pi, is he prediced offse vecor, i.e. pi, =

7 AP area: AR num maxdes: T θ (G) VGG LP 0.10: small medium large TrackNe (no VGG, no ransformer, predic all) TrackNe (no VGG, no ransformer, inerpolae) TrackNe (no ransformer) TrackNe Lef view only TrackNe Righ view only TrackNe Fronal view only TrackNe (300 proposals during es) TrackNe (2000 proposals during es) Fine-uned Faser RCNN Table 1. Deecion resuls of rackne and he Faser RCNN[19] baseline on UA-DETRAC daase[26] (evaluaed using COCO API[15]). The average precision (AP)(%) and average recall (AR)(%) raes are repored under differen seings (i.e. IoU hresholds; bounding box area; hresholds on max deecions per image). TrackNe is our final model. All TrackNe varians repored here used 300 op proposals during he es excep he one indicaed wih 2000 proposals. [ X, Y, W, H ] i. When we use he opion of direcly regressing box locaions in each frame, we add a smoohness loss erm L smooh o furher enhance he smoohness (quasi-lineariy) of he ube, which can be derived from he oal variaion of he ube posiions or he average posiion change beween wo frames. The λs conrol he weighs for differen losses. From experimens, we found hese hyper parameers are no very sensiive. We se λ 1,2,3,4 = 1 and λ 5 = in all of he following experimens. The muli-ask loss for raining is defined as: L muli-ask = N ubes i=1 4. Experimens and Evaluaion L(s i, p i ) (2) Daase. Mos of he objec deecion daase are 2D images, such as ImageNe [4], PASCAL VOC [5], Microsof COCO [15], ec. In he ILSVRC2015 challenge, ImageNe inroduced he VID ask wih 30 caegories o arac aenion in he objec deecion in videos. However mos of he videos only conain very few dominan objecs, whereas in real world, muliple objec deecion and muliple objec racking need o be addressed simulaneously. In order o evaluae our model boh on deecion and racking of muliple objecs, he UA-DETRAC daase[26] is used. This daase was inroduced as a benchmark for boh objec deecion and racking, which consiss of challenging video sequences capured from real-world raffic scenes wih differen viewing angles. We spli he daase ino 45 raining and 15 esing videos and made sure ha boh raining and esing covers all differen camera views. The video lenghs range from around 700 frames o around 2500 frames. This daase spans a variey of differen weahers such as sunny, cloudy, rainy and nigh. We did no sample he daase o ensure ha he raining and esing se each includes samples aken under differen weaher condiions. However, he rained model urns ou o be prey robus o differen weaher condiions excep for a few nigh videos, which have very differen lighing condiions han ohers. Wha s more, he cameras used o capure nigh videos also had ou-of-focus problems in lowligh condiion and resuled in blurry images which may have caused he performance drop. Training. During raining, 8 consecuive video frames are randomly seleced from he raining se wihou any daa augmenaion. We firs fine-une he VGG branch alone under he framework of faser R-CNN [19] using he whole raining se as a warmup. The proposed TrackNe is hen rained wih he VGG branch frozen. We also fine-une he las convoluion layer (conv5a, conv5b) in he C3D backbone. The iniial learning rae was and was reduced by 10 imes afer 10K ieraions. We used Adam opimizer and rained he model for oal 50K ieraions. Fine-une Faser RCNN as he baseline. We fine-une whole Faser RCNN model wih VGG as base nework on he same raining daase for 70k ieraions. Afer fineuning, he faser RCNN achieved a very high recall rae on he UA-DETRAC daase for vehicle deecion and is used as a very srong baseline. We evaluae he rackne using wo ses of merics: (1) objec deecion performance in each frame using he COCO API[15], (2) objec racking performance using he MOT merics[1]. Some visual deecion and racking resuls are in Figure 4. Objec Deecion Performance. In order o compare rackne wih he 2D deecion baseline, we consider all bounding ubes generaed by rackne and evaluae all bounding boxes in each frame. Table 1 shows he average precision (AP) and average recall (AR) raes for differen evaluaion condiions. The crieria of labeling a deecion as a rue mach is sricer when one increases he IoU hreshold. From he able we can see ha: (1) rackne has ouperformed he srong baseline by approximaely 10.7% in erms of map (31.53% versus 20.77%). I is paricularly beer for deecing larger objecs (41.29% versus 25.54%).

8 Rcll Prcn FAR MOTA MOTP TrackNe Lef view only TrackNe Righ view only TrackNe Fronal view only TrackNe Fine-uned Faser RCNN+SORT[2] Table 2. Tracking resuls evaluaed using MOT merics[1]: recall, precision, FAR (false posiive rae), MOTA and MOTP raes(%) wih maching hreshold (euclidean disance) of 0.8 and 1.0. (2) rackne has more confidence when limiing he maximum number of deecions o be only 1. I ouperformed he baseline by around 9.4% (26.11% versus 16.64%). (3) rackne has a lower recall rae compared wih he baseline, especially when he overlapping crieria is sricer (49.61% versus 79.69%). The proposed model has uilized he concaenaion of boh spaial-emporal feaures and appearance feaures. I has higher precision (less false posiives) for objec deecion. Because a bounding box is generaed only as par of a deeced bounding ube. This however also has a consequence of reducing he recall rae a he frame level. Objec Tracking Performance. In order o compare he racking performance, he bounding boxes produced by he baseline are linked using a real-ime associaor(racker) SORT [2]. Since he rackne already produces rackle, no associaion needs o be done wihin one GoP for N frames. Tubes are simply conneced based on he 3D-IoU across GoPs o formulae longer racks. Afer associaion, he racks generaed from faser RCNN+SORT and he rackne are compared using he muliple objec racking merics[1]. Performances under maching hreshold of 0.8 and 1.0 are repored in able 2. The proposed model ouperformed he faser RCNN+SORT baseline by 15% in erms of racking precision (87.3% versus 72.0%) and 1.1% in MOTA score, which agrees wih he observaions ha, rackne has a much lower false posiive rae by uilizing boh moion and appearance informaion. Noe ha rackne has lower MOTP score which implies bigger disances beween he ground ruh racks and he mached racks. This is expeced given he feaure inpu is in a GoP level, no in a frame level Ablaion Analysis In order o undersand he roles which he major design componens are playing, differen versions of rack- Ne are rained and esed using he same daases. In able 1, we show he comparisons beween rackne wihou ransformer, rackne wihou VGG feaure concaenaion, rackne wihou ransformer or VGG in eiher predicing all mode or linear inerpolaion (LP) mode and he full-version rackne. We furher spli he raining and esing daase based on he viewing angles ino lef view, righ view and fronal view, and show he performances when raining and esing only on he sub-daases separaely. From he able i is clearly ha he performances go boosed afer VGG concaenaion and insering spaial ransformer. The linear inerpolaion (LP) has convenienly served as an implici smoohness regularizaion and improved he performance wih even less parameers. 5. Conclusion We presen he rackne, which can deec and rack muliple objecs in videos joinly by generaing bounding ubes. Uilizing he spaial-emporal feaures deeced by a 3D convoluional nework in addiion o he spaial feaures from he VGG nework, he rackne generaes ube proposals and furher classify hem and refine heir locaions. TrackNe consiss of hree sages: (1) feaure exracion and spaial ransformaion, (2) Tube proposal nework(tpn), and (3)pos-TPN classificaion and refinemen. We explored several ways o do ube proposal and offse regression. TrackNe was rained and esed on he challenging raffic video daase UA-DETRAC and achieved very promising resuls. In fuure work we would like o improve rackne in erms of more precise localizaion. Pooling feaures from muliple scales in spaial and emporal domain will be esed and linear inerpolaion srucure will be relaxed o allow more complex moion paerns. Our curren experimen resuls show ha C3D feaures are no sufficien o accuraely locae objecs. We suspec ha i is possible o design a modified 3D convoluional nework ha can capure more deailed spaial informaion han he C3D archiecure, so ha rackne can provide good performance wihou requiring a separae 2D CNN for feaure exracion. This will be anoher direcion for our fuure research. References [1] K. Bernardin and R. Siefelhagen. Evaluaing muliple objec racking performance: he clear mo merics. EURASIP Journal on Image and Video Processing, 2008(1):246309, , 8 [2] A. Bewley, Z. Ge, L. O, F. Ramos, and B. Upcrof. Simple online and realime racking. In Image Processing (ICIP), 2016 IEEE Inernaional Conference on, pages IEEE, [3] N. Dalal and B. Triggs. Hisograms of oriened gradiens for human deecion. In Compuer Vision and Paern Recogni-

9 ion, CVPR IEEE Compuer Sociey Conference on, volume 1, pages IEEE, [4] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei- Fei. Imagene: A large-scale hierarchical image daabase. In Compuer Vision and Paern Recogniion, CVPR IEEE Conference on, pages IEEE, , 7 [5] M. Everingham, L. Van Gool, C. K. Williams, J. Winn, and A. Zisserman. The pascal visual objec classes (voc) challenge. Inernaional journal of compuer vision, 88(2): , [6] P. F. Felzenszwalb, R. B. Girshick, D. McAlleser, and D. Ramanan. Objec deecion wih discriminaively rained parbased models. IEEE ransacions on paern analysis and machine inelligence, 32(9): , [7] R. Girshick. Fas r-cnn. In Proceedings of he IEEE Inernaional Conference on Compuer Vision, pages , , 6 [8] R. Girshick, J. Donahue, T. Darrell, and J. Malik. Rich feaure hierarchies for accurae objec deecion and semanic segmenaion. In Proceedings of he IEEE conference on compuer vision and paern recogniion, pages , , 5 [9] K. He, X. Zhang, S. Ren, and J. Sun. Spaial pyramid pooling in deep convoluional neworks for visual recogniion. In European Conference on Compuer Vision, pages Springer, [10] M. Jaderberg, K. Simonyan, A. Zisserman, e al. Spaial ransformer neworks. In Advances in Neural Informaion Processing Sysems, pages , , 3 [11] K. Kang, H. Li, T. Xiao, W. Ouyang, J. Yan, X. Liu, and X. Wang. Objec deecion in videos wih ubele proposal neworks. arxiv preprin arxiv: , , 3, 6 [12] K. Kang, H. Li, J. Yan, X. Zeng, B. Yang, T. Xiao, C. Zhang, Z. Wang, R. Wang, X. Wang, e al. T-cnn: Tubeles wih convoluional neural neworks for objec deecion from videos. arxiv preprin arxiv: , , 3 [13] K. Kang, W. Ouyang, H. Li, and X. Wang. Objec deecion from video ubeles wih convoluional neural neworks. In Proceedings of he IEEE Conference on Compuer Vision and Paern Recogniion, pages , , 3 [14] A. Krizhevsky, I. Suskever, and G. E. Hinon. Imagene classificaion wih deep convoluional neural neworks. In Advances in neural informaion processing sysems, pages , [15] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zinick. Microsof coco: Common objecs in conex. In European Conference on Compuer Vision, pages Springer, [16] W. Luo, X. Zhao, and T.-K. Kim. Muliple objec racking: A review. arxiv preprin arxiv: , 1, [17] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi. You only look once: Unified, real-ime objec deecion. In Proceedings of he IEEE Conference on Compuer Vision and Paern Recogniion, pages , , 4, 5 [18] J. Redmon and A. Farhadi. Yolo9000: Beer, faser, sronger. arxiv preprin arxiv: , [19] S. Ren, K. He, R. Girshick, and J. Sun. Faser r-cnn: Towards real-ime objec deecion wih region proposal neworks. In Advances in neural informaion processing sysems, pages 91 99, , 7 [20] Z. Shou, D. Wang, and S.-F. Chang. Temporal acion localizaion in unrimmed videos via muli-sage cnns. In Proceedings of he IEEE Conference on Compuer Vision and Paern Recogniion, pages , [21] K. Simonyan and A. Zisserman. Very deep convoluional neworks for large-scale image recogniion. arxiv preprin arxiv: , [22] C. Szegedy, W. Liu, Y. Jia, P. Sermane, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich. Going deeper wih convoluions. In Proceedings of he IEEE Conference on Compuer Vision and Paern Recogniion, pages 1 9, [23] D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri. Learning spaioemporal feaures wih 3d convoluional neworks. In Proceedings of he IEEE Inernaional Conference on Compuer Vision, pages , , 3 [24] J. R. Uijlings, K. E. Van De Sande, T. Gevers, and A. W. Smeulders. Selecive search for objec recogniion. Inernaional journal of compuer vision, 104(2): , , 3 [25] H. Wang and C. Schmid. Acion recogniion wih improved rajecories. In Proceedings of he IEEE Inernaional Conference on Compuer Vision, pages , [26] L. Wen, D. Du, Z. Cai, Z. Lei, M. Chang, H. Qi, J. Lim, M.- H. Yang, and S. Lyu. Derac: A new benchmark and proocol for muli-objec racking. arxiv preprin arxiv: , , 4, 7

EECS 487: Interactive Computer Graphics

EECS 487: Interactive Computer Graphics EECS 487: Ineracive Compuer Graphics Lecure 7: B-splines curves Raional Bézier and NURBS Cubic Splines A represenaion of cubic spline consiss of: four conrol poins (why four?) hese are compleely user specified