arxiv: v1 [cs.cv] 25 Apr 2017

Size: px

Start display at page:

Download "arxiv: v1 [cs.cv] 25 Apr 2017"

Anastasia Andrews
5 years ago
Views:

Sudheendra Vijayanarasimhan Susanna Ricco svnaras@google.com ricco@google.com... arxiv:1704.07804v1 [cs.

fr Rahul Sukhankar Kaerina Fragkiadaki sukhankar@google.com kaef@cs.cmu.

1 Sudheendra Vijayanarasimhan Susanna Ricco arxiv: v1 [cs.cv] 25 Apr 2017 SfM-Ne: Learning of Srucure and Moion from Video Cordelia Schmid Esimaed deph, camera moion, objec moion and segmenaion Rahul Sukhankar Kaerina Fragkiadaki Absrac Flow Masks We propose SfM-Ne, a geomery-aware neural nework for moion esimaion in videos ha decomposes frameo-frame pixel moion in erms of scene and objec deph, camera moion and 3D objec roaions and ranslaions. Given a sequence of frames, SfM-Ne predics deph, segmenaion, camera and rigid objec moions, convers hose ino a dense frame-o-frame moion field (opical flow), differeniably warps frames in ime o mach pixels and back-propagaes. The model can be rained wih various degrees of supervision: 1) self-supervised by he reprojecion phoomeric error (compleely unsupervised), 2) supervised by ego-moion (camera moion), or 3) supervised by deph (e.g., as provided by RGBD sensors). SfMNe exracs meaningful deph esimaes and successfully esimaes frame-o-frame camera roaions and ranslaions. I ofen successfully segmens he moving objecs in he scene, even hough such supervision is never provided. Forward-backward consrains Deph Phoomeric error Phoomeric error Weigh sharing CNN Inpu frames Figure 1. SfM-Ne: Given a pair of frames as inpu, our model decomposes frame-o-frame pixel moion ino 3D scene deph, 3D camera roaion and ranslaion, a se of moion masks and corresponding 3D rigid roaions and ranslaions. I backprojecs he resuling 3D scene flow ino 2D opical flow and warps accordingly o mach pixels from one frame o he nex. Forwardbackward consisency checks consrain he esimaed deph. in he case of self-supervision. SfM-Ne can ake advanage of varying levels of supervision, as demonsraed in our experimens: compleely unsupervised (self-supervised), supervised by camera moion, or supervised by deph (from Kinec). SfM-Ne is inspired by works ha impose geomeric consrains on opical flow, exploiing rigidiy of he visual scene, such as early low-parameric opical flow mehods [2, 19, 23] or he so-called direc mehods for visual SLAM (Simulaneous Localizaion and Mapping) ha perform dense pixel maching from frame o frame while esimaing a camera rajecory and deph of he pixels in he scene [9, 26]. In conras o hose, insead of opimizing direcly over opical flow vecors, 3D poin coordinaes or camera roaion and ranslaion, our model opimizes over neural nework weighs ha, given a pair of frames, produce such 3D srucure and moion. In his way, our mehod learns o esimae srucure and moion, and can in principle improve as i processes more videos, in conras o non-learning based alernaives. I can hus be made robus o lack of exure, degenerae camera moion rajecories or 1. Inroducion We propose SfM-Ne, a neural nework ha is rained o exrac 3D srucure, ego-moion, segmenaion, objec roaions and ranslaions in an end-o-end fashion in videos, by exploiing he geomery of image formaion. Given a pair of frames and camera inrinsics, SfM-Ne, depiced in Figure 1, compues deph, 3D camera moion, a se of 3D roaions and ranslaions for he dynamic objecs in he scene, and corresponding pixel assignmen masks. Those in urn provide a geomerically meaningful moion field (opical flow) ha is used o differeniably warp each frame o he nex. Pixel maching across consecuive frames, consrained by forward-backward consisency on he compued moion and 3D srucure, provides gradiens during raining Google Research Grenoble, France Carnegie Mellon Universiy Inria, 1

2 dynamic objecs (our model explicily accouns for hose), by providing appropriae supervision. Our work is also inspired and builds upon recen works on learning geomerically inerpreable opical flow fields for poin cloud predicion in ime [5] and backpropagaing hrough camera projecion for 3D human pose esimaion [33] or single-view deph esimaion [11, 35]. In summary, our conribuions are: A mehod for self-supervised learning in videos inhe-wild, hrough explici modeling of he geomery of scene moion and image formaion. A deep nework ha predics pixel-wise deph from a single frame along wih camera moion, objec moion, and objec masks direcly from a pair of frames. Forward-backward consrains for learning a consisen 3D srucure from frame o frame and beer exploi self-supervision, exending lef-righ consisency consrains of [13]. We show resuls of our approach on KITTI [12, 21], MoSeg [4], and RGB-D SLAM [27] benchmarks under differen levels of supervision. SfM-Ne learns o predic srucure, objec, and camera moion by raining on realisic video sequences using limied ground-ruh annoaions. 2. Relaed work Back-propagaing hrough warps and camera projecion. Differeniable warping [16] has been used o learn end-o-end unsupervised opical flow [34], dispariy flow in a sereo rig [13] and video predicion [24]. The closes previous works o ours are SE3-Nes [5], 3D image inerpreer [33], and Garg e al. s deph CNN [11]. SE3-Nes [5] use an acuaion force from a robo and an inpu poin cloud o forecas a se of 3D rigid objec moions (roaion and ranslaions) and corresponding pixel moion assignmen masks under a saic camera assumpion. Our work uses similar represenaion of pixel moion masks and 3D moions o capure he dynamic objecs in he scene. However, our work differs in ha 1) we predic deph and camera moion while SE3-Nes operae on given poin clouds and assume no camera moion, 2) SE3-Nes are supervised wih pre-recorded 3D opical flow, while his work admis diverse and much weaker supervision, as well as complee lack of supervision, 3) SE3-Nes consider one frame and an acion as inpu o predic he fuure moion, while our model uses pairs of frames as inpu o esimae he inra-frame moion, and 4) SE3-Nes are applied o oy or lab-like seups whereas we show resuls on real videos. Wu e al. [33] learn 3D sparse landmark posiions of chairs and human body joins from a single image by compuing a simplified camera model and minimizing a camera re-projecion error of he landmark posiions. They use synheic daa o pre-rain he 2D o 3D mapping of heir nework. Our work considers dense srucure esimaion and uses videos o obain he necessary self-supervision, insead of saic images. Garg e al. [11] also predic deph from a single image, supervised by phoomeric error. However, hey do no infer camera moion or objec moion, insead requiring sereo pairs wih known baseline during raining. Concurren work o ours [35] removes he consrain ha he ground-ruh pose of he camera be known a raining ime, and insead esimaes he camera moion beween frames using anoher neural nework. Our approach ackles he more challenging problem of simulaneously esimaing boh camera and objec moion. Geomery-aware moion esimaion. Moion esimaion mehods ha exploi rigidiy of he video scene and he geomery of image formaion o impose consrains on opical flow fields have a long hisory in compuer vision [2, 3, 19]. Insead of non-parameric dense flow fields [14] researchers have proposed affine or projecive ransformaions ha beer exploi he low dimensionaliy of rigid objec moion [23]. When deph informaion is available, moions are rigid roaions and ranslaions [15]. Similarly, direc mehods for visual SLAM having RGB [26] or RGBD [17] video as inpu, perform dense pixel maching from frame o frame while esimaing a camera rajecory and deph of he pixels in he scene wih impressive 3D poin cloud reconsrucions. These works ypically make a saic world assumpion, which makes hem suscepible o he presence of moving objecs in he scene. Insead, SfM-Ne explicily accouns for moving objecs using moion masks and 3D ranslaion and roaion predicion. Learning-based moion esimaion. Recen works [7, 20, 29] propose learning frame-o-frame moion fields wih deep neural neworks supervised wih ground-ruh moion obained from simulaion or synheic movies. This enables efficien moion esimaion ha learns o deal wih lack of exure using raining examples raher han relying only on smoohness consrains of he moion field, as previous opimizaion mehods [28]. Insead of direcly opimizing over unknown moion parameers, such approaches opimize neural nework weighs ha allow moion predicion in he presence of ambiguiies in he given pair of frames. Unsupervised learning in videos. Video holds a grea poenial owards learning semanically meaningful visual represenaions under weak supervision. Recen works have explored his direcion by using videos o propagae in ime semanic labels using moion consrains [25], impose emporal coherence (slowness) on he learn visual feaure [32], predic emporal evoluion [30], learn emporal insance 2

512 Camera Moion Transformed Poin Cloud Pair of Frames 512 Objec Moion Flow 384x128x6 Transformed Poin Cloud Single Frame MOTION NETWORK Objec Masks s = 1 s = 2 STRUCTURE NETWORK deconv Poin Cloud

For each pair of consecuive frames I, I +1, a conv/deconv sub-nework predics deph d while anoher predics a se of K segmenaion masks m.

The prediced deph is convered ino a per frame poin-cloud using esimaed or known camera inrinsics.

3 512 Camera Moion Transformed Poin Cloud Pair of Frames 512 Objec Moion Flow 384x128x6 Transformed Poin Cloud Single Frame MOTION NETWORK Objec Masks s = 1 s = 2 STRUCTURE NETWORK deconv Poin Cloud 384x128x x x32 48x x8 12x4 24x8 48x x32 192x64 384x Deph Figure 2. SfM-Ne archiecure. For each pair of consecuive frames I, I +1, a conv/deconv sub-nework predics deph d while anoher predics a se of K segmenaion masks m. The coarses feaure maps of he moion-mask encoder are furher decoded hrough fully conneced layers owards 3D roaions and ranslaions for he camera and he K segmenaions. The prediced deph is convered ino a per frame poin-cloud using esimaed or known camera inrinsics. Then, i is ransformed according o he prediced 3D scene flow, as composed by he 3D camera moion and independen 3D mask moions. Transformed 3D deph is projeced back o he 2D nex frame, and hus provides corresponding 2D opical flow fields. Differeniable backward warping maps frame I +1 o I, and gradiens are compued based on pixel errors. Forward-backward consrains are imposed by repeaing his process for he invered frame pair I +1, I and consraining he dephs d and d +1 o be consisen hrough he esimaed scene moion. level associaions [31], predic emporal ordering of video frames [22], ec. Mos of hose unsupervised mehods are shown o be good pre-raining mechanisms for objec deecion or classificaion, as done in [22, 30, 31]. In conras and complemenary o he works above, our model exracs fine-grained 3D srucure and 3D moion from monocular videos wih weak supervision, insead of semanic feaure represenaions. 3. Learning SfM 3.1. SfM-Ne archiecure. Our model is shown in Figure 2. Given frames I, I +1 R w h, we predic frame deph d [0, ) w h, camera roaion and ranslaion {R, c c } SE3, and a se of K moion masks m k [0, 1] w h, k 1,..., K ha denoe membership of each pixel o K corresponding rigid objec moions {R k, k } SE3, k {1,..., K}. Noe ha a pixel may be assigned o none of he moion masks, denoing ha i is a background pixel and par of he saic world. Using he above esimaes, opical flow is compued by firs generaing he 3D poin cloud corresponding o he image pixels using he deph map and camera inrinsics, ransforming he poin cloud based on camera and objec rigid ransformaions, and back projecing he ransformed 3D coordinaes o he image plane. Then, given he opical flow field beween iniial and projeced pixel coordinaes, differeniable backward warping is used o map frame I +1 o I. Forward-backward consrains are imposed by repeaing his process from frame I +1 o I and consraining he dephs d and d +1 o be consisen hrough he esimaed scene moion. We provide deails of each of hese componens below. Deph and per-frame poin clouds. We compue per frame deph using a sandard conv/deconv subnework operaing on a single frame (he srucure nework in Figure 2). We use a RELU acivaion a our final layer, since deph values are non-negaive. Given deph d, we obain he 3D poin cloud X i = (X, i Y i, Z), i i 1,..., w h corresponding o he pixels in he scene using a pinhole camera model. Le (x i, y) i be he column and row posiions of he i h pixel in frame I and le (c x, c y, f) be he camera inrinsics, hen 3

4 X i x i X i = Y i = di w cx y i Z i f h c y (1) f where d i denoes he deph value of he ih pixel. We use he camera inrinsics when available and rever o defaul values of (0.5, 0.5, 1.0) oherwise. Therefore, he prediced deph will only be correc up o a scalar muliplier. Scene moion. We compue he moion of he camera and of independenly moving objecs in he scene using a conv/deconv subnework ha operaes on a pair of images (he moion nework in Figure 2). We deph-concaenae he pair of frames and use a series of convoluional layers o produce an embedding layer. We use wo fully-conneced layers o predic he moion of he camera beween he frames and a predefined number K of rigid body moions ha explain moving objecs in he scene. Le {R, c c } SE3 denoe he 3D roaion and ranslaion of he camera from frame I o frame I +1 (relaive camera pose across consecuive frames). We represen R c using an Euler angle represenaion as R cx (α)r cy (β)r cz (γ) where cos α sin α 0 R cx (α) = sin α cos α 0, cos β 0 sin β R cy (β) = 0 1 0, sin β 0 cos β R cz (γ) = cos γ sin γ, 0 sin γ cos γ and α, β, γ are he angles of roaion abou he x, y, z-axes respecively. The fully-conneced layers are used o predic ranslaion parameers c, he pivo poins of he camera roaion p c R 3 as in [5], and sin α, sin β, sin γ. These las hree parameers are consrained o be in he inerval [ 1, 1] by using RELU acivaion and he minimum funcion. Le {R k, k } SE3, k {1,..., K} denoe he 3D rigid moions of up o K objecs in he scene. We use similar represenaions as for camera moion and predic parameers using fully-conneced layers on op of he same embedding E. While camera moion is a global ransformaion applied o all he pixels in he scene, he objec moion ransforms are weighed by he prediced membership probabiliy of each pixel o each rigid moion, m k [0, 1] (h w), k {1,..., K}. These masks are produced by feeding he embedding layer hrough a deconvoluional ower. We use sigmoid acivaions a he las layer insead of sofmax in order o allow each pixel o belong o any number of rigid body moions. When a pixel has zero acivaion across all K maps i is assigned o he saic background whose moion is a funcion of he global camera moion alone. We allow a pixel o belong o muliple rigid body ransforms in order o capure composiion of moions, e.g., hrough kinemaic chains, such as ariculaed bodies. Learning he required number of moions for a sequence is an ineresing open problem. We found ha we could fix K = 3 for all experimens presened here. Noe ha our mehod can learn o ignore unnecessary objec moions in a sequence by assigning no pixels o he corresponding mask. Opical flow. We obain opical flow by firs ransforming he poin cloud obained in Equaion 1 using he camera and objec moion rigid body ransformaions followed by projecing he 3D poin on o he image plane using he camera inrinsics. In he following, we drop he pixel superscrip i from he 3D coordinaes, since i is clear we are referring o he moion ransformaion of he ih pixel of he h frame. We firs apply he objec ransformaions: X = X + K k=1 mk (i)(r k (X p k ) + k X ). We hen apply he camera ransformaion: X = R(X c p c ) + c. Finally we obain he row and column posiion of he pixel in he second frame (x i +1, y+1) i by projecing he corresponding 3D poin X = (X, Y, Z ) back o he image plane as follows: [ x i +1 w y i +1 h ] = f Z X Y f + [ cx c y ] The flow U, V beween he wo frames a pixel i is hen (U (i), V (i)) = (x i +1 x i, y i +1 y i ) Supervision SfM-Ne invers he image formaion and exracs deph, camera and objec moions ha gave rise o he observed emporal differences, similar o previous SfM works [1, 6]. Such inverse problems are ill-posed as many soluions of deph, camera and objec moion can give rise o he same observed frame-o-frame pixel values. A learning-based soluion, as opposed o direc opimizaion, has he advanage of learning o handle such ambiguiies hrough parial supervision of heir weighs or appropriae pre-raining, or simply because he same coefficiens (nework weighs) need o explain a large abundance of video daa consisenly. We deail he various supervision modes below and explore a subse of hem in he experimenal secion. Self-Supervision. Given unconsrained video, wihou accompanying ground-ruh srucure or moion informaion, our model is rained o minimize he phoomeric error 4

5 beween he firs frame and he second frame warped owards he firs according o he prediced moion field, based on well-known brighness consancy assumpions [14]: L color = 1 w h I (x, y) I +1 (x, y ) 1 where x = x + U (x, y) and y = y + V (x, y). We use differeniable image warping proposed in he spaial ransformer work [16] and compue color consancy loss in a fully differeniable manner. Supervising camera moion. If ground-ruh camera pose rajecories are available, we can supervise our model by compuing corresponding ground-ruh camera roaion and ranslaion R c GT, c GT from frame o frame, and consrain our camera moion predicions accordingly. Specifically, we compue he relaive ransformaion beween prediced and ground-ruh camera moion { err = inv(r)( c c GT c ), R err = inv(r)r c c GT )} and minimize is roaion angle and ranslaion norm [27]: L crans L cro = err 2 = arccos ( ( min 1, max ( 1, race(rerr ) 1 2 ))) (2) Spaial smoohness priors. When our nework is selfsupervised, we add robus spaial smoohness penalies on he opical flow field, he deph, and he inferred moion maps, by penalizing he L1 norm of he gradiens across adjacen pixels, as usually done in previous works [18]. For deph predicion, we penalize he norm of second order gradiens in order o encourage no consan bu raher smoohly changing deph values. Forward-backward consisency consrains. We incorporae forward-backward consisency consrains beween inferred scene deph in differen frames as follows. Given inferred deph d from frame pair I, I +1 and d +1 from frame pair I +1, I, we ask for hose o be consisen under he inferred scene moion, ha is: L F B = 1 w h (d (x, y) + W (x, y)) d +1 (x + U (x, y), y + V (x, y)) where W (x, y) is he Z componen of he scene flow obained from he poin cloud ransformaion. Composing scene flow forward and backward across consecuive frames allows us o impose such forward-backward consisency cycles across more han one frame gaps, however, we have no ye seen empirical gain from doing so. Supervising deph. If deph is available on pars of he inpu image, such as wih video sequences capured by a Kinec sensor, we can use deph supervision in he form of robus deph regression: L deph = 1 w h dmask GT (x, y) d (x, y) d GT (x, y) 1, where dmask GT denoes a binary image ha signals presence of ground-ruh deph. Supervising opical flow and objec moion. Groundruh opical flow, objec masks, or objec moions require expensive human annoaion on real videos. However, hese signals are available in recen synheic daases [20]. In such cases, our model could be rained o minimize, for example, an L1 regression loss beween prediced {U(x, y), V (x, y)} and ground-ruh {U GT (x, y), V GT (x, y)} flow vecors Implemenaion deails Our deph-predicing srucure and objec-maskpredicing moion conv/deconv neworks share similar archiecures bu use independen weighs. Each consis of a series of 3 3 convoluional layers alernaing beween sride 1 and sride 2 followed by deconvoluional operaions consising of a deph-o-space upsampling, concaenaion wih corresponding feaure maps from he convoluional porion, and a 3 3 convoluional layer. Bach normalizaion is applied o all convoluional layer oupus. The srucure nework akes a single frame as inpu, while he moion nework akes a pair of frames. We predic deph values using a 1 1 convoluional layer on op of he image-sized feaure map. We use RELU acivaions because dephs are posiive and a bias of 1 o preven small deph values. The maximum prediced deph value is furher clipped a 100 o preven large gradiens. We predic objec masks from he image-sized feaure map of he moion nework using a 1 1 convoluional layer wih sigmoid acivaions. To encourage sharp masks we muliply he logis of he masks by a parameer ha is a funcion of he number of sep for which he nework has been rained. The pivo variables are prediced as hea maps using a sofmax funcion over all he locaions in he image followed by a weighed average of he pixel locaions. 4. Experimenal resuls The main conribuion of SfM-Ne is he abiliy o explicily model boh camera and objec moion in a sequence, allowing us o rain on unresriced videos conaining moving 5

close o 200 frame sequence and sereo pairs.

seups which is defined as ( ) 2 E scaleinv = 1 N d(x, 1 y) 2 N d(x, y) 1, where N is he number of pixels and d =

We pre-rain he our unsupervised deph predicion models using adjacen frame pairs on he raw KITTI daase which

We compare he he resuls of Garg e al. [11] who use sereo pairs o esimae deph.

phoomeric error in order o esimae he deph.

from a video and camera pose, deph and objec moion are all esimaed wihou any form of supervision. Garg e al.

To compare wih our approach on he full se we emulae he model of Garg e al.

We also evaluae our full model on frame sequence pairs wih camera moion esimaion boh wih and wihou explici objec

When using sereo pairs we obain a value of 0.

When using frame sequence pairs insead of calibraed sereo pairs he problem becomes more difficul, as we mus now

As expeced, he deph esimaes learned in his scenario are less accurae, bu performance is much worse Approach Log

sequences in KITTI 2012 and 2015 daases.

6 objecs. To demonsrae his, we rained self-supervised neworks (using zero ground-ruh supervision) on he KITTI daases [12, 21] and on he MoSeg daase [4]. KITTI conains pairs of frames capured from a moving vehicle in which oher independenly moving vehicles are visible. MoSeg conains sequences wih challenging objec moion, including ariculaed moions from moving people and animals. KITTI. Our firs experimen validaes ha explicily modeling objec moion is necessary o effecively learn from unconsrained videos. We evaluae unsupervised deph predicion using our models on he KITTI 2012 and KITTI 2015 daases which conain close o 200 frame sequence and sereo pairs. We use a scale-invarian error meric (log RMSE) proposed in [8] due o he global scale ambiguiiy in monocular seups which is defined as ( ) 2 E scaleinv = 1 N d(x, 1 y) 2 N d(x, y) 1, where N is he number of pixels and d = (log(d) log(d GT )) denoes he difference beween he log of ground-ruh and prediced deph maps. We pre-rain he our unsupervised deph predicion models using adjacen frame pairs on he raw KITTI daase which conains 42, 000 frames and rain and evaluae on KITTI 2012 and 2015 which have deph ground ruh. We compare he he resuls of Garg e al. [11] who use sereo pairs o esimae deph. Their approach assumes he camera pose beween he frames is a known consan (sereo baseline) and opimize he phoomeric error in order o esimae he deph. In conras, our model considers a more challenging in he wild seing where we are only given sequences of frames from a video and camera pose, deph and objec moion are all esimaed wihou any form of supervision. Garg e al. repor a log RMSE of on a subse of he KITTI daase. To compare wih our approach on he full se we emulae he model of Garg e al. using our archiecure by removing objec masks from our nework and using sereo pairs wih phoomeric error. We also evaluae our full model on frame sequence pairs wih camera moion esimaion boh wih and wihou explici objec moion esimaion. Table 1 shows he log RMSE error beween he groundruh deph and he hree approaches. When using sereo pairs we obain a value of 0.31 which is on par wih exising resuls on he KITTI benchmark (see [11]). When using frame sequence pairs insead of calibraed sereo pairs he problem becomes more difficul, as we mus now infer he unknown camera and objec moion beween he wo frames. As expeced, he deph esimaes learned in his scenario are less accurae, bu performance is much worse Approach Log RMSE KITTI 2012 KITTI 2015 wih sereo pairs seq. wih moion masks seq. wihou moion masks Table 1. RMSE of Log deph wih respec o ground ruh for our model wih sereo pairs and wih and wihou moion masks on sequences in KITTI 2012 and 2015 daases. When using sereo pairs he camera pose beween he frames is fixed and he model is equivalen o he approach of Garg e al. [11]. Moion masks help improve he error on boh daases bu more so on he KITTI 2015 daase which conains more moving objecs. RGB frame Prediced Deph (sereo pairs) (sequence) Figure 3. Qualiaive comparison of he esimaed deph using our unsupervised model on sequences versus using sereo pairs in he KITTI 2012 benchmark. When using sereo pairs he camera pose beween he pair is consan and hence he model is equivalen o he approach of Garg e al. [11]. For sequences, our model needs o addiionally predic camera roaion and ranslaion beween he wo frames. The firs six rows show successful predicions even wihou camera pose informaion and he las wo illusrae failure cases. The failure cases show ha when here is no ranslaion beween he wo frames deph esimaion fails whereas when using sereo pairs here is always a consan offse beween he frames. when no moion masks are used. The gap beween he wo approaches is wider on he KITTI 2015 daase which conains more moving objecs. This shows ha i is imporan o accoun for moving objecs when raining on videos in he wild. Figure 3 shows qualiaive examples comparing he deph obained when using sereo pairs wih a fixed baseline and when using frame sequences wihou camera pose informaion. When here is large ranslaion beween he frames, deph esimaion wihou camera pose informaion is as good as using sereo pairs. The failure cases in he las wo rows show ha he nework did no learn o accuraely 6

Prediced Moion Masks Ground Truh Mask Prediced Flow Ground Truh

Ground ruh segmenaion and flow compared o prediced moion masks

The model was rained in a fully unsupervised manner.

of generic scenes appearing in a sequence conaining ineresing

Analysis of our failure cases sugges possible direcions for

Moving objecs inroduce significan occlusions, which should be

Because our nework has no direc supervision on objec masks or

some ground-ruh masks or objec moions are provided as explici

Figure 4 provides qualiaive examples of he prediced moion masks

Ofen, he prediced moion masks are fairly close o he ground ruh

We noice ha objec masks ended o miss very small, disan moving

This may be due o he fac ha hese objecs and heir moions are oo

The boom wo rows show cases where he prediced masks do no

semanically meaningful, noe ha he esimaed flow field is

car. In he second failure case, he moving car on he lef is

7 Prediced Moion Masks Ground Truh Mask Prediced Flow Ground Truh Flow Figure 4. Ground ruh segmenaion and flow compared o prediced moion masks and flow from SfM-Ne in KITTI The model was rained in a fully unsupervised manner. The op six rows show successful predicion and he las wo show ypical failure cases. predic deph for scenes where i saw lile or no ranslaion beween he frames during raining. This is no he case when using sereo pairs as here is always a consan offse beween he frames. Using more daa could help here because i increases he likelihood of generic scenes appearing in a sequence conaining ineresing camera moion. Analysis of our failure cases sugges possible direcions for improvemen. Moving objecs inroduce significan occlusions, which should be handled carefully. Because our nework has no direc supervision on objec masks or objec moion, i does no necessarily learn ha objec and camera moions should be differen. These priors could be buil ino our loss or learned direcly if some ground-ruh masks or objec moions are provided as explici supervision. Figure 4 provides qualiaive examples of he prediced moion masks and flow fields along wih he ground-ruh in he KITTI 2015 daase. Ofen, he prediced moion masks are fairly close o he ground ruh and help explain par of he moion in he scene. We noice ha objec masks ended o miss very small, disan moving objecs. This may be due o he fac ha hese objecs and heir moions are oo small o be separaed from he background. The boom wo rows show cases where he prediced masks do no correspond o moving objecs. In he firs example, alhough he mask is no semanically meaningful, noe ha he esimaed flow field is reasonable, wih some misakes in he region occluded by he moving car. In he second failure case, he moving car on he lef is compleely missed bu he moion of he saic background is well capured. This is a paricularly difficul example for he self-supervised phoomeric loss because he moving objec appears in heavy shadow. MoSeg. The moving objecs in KITTI are primarily vehicles, which undergo rigid-body ransformaions, making i a good mach for our model. To verify ha our nework can sill learn in he presence of non-rigid moion, we rerained i from scrach under self-supervision on he MoSeg daase, using frames from all sequences. Because each moion mask corresponds o a rigid 3D roaion and ranslaion, we do no expec a single moion mask o capure a deformable objec. Insead, differen rigidly moving objec pars will be assigned o differen masks. This is no a problem from he perspecive of accurae camera moion esimaion, where he imporan issue is disinguishing pixels whose moion is caused by he camera pose ransformaion direcly from hose whose moion is affeced by indepen7

RGB frame Prediced flow Moion masks Seq. ransl [27] ro [27] ransl. ours ro ours 360 0.099 0.474 0.009 1.123 plan 0.016 1.053 0.011 0.796 eddy 0.020 1.14 0.0123 0.877 desk 0.008 0.495 0.

he nex and compare wih he prediced camera moion from our model, by measuring ranslaion and roaion error of heir relaive ransformaion, as done in he corresponding evaluaion scrip for

We repor camera roaion and ranslaion error in Table 2 for each of he Freiburg1 sequences compared o he error in he benchmark s baseline rajecories.

We observe ha our resuls beer esimae he frame-o-frame ranslaion and are comparable for roaion. Figure 5. Moion segmens compued from SfM-Ne in MoSeg [4].

Because MoSeg only conains groundruh annoaions for segmenaion, we canno quaniaively evaluae he esimaed deph, camera rajecories, or opical flow fields.

oal of six proposed segmens in each frame, wo from each of he hree moion masks), averaging across frames and groundruh objecs. We obain an IoU of 0.

See, for example, he las column of Figure 5 from [10], whose proposed mehods for moving objec proposals achieve IoU around 0.3 wih four proposals.

While he fully unsupervised resuls show promise, our nework can benefi from exra supervision of deph or camera moion when available.

We also experimened wih adding deph supervision o improve camera moion esimaion using he RGB-D SLAM daase [27].

Conclusion Curren geomeric SLAM mehods obain excellen egomoion and rigid 3D reconsrucion resuls, bu ofen come a a price of exensive engineering, low olerance o moving objecs which are

8 RGB frame Prediced flow Moion masks Seq. ransl [27] ro [27] ransl. ours ro ours plan eddy desk desk Table 2. Camera pose relaive error from frame o frame for various video sequences of Freiburg RGBD-SLAM benchmark. he nex and compare wih he prediced camera moion from our model, by measuring ranslaion and roaion error of heir relaive ransformaion, as done in he corresponding evaluaion scrip for relaive camera pose error and deailed in Eq. 2. We repor camera roaion and ranslaion error in Table 2 for each of he Freiburg1 sequences compared o he error in he benchmark s baseline rajecories. Our model was rained from scrach for each sequence and used he focal lengh value provided wih he daase. We observe ha our resuls beer esimae he frame-o-frame ranslaion and are comparable for roaion. Figure 5. Moion segmens compued from SfM-Ne in MoSeg [4]. The model was rained in a fully unsupervised manner. den objec moions in he scene. Qualiaive resuls on sampled frames from he daase are shown in Fig. 5. Because MoSeg only conains groundruh annoaions for segmenaion, we canno quaniaively evaluae he esimaed deph, camera rajecories, or opical flow fields. However, we did evaluae he qualiy of he objec moion masks by compuing Inersecion over Union (IoU) for each ground-ruh segmenaion mask agains he bes maching moion mask and is complemen (a oal of six proposed segmens in each frame, wo from each of he hree moion masks), averaging across frames and groundruh objecs. We obain an IoU of 0.29 which is similar o previous unsupervised approaches for he small number of segmenaion proposals we use per frame. See, for example, he las column of Figure 5 from [10], whose proposed mehods for moving objec proposals achieve IoU around 0.3 wih four proposals. They require more han 800 proposals o reach an IoU above Kinec deph supervision. While he fully unsupervised resuls show promise, our nework can benefi from exra supervision of deph or camera moion when available. The improved deph predicion given ground ruh camera poses on KITTI sereo demonsrae some gain. We also experimened wih adding deph supervision o improve camera moion esimaion using he RGB-D SLAM daase [27]. Given ground-ruh camera pose rajecories, we esimaed relaive camera pose (camera moion) from each frame o 5. Conclusion Curren geomeric SLAM mehods obain excellen egomoion and rigid 3D reconsrucion resuls, bu ofen come a a price of exensive engineering, low olerance o moving objecs which are reaed as noise during reconsrucion and sensiiviy o camera calibraion. Furhermore, maching and reconsrucion are difficul in low exured regions. Incorporaing learning ino deph reconsrucion, camera moion predicion and objec segmenaion, while sill preserving he consrains of image formaion, is a promising way o robusify SLAM and visual odomery even furher. However, he exac raining scenario required o solve his more difficul inference problem remains an open quesion. Exploiing long hisory and far in ime forward-backward consrains wih visibiliy reasoning is an imporan fuure direcion. Furher, exploiing a small amoun of annoaed videos for objec segmenaion, deph, and camera moion, and combining hose wih an abundance of self-supervised videos, could help iniialize he nework weighs in he righ regime and faciliae learning. Many oher curriculum learning regimes, including hose ha incorporae synheic daases, can also be considered. Acknowledgemens. We hank our colleagues Tinghui Zhou, Mahew Brown, Noah Snavely, and David Lowe for heir advice and Bryan Seybold for his work generaing synheic daases for our iniial experimens. 8

9 References [1] I. Akher, Y. A. Sheikh, S. Khan, and T. Kanade. Nonrigid srucure from moion in rajecory space. In NIPS, [2] J. R. Bergen, P. Anandan, K. J. Hanna, and R. Hingorani. Hierarchical model-based moion esimaion. In ECCV, [3] M. Black, Y. Yacoob, A. Jepson, and D. Flee. Learning parameerized models of image moion. In CVPR, [4] T. Brox and J. Malik. Objec segmenaion by long erm analysis of poin rajecories. In ECCV [5] A. Byravan and D. Fox. SE3-Nes: Learning rigid body moion using deep neural neworks. CoRR, abs/ , [6] J. Coseira and T. Kanade. A muli-body facorizaion mehod for moion analysis. ICCV, [7] A. Dosoviskiy, P. Fischer, E. Ilg, P. Häusser, C. Hazirbas, V. Golkov, P. Smag, D. Cremers, and T. Brox. FlowNe: Learning opical flow wih convoluional neworks. In ICCV, [8] D. Eigen, C. Puhrsch, and R. Fergus. Deph map predicion from a single image using a muli-scale deep nework. In NIPS, [9] J. Engel, T. Schöps, and D. Cremers. LSD-SLAM: Largescale direc monocular SLAM. In ECCV, [10] K. Fragkiadaki, P. A. Arbeláez, P. Felsen, and J. Malik. Spaio-emporal moving objec proposals. CoRR, abs/ , [11] R. Garg, B. V. Kumar, G. Carneiro, and I. Reid. Unsupervised cnn for single view deph esimaion: Geomery o he rescue. In ECCV, [12] A. Geiger, P. Lenz, and R. Urasun. Are we ready for auonomous driving? The KITTI vision benchmark suie. In CVPR, [13] C. Godard, O. Mac Aodha, and G. J. Brosow. Unsupervised monocular deph esimaion wih lef-righ consisency. CoRR, abs/ , [14] B. K. Horn and B. G. Schunck. Deermining opical flow. Arificial Inelligence, 17, [15] M. Hornacek, A. Fizgibbon, and C. Roher. SphereFlow: 6 DoF scene flow from RGB-D pairs. In CVPR, [16] M. Jaderberg, K. Simonyan, A. Zisserman, and K. Kavukcuoglu. Spaial ransformer neworks. In NIPS, [17] C. Kerl, J. Surm, and D. Cremers. Dense visual SLAM for RGB-D cameras. In IROS, [18] N. Kong and M. J. Black. Inrinsic deph: Improving deph ransfer wih inrinsic images. In ICCV, [19] L. Z. Manor and M. Irani. Muli-Frame Esimaion of Planar Moion. PAMI, 22(10): , [20] N. Mayer, E. Ilg, P. Häusser, P. Fischer, D. Cremers, A. Dosoviskiy, and T. Brox. A large daase o rain convoluional neworks for dispariy, opical flow, and scene flow esimaion. In CVPR, [21] M. Menze and A. Geiger. Objec scene flow for auonomous vehicles. In CVPR, [22] I. Misra, C. L. Zinick, and M. Heber. Unsupervised learning using sequenial verificaion for acion recogniion. In ECCV, [23] T. Nir, A. Brucksein, and R. Kimmel. Over-Parameerized Variaional Opical Flow. IJCV, 76(2): , [24] V. Paraucean, A. Handa, and R. Cipolla. Spaio-emporal video auoencoder wih differeniable memory. CoRR, abs/ , [25] A. Pres, C. Leisner, J. Civera, C. Schmid, and V. Ferrari. Learning objec class deecors from weakly annoaed video. In CVPR, [26] T. Schöps, J. Engel, and D. Cremers. Semi-dense visual odomery for AR on a smarphone. In ISMAR, [27] J. Surm, N. Engelhard, F. Endres, W. Burgard, and D. Cremers. A benchmark for he evaluaion of RGB-D SLAM sysems. In IROS, [28] D. Sun, S. Roh, and M. J. Black. Secres of opical flow esimaion and heir principles. In CVPR, [29] J. Thewlis, S. Zheng, P. H. Torr, and A. Vedaldi. Fullyrainable deep maching. In BMVC, [30] J. Walker, C. Doersch, A. Gupa, and M. Heber. An uncerain fuure: Forecasing from saic images using variaional auoencoders. In ECCV, [31] X. Wang and A. Gupa. Unsupervised learning of visual represenaions using videos. In ICCV, [32] L. Wisko and T. J. Sejnowski. Slow feaure analysis: Unsupervised learning of invariances. Neural Compu., 14(4): , [33] J. Wu, T. Xue, J. J. Lim, Y. Tian, J. B. Tenenbaum, A. Torralba, and W. T. Freeman. Single image 3D inerpreer nework. In ECCV, [34] J. J. Yu, A. W. Harley, and K. G. Derpanis. Back o basics: Unsupervised learning of opical flow via brighness consancy and moion smoohness. In ECCV, [35] T. Zhou, M. Brown, N. Snavely, and D. Lowe. Unsupervised learning of deph and ego-moion from video. In CVPR

CAMERA CALIBRATION BY REGISTRATION STEREO RECONSTRUCTION TO 3D MODEL

CAMERA CALIBRATION BY REGISTRATION STEREO RECONSTRUCTION TO 3D MODEL Klečka Jan Docoral Degree Programme (1), FEEC BUT E-mail: xkleck01@sud.feec.vubr.cz Supervised by: Horák Karel E-mail: horak@feec.vubr.cz