WHILE estimating the depth of a scene from a single image

Size: px

Start display at page:

Download "WHILE estimating the depth of a scene from a single image"

Virginia Scott
5 years ago
Views:

JOURNAL OF L A T E X CLASS FILES, VOL. 4, NO.

Wang, Senior Member, IEEE, Nicu Sebe, Senior Member, IEEE arxiv:803.0089v [cs.cv] Mar 08 Abstract Depth cues have been proved very usefu in various computer vision and robotic tasks.

Inspired by the effectiveness of recent works on muti-scae convoutiona neura networks (CNN), we propose a deep mode which fuses compementary information derived from mutipe CNN side outputs.

In particuar, we propose two different variations, one based on a cascade of mutipe CRFs, the other on a unified graphica mode.

avaiabe datasets, i.e. NYUD-V, Make3D and KITTI. Index Terms Monocuar Depth Estimation, Convoutiona Neura Networks (CNN), Deep Muti-Scae Fusion, Conditiona Random Fieds (CRFs).

chaenging task. Many attempts have been made to address this probem in the past. In particuar, recent works have achieved remarkabe performance thanks to powerfu deep earning modes [], [], [30], [36].

1 JOURNAL OF L A T E X CLASS FILES, VOL. 4, NO. 8, AUGUST 05 Monocuar Depth Estimation using Muti-Scae Continuous CRFs as Sequentia Deep Networks Dan Xu, Student Member, IEEE, Eisa Ricci, Member, IEEE, Wani Ouyang, Senior Member, IEEE, Xiaogang Wang, Senior Member, IEEE, Nicu Sebe, Senior Member, IEEE arxiv: v [cs.cv] Mar 08 Abstract Depth cues have been proved very usefu in various computer vision and robotic tasks. This paper addresses the probem of monocuar depth estimation from a singe sti image. Inspired by the effectiveness of recent works on muti-scae convoutiona neura networks (CNN), we propose a deep mode which fuses compementary information derived from mutipe CNN side outputs. Different from previous methods using concatenation or weighted average schemes, the integration is obtained by means of continuous Conditiona Random Fieds (CRFs). In particuar, we propose two different variations, one based on a cascade of mutipe CRFs, the other on a unified graphica mode. By designing a nove CNN impementation of mean-fied updates for continuous CRFs, we show that both proposed modes can be regarded as sequentia deep networks and that training can be performed end-to-end. Through an extensive experimenta evauation, we demonstrate the effectiveness of the proposed approach and estabish new state of the art resuts for the monocuar depth estimation task on three pubicy avaiabe datasets, i.e. NYUD-V, Make3D and KITTI. Index Terms Monocuar Depth Estimation, Convoutiona Neura Networks (CNN), Deep Muti-Scae Fusion, Conditiona Random Fieds (CRFs). INTRODUCTION WHILE estimating the depth of a scene from a singe image is a natura abiity for humans, devising computationa modes for accuratey predicting depth information from RGB data is a chaenging task. Many attempts have been made to address this probem in the past. In particuar, recent works have achieved remarkabe performance thanks to powerfu deep earning modes [], [], [30], [36]. Assuming the avaiabiity of a arge training set of RGB-depth pairs, monocuar depth prediction from singe images can be regarded as a pixe-eve continuous regression probem and Convoutiona Neura Network (CNN) architectures are typicay empoyed. In the ast few years significant efforts have been made in the research community to improve the performance of CNN modes for pixe-eve prediction tasks (e.g. semantic segmentation, contour detection). Previous works have shown that, for depth estimation as we as for other pixeeve cassification or regression probems, more accurate estimates can be obtained by combining information from mutipe scaes [9], [], [46], [48]. This can be achieved in different ways, e.g. fusing feature maps corresponding to different network ayers or designing an architecture with mutipe inputs corresponding to images at different resoutions. Other works have demonstrated that, by adding a Conditiona Random Fied (CRF) in cascade to Dan Xu, Nicu Sebe are with the Department of Information Engineering and Computer Science, University of Trento, Trento, Itay. E-mai: {dan.xu, nicuae.sebe}@unitn.it Eisa Ricci is with Fondazione Bruno Kesser. Emai: eiricci@fbk.eu Wani Ouyang is with the Schoo of Eectrica and Information Engineering, The University of Sydney. Emai: wani.ouyang@sydney.edu.au Xiaogang Wang is with the Department of Eectronic Engineering, The Chinese University of Hong Kong. Emai: xgwang@ee.cuhk.edu.hk Manuscript received Apri 9, 005; revised August 6, 05. )- ( Fig.. Monocuar depth estimation resuts on three different benchmark datasets, i.e. NYUD-V (the st row), Make3D (the nd row) and Kitti (the 3rd row), using the proposed muti-scae CRF mode with a pretrained CNN (e.g. VGG Convoution-Deconvoution [34]). From eft to right, each coumn is origina RGB images, the recovered depth maps and the groundtruth, respectivey. a convoutiona neura architecture, the performance can be greaty enhanced and the CRF can be fuy integrated within the deep mode enabing end-to-end training with back-propagation [5]. However, these works mainy focus on pixe-eve prediction probems in the discrete domain (e.g. semantic segmentation). Whie compementary, so far these strategies have been ony considered in isoation and no previous works have expoited muti-scae information within a CRF inference framework. In this paper we argue that, benefiting from the fexibiity

2 JOURNAL OF L A T E X CLASS FILES, VOL. 4, NO. 8, AUGUST 05 and the representationa power of graphica modes, we can optimay fuse representations derived from mutipe CNN side-output ayers using structured constraints, improving performance over traditiona muti-scae strategies. By expoiting this idea, we introduce a nove framework to estimate depth maps from singe sti images. Opposite to previous work fusing muti-scae features by weighted averaging or concatenation, we propose to integrate muti-ayer side-output information by designing a nove approach based on continuous CRFs. Specificay, we present two different methods. The first approach is based on a singe muti-scae unified CRF mode, whie the other considers a cascade of scae-specific CRFs. We aso show that, by introducing a common CNN impementation for mean-fieds updates in continuous CRFs, both modes are equivaent to sequentia deep networks and an end-to-end approach can be devised for training. Through extensive experimenta evauation we demonstrate that the proposed CRF-based approach produces more accurate depth maps than traditiona muti-scae approaches for pixe-eve prediction tasks [6], [46]. Moreover, by performing experiments on the pubicy avaiabe NYU Depth V [43], Make3D [4] and KITTI [4] datasets, we show that our approach is abe to robusty reconstruct depth with good visua quaity (Fig.) and outperforms state of the art methods for the monocuar depth estimation task. This paper extends our earier work [50] through proposing and investigating different muti-scae connection structures for message passing, further enriching the reated works, providing more approach detais, and significanty expanding experimenta resuts and anaysis. To summarize, the contribution of this paper is threefod: Firsty, we propose a nove approach for predicting depth maps from RGB inputs which expoits muti-scae estimations derived from CNN inner semantic ayers by structuray fusing them within a unified CNN-CRF framework. Secondy, as the task of pixe-eve depth prediction impies inferring a set of continuous vaues, we show how mean fied (MF) updates can be impemented as sequentia deep modes, enabing end-to-end training of the whoe network. We beieve that our MF impementation wi be usefu not ony to researchers working on depth prediction, but aso to those interested in other probems invoving continuous variabes. Therefore, our code is made pubicy avaiabe at Thirdy, our experiments demonstrate that the proposed muti-scae CRF framework is superior to previous methods integrating information from different semantic network ayers by combining mutipe osses [46] or by adopting feature concatenations [6]. We aso show that our approach outperforms state of the state of the art monocuar depth estimation methods on pubic benchmarks and that the proposed CRF-based modes can be empoyed in combination with different pre-trained CNN architectures, consistenty enhancing their performance. The remainder of this paper is organised as foows. We first introduce reated work in Section, and then the proposed muti-scae CRF modes for monocuar depth estimation is presented in Section 3. We further eaborate how the proposed modes can be impemented as sequentia neura network for end-to-end joint optimization in Section 4. The experimenta resuts and anaysis are eaborated in Section 5, and we concude the paper in Section 6. RELATED WORK Our approach is buit upon recent successes of deep CNN architectures for image cassification [7], [3], [44] and fuy convoutiona networks for dense semantic image segmentation [33], [34]. We briefy introduce the most reated works by organizing them into three main aspects, i.e. monocuar depth estimation, muti-scae CNN and dense pixe-eve prediction via combination of CNN and CRFs. Monocuar depth estimation. Previous approaches for depth estimation from singe images can be grouped into three main categories: (i) methods operating on hand crafted features, (ii) methods based on graphica modes and (iii) methods adopting deep convoutiona neura networks. Earier works addressing the depth prediction task beong to the first category. Hoiem et a. [8], [9] proposed photo pop-up, a fuy automatic method for creating a basic 3D mode from a singe photograph by introducing an assumption of ground-vertica geometric structure. Karsch et a. [0] deveoped Depth Transfer, a non parametric approach based on SIFT Fow, where the depth of an input image is reconstructed by transferring the depth of mutipe simiar images and then appying some warping and optimizing procedures. Instead of directy recovering depth from appearance features, Liu et a. [9] expored using semantic scene segmentation resuts to guide the 3-D depth reconstruction. Simiary, Ladicky et a. [5] aso demonstrated the benefit of combining semantic object abes with depth features. However, the hand-crafted representations are not robust enough for this chaenging probem. In the second category, some works expoited the fexibiity of graphica modes to reconstruct depth information. For instance, Deage et a. [0] proposed a dynamic Bayesian framework for recovering 3D information from indoor scenes. A discriminativey-trained mutiscae Markov Random Fieds (MRFs) were introduced in [39], [40], in order to optimay fuse oca and goba features. Depth estimation was treated as an inference probem in a discretecontinuous CRF mode in [3]. However, these works did not empoy deep networks. More recent approaches for depth estimation are based on CNNs [], [7], [30], [38], [45]. For instance, Eigen et a. [] proposed a muti-scae approach for depth prediction, considering two deep networks, one performing a coarse goba prediction based on the entire image, and the other refining predictions ocay. This approach was extended in [] to hande mutipe tasks (e.g. semantic segmentation, surface norma estimation). Wang et a. [45] introduced a CNN for joint depth estimation and semantic segmentation. The obtained estimates were further refined with Hierarchica CRFs. The most simiar work to ours is [30], where the representationa power of deep CNN and continuous CRFs is jointy expoited for depth prediction. However, the method proposed in [30] is based on superpixes and the information associated to mutipe scaes is not expoited in their graphica mode.

JOURNAL OF L A T E X CLASS FILES, VOL. 4, NO. 8, AUGUST 05 3 Front-End Convoutiona Neura Network r s? d s s 3 s 4 s 5 Side Outputs C-MF C-MF C-MF C-MF C-MF r... C-MF............ C-MF C-MF C-MF C-MF Muti-Scae Fusion with Continuous CRFs Fig.

The fusion modue uses continuous CRFs to integrate mutipe side output maps of the front-end CNN.

s s s 3 s 4 s 5 s s s 3 s 4 s 5 s s s 3 s 4 s 5 s s s 3 s 4 s 5 3 4 4 3 3 4 3 4 Bottom up structure Top down structure (c) Skip connection structure (d) A to one structure Fig. 3. Iustration of different muti-scae message passing structures for the integration of the muti-scae predictions s to s 5 produced from the front-end convoutiona network.

The probem of combining information from mutipe scaes has recenty received considerabe interest in various computer vision tasks.

Skip-ayer networks, where the feature maps derived from different semantic ayers of a primary frontend network are jointy considered in an output ayer, have aso become very popuar [3], [6], [33].

3 JOURNAL OF L A T E X CLASS FILES, VOL. 4, NO. 8, AUGUST 05 3 Front-End Convoutiona Neura Network r s? d s s 3 s 4 s 5 Side Outputs C-MF C-MF C-MF C-MF C-MF r... C-MF C-MF C-MF C-MF C-MF Muti-Scae Fusion with Continuous CRFs Fig.. Overview of the proposed deep architecture. Our mode is composed of two main components: a front-end CNN and a fusion modue. The fusion modue uses continuous CRFs to integrate mutipe side output maps of the front-end CNN. We consider two different CRFs-based muti-scae modes and impement them as sequentia deep networks by stacking severa eementary bocks, the C-MF bocks. s s s 3 s 4 s 5 s s s 3 s 4 s 5 s s s 3 s 4 s 5 s s s 3 s 4 s Bottom up structure Top down structure (c) Skip connection structure (d) A to one structure Fig. 3. Iustration of different muti-scae message passing structures for the integration of the muti-scae predictions s to s 5 produced from the front-end convoutiona network. The arrows represent the direction of the message passing, and the numbers in circes represent the order. The dashed ine box in Fig. shows a bottom-up passing structure. Muti-Scae CNNs. The probem of combining information from mutipe scaes has recenty received considerabe interest in various computer vision tasks. In [46] a deepy supervised fuy convoutiona neura network was proposed for edge detection by weighted combination of mutipe side outputs. Skip-ayer networks, where the feature maps derived from different semantic ayers of a primary frontend network are jointy considered in an output ayer, have aso become very popuar [3], [6], [33]. Other works considered muti-stream architectures, where mutipe parae networks receiving inputs at different scae are fused [4]. Cai et a. [5] proposed a muti-scae method via combining the predictions obtained from feature maps with different resoution for object detection. Diated convoutions (e.g. diation or à trous) have been aso empoyed in different deep network modes in order to aggregate muti-scae contextua information [7]. However, in these works, the muti-scae representations or estimations are typicay combined by using simpe concatenation or weighted averaging operation. We are not aware of previous works exporing fusing deep muti-scae information within a CRF framework. Dense pixe-eve prediction via combination of CNN and CRFs. The combination of CNN and CRFs has shown great usefuness for dense pixe-eve structured prediction [], [4]. Some existing works utiize CRFs as a post processing modue for further refining the predictions from the CNN [8], [35]. To benefit from end-to-end earning, Zhang et a. [5] proposed a CRF-RNN mode which jointy optimizes a front-end deep network with a discrete CRF for semantic image segmentation. Xu et a. [47] proposed an attention-gated deep CRF framework for pixe-eve contour prediction. However, as far as we know, this work is a first attempt to combine muti-scae continuous CRFs with deep convoutiona neura network for constructing a unified mode for end-to-end monocuar depth estimation. 3 MULTI-SCALE CRF MODELS FOR MONOCULAR DEPTH ESTIMATION In this section we introduce our deep mode with the designed muti-scae continuous CRFs for monocuar depth estimation from RGB images. We first formaize the probem of depth prediction and give a brief overview of the proposed approach. Then, we describe two different variants of the proposed muti-scae mode, one based on a cascade of CRFs and the other on a singe muti-scae unified CRFs. 3. Probem Formuation and Overview Foowing previous works we formuate the task of depth prediction from monocuar RGB input as the probem of earning a non-inear mapping F : I D from the image space I to the output depth space D. More formay, et Q = {(r i, d i )} Q i= be a training set of Q pairs, where r i I

4 JOURNAL OF L A T E X CLASS FILES, VOL. 4, NO. 8, AUGUST 05 4 denotes an input RGB image with N pixes and d i D represents its corresponding rea-vaued depth map. For earning F we consider a deep mode made of two main buiding bocks (Fig. ). The first component is a CNN architecture with a set of intermediate side outputs S = {s } L =, s R N, produced from L different ayers with a mapping function f s (r; Θ, θ ) s. For simpicity, we denote with Θ the set of front-end network ayer parameters and with θ the parameters of the network branch producing the side output associated to the -th ayer (see Section 5. for detais of our impementation). In the foowing we denote this network as the front-end CNN. The second component of our mode is a fusion bock. As shown in previous works [3], [33], [46], features generated from different CNN ayers capture compementary information. The main idea behind the proposed fusion bock is to use CRFs to effectivey integrate the side output maps of our front-end CNN for robust depth prediction. Our approach deveops from the intuition that these representations can be combined within a sequentia framework, i.e. performing depth estimation at a certain scae and then refining the obtained estimates in the subsequent eve. Specificay, we introduce and compare two different muti-scae modes, both based on CRFs, and corresponding to two different versions of the fusion bock. The first mode is based on a singe muti-scae unified CRFs, which integrates information avaiabe from different scaes and simutaneousy enforces smoothness constraints between the estimated depth vaues of neighboring pixes and neighboring scaes. The second mode impements a cascade of scae-specific CRFs: at each scae a CRF is empoyed to recover the depth information from side output maps s and the outputs of each CRF mode are used as additiona observations for the subsequent mode. In Section 3.. we describe the two modes in detais, whie in Section 4 we show how they can be impemented as sequentia deep networks by stacking severa eementary bocks. We ca these bocks C-MF bocks, as they impement Mean Fied updates for Continuous CRFs (Fig. ). 3. Muti-scae Fusion with Continuous CRFs We now eaborate the proposed CRF-based modes for fusing muti-scae side-outputs derived from different semantic ayers of the front-end deep convoutiona neura networks. 3.. Muti-Scae Unified CRF Mode Given a vector ŝ with a dimension of L N obtained by concatenating the side output score maps {s,..., s L } and a vector d with a dimension of L N expressing rea-vaued output variabes, we define a CRF modeing the foowing conditiona distribution: P (d ŝ) = exp{ E(d, ŝ)}, () Z(ŝ) where Z(ŝ) = d exp{ E(d, ŝ)}dd is the partition function [6] acting as a normaization factor for probabiities. The energy function is defined as: N L E(d, ŝ) = ψ(d i, d k j ), () i= = φ(d i, ŝ) + i,j,k and d i indicates the hidden variabe associated to scae and pixe i. The first term is the sum of quadratic unary terms defined as: φ(d i, ŝ) = ( d i s, i) (3) where s i is the regressed depth vaue at pixe i and scae obtained with f s (r; Θ, θ ). The second term is the sum of pairwise potentias describing the reationship between pairs of hidden variabes d i and dk j and is defined as foows: M ψ(d i, d k j ) = β m w m (i, j,, k, r)(d i d k j ), (4) m= where w m (i, j,, k, r) is a weight which specifies the reationship between the estimated depth of the pixes i and j at scae and k, respectivey; M is the number of kernes. To perform inference we rey on the mean-fied theory to approximate P (d ŝ) with another distribution Q(d ŝ), where Q(d ŝ) = N L i= = Q i,(d i ŝ), expressing a product of independent marginas. By minimizing the Kuback- Leiber divergence between the distribution of P and Q, we obtain the soution of Q. As the og distribution og Q i, (d i ŝ) has a quadratic form w.r.t. d i and can be represented as Gaussian distribution, the foowing meanfied updates can be derived: γ i, = ( M + β m w m (i, j,, k, r) ), (5) µ i, = γ i, ( s i + m= k j,i M ) β m w m (i, j,, k, r)µ j,k. (6) m= k Here γ i, and µ i, are the variance and mean of the distribution Q i,, respectivey. To define the weights w m (i, j,, k, r) we introduce the foowing assumptions. First, we assume that the estimated depth at scae ony depends on the depth estimated at previous scae. Second, for reating pixes at the same and at previous scae, we set weights depending on m kerne functions Km ij, which consists of Gaussian kernes with form of exp ( ) hm i hm j θ. Here, h m m i and h m j indicate some features derived from the input image r for pixes i and j. θ m are user-defined bandwidth parameters []. Foowing previous works [], [5], we use pixe positions and coor vaues as features, eading to two kerne functions, i.e. a biatera appearance kerne using both the pixe positions and the coor vaue features and a spatia smoothness kerne using ony the pixe positions features, for modeing dependencies of pixes at scae and other two for reating pixes at neighboring scaes. Under these assumptions, the meanfied updates (5) and (6) can be rewritten as: γ i, = ( 4 + β m Km ij + β m Km) ij, (7) m= j i µ i, = γ i, ( s i + + j,i m=3 j,i β m Kmµ ij j,, m= j i (8) 4 β m Kmµ ij ) j,. m=3 The parameters β m need to be earned during training. We wi present the detais of the parameter optimization in j,i

5 JOURNAL OF L A T E X CLASS FILES, VOL. 4, NO. 8, AUGUST 05 5 µ t Biatera Fitering µ t, = K µ t µ t µ t, ( J) Biatera Fitering = K J Input Data Bobs µ t G Spatia Fitering µ t, = K µ t µ t Biatera Fitering µ t, = K 3 µ t Spatia Fitering µ t, = K 4 µ t µ t, 3 µ t, µ t, ( J) Spatia Fitering = K J Biatera Fitering 3 = K 3 J Spatia Fitering 4 = K 4 J G Output Data Bobs µ t S Adding Unary Term µ t = S µ t Normaizing µ t = µ t Adding Constant = J µ t Fig. 4. Detaied computing fow graph of the proposed C-MF bock. J represents a W H matrix with a eements equa to one. The symbos,, and indicate eement-wise addition, subtraction, division and Gaussian convoution operation, respectivey. G and G represent two gate functions for controing the computing fow. Section 4. Given a new test image, the optima d can be computed via maximizing the og conditiona probabiity [37], i.e. d = arg max d og(q(d S)), where d = [µ,,..., µ N,L ] is a vector of the L N mean vaues associated to Q(d ŝ). We take the estimated variabes at the finest scae L (i.e. µ,l,..., µ N,L ) as our predicted depth map d. 3.. Muti-Scae Cascade CRF Mode The cascade mode is based on a set of L CRF modes, each one associated to a specific scae, which are progressivey stacked such that the estimated depth at previous scae can be used as observations of the CRF mode in the foowing scae eve. Each CRF is used to compute the output vector d and it is constructed considering the side output representations s and the estimated depth at the previous step d as observed variabes, i.e. o = [s, d ]. The associated energy function of the CRF mode is defined as: N E(d, o ) = ψ(d i, d j). (9) φ(d i, o ) + i= i j The unary and pairwise terms can be defined anaogousy to the above-introduced unified muti-scae mode. In particuar the unary term, refecting the simiarity between the observation o i and the hidden depth vaue d i, is: φ(y i, o ) = ( d i o i), (0) where o i is obtained via combining the regressed depth from the side output s and the map d estimated by the CRF at previous scae. In our impementation we simpy consider o i = s i + d i, but other aternative strategies can be aso considered. The pairwise potentias, used to force neighboring pixes with simiar appearance to have cose depth vaues, are: M ψ(d i, d j) = β m Km(d ij i d j), () m= where we consider M = Gaussian kernes, one for appearance features, and the other accounting for pixe positions. Simiar to the muti-scae CRF mode, under mean-fied approximation, the foowing updates can be derived: γ i, = ( M + β m Km) ij, () µ i, = γ i, ( o i + m= j i M β m Kmµ ij ) j,. (3) m= At the test time, we use the estimated depth variabes corresponding to the cascade CRF mode of the finest scae L as our fina predicted depth map d. j i 4 MULTI-SCALE MODELS AS SEQUENTIAL DEEP NETWORKS In this section, we describe how the two proposed CRFsbased modes can be impemented as sequentia deep networks, enabing end-to-end training of our whoe deep network mode (the front-end CNN and the fusion modue). We first show how the mean-fied iterations derived for the muti-scae and the cascade modes can be impemented by designing a common structure, the continuous mean-fied updating (C-MF) bock, consisting into stack of a series of CNN operations. Then, we present the resuting sequentia network structures and detais of the training phase for optimizing the whoe deep network. 4. C-MF: a Common CNN Impementation of Continuous Mean-Fied Updating By anayzing the two proposed CRF modes, we can observe that the mean-fied updates derived for the cascade and for the muti-scae modes share common terms. As stated above, the main difference between the two is the way the estimated depth at previous scae is handed at the current scae. In the muti-scae CRFs, the reationship among neighboring scaes is modeed in the hidden variabe space, whie

6 JOURNAL OF L A T E X CLASS FILES, VOL. 4, NO. 8, AUGUST 05 6 CCRF_ µ t, ReLU µ 0 µ 0, S CNN at scae CCRF_, µ t Outputd? µ S µ µ, CNN at scae, µ 0,,, CCRF_ µ µ t ReLU ReLU, µ µ, µ 0 µ 0 S CNN at scae (a) The proposed muti-scae cascade CRF mode as sequentia neura network using the C-MF bock. µ T 3, 4 µ T 3, 4 µ T Outputd?,,, µ 3, 4 µ 3, 4 µ µ, 3, 4 µ, 3, 4 µ, S,,, µ 0 S µ 0 S µ 0 CNN at scae CNN at scae CNN at scae (b) The proposed muti-scae unified CRF mode as sequentia neura network using the C-MF bock. Fig. 5. Description of the proposed two CRF modes as sequentia deep networks. The bue and yeow boxes indicate the estimated variabes and observations, respectivey. The parameters β m are used for mean-fied updates. As in the cascade mode parameters are not shared among different CRFs, we use the notation β, β to denote parameters associated to the -th scae. in the cascade CRFs the depth estimated at previous scae acts as an observed variabe. Starting from this observation, in this section we show how the computation of Eq. (8) and Eq. (3) can be impemented with a common structure. Figure 4 describes in detais these computations. In the foowing, for the sake of carity, we introduce matrix representation. Let S R W H be the matrix obtained by rearranging the N = W H pixes corresponding to the side output vector s and µ t H RW the matrix of the estimated output depth variabes associated to scae and mean-fied iteration t. To impement the muti-scae mode at each iteration t, µ t and µ t are convoved by two Gaussian kernes. Foowing [], we use a spatia and a biatera kerne. As Gaussian convoutions represent the computationa botteneck (requiring a compexity of O(N )) in the mean-fied iterations, we adopt the permutohedra attice impementation [] to approximate the fiter response cacuation reducing the computationa cost from quadratic to inear [37]. The weighing of the parameters β m is performed as a convoution with a kerne. Then, the outputs are combined and are added to the side-output maps S. Finay, a normaization step foows, corresponding to the cacuation of Eq. (7). The normaization matrix γ R W H is aso computed by considering convoutions with Gaussian kernes and weighting with parameters β m. It is worth noting that the normaization step in our mean-fied updates for continuous CRFs is substantiay different from that of discrete CRFs in CRF- RNN [5] based on a softmax function. In the cascade CRF mode, differenty from the mutiscae unified CRF mode, µ t acts as an observed variabe. To design a common C-MF bock among the two modes, we introduce two gate functions G and G (Fig. 4) controing the computing fow and aowing to easiy switch between the two approaches. Both gate functions accept a userdefined booean parameter. In our setting, the vaue corresponds to the muti-scae CRF and the vaue 0 corresponds to the cascade mode. Specificay, if G is equa to, the gate function G passes µ t to the Gaussian fitering bock, otherwise passes it to the eement-wise addition bock with the computed message. Simiary, G contros the computation of the normaization terms and switches between the computation of Eq. (7) and Eq. (). In other words, if G equas to 0, then the Gaussian fitering and weighting

7 JOURNAL OF L A T E X CLASS FILES, VOL. 4, NO. 8, AUGUST 05 7 operations for γ 3 and γ 4 are disabed. Importanty, for each step in the C-MF bock we impement the cacuation of error differentias for the back-propogation as in [5]. There are two different types of CRF parameters to be earned, i.e. the bandwidth parameters θ m and the Gaussiankerne weights β m. For optimizing these CRF parameters, simiar to [], the bandwidth vaues θ m are pre-defined for simpifying the cacuation, and we impement the backward differentia computation for the weights of Gaussian kernes β m. In this way β m are earned automaticay with back-propagation. 4. From Mean-Fied Updates to Sequentia Deep Networks Fig. 4 iustrates the impementation of the proposed two CRF-based modes using the designed C-MF bock described above. In the figure, each bue-dashed box is associated to a mean-fied iteration. The cascade mode as shown in Fig. 5(b) consists of L singe-scae CRFs. At the -th scae, t mean-fied iterations are performed and then the estimated depth outputs are passed to another CRF mode of the subsequent scae after a Rectified Linear Unit (ReLU) operation. The ReLU used here has two aspects of consideration: first the depth predictions shoud be aways positive, and second we want to increase the noninearity of the sequentia network for better mapping. To impement a singe-scae CRF, we stack t C-MF bocks and make them share the parameters, whie we earn different parameters for different CRFs. For the muti-scae mode, one fu meanfied update invoves L scaes simutaneousy, obtained by combining L C-MF bocks. We further stack T iterations for earning and inference. The parameters corresponding to different scaes and different mean-fied iterations are shared. In this way, by using the common C-MF ayer, we impement the two proposed muti-scae continuous CRFs modes as deep sequentia networks enabing end-to-end training with the front-end network. 4.3 Muti-Scae Message Passing Structures The proposed work aims at muti-scae structured fusion and prediction, the connection structure between the different muti-scae predictions for message passing pays an important roe in the performance. In this section, we thus propose and investigate different message passing structures. Fig. 3 iustrates severa structures incude top down structure, skip-connection structure and a to one structure. The top down structure is simiar to the bottom up structure depicted in Fig., which graduay refines the score maps from coarse to fine. The skip connection structure aims at utiizing more compementary information via skipping scaes. The a to one structure uses a the other scaes to refine the finest scae. Since a the message passing structures invove two scaes at each time, we are abe to buid a these proposed connection structures by using the proposed aforementioned neura-network impemented C- MF bock. The experimenta investigation of these structures is iustrated in the experimenta part. TABLE The parameter detais of the sub-network for generating the side output from the ast-scae convoutiona bock of ResNet-50. Name conv s5 deconv s5 deconv s5 Type conv deconv deconv Kerne Stride, Padding,,, Activation ReLU ReLU ReLU Name deconv s5 3 deconv s5 4 pred Type deconv deconv deconv & crop Kerne Stride, Padding,,, Activation ReLU ReLU Optimization of The Whoe Network We train the whoe network using a two phase scheme. In the first phase (pretraining), the parameters of the base front-end network Θ and the parameters of the side-output generation sub-branch networks ϑ = {θ } L = are earned by minimizing the sum of L distinct side osses as in [46], corresponding to L side outputs. We define the optimization objective using a square oss over Q training sampes as foows: {Θ, ϑ } = arg min Θ,θ L = i= Q f s (r i ; Θ, θ ) d i, (4) where d i denotes the i-th ground-truth sampe. In the second phase (fine tuning), we initiaize the front-end network with the earned parameters {Θ, ϑ } in the first phase, and jointy fine-tune with the proposed muti-scae CRF modes to compute the optima vaue of the parameters Θ, ϑ and β, with β = {β m } M m=. The entire network is earned with Stochastic Gradient Descent (SGD) by minimizing a square oss Q {Θ, ϑ, β } = arg min F (r i ; Θ, ϑ, β) d i. (5) Θ,ϑ,β i= When the whoe network optimization is finished, the test can be performed end-to-end, i.e. given a test RGB image as input the network directy outputs an estimated depth map. 5 EXPERIMENTS To demonstrate the effectiveness of the proposed muti-scae CRF modes for monocuar depth prediction, we performed experiments on three pubicy avaiabe datasets: the NYU Depth V [43], the Make3D [39] and the KITTI [4] datasets. In the foowing we first describe the experimenta setup and the impementation detais, and then present the experimenta resuts and anaysis. 5. Experimenta Setup 5.. Datasets The NYU Depth V dataset [43] contains 0K unique pairs of RGB and depth images captured with a Microsoft Kinect. The datasets consists of 49 scenes for training and 5 scenes for testing. The images have a resoution of To speed up the training phase, foowing previous works [30], [53] we consider ony a sma subset of images. This subset has 449 aigned RGB-depth pairs: 795 pairs are used for training, 654 for testing. Foowing [], we perform data augmentation for the training

JOURNAL OF L A T E X CLASS FILES, VOL. 4, NO. 8, AUGUST 05 8 Fig. 6. Exampes of quaitative depth prediction resuts of different methods on the NYU v test dataset.

8 JOURNAL OF L A T E X CLASS FILES, VOL. 4, NO. 8, AUGUST 05 8 Fig. 6. Exampes of quaitative depth prediction resuts of different methods on the NYU v test dataset. Different front-end deep network architectures are investigated. VGG-CD-MSCRF and ResNet-MSCRF represent our approach with the proposed muti-scae continuous CRF mode pugged on VGG-CD and ResNet-50 network respectivey. sampes. The RGB and depth images are scaed with a ratio ρ {,.,.5} and the depths are divided by ρ. Additionay, we horizontay fip a the sampes and randomy crop them to pixes. The data augmentation phase produces 4770 training pairs in tota. The Make3D dataset [39] contains 534 RGB-depth pairs, spit into 400 pairs for training and 34 for testing. We resize a the images to a resoution of as done in [3] to preserve the aspect ratio of the origina images. We adopted the same data augmentation scheme used for NYU Depth V dataset but, for ρ = {.,.5} we randomy generate two sampes each via cropping, obtaining 4K training sampes. The KITTI dataset [4] is buit for various computer vision tasks within the context of autonomous driving, which contains depth videos captured through a LiDAR sensor depoyed on a driving vehice. For the training and testing spit, we foow the protoco made by Eigen et a. [] for a better comparison with existing works. Specificay, 6 scenes are seected from the raw data. Tota,600 images from 3 scenes are used for training, and 697 images from the other 9 scenes are used for testing. Foowing [3], the ground-truth depth maps are generated by reprojecting the 3D points coected from veodyne aser into the eft monocuar camera. The resoution of RGB images are reduced haf from origina for training and testing. 5.. Evauation Metrics Foowing previous works [], [], [45], we adopt the foowing evauation metrics to quantitativey assess the performance of our depth prediction mode. Specificay, we consider: mean reative error (re): root mean squared error (rms): P d i d i P i= d i ; P P i= ( d i d i ) ; mean og0 error (og0): P P i= og 0( d i ) og 0 (d i ) ; scae invariant rms og error as used in [], rms(scinv.); accuracy with threshod t: percentage (%) of d i, subject to max( d i ) = δ < t (t [.5,.5,.5 3 ]). d, d i i d i Where d i and d i is the ground-truth depth and the estimated depth at pixe i respectivey; P is the tota number of pixes of the test images. 5. Impementation Detais We impemented the proposed deep mode using the popuar Caffe framework [5] on a singe Nvidia Tesa K80 GPU with GB memory. More detais on the front-end CNN architectures, the generation of muti-scae side outputs and the parameter settings are eaborated as foows. 5.. Front-end CNN Architectures To study the infuence of the frond-end CNN, we consider severa network architectures incuding: (i) AexNet [3], (ii) VGG-6 [44], (iii) a fuy convoutiona encoder-decoder network derived from VGG-6, referred as VGG-ED [], (iv) a Convoution-Deconvoution network based on VGG- 6, referred as VGG-CD [34], and (v) ResNet-50 [7]. For AexNet, VGG-6 and ResNet-50, we obtain the side outputs from the ast semantic convoutiona ayer of different convoutiona bocks, in which each the ayer produces feature maps with the same shape. The scheme utiized for the generation wi be introduced in the next section. The number of side outputs considered in our experiments is 5, 5 and 4 for AexNet, VGG-6 and ResNet-50, respectivey. As VGG-ED and VGG-CD have been widey used for dense pixe-eve prediction tasks, we aso investigate them in the experimenta anaysis. Both VGG-ED and VGG-CD have a

9 JOURNAL OF L A T E X CLASS FILES, VOL. 4, NO. 8, AUGUST 05 9 TABLE Quantitative performance comparison of different front-end deep network architectures and the proposed two muti-scae CRF modes associated with the pretrained front-end networks on the NYU Depth V dataset. Network Architecture Error (ower is better) Accuracy (higher is better) re og0 rms δ <.5 δ <.5 δ <.5 3 AexNet (pretrain) VGG-6 (pretrain) VGG-ED (pretrain) VGG-CD (pretrain) ResNet-50 (pretrain) AexNet + cascade-crfs VGG-6 + cascade-crfs VGG-ED + cascade-crfs VGG-CD + cascade-crfs ResNet-50 + cascade-crfs symmetric network structure, and five side outputs are then generated from the different bocks of the decoder or the deconvoutiona network part. 5.. Generation of muti-scae CNN side-outputs Our approach can be appied with any muti-scae frontend CNN modes incuding those with skip-connections. We here briefy describe the scheme we adopt to buid CNN side outputs from the front-end CNN for the mutiscae fusion with CRFs. In [46] a convoutiona ayer is first used to generate a score map from the feature map and then a deconvoutiona (deconv) ayer is adopted as a biatera upsamping operator to enarge the score map such as to obtain the same size of the input image. However, we noticed that by adopting the approach in [46] the generated side outputs associated to the feature maps with smaer size are very coarse, causing a ot scene detais missing. To address this probem, after the convoutiona ayer, we stack severa deconv ayers, each of them enarging the output map by two times. A Rectified Linear Unit (ReLU) is appied after each deconv ayer. After the ast deconv ayer we use a crop ayer to cut the extra margin and obtain a side output with the same resoution of the ground-truth image. We empoy this scheme to obtain side outputs for AexNet, VGG-6 and ResNet-50, whie for VGG-CD and VGG-ED, we use the same setting as in [46], as their decoder or deconvoutiona part is abe to obtain more fine-grained side outputs. Tabe shows detaied network parameters used to obtain the side output from the ast convoutiona bock of ResNet-50 (i.e. from the ayer res5c) Parameters settings As described in Section 4.4, training consists of a pretraining and a fine tuning phase. In the first phase, we train the front-end CNN with parameters initiaized with the corresponding ImageNet pretrained modes. For AexNet, VGG- 6, VGG-ED and VGG-CD, the batch size is set to and for ResNet-50 to 8. The earning rate is initiaized at 0 and decreases by 0 times around every 50 epochs. 80 epochs are performed for pretraining in tota. The momentum and the weight decay are set to 0.9 and , respectivey. When the pretraining is finished, we connect a the side outputs of the front-end CNN to our CRFs-based muti-scae deep modes for end-to-end training of the whoe network. In this phase, the batch size is reduced to 6 and a fixed earning rate of 0 is used. The same parameters of the pre-training phase are used for momentum and weight decay. The bandwidth weights for the Gaussian kernes are obtained through cross vaidation. The number of meanfied iterations is set to 5 for efficient training for both the cascade CRFs and muti-scae CRFs. We do not observe significant improvement using more than 5 iterations. Training the whoe network takes around 5 hours on the Make3D dataset, 8 hours on the KITTI dataset and 3 hours on the NYU v dataset. 5.3 Experimenta Resuts To present the experimenta resuts, we start from an abation study for investigating the performance impact of different front-end network architectures, the effectiveness of the proposed CRF-based muti-scae fusion modes and the infuence of the stacking orders for making the sequentia neura network. Then we compare the overa performance with the state of the art methods, and finay the quaitative resuts and running time are anayzed Evauation of different front-end CNN architectures As discussed above, the proposed muti-scae CRF-based fusion modes are genera and different deep architectures can be used for the front-end network. In this section we evauate the impact of this choice on the depth estimation performance. We consider both the case of the pretrained front-end modes (i.e. ony side osses are empoyed but the muti-scae CRF modes are not pugged), indicated with pretrain, and the case of the fine-tuned modes, incuding the front-end network with the muti-scae cascade CRFs (cascade-crfs). The resuts of the experiments are shown in Tabe. As expected, in both cases deeper CNN architectures produced more accurate predictions, and ResNet- 50 achieves the best performance among a the front-end networks. Moreover, VGG-CD is sighty better than VGG- ED, and both these modes outperforms VGG-6, showing that the symmetric network structure is beneficia for the

10 JOURNAL OF L A T E X CLASS FILES, VOL. 4, NO. 8, AUGUST 05 0 TABLE 3 Quantitative baseine comparison with different muti-scae fusion schemes, and with the continuous CRF as a post-processing modue on the NYU Depth V dataset. The number of scaes is investigated for both muti-scae modes with a bottom up message passing structure. Method Error Accuracy (ower is better) (higher is better) re og0 rms δ <.5 δ <.5 δ <.5 3 HED [46] Hypercoumn [6] C-CRF Ours (singe-scae) Ours - cascade (3-scae) Ours - cascade (5-scae) Ours - unified (3-scae) Ours - unified (5-scae) TABLE 4 Quantitative performance evauation of different message passing structures for the cascade CRF mode via buiding the sequentia deep network with the proposed C-MF bock on the NYU Depth V dataset. Method Error (ower is better) Accuracy (higher is better) re og0 rms δ <.5 δ <.5 δ <.5 3 Top down structure Bottom up structure Skip connection structure A to one structure TABLE 5 Overa performance comparison with state of the art methods on the NYU Depth V dataset. Our approach achieves the best on most of the metrics, whie the runners-up Eigen and Fergus [] and Laina et a. [7] empoy more training data than ours. ResNet-50-unified means using ResNet-50 front-end network with the proposed muti-scae unified CRF mode. Method Error (ower is better) Accuracy (higher is better) re og0 rms rms (sc-inv.) δ <.5 δ <.5 δ <.5 3 Karsch et a. [4] Ladicky et a. [0] Liu et a. [3] Ladicky et a. [5] Zhuo et a. [53] Liu et a. [30] Wang et a. [45] Eigen et a. [] Roi and Todorovic [38] Eigen and Fergus [] Laina et a. [7] Ours (ResNet-50-unified-4.7K-bottom up) Ours (ResNet-50-unified-95K-bottom up) Ours (ResNet-50-unified-95K-a to one) TABLE 6 Overa performance comparison with state of the art methods on the Make3D dataset. Our approach outperforms a the competitors w.r.t. the C Error, and performs ony sighty worse on the re metric of the C Error than Laina et a. [7] using Huber oss and significanty arger training data. Method C Error C Error re og0 rms rms (sc-inv.) re og0 rms Karsch et a. [0] Liu et a. [3] Liu et a. [30] Li et a. [8] Laina et a. [7] ( oss) Laina et a. [7] (Huber oss) Ours (ResNet-50-cascade-bottom up) Ours (ResNet-50-unified-bottom up) Ours (ResNet-50-unified-0K-bottom up) Ours (ResNet-50-unified-0K-a to one)

Exampes of depth prediction resuts on the Make3D dataset.

depth maps, respectivey. dense pixe-eve prediction probems.

significant increase in performance when appying the proposed

Figure 6 depicts some exampes of predicted depth maps using

As we can see from the figure, the quaitative resuts confirm that

By comparing the reconstructed depth maps obtained with pretrained

using ony the frontend networks VGG-CD and ResNet-50) with those

remarkaby improves prediction accuracy and visua quaity. 5.3.

effectiveness of the proposed CRF-based muti-scae fusion modes, we

foowing baseines: (i) the HED method in [46], where mutipe side

mutipe side output osses is jointy minimized as deep supervision

invoves continuous variabes; (ii) the Hypercoumn method [6], where

ayers are concatenated and fused; (iii) a continuous CRF ( C-CRF )

output ayer as a post-processing modue without end-toend training.

popuar methods for fusing muti-scae CNN information, whie the third

architecture. The resuts of the comparison are shown in Tabe 3.

11 JOURNAL OF LATEX CLASS FILES, VOL. 4, NO. 8, AUGUST 05 GroundTruth Ours Laina et a. [6] RGB Image Fig. 7. Exampes of depth prediction resuts on the Make3D dataset. The four rows from up to bottom are the input test RGB images, the resuts produced from Laina et a. [7], the resuts of our ResNet50-MSCRF mode and the groundtruth depth maps, respectivey. dense pixe-eve prediction probems. Importanty, for a considered front-end networks there is a significant increase in performance when appying the proposed CRF-based modes. Figure 6 depicts some exampes of predicted depth maps using different front-end networks on the NYU Depth V test dataset. As we can see from the figure, the quaitative resuts confirm that the deeper architecture eads to better depth recovery. By comparing the reconstructed depth maps obtained with pretrained modes (e.g. using ony the frontend networks VGG-CD and ResNet-50) with those generated with our muti-scae modes, it is cear that our approach remarkaby improves prediction accuracy and visua quaity Evauation of different muti-scae CRF fusion modes To evauate the effectiveness of the proposed CRF-based muti-scae fusion modes, we conduct experiments on the NYU Depth V dataset and consider the foowing baseines: (i) the HED method in [46], where mutipe side outputs are fused with a weighted averaging scheme and the sum of mutipe side output osses is jointy minimized as deep supervision with a cross-entropy oss, whie we use the square oss as our probem invoves continuous variabes; (ii) the Hypercoumn method [6], where muti-scae feature maps generated from different semantic network ayers are concatenated and fused; (iii) a continuous CRF ( C-CRF ) appied on the prediction of the front-end network, i.e. pugging after the ast output ayer as a post-processing modue without end-toend training. For the first two baseines, we want to compare our modes with other popuar methods for fusing muti-scae CNN information, whie the third one aims at demonstrating the effectiveness of the continous CRF itsef. In these experiments we consider VGG-CD as the front-end CNN architecture. The resuts of the comparison are shown in Tabe 3. It is evident that with our CRF-based fusion modes (both the cascade CRFs and the unified CRFs) more accurate depth maps can be obtained, demonstrating that our idea of integrating compementary information derived from CNN side output maps within a graphica mode framework is more effective than traditiona fusion schemes. Tabe 3 aso compares the proposed cascade and unified modes. As expected, the unified mode produces more accurate depth maps, at the price of an increased computationa cost. This can aso be observed from Tabe. The C-CRF (in Tabe 3) improves the depth estimation at a metrics over the VGGCD (pretrain) (in Tabe ) with a cear gap, showing the CRF mode is very usefu for refining the deepy predicted map. By jointy earning with the front-end (i.e. end-to-end training), ours (singe-scae) further boosts the performance. Finay, we anayze the impact of adopting mutipe scaes and compare our compete modes (5 scaes) with their version when ony a singe and three side output ayers are used. It is evident that the performance can be improved by increasing the number of scaes Evauation of muti-scae message passing structures We evauate the infuence of different muti-scae message passing structures using the cascade CRF mode. Four connection structures as depicted in Fig. 3 are compared. Tabe 4

JOURNAL OF L A T E X CLASS FILES, VOL. 4, NO. 8, AUGUST 05 RGB Image GT Depth Map Eigen et a. [] Zhou et a. [5] Garg et a. [3] Godard et a. [5] Ours Fig. 8. Exampes of depth prediction resuts on the KITTI raw dataset.

12 JOURNAL OF L A T E X CLASS FILES, VOL. 4, NO. 8, AUGUST 05 RGB Image GT Depth Map Eigen et a. [] Zhou et a. [5] Garg et a. [3] Godard et a. [5] Ours Fig. 8. Exampes of depth prediction resuts on the KITTI raw dataset. Quaitative comparison with other depth estimation methods on this dataset is presented. The sparse ground-truth depth maps are interpoated for better visuaization. TABLE 7 Overa performance comparison with state of the art methods on the KITTI raw dataset. Our approach obtains very competitive performance over a the competitors w.r.t. a the evauation metrics on the testing set given by Eigen et a. []. For the setting, caps means different gt/predicted depth range and stereo means using eft and right images captured from two monocuar cameras in the training phase. Ours uses a unified mode considering both the bottom up and the a to one network structure. Method Setting Error (ower is better) Accuracy (higher is better) range stereo re sq re rms rms (sc-inv.) δ <.5 δ <.5 δ <.5 3 Saxena et a. [4] 0-80m No Eigen et a. [] 0-80m No Liu et a. [30] 0-80m No Zhou et a. [5] 0-80m No Kuznietsov et a. [4] (ony supervised) 0-80m No Garg et a. [3] 0-80m Yes Garg et a. [3] L + Aug 8x -50m Yes Godard et a. [5] 0-80m Yes Kuznietsov et a. [4] 0-80m Yes Ours (ResNet-50 Pretrain) 0-80m No Ours (ResNet-50 Fine-tune-bottom up) 0-80m No Ours (ResNet-50 Fine-tune-a to one) 0-80m No shows the monocuar depth estimation resuts on NYUD-v dataset. The comparison resuts confirm that the message passing structure indeed has an impact on the fina performance. The bottom up and top down structures have simiar performance, whie the skip-connection structure sighty outperform these two. The a to one structure performs the best, producing around.0% gain in terms of the re metric than the top down structure, which means that directy passing message to the finest prediction scae from the rest scaes can absorb more compementary information than the gradua passing fashions used in the first three structures Comparison with state of the art We aso compare our approach with state of the art methods on a the datasets. For previous works we directy report resuts taken from the origina papers. Tabe 5 shows the resuts of the comparison on the NYU Depth V dataset. For our approach we consider the cascade mode and use two different training sets for pretraining: the sma set of 4.7K pairs empoyed in a our experiments and a arger set of 95K images as in [7]. Note that for fine tuning we ony use the sma set. As shown in the tabe, our approach outperforms a competing methods and it is the second best mode when we use ony 4.7K images. This is remarkabe considering that, for instance, in [] 0K image pairs are used for training. Our mode achieves the best resuts on a the metrics via using 95K pretraining sampes and using the proposed a to one message passing structure. We aso perform a comparison with severa state of the art methods on the Make3D dataset (Tabe 6). Foowing [3], the error metrics are computed in two different settings, i.e. considering (C) ony the regions with groundtruth depth ess than 70 and (C) the entire image. It is cear that the proposed approach is significanty better than previous methods. In particuar, comparing with Laina et a. [7], the best performing method in the iterature, it is evident that our approach, both in case of the cascade and the mutiscae modes, outperforms [7] by a significant margin when Laina et a. aso adopt a square oss. It is worth noting that in [7] a training set of 5K image pairs is considered, whie we empoy much ess training sampes. By increasing our training data (i.e. 0K in the pretraining phase), our muti-scae CRF mode aso outperforms [7] with Huber oss (og0 and rms metrics). The fina performance is further boosted by considering the a to one structure simiar to NYUD v dataset. Finay, it is very interesting to compare the proposed method with the approach in Liu et a. [30], since

JOURNAL OF LATEX CLASS FILES, VOL. 4, NO. 8, AUGUST 05 3 aso shows a quaitative comparison between the pretrained front-end CNN and the fine-tuned whoe mode.

We beieve that this is probaby because the effective structured fusion of the coarse-to-fine mutiscae predictions of the deep network with the proposed CRF modes.

It may have infuence but the main infuence is dominated by the predictions of deep front-end CNN based on our observation from the experimenta resuts. Fig. 9.

in [30] a CRF mode is aso empoyed within a deep network trained end-to-end. Our method significanty outperforms [30] in terms of accuracy. Moreover, in [30] a time of.

Oppositey, with our method computing the depth map for a singe image takes about sec in tota. The state of the art comparison on KITTI dataset is shown in Tabe 7. The competitors incude Saxena et a.

As the same setting of ours, the first four methods use singe monocuar images in the training phase, whie the ast two considered two monocuar images with a stereo setting for training.

[] significanty outperforms the others in terms of the metric of the mean reative error (re), due to the usage of argescae training data (more than miion sampes).

Athough the training of the ast two methods (requiring two monocuar images) is not equa to our setting, the proposed approach with both the bottom-up and the a to one structures sti produces better

It is not directy comparabe with the stereo training setting, which is significanty different as it requires both eft and right images from a binocuar camera.

8 aso shows some quaitative comparison resuts with these methods, further demonstrating the advantageous performance of our approach. 5.3.5 Quaitative depth estimation resuts Fig.

13 JOURNAL OF LATEX CLASS FILES, VOL. 4, NO. 8, AUGUST 05 3 aso shows a quaitative comparison between the pretrained front-end CNN and the fine-tuned whoe mode. It can be observed that our approach can recover more scene structures and detais. We beieve that this is probaby because the effective structured fusion of the coarse-to-fine mutiscae predictions of the deep network with the proposed CRF modes. For the infuence of the variance in the CRF mode on the prediction errors, as the variance term is actuay acted as a normaization factor after the message passing. It may have infuence but the main infuence is dominated by the predictions of deep front-end CNN based on our observation from the experimenta resuts. Fig. 9. Exampes of depth prediction resuts on the KITTI raw dataset. The midde coumn and the right coumn show the pretrained and the fine-tuned estimation resuts respectivey. in [30] a CRF mode is aso empoyed within a deep network trained end-to-end. Our method significanty outperforms [30] in terms of accuracy. Moreover, in [30] a time of.sec is reported for performing inference on a test image but the time required by superpixes cacuations is not taken into account. Oppositey, with our method computing the depth map for a singe image takes about sec in tota. The state of the art comparison on KITTI dataset is shown in Tabe 7. The competitors incude Saxena et a. [39], Eigen et a. [], Liu et a. [3], Zhou et a. [5], Garg et a. [3], Godard et a. [5] and Kuznietsov et a. [4]. As the same setting of ours, the first four methods use singe monocuar images in the training phase, whie the ast two considered two monocuar images with a stereo setting for training. Among the first four competitors, Eigen et a. [] significanty outperforms the others in terms of the metric of the mean reative error (re), due to the usage of argescae training data (more than miion sampes). Whie our mode achieves much better performance than Eigen et a. [] in a metrics with much ess data (.6K sampes). Athough the training of the ast two methods (requiring two monocuar images) is not equa to our setting, the proposed approach with both the bottom-up and the a to one structures sti produces better resuts than them with cear performance gap in a metrics. Kuznietsov et a. [4] reports resuts for both the stereo training and the monocuar supervised training. It is not directy comparabe with the stereo training setting, which is significanty different as it requires both eft and right images from a binocuar camera. Ours focuses on monocuar depth estimation and achieves ower error performance comparing with theirs using the same monocuar setting. Fig. 8 aso shows some quaitative comparison resuts with these methods, further demonstrating the advantageous performance of our approach Quaitative depth estimation resuts Fig. 6, 7 and 9 show some exampes of the quaitative depth estimation resuts and the comparison with the competing methods on the NYUD-V, Make3D and KITTI dataset respectivey. It is cear that the proposed approach is abe to produce sharper depth estimation with better visua quaity compared with the cassic CNN structures, which demonstrates the importance of the prediction aided by the CRFs with appearance and smoothness constraints. Fig Empirica run-time anaysis Computationa run-time compexity is an important aspect for deep structured prediction modes. In this paragraph we provide a short discussion about the computationa cost of the proposed CRFs-based modes. As shown in the paper, the muti-scae CRF mode achieves better accuracy and ower error than the cascade mode for both the NYU Depth V and the Make3D experiments. However, as expected, the cascade mode is more advantageous in terms of the running time. For instance, considering ResNet-50 as the front-end CNN, the time required at test phase for one image is.0 seconds w.r.t. the cascade mode and.45 seconds w.r.t. the muti-scae mode, and the image resoution is pixes. Higher resoution of the network input usuay brings more computationa overhead. We aso test the running time given the input resoution of and it costs around.5 seconds for processing one image. We beieve that if we reduce the receptive fied of the CRF mode from fuy connected to partiay connected, the computing time coud be significanty reduced. 6 C ONCLUSION In this paper, we introduced a nove approach for predicting depth maps from a singe RGB image. The core of the method is a nove framework based on continuous CRFs for fusing muti-scae score-eve side-outputs derived from different semantic CNN ayers. We demonstrated that this framework can be used in combination with severa common CNN architectures and can be impemented for end-to-end training. The extensive experiments confirmed the vaidity of the proposed muti-scae fusion approach. Whie this paper specificay addresses the probem of depth prediction, we beieve that other tasks in computer vision invoving pixe-eve predictions of continuous variabes, can aso benefit from our impementation of the mean-fied updating within the CNN framework. Currenty, the muti-scae fusion is performed on the score eve. Further research direction wi investigate the integration of both the feature- and the score-eve muti-scae information within a unified graphica mode. Moreover, the study of strategies for further improving the training and testing efficiency of the CNN-CRF modes wi aso be an interesting aspect in the future work. The monocuar depth estimation is particuary usefu for various crossmoda recognition and detection tasks. A straightforward foow-up of this work woud be designing a joint mutitask deep mode to transfer the earned depth mode for

Multi-Scale Continuous CRFs as Sequential Deep Networks for Monocular Depth Estimation

Multi-Scale Continuous CRFs as Sequential Deep Networks for Monocular Depth Estimation Muti-Scae Continuous CRFs as Sequentia Deep Networks for Monocuar Depth Estimation Dan Xu 1, Eisa Ricci 4,5, Wani Ouyang 2,3, Xiaogang Wang 2, Nicu Sebe 1 1 University of Trento, 2 The Chinese University