ACQUIRING high-quality and well-defined depth data. Online Temporally Consistent Indoor Depth Video Enhancement via Static Structure

Size: px

Start display at page:

Download "ACQUIRING high-quality and well-defined depth data. Online Temporally Consistent Indoor Depth Video Enhancement via Static Structure"

Griselda Flynn
5 years ago
Views:

1 SUBMITTED TO TRANSACTION ON IMAGE PROCESSING 1 Online Temporally Consisen Indoor Deph Video Enhancemen via Saic Srucure Lu Sheng, Suden Member, IEEE, King Ngi Ngan, Fellow, IEEE, Chern-Loon Lim and Songnan Li, Member, IEEE Absrac In his paper, we propose a new mehod o online enhance he qualiy of a deph video based on he inermediary of a so-called saic srucure of he capured scene. The saic and dynamic regions of he inpu deph frame are robusly separaed by a layer assignmen procedure, in which he dynamic par says in he fron while he saic par fis and helps o updae his srucure by a novel online variaional generaive model wih added spaial refinemen. The dynamic conen is enhanced spaially while he saic region is oherwise subsiued by he updaed saic srucure so as o favor he long-range spaioemporal enhancemen. The proposed mehod boh performs long-range emporal consisency on he saic region and keeps necessary deph variaions in he dynamic conen. Thus i can produce flicker-free and spaially-opimized deph videos wih reduced moion blur and deph disorion. Our experimenal resuls reveal ha he proposed mehod is effecive in boh saic and dynamic indoor scenes and is compaible wih deph videos capured by Kinec and Time-offligh camera. We also demonsrae ha excellen performance can be achieved by he proposed mehod in comparison wih he exising spaio-emporal approaches. In addiion, our enhanced deph videos and saic srucures can ac as effecive cues o improve various applicaions including deph-aided background subracion and novel view synhesis, showing saisfacory resuls wih few visual arifacs. Index Terms Saic srucure, emporally consisen deph video enhancemen, online esimaion, layer assignmen. I. INTRODUCTION ACQUIRING high-qualiy and well-defined deph daa from real scenes has been a key problem in compuer vision wih he prevalence of various 3D applicaions in manufacuring and he enerainmen indusry, in uses ha include virual realiy, 3DTV and free-viewpoin TV, game conroller and robo vision. Recenly a variey of sysems have been proposed o obain deph informaion of a real scene, from passive sereo vision sysem o acive sensors like real-ime srucured-ligh deph sensors (e.g., Kinec), Time-of-Fligh (ToF) cameras or laser scanners. Unforunaely mos sysems suffer from low qualiy of he acquired deph maps, ypically in erms of low resoluion, noise and ouliers, and missing deph regions (or holes) wihou deph measuremens. These Copyrigh (c) 213 IEEE. Personal use of his maerial is permied. However, permission o use his maerial for any oher purposes mus be obained from he IEEE by sending a reques o pubs-permissions@ieee.org. This work was parially suppored by he Universiy of Malaya, Malaysia (Projec UM.C/625/1/HIR/MOHE/ENG/42). L. Sheng, K. Ngan and S. Li are wih he Deparmen of Elecronic Engineering, he Chinese Universiy of Hong Kong, Hong Kong ( lsheng@ee.cuhk.edu.hk; knngan@ee.cuhk.edu.hk; snli@ee.cuhk.edu.hk) C.L. Lim is wih he Universiy of Malaya, Malaysia (limchernloon@gmail.com). shorcomings obsruc he direc usage of deph informaion of capured scenes for differen 3D applicaions. Even hough spaial enhancemen of deph maps has been exensively sudied in recen years, in area such as energy minimizaion mehods [1] [3] or filering mehods based on highdimensional filering [4] [6], as well as oher mehods like pach maching [7], [8] and so on, he emporal inconsisency problem is neverheless negleced since he necessary emporal relaionship beween adjacen frames has no been aken ino consideraion, hus severe flickering arifacs become an urgen issue o ackle. However, due o various complex and even unpredicable dynamic conens, as well as ouliers in a deph video, i is no easy o exacly locae he regions where emporal consisency should be enforced. Several exising mehods [9], [1] employ he emporal exure similariy o exrac 2D moion informaion, bu correc deph variaion canno always be mainained hus causing severe moion blur. In addiion, ypical reamens always apply emporal consisency over a shor-lengh sequence (usually 2 3 frames), which is oherwise insufficien o generae sable and emporally consisen resuls over hundreds of frames. Furhermore, oversmoohing around he boundaries beween dynamic objecs and saic scenes should be eliminaed o produce high qualiy and well-defined deph video. In his paper, we presen an alernaive mehod o enhance a deph video boh spaially and emporally by addressing wo aspecs of hese problems: 1) efficienly and effecively enforcing he emporal consisency where i is necessary, and 2) enabling online processing. A common fac is ha regions in one frame wih various moion paerns (e.g., saic, slowly/fas moving and ec.) belong o differen objecs or srucures and require emporal consisencies wih differen levels. For insance, he saic region needs a long-range emporal enhancemen o ensure ha i is saic over a long duraion, while dynamic regions wih slow/rapid moions expec shorerm or no emporal consisency. However, i is difficul o accuraely enhance arbirary and complex dynamic conens in he emporal domain wihou apparen moion blurs or deph disorions. Thus we propose an inuiive compromise o cancel he emporal enhancemen in he dynamic region as long as is spaial enhancemen is sufficienly saisfacory, in which he necessary deph variaion will no be disored while he emporal arifacs are no as easy as hose in he saic region o be perceived. Therefore, we aim a srenghening longrange emporal consisency around he saic region whils mainaining necessary deph variaion in he dynamic conen. To accuraely separae he saic and dynamic regions, we

2 SUBMITTED TO TRANSACTION ON IMAGE PROCESSING 2 online rack and incremenally refine a probabilisic model called saic srucure, which acs as a medium o indicae he region ha is saic in he curren frame. By online fusing he saic region of he curren frame ino he saic srucure wih an efficien variaional fusion scheme, his srucure has implicily gahered all he emporal daa a and before he curren frame ha belong o i. Subsiuing he saic region by he updaed saic srucure, i is hus emporally consisen and sable in a long range accordingly. Moreover, i is also suiable for online processing he sreaming deph videos (3D eleconference, 3DTV and ec.) wihou he necessiy of soring amouns of adjacen frames, hus is memory and compuaionally efficien. Overall, he emporally consisen deph video enhancemen is performed a wo layers: 1) he saic region of he inpu frame revealing he saic srucure is enhanced spaially and emporally by an online fusion echnique combining i wih he saic srucure, and 2) he dynamic conen is enhanced spaially wihou emporal smoohness. In addiion o he advanages saed aforemenioned, enhancing he saic and dynamic regions separaely also effecively eliminaes arifacs ha frequenly occur in convenional deph video enhancemens, like he blurring arifacs or he unreliable deph propagaion, across he boundaries beween dynamic objecs and saic objecs/background. Furhermore, when he deph video conains severe holes, he saic srucure can fill saic holes convincingly and leave he res holes filled by he dynamic conen so as o avoid he inpaining arifacs. Since fully dynamic deph videos usually have weak emporal consisency hus our proposed algorihm is relegaed o a spaial enhancemen approach, which does no force he enhanced deph video o bear unnecessary emporal smoohness. Based on he conference version [11] of his work, more echnical deails and heoreic analysis abou he formulaion of he saic srucure, effecive layer assignmen as well as a sound spaial enhancemen of he saic srucure are discussed in his paper. Furhermore, a complee framework abou emporally consisen deph video enhancemen, a horough experimenal evaluaion as well as discussions abou is applicaions and limiaions are also provided. The res of he paper is organized as follows. Secion II reviews exising works in spaial and emporal deph video enhancemen, as well as approaches on saic scene reconsrucion, which is indeed relaed o our formulaion of he saic srucure. Secion III describes our proposed framework of online esimaion of he saic srucure and he approach regarding emporally consisen deph video enhancemen. Experimenal resuls and discussions of our mehod can be found in Secion IV. Discussions abou is limiaions and applicaions are presened in Secion V. Concluding remarks and discussion on fuure work are given in Secion VI. II. RELATED WORK Spaial enhancemen On he aspec of global opimizaion, he pioneering work was done by Diebel e al. [1] uilizing he pixel-wise MRF model wih he guidance of exure o denoise he deph map. Several augmened models were also proposed o handle inpaining and super-resoluion [2], [3], [12] [14], wih special choices of he daa and smoohness erms as well as addiional regularizaion erms (TV-l 1 norm [14], ec.), enabling a reasonable performance even wihou exure informaion [14]. Bu he high compuaional cos of hese mehods hinders real-ime applicaions. Anoher choice is high-dimensional filering. One varian is high-dimensional average filering [4] [6], [9], [15], whose weighs are defined by he spaial nearness and feaure proximiy. The feaure can be exure/deph inensiies or paches [6], [16] and oher userdefined ones. The main problems here are edge blurring and exure mapping. Anoher varian uses he median of he deph candidae hisogram insead [17], [18], producing more robus resuls bu also suffering from quanizaion error and slower speed. Weighed mode filering [1], [19] oherwise looks for he hisogram s global mode, and has similar arifacs. In addiion, spaial enhancemen, especially super-resoluion and inpaining, can be performed by pach maching hroughou he deph map, which achieved saisfacory visual resuls [7], [8] bu wih high compuaional complexiy. Temporal enhancemen Exising emporal enhancemen approaches usually employ he guidance of emporal exure consisency, especially by fusing he previous deph frame ono he curren one according o he moion vecors esimaed beween he corresponding adjacen color frames [9], [1]. However, he neglec of addiional moion vecors in z-axis reduces he warping accuracy. 3D moion esimaion is ypically adoped o solve he problem in [2] [22]. Following hem, he emporal fusion beween curren and warped previous frames are usually based on weighed average or weighed median filers, and energy minimizaion as well [9], [1], [23], [24]. Therefore he performance, on one hand, relies heavily on he accuracy of moion esimaion, which is difficul o be saisfied. On he oher hand, he emporal coninuiy is only preserved among few adjacen frames, which does no mee he demand of consraining long-range emporal consisency. To fix such an issue, Lang e al. [25] proposed o offline filer he pahs which are he vecors of all he pixels ha correspond o he moion of one scene poin over ime. I provides a pracical and remarkable soluion o enhance a deph video wih longrange emporal consisency boh effecively and efficienly. Our work is relaed o, bu has essenial differences from he layer denoising and compleion proposed by Shen e al. [26], which offline rained background layer models beforehand o label he foreground and background of he inpu deph frame, and no emporal consisency was srenghened. Conversely, our mehod esimaes he saic srucure in an online fashion and here is no need o have a series of deph frames capuring purely saic scenes. Moreover, he emporal consisency is mainained where i is required. Tha aside, only he spaial enhancemen is aken ino consideraion as presened in [26]. Saic scene reconsrucion The saic srucure esimaion is relaed o he saic scene reconsrucion by fusing a series of deph maps. A majoriy of hese works are offline mehods [27] [31] which fuse a se of deph maps o oupu a single geomeric srucure, while he res are online approaches ha receive deph measuremens sequenially and incremenally esimae he curren geomeric srucure. Offline mehods always

3 SUBMITTED TO TRANSACTION ON IMAGE PROCESSING 3 A saic objec A moving objec Inpu deph frame Saic Srucure (a) Saic background Layer Assignmen (b) A saic objec A moving objec Saic srucure Saic background Variaional Approximaion Spaial Enhancemen Inpu daa Saic Srucure Temporally Consisen Deph Video Enhancemen Fig. 1. The illusraion of he saic srucure in comparison wih he inpu deph frame. (a) shows he inpu deph frame (in blue curve) lies on he capured scene, (b) represens he saic srucure (in black curve). The deph sensor is above he capured scene. The saic srucure includes he saic objecs as well as he saic background. exrac a bach of deph frames ogeher so ha he complexiy becomes unbearable when he number of frames is large. One of he offline approaches by Zinick e al. [31] employed he consisency of boh he muliple view color and dispariy, which is analogous o our consrain of emporal consisency, o regularize he dispariy space disribuion so as o bring abou he refined dispariy map. Mos online mehods quanize he 3D space ino grids [32] [35] o reduce he memory and compuaional cos. Thus hey are always deficien in sub-grid accuracy, bu one addiional approach explois a weighed sum of runcaed signed disance funcion (TSDF) [33], [34] over deph measuremens. However i is sensiive o ouliers and hus no robus o esimae a saic scene conaining dynamic objecs and heavy ouliers. To robusly esimae he saic scene capured by noisy and cluered daa, some researchers have proposed a variey of measuremen models wih parameers describing he naure of he noise and ouliers. Several mehods [32], [36] need parameers learned from ground ruh daa or hose uned empirically. One successful model ha requires fewer manually uned parameers is he generaive model, which has he abiliy o derive he model of he noise and cluer characerisics from he inpu daa. Vogiazis e al. [37] proposed a generaive Gaussian plus uniform model ha simulaneously infers he deph and oulier raio per pixel using an efficien online variaional scheme, which mees he cluer characerisics of deph maps generaed by sereo. Our saic srucure esimaion is similar as an online generaive model considering boh noise and ouliers as well as a special reamen of he dynamic scenes. III. APPROACH The saic srucure can be regarded as an inrinsic deph srucure (and exure srucure when he regisered color video is available) underneah he capured scene 1, which always lies on or behind he surface of he inpu deph frame. As shown in Figure 1, any moving or foreground objec says in fron of he saic srucure whereas he saic objecs or visible 1 Wihin he scope of his paper, we assume he arge deph video is capured by a saic deph sensor hence he capured scene is saic excep he dynamic objecs. Alhough he enhancemen of deph video capured by moving cameras is a more general opic, we will refer i o our fuure work. Online Saic Srucure Updaing Scheme Enhanced deph frame Fig. 2. Flowchar of he overall framework of he proposed mehod on he esimaion of saic srucure and deph video enhancemen. Please refer o he ex for he deailed descripion. saic background are usually on i, i.e., he deph value of he saic srucure a one pixel is always deeper han ha of a dynamic objec a he same place. Bu i is differen from he background of a scene, because we focus more on he saic geomeric srucure raher han he disance from he camera. Since he emporal consisency around saic or slowly moving regions are required o be enforced, he saic naure is more useful han he idea of background. To handle arifacs like noise, ouliers and holes as well as complex dynamic conens in he inpu deph frame, we propose a probabilisic generaive mixure model o describe he saic srucure as well as he characerisics of noise and ouliers (Secion III-A). We also define an efficien layer assignmen leveraging dense condiional random fields o accuraely label inpu deph frame ino dynamic and saic regions (Secion III-D). For he sake of memory and calculaion efficiency, as well as he abiliy o process sreaming daa, he saic srucure is online updaed (Secion III-E) via a variaional approximaion (Secion III-B) governed by a firs order Markov chain, which effecively fuses he labeled saic region in he curren deph frame wih he previous esimaed srucure. I is furher refined spaially o fill holes and regularize he srucure (Secion III-E). The updaed saic srucure in urn subsiues he saic region of he inpu deph frame, resuling in a emporally consisen deph video enhancemen (Secion III-F). The framework of he online saic srucure updae scheme and emporally consisen deph video enhancemen is referred o in he flowchar in Figure 2. Noaion The daa sequence is denoed as S and formed by a deph video D = {D =1, 2,...,T} as S = D, or as a pair of aligned deph plus color videos as S = {D, I}, where I = {I =1, 2,...,T}. The daa in each frame is S = D or {D, I }. The pixel locaion is defined as x, and is deph value a is d x and is corresponding color is I x. The parameer se for he probabilisic model a each frame is denoed as P S,, and Px S, is defined for each pixel x, whose elemens are defined in deail in he following secions.

SUBMITTED TO TRANSACTION ON IMAGE PROCESSING 4 d Sae-B Saic srucure Sae-I Sae-F Camera cener Fig. 3. Illusraion of hree saes of inpu deph measuremens wih respec o he saic srucure on one line-of-sigh.

When d is in fron of his srucure, we denoe i as sae-f. While i is far behind he saic srucure, he sae is sae-b. A.

4 SUBMITTED TO TRANSACTION ON IMAGE PROCESSING 4 d Sae-B Saic srucure Sae-I Sae-F Camera cener Fig. 3. Illusraion of hree saes of inpu deph measuremens wih respec o he saic srucure on one line-of-sigh. The curren saic srucure refers o he blue sick in he middle. Decision boundaries are marked as blue do lines. The deph measuremen d is caegorized ino sae-i when i is placed around he saic srucure. When d is in fron of his srucure, we denoe i as sae-f. While i is far behind he saic srucure, he sae is sae-b. A. A Probabilisic Generaive Mixure Model A he very beginning, we only consider he case where S = D. Denoe he sequenially incoming deph samples of pixel x on and before ime as forming a se Dx = {d τ x τ = 1, 2,...,}. The deph value of he saic srucure in he pixel x is Z x, whose noise is convenienly governed by a Gaussian disribuion. We also propose wo individual oulier disribuions o describe he ouliers before and afer he saic srucure respecively. Hence, hey do no only describe he deph disribuion bu also provide evidence o indicae he sae o which he curren deph sample belongs. 1) Sae Descripion: The hree saes Ψ={I,F,B} are illusraed in Figure 3 and lised as follows. Sae-I: Fiing he saic srucure If d x belongs o he saic srucure, we assume ha i follows a Gaussian disribuion cenered a Z x as N ( d x Z x,ξx) 2, where ξx denoes he noise sandard deviaion, and is predefined based on he sysemaic error of he deph sensor. For insance, he noise variance of Kinec is acually relaed o he deph so i is appropriae o se ξ x deph-dependenly. Sae-F: Forward ouliers On he oher hand, he deph measuremens from moving objecs or ouliers in fron, follow a cluer disribuion like U f (d x Z x )=U f 1 [d x <Z x], where 1 [ ] is an indicaor funcion ha equals o 1 when he inpu argumen is rue, and oherwise. This sae is acivaed when d x is smaller han Z x, and swiched off if i is larger han Z x. I can be inferred from his sae ha no only are he ouliers in fron, bu also ha dynamic objecs are a he given locaion. Sae-B: Backward ouliers Furhermore, i is possible ha he inpu deph measuremens are ouliers lying behind he curren esimaion of he saic srucure. Anoher similar indicaor disribuion is inroduced as U b (d x Z x )=U b 1 [d x >Z x]. I can naurally represen ouliers ha have larger deph values han a given srucure. Meanwhile, i provides a cue o infer he risk wheher curren saic srucure esimaion is incorrec. An addiional hidden variable m x = [ ] m I x,m F x,m B x is inroduced as he sae indicaor o represen hese saes, where m k x {, 1},k Ψ. In his case, only one specific sae m k x =1and he res are s, hus k Ψ mk x =1. 2) A Generaive Model: The reason o inroduce he generaive model is ha i can simulae he saic srucure as well as is noise and ouliers, hus in case here are no observed measuremens a he curren frame (e.g., deph holes), we can sill give a reasonable saic srucure. Moreover, given suiable parameric forms of hese disribuions, he generaive model can be online esimaed and refined by updaing he parameers wih sequenially incoming deph samples. Likelihood Appending he sae indicaor m x, he likelihood of d x condiioned on m x and he saic srucure Z x is a produc of he disribuions of he hree saes as p(d x m x,z x )= N (d x Z x,ξx) 2 mi x Uf (d x Z x ) mf x U b (d x Z x ) mb x. I equals o one required sae disribuion by riggering off his sae indicaor m k x =1,k Ψ. Prior Le he prior for Z x also be a Gaussian disribuion wih he mean μ x and he sandard deviaion σ x, wrien as p(z x ) = N ( Z x μ x,σx) 2. σx is differen from ξ x since i represens he possible range of he saic srucure raher han is noise level. The prior of he chance o acivae one sae is a caegorical disribuion Ca(m x ω x ) [38], where ω x = [ ] ωx,ω I x F,ωx B and k Ψ ωk x = 1,ωx k (, 1). This parameer reveals he opporuniies o induce hese saes in advance of he inpu deph samples. And ω x is furher modeled by a Dirichle disribuion p(ω x ) = Dir (ω x α x ), where α x =[αx,α I x F,αx B ],αx k R + and corresponds o ωx. k Poserior Two poseriors are in fac essenial for he saic srucure esimaion. One is p(z x, ω x Dx), which joinly presens he deph disribuion of he saic srucure and he populariy densiies of hese hree saes given he curren and all previous deph frames. The oher is he poserior of he sae indicaor p(m x Dx), which represens he possible saes a he curren frame. Based on he esimaed poseriors, we can evaluae he mos probable deph values of he saic srucure by calculaing he expecaion of p(z x Dx) as E p(zx Dx ) [ [Z x ] ]. The reliabiliy of curren esimaion refers o E p(ωx Dx ) ω I x, which means ha he larger he porion of inpu deph samples ha agree wih he model, he more reliable he esimaion is. The mos possible sae ha d x should occupy is calculaed sraighforward from arg max mx p(m x Dx). B. Variaional Approximaion However i is almos unfeasible o solve hese poseriors analyically because i is no independen beween Z x and ω x for p(z x, ω x Dx), and p(z x Dx) and p(ω x Dx) do no exacly follow Gaussian and Dirichle disribuions any more. Therefore, variaional approximaion [38] of he poseriors is inroduced o provide sufficienly accurae approximaed poseriors efficienly. I minimizes he Kullback-Leibler divergence beween he approximaed and he original poseriors. The variaionally approximaed poseriors are required o own he same parameric forms as he priors hus hey also produce analyical [ ] soluions o approximae E p(zx Dx ) [Z x ] and E p(ωx Dx ) ω I x. The approximaion sars from facorizing he poserior p(z x, ω x Dx) ino he produc of independen Gaussian disribuion q (Z x ) = N ( Z x μ x, (σx) 2) and Dirichle disribuion q (ω x )=Dir(ω x α x) as q (Z x, ω x )=q (Z x )q (ω x ) p(z x, ω x D x). (1) No only ha, bu he exac esimaion also depends on all he previous deph samples D x. Too many frames will bring abou unbearable complexiy and memory requiremen. We admi a firs order Markov chain ino our framework so as o favor he online esimaion. I means ha we can esimae

5 SUBMITTED TO TRANSACTION ON IMAGE PROCESSING approximaed confidence inerval approximaed μ x approximaed ωx I, approximaed ωx F, approximaed ωx B, ideal ωx I ideal ωx F ideal ωx B hisogram of raw daa approximaed disribuion No. of Frames No. of Frames No. of Frames Deph Range (a) Zx v.s. Dx (b) Confidence inerval w.r.. Zx (c) Evoluion of each saes porions (d) q T (d x Px D,T ) v.s. daa hisogram Fig. 4. Variaional approximaion of he parameer se of he saic srucure for a 1D deph sequence. The number of frames is T = 5. (a) The expeced deph sequence of he saic srucure versus he raw deph sequence, where he ideal Z x =5. (b) The confidence inerval of Zx, he inerval is cenered μ x and beween μ x ±2σ x wih 95% confidence. (c) The evoluion of he porions (defined by he expeced value of ωx a frame, denoed by [ωi, x,ωx F,,ωx B, ]) of he hree saes. The ideal porions are ω x =[.89,.1,.1]. (d) The esimaed disribuion q T (d x Px D,T ) versus he normalized hisogram esimaed by Dx T when T = 5. The esimaed deph of he saic srucure goes o he ideal value only wih a few samples. Is confidence inerval shrinks rapidly, which means he uncerainy is reduced very fas. The porion of each sae is evolved wih he raw deph sequence, and hey mach heir ideal value wih enough deph samples. When T = 5, he esimaed daa disribuion fis he daa hisogram compacly. he curren poserior jus based on he curren likelihood and he poserior of he las frame, herefore i is memory- and compuaionally efficien. We reformulae he poserior as a sequenial parameer esimaion problem q (Z x, ω x ) p ( Z x, ω x Dx ) p(d x Z x, ω x )q 1 (Z x, ω x )/q (d x) (2) = Q(Z x, ω x d x), where he parameers of he lef-hand side are esimaed by maching momens beween he disribuions of lef- and righhand sides [38]. This only considers he curren daa samples and he previous esimaed parameers o approximae he curren parameers. We define he parameer se esimaed a 1 is Px D, 1 = {μ 1 x,σx 1, α 1 x }, while he required parameer se is Px D,. By maching he firs and he second momens beween Q(Z x d x) and q (Z x ) as well as hose beween Q(ω x d x) and q (ω x d x) [39], we can obain a closed-form soluion for any parameer in Px D,. Please refer o he supplemenary maerials for heir deailed derivaions. Hence, recall he problem addressed in Secion III-A, he approximaed poserior wih respec o he sae indicaor m x is q (m k x =1 d x),k Ψ, which is a suiable approximaion of p(m x Dx) and also has a closed-form soluion. Apar from ha, he mos probable deph value of he saic srucure a pixel x and ime is Zx = E p(zx Dx ) [Z x ] μ x, (3) and he reliabiliy of curren esimaion of he saic srucure is he expecaion of ωx I as rx [ ] = E p(ωx Dx ) ω I x α I, x / αx k,. (4) k Ψ As shown in Figure 4, an example of he variaional approximaion of he parameer se for a 1D deph sequence illusraes he poenial of he proposed mehod o capure he naure of he inpu deph sequence. C. Improvemen wih Color Video The above discussion only considers he esimaion and updae of he saic srucure wih he deph video. A more complee reamen is ogeher wih he regisered color video, in which case an improved probabilisic generaive model can be formulaed as follows. Prior We inroduce anoher prior over C x, he color value of he saic srucure a x as p(c x )=N (C x U x, Σ x ) wih wo parameers: he mean U x and he variance Σ x. Likelihood The likelihood of inpu deph and color samples d x and I x condiioned on m x given Z x and C x is p(d x, I x m x,z x, C x )=U f ( d x Z x ) m F x U b ( d x Z x ) m B x [ N ( d x Z x,ξ 2 x) N ( I x C x, Ξ x )] m I x, (5) where Ξ x denoes he variance marix for he color noise. A sep furher we have he likelihood of d x and I x condiioned on Z x and C x accordingly. This formulaion improves he inference since he inpu deph sample will belong o he saic srucure only when boh he deph and color samples agree wih he previous model. Therefore, he risk of false esimaion is reduced. Poserior and variaional approximaion In a similar fashion in Secion III-B, we can derive he approximaed poserior when color video exiss. The parameer se P S, { μ x,σx, U x, Σ } x, α x, S = {D, I} can also be esimaed online and analyically. Furhermore, he mos probable deph Zx and color C x of he saic srucure are achieved based on μ x and U x. The approximae poseriors q (m k x d x, I x),k Ψ are also derived accordingly. x = D. Layer Assignmen In his secion, we would like o find he saic region of he inpu deph frame so as o robusly updae he model of he saic srucure and find he dynamic region. Specifically, we label he inpu deph frame in hree layers L = {l iss,l dyn,l occ }: l iss : agree wih esimaed saic srucure; l dyn : belong o dynamic objecs in is fron; or l occ : refer o he once occluded srucure behind i. The addiional label l occ is essenial because he regions belonging o he once occluded srucure do no fi he curren

6 SUBMITTED TO TRANSACTION ON IMAGE PROCESSING 6 Inpu deph frame Saic srucure Color frames Fig. 5. One oy example illusraes he layer assignmen. The cyan do line indicaes he curren esimaed deph srucure of he saic srucure, and he red solid line is from he inpu deph frame. If color frames are available, hey provide addiional consrains o regularize he assignmen, where he upper line corresponds o he curren esimaed exure srucure of he saic srucure, and he lower one refers o he inpu color frame. model, bu hey reveal he hidden srucure behind he curren esimaed saic srucure. I also poins ou ha curren esimaion produces bias a hese regions, in which he deph srucure from he inpu deph frame D would be a more reasonable subsiuion o recify he previous esimaion. One oy example is shown in Figure 5, where D provides a differen layou from he curren saic srucure. Inuiively, l occ occurs when he inpu deph frame provides larger deph values and exposes he hidden saic srucure. l dyn, on he conrary, encourages smaller deph values. Furhermore, he failure of inference due o deph holes, noise and ouliers can be eliminaed by he inroducion of exure informaion, which also provides addiional cues o regularize heir spaial layou. To improve he expressive power o label complex srucures ha is employed frequenly in our case, we exploi a fully conneced condiional random field (fully-conneced CRF) [4] o srenghen he spaial long-range relaionship. Assume a random field L = {l x L x} condiioned on he inpu daa S and he previous model parameer se M = P S, 1. The Gibbs energy of a label assignmen L is E(L S, M) = x ψ u ( lx S, M ) ( ψ p lx,l y S, M ), (6) x y where x and y are pixel locaions. ψ u ( ) and ψ p (, ) indicae he unary and pairwise poenials. S = D or {D, I }. 1) Definiion of unary and pairwise poenials: We define he unary poenials and pairwise poenials as follows: Unary poenials The unary poenials are negaive logarihms of he approximaed poseriors q (m x S x), indicaing he chance ha he curren deph samples should follow he previous esimaion (i.e., l iss requires m I x =1), or in is fron (i.e., l dyn needs m F x =1) or a is back (i.e., l occ refers o m B x =1). In deail, we have ψ u (l x = l k S, M) = ln q (m k x =1 S x), and l k and m k x follow he correspondences lised above. Pairwise poenials The pairwise poenial beween pixels x and y is a weighed mixure of Gaussian kernels as { ψ p (l x,l y S x, M x )=1 [lx l y] w s exp ( τ α x y 2 /2 ) + ( )} w r exp Δ f x Δ f y 2 Σ β /2 τ γ x y 2 /2. (7) Algorihm 1: Online Saic Srucure Updae Scheme Inpu : Daa sequence S = {S τ τ =, 1, 2,...}; Iniial parameer se P S ini; Oupu: Curren parameer se P S, ; // iniializaion 1, P S, param ini(s, P S ini); 2 while S do 3 +1; // 1.layer assignmen 4 M P S, 1, L arg min L E(L S, M); // 2.parameer updae 5 for x do 6 if l x = l iss hen P S, x else if l x = l occ hen P S, x else if l x = l dyn hen P S, x vari approx(s x, Px S, 1 ) param ini(s x, Pini) S P S, 1 x // 3.spaial enhancemen 7 Z x μ x, x; 8 Z spaial enhance(z, P S, ), μ x Z x, x; We define Δ f x = fx I, 1 fx o measure he difference beween he feaures of he saic srucure and hose of he inpu daa. When S = D, fx and fx I, 1 are he normalized d x and Zx 1, by a whiening process of he overall variance ( ξ x) 2 =(σx 1 ) 2 + ξx.ifs 2 = {D, I }, le fx and fx I, 1 [ be he concaenaions ] of he normalized vecors [d x; I x] and Z 1 x ; C 1 x. The color feaures are normalized wih he variance Ξ x = Ξ x + Σx 1. The indicaor funcion 1 [lx l y] les he pairwise poenials be Pos model. I encourages a penaly for nearby pixels ha are assigned differen labels bu hey have similar feaures. The firs kernel is a smoohness kernel ha removes small isolaed regions and is adjused by τ α. The second kernel is a range kernel rying o force nearby pixels wih similar deph and/or color variaion o share he same label, wih a given parameer τ γ o se he degree of nearness. Δ f x Δ f y 2 Σ β is he Mahalanobis disance beween Δ f x and Δ f y, where he covariance marix Σ β encodes he feaure proximiy. The weigh of he range kernel is se as w r. If we only have he range kernel, he resul ends o be noisy, while if we only have he smooh kernel, he srucure canno be well regularized. 2) Inference: We exploi an efficien mean field inference mehod for fully-conneced CRF when he pairwise poenials are Gaussian [4]. I urns ou o be an ieraive esimaion process convolving several runs of real-ime high dimensional filering characerized by he pairwise poenials (7). E. Online Saic Srucure Updae Scheme The online saic srucure updaing scheme is acually a sequenial variaional parameer esimaion problem wih a layer assignmen o exclude he dynamic objecs and include he once occluded saic srucure. A spaial enhancemen is appended o regularize he spaial layou of he srucure. The skech of he algorihm is given in Algorihm 1. An iniializaion of he parameer se P S is necessary. We se he iniial μ x = d x, where d x D from he firs frame of he deph video. Similarly, le U x = I x, where I x I from he color video. The noise parameers ξ x and

7 SUBMITTED TO TRANSACTION ON IMAGE PROCESSING 7 Ξ x are user-specified consans which should be large enough o enable sufficien variance of inpu daa. σx and Σ x will be iniialized as large values as well. The parameers of ω x are also se up wih given consans α x. A convenien seup is αx I, = αx F, = αx B,. The user-given iniializaion parameer se is Pini S = {ξ x,σx, α x x} when S = D and Pini S = {ξ x,σx, Ξ x, Σ x, α x x} when S = {D, I}. In addiion, he layer assignmen is no applied in he iniializaion sep. A he h frame, he layer assignmen is applied a firs based on he previous parameer se P S, 1 and he inpu daa S. The region in which l x = l iss will perform he variaional parameer esimaion o obain a renewed Px S,.Ifl x = l dyn, i belongs o a dynamic objec so ha Px S, = Px S, 1. Bu on he oher hand, if l x = l occ, he parameer se of his pixel is reiniialized as in he iniializaion sep, bu μ x = d x, U x = I x. Furhermore, i is a common phenomenon ha he inpu deph frames conain holes wihou deph measuremens. In his case, μ x and λ x will no be updaed in hese special regions. The spaial enhancemen, including hole filling, smoohing and regularizaion, is necessary o generae a spaially refined saic srucure. I is performed afer he parameer esimaion in each frame, where we have obained he mos probable deph map Z (Zx Z ). A variaional inpaining mehod incorporaing a TV-Huber norm and a daa erm by Mahalanobis disance wih he variance ( ξ x) 2 is employed for spaial enhancemen, which is ieraively solved by a primaldual approach [14]. Since he solver requires hundreds of runs o converge, a rade-off beween speed and accuracy is adoped by fixing he number of ieraions and borrowing he spaially enhanced resul in he las frame Z 1 as he iniializaion. To reduce error propagaion, unreliable pixels in he inpu deph map Z are deleed according o he reliabiliy check rx >.5 (c.f., equaion (4)). Given he mos probable color image of he curren saic srucure C, he spaial enhancemen in Z can absorb he exure informaion o guide he propagaion of he local srucures. In he end, he enhanced deph map Z x will subsiue μ x in P S, x. F. Temporally Consisen Deph Video Enhancemen Apar from spaial enhancemen, i is preferred o employ emporal enhancemen o produce a flicker-free deph video. To enable long-range emporal consisency and allow online processing, we exploi he saic srucure of he capured scene as a medium o find he region in he inpu frame exhibiing long-range emporal connecion. The saic region is enhanced by fusing he inpu deph measuremens wih he saic srucure according o he online saic srucure updae scheme in Secion III-E. Thus he saic regions are well-preserved and incremenally refined over ime. The idea behind his is ha we resric he emporal consisency o be enforced only around saic region or slowly moving objecs. This assumpion is somewha resricive bu is sill suiable o process normal deph videos. One addiional advanage of he proposed mehod is ha i can preven bleeding arifacs ha propagae deph values from moving objecs ino he saic background as long as he layer assignmen is robus. Given he resuling layer assignmen of he curren frame, he saic region is where l x {l iss,l occ }, including he regions referring o he saic srucure and hose belonging o he once occluded saic srucure. They boh expose he curren visible saic srucure of he capured scene, hus shall be enhanced separaely from he dynamic objecs. The enhanced version is obained by subsiuing i wih is counerpar in he saic srucure, which has already been updaed in he emporal domain and enhanced in he spaial domain (see Secion III-E). The dynamic region can be enhanced by various approaches explored in he lieraure, while in his paper we exploi a convenional join bilaeral filer, boh o fill holes and o perform edge-preserving filering in he dynamic region. The proposed mehod is boh memory- and compuaionally efficien. The memory requesed for he proposed mehod only goes o soring he parameer se for each pixel, hus is efficien o process sreaming videos or long sequences of high qualiy. Exceping he cos of he spaial enhancemen, he complexiy for emporal enhancemen hinges on ha of he online saic srucure updae scheme, in which all he required parameers have analyical soluions whils he layer assignmen is efficien hanks o he consan-ime implemenaions in solving he fully-conneced CRF model. Provided wih an efficien spaial enhancemen approach, for example, he domain ransform filer [41] or he proposed one wih he help of muli-hread echniques or GPGPUs [42], he enire emporally consisen deph video enhancemen procedure can be achieved in real-ime. IV. EXPERIMENTS AND DISCUSSIONS In his secion, we presen our experimens on synheic and real daa o demonsrae he effeciveness and robusness of our saic srucure esimaion and deph video enhancemen. Secion IV-A numerically evaluaes he performance of our mehod for saic srucure esimaion using synheic deph videos 2 generaed from he Middlebury daase [43], [44]. Our mehod is no sensiive o he user-given parameers, and ouperforms various mehods abou saic scene esimaion wih running ime comparable o emporal median filering. In Secion IV-B, we evaluae he performance on real daa capured by Kinec and ToF cameras. Boh saic and dynamic indoor scenes are aken ino consideraion. Apar from he esimaion of saic srucure, we also evaluae he performance of he saic scene reconsrucion and mos imporanly, he emporally consisen deph video enhancemen in Secion IV-C. Iniial parameers are simply se as α x =[1, 1, 1], σx is he 1% of he deph range of he inpu scene. And iniial parameer Σ x is a diagonal marix wih each diagonal eniy he square of 1% of he color range. A. Numerical Evaluaion of he Saic Srucure Esimaion By Synhesized Daa We used wo ypes of noise and ouliers, which are illusraed in Figure 6, o conaminae he deph video so ha we could evaluae he performance of our mehod wih respec o differen kinds of errors from differen ypes of deph sensors. 2 The deph of one pixel in he deph frame is proporional o he reciprocal of he dispariy a he same place in he corresponding dispariy frame.

SUBMITTED TO TRANSACTION ON IMAGE PROCESSING 8 1 3 1 2 1 1 (1 3.5, 2) (1 3.3, 3.2) 1 3 1 2 1 1 (1 3.7, 2) (1 3.7, 2.2) 1 3 1 2 1 1 (1 5, 2) (1 4.8, 2.

(a) is he sample color frame, (b) and (c) are he conaminaed deph frames wih σ n =2and ω n =1 2. (b) is ype-i bu (c) is ype-ii. Type-II error is worse han ype-i error wih he same parameers.

The red curve is by deph-dependen ξ x, and he blue curve is by consan ξ x. Each curve is obained a is own opimal parameer pair (u, σ), as shown in he legends.

param u 1 (b) I :(1 2, 2) sd param σ 2 1e 5 1e 3 1e 1e 1 oulier param u 1 1e3 1e2 1e1 2 1e 5 1e 3 1e 1e 1 oulier param u (c) I :(1 1, 4) (d) II :(1 3, 1) (e) II :(1 2, 2) (f) II :(1 1, 4) Fig. 7.

1e3 1e2 1e1 sd param σ 1 1e3 1e2 1e1 2 1e 5 1e 3 1e 1e 1 oulier param u Type-I: We conaminaed he deph map via p(d x Z x )=(1 ω n )N ( d x Z x,σn) 2 +ωn U (d x ), where U(d x ) is he reciprocal of he

The dispariy map was ransformed ino he deph map. U(d disp x ) was he reciprocal of he dispariy range. I mimicked he ouliers in common deph videos capured by sereo or Kinec.

8 SUBMITTED TO TRANSACTION ON IMAGE PROCESSING (1 3.5, 2) (1 3.3, 3.2) (1 3.7, 2) (1 3.7, 2.2) (1 5, 2) (1 4.8, 2.2) (a) Reindeer (b) I :(1 2, 2) (c) II :(1 2, 2) Fig. 6. Sample frames of he inpu deph video wih wo ypes of noise and ouliers. (a) is he sample color frame, (b) and (c) are he conaminaed deph frames wih σ n =2and ω n =1 2. (b) is ype-i bu (c) is ype-ii. Type-II error is worse han ype-i error wih he same parameers (a) (1 1, 4) (b) (1 2, 2) (c) (1 3, 1) Fig. 8. Performance comparisons beween he consan and deph-dependen ξ x under differen ype-ii noise and oulier parameer pairs (ω n,σ n). The red curve is by deph-dependen ξ x, and he blue curve is by consan ξ x. Each curve is obained a is own opimal parameer pair (u, σ), as shown in he legends. sd param σ sd param σ 1 1e3 1e2 1e1 2 1e 5 1e 3 1e 1e 1 oulier param u 1 (a) I :(1 3, 1) 1e3 1e2 1e1 2 1e 5 1e 3 1e 1e 1 oulier param u sd param σ sd param σ 1 1e3 1e2 1e1 2 1e 5 1e 3 1e 1e 1 oulier param u 1 (b) I :(1 2, 2) sd param σ 2 1e 5 1e 3 1e 1e 1 oulier param u 1 1e3 1e2 1e1 2 1e 5 1e 3 1e 1e 1 oulier param u (c) I :(1 1, 4) (d) II :(1 3, 1) (e) II :(1 2, 2) (f) II :(1 1, 4) Fig. 7. maps wih varying u and σ under differen noise and oulier parameer pairs (ω n,σ n). (a)-(c) were conaminaed by ype-i, while (d)-(e) were conaminaed by ype-ii. 1e3 1e2 1e1 sd param σ 1 1e3 1e2 1e1 2 1e 5 1e 3 1e 1e 1 oulier param u Type-I: We conaminaed he deph map via p(d x Z x )=(1 ω n )N ( d x Z x,σn) 2 +ωn U (d x ), where U(d x ) is he reciprocal of he deph range. I is a general model of noise and ouliers. Type-II: We damaged he dispariy map by p(d disp x Zx disp )= (1 ω n )N (d disp x Zx disp,σn)+ω 2 n U(d disp x ) and rounded i. The dispariy map was ransformed ino he deph map. U(d disp x ) was he reciprocal of he dispariy range. I mimicked he ouliers in common deph videos capured by sereo or Kinec. 1) Analysis of user-given parameers: We firs evaluaed he user-given parameers for he oulier parameers U f, U b and he noise sandard deviaion ξ x. In case-i, we se ξ x = σ as a consan hroughou he pixel domain. For case-ii, he choice of ξ x should be suiable o dispose of he non-uniform quanizaion error due o dispariy-deph conversion as ξ x = σ d2 x fb.3 Meanwhile, we se U f = U b = u. The experimens were evaluaed by he score wih varying u and σ under differen levels of noise (σ d ) and ouliers (ω d ). The resuls are shown in Figure 7, where he es video had 1 frames. We se σ [, 2] and u [1 5, 1 1 ]. Noice ha he esed scene was saic hus here was NO need o perform layer assignmen. The spaial enhancemen was also skipped. The proposed mehod achieves saisfacory performances and is insensiive o ξ x, bu a slighly bigger ξ x urns ou o be 3 f is he focal lengh and B is he baseline, boh of which are provided in he Middlebury daase. The conversion relaionship is derived in he supplemenary maerials. more robus. On he oher hand, we obain low scores when u is around or smaller han he reciprocal of he deph range ( 1 3 in he es deph videos). Alhough smaller u can sill achieve good performance, is range ends o be narrower when noise level is increased. In pracice, seing he U f and U b o be he reciprocal of he deph range is sufficien and convenien, since i acually means ha he ouliers may uniformly occur inside he deph range. In addiion, he deph-dependen noise parameer ξ x performs superior o he consan ξ x in dealing wih ype-ii error. A shown in Figure 8, comparisons of he resuls by opimal parameer pairs (u, σ) of boh cases 4 reveal ha a larger consan ξ x is required o cach severer noise presened a larger deph values due o he propery of ype-ii error. In comparison wih he deph-dependen noise, consan ξ x migh be sufficien for slighly noisy deph videos as shown in Figure 8(c), bu lacks capabiliy o cach severe noise, as shown in Figure 8(a) and (b). 2) Comparison of synheic saic scenes: As some online 3D scene reconsrucion mehods can also successfully perform he saic scene esimaion in an online fashion, we numerically compared several sae-of-he-ar candidaes, i.e., he runcaed signed disance funcion (TSDF) [33], [34] in KinecFusion, he emporal median filer (-MF) and he generaive model for deph fusion (g-df) [35], wih our mehod. The grid number per pixel was se as 1, for boh TSDF and g-df. The emporal window size of -MF was 5 in our experimens. As shown in Figure 9, as wih all oher mehods, our mehods end o decrease he progressively wih more frames included. However, our mehod is robus o he noise and ouliers for boh he ype-i and ype-ii errors, and has a faser rae, i.e., uses a smaller number of frames o converge and achieve a sable performance. The severer he noise is, he more superior he proposed mehod can be. Because TSDF is always slower o converge and g-df suffers from quanizaion errors, hey canno usually achieve he same performance our mehod was able o achieve. In fac wih a very large window size, -MF migh obain scores lower even han hose of our mehod, bu would require more memory and will end o be slower. Furhermore, -MF does no provide confidence of is oupu as our mehod does. Due o he quanizaion arifac of g-df, even in an opimal seing, 4 The opimal resuls were obained by exhausive search of 4 uniformlysampled parameer pairs in he range σ [, 2] and u [1 5, 1 1 ].

SUBMITTED TO TRANSACTION ON IMAGE PROCESSING 9 Raw deph sequence Raw color sequence = = 5 = 1 Raw deph sequence Raw color sequence = = 5 = 1 Saic srucure w/o spaial enhancemen Saic srucure w/o spaial

enhancemen (w/ exure) (b) Indoor_Scene_2 Fig. 1. Visual evaluaion on real indoor saic scenes.

The second row is he seleced resuls of he esimaed saic srucures wihou spaial enhancemen a frame =, 5, 1 respecively.

The yellow color in he second row marks missed deph values (holes). Gray represens deph value, ligher meaning a nearer disance from he camera. Bes viewed in color.

9 SUBMITTED TO TRANSACTION ON IMAGE PROCESSING 9 Raw deph sequence Raw color sequence = = 5 = 1 Raw deph sequence Raw color sequence = = 5 = 1 Saic srucure w/o spaial enhancemen Saic srucure w/o spaial enhancemen Saic srucure w/ spaial enhancemen (w/o exure) Saic srucure w/ spaial enhancemen (w/o exure) Saic srucure w/ spaial enhancemen (w/ exure) (a) Indoor_Scene_1 Saic srucure w/ spaial enhancemen (w/ exure) (b) Indoor_Scene_2 Fig. 1. Visual evaluaion on real indoor saic scenes. (a) and (b) are he resuls of wo sequences Indoor_Scene_1 and Indoor_Scene_2, capured from wo real indoor scenes. The firs row shows he raw deph sequences and color sequences. The second row is he seleced resuls of he esimaed saic srucures wihou spaial enhancemen a frame =, 5, 1 respecively. The hird row shows corresponding spaially enhanced saic srucure wihou exure informaion, while he las row exhibis he resuls wih he guidance of exure informaion. The yellow color in he second row marks missed deph values (holes). Gray represens deph value, ligher meaning a nearer disance from he camera. Bes viewed in color Ours TSDF -MF g-df Inpu Ours TSDF -MF g-df Inpu TABLE I PER-FRAME RUNNING TIME COMPARISON (MATLAB PLATFORM) Algorihms -MF (w=5/1) g-df TSDF Ours Running ime (s).188 / (a) I :(1 3, 1) Ours TSDF -MF g-df Inpu (c) I :(1 2, 2) 1 1 Ours TSDF 1 -MF g-df Inpu (b) II :(1 3, 1) 1 1 Ours TSDF 1 -MF g-df Inpu (d) II :(1 2, 2) 1 1 Ours TSDF 1 -MF g-df Inpu (e) I :(1 1, 4) (f) II :(1 1, 4) Fig. 9. Comparison wih oher mehods on saic srucure esimaion of he synheic saic scenes. Three levels of noise and oulier parameer pairs (ω n,σ n) were esed. (a), (c) and (e) were of ype-i. (b), (d) and (f) were of ype-ii. The x-axis marks he frame order, and y-axis is he score g-df will generally exhibi a lower performance han ha of he proposed mehod. The occupancy grid forbids g-df o obain a sub-grid accuracy [35]. The per-frame running ime comparison is lised in able I, where our mehod is comparable wih -MF. The -MF wih window size 5 has a slighly smaller compuaional cos, bu when he window size is 1, is running ime exceeds ha of our mehod. g-df and TSDF require much more ime o process a single frame, bu heir performances are sill no comparable o our mehod. B. Evaluaion of he Saic Srucure Esimaion By Real Daa To validae our algorihm wih he real daa, we picked several deph video sequences capured by Kinec and ToF cameras. Boh saic and dynamic scene were esed. 1) Saic scenes: Figure 1 shows he resuls of wo real indoor scenes capured by Kinec. The firs row shows he raw deph and color video sequences. Noice ha here are severe holes presened, and fine deails of he scene are suscepible o be missed or in faul deph values. Neverheless, heir corresponding color frames are always well-defined everywhere o provide enough cues o regularize he srucures.

SUBMITTED TO TRANSACTION ON IMAGE PROCESSING 1 #1 #2 #3 #4 #5 (a) #1 #2 #3 #4 #5 (a) Indoor_Scene_1 (b) Indoor_Scene_2 Fig. 11.

(b) #1 #2 #3 #4 #5 We firs esimae he saic srucure jus by raw deph frames wihou spaial enhancemen. See he second row in Figure 1.

In he case where only deph video is applicable, spaial enhancemen is only consrained by he deph informaion.

As illusraed in he las row of Figure 1, spaial enhancemen based on boh deph and exure informaion produces refined saic srucures

Direcly employing spaial enhancemen in raw deph frames canno obain sable resuls since randomly occurring holes and ouliers desroy

The saic srucure, in conras, enforces he longrange emporal connecion and incremenally refines he saic scene.

srucure as per equaion (4), which indicaes ha fla or smooh surfaces in he saic srucure are of high reliabiliy.

I is reasonable ha measuremens around such regions end o be unreliable due o he sysemaic limiaions of Kinec and relaed deph

The saic srucure can be spaially regularized furher in conjuncion wih he reliabiliy map by reducing he daa confidence in he

2) Dynamic Scenes: Our mehod can effecively exrac he dynamic conen from a saic scene and furher esimae and refine he saic srucure

The second was a hand sequence by a ToF camera (dyn_of_l). Kinec sequence dyn_kinec_l is a ime-lapse (3 ) Kinec sequence.

The parameer se for layer assignmen: w r =5,w s = 1, τ α = 16 2,τ γ = 3 2, Σ β = I.

Saic srucure esimaion on dyn_kinec_l. (a) and (b) are he firs five frames of he inpu sequence.

(d) represens he deph map of he saic srucure, and (e) shows he corresponding color map. The firs frame is for iniializaion.

(a) shows he firs 5 frames of he inpu sequence. (b) shows he layer assignmen resuls.

The firs frame is for iniializaion. color) wih very few frames.

10 SUBMITTED TO TRANSACTION ON IMAGE PROCESSING 1 #1 #2 #3 #4 #5 (a) #1 #2 #3 #4 #5 (a) Indoor_Scene_1 (b) Indoor_Scene_2 Fig. 11. Reliabiliy maps of wo es sequences of indoor saic scenes. (b) #1 #2 #3 #4 #5 We firs esimae he saic srucure jus by raw deph frames wihou spaial enhancemen. See he second row in Figure 1. Our mehod can robusly fill holes as long as sufficien deph samples in previous frames are available. In he case where only deph video is applicable, spaial enhancemen is only consrained by he deph informaion. Even hough he resuls are more spaially regular han hose wihou spaial enhancemen, he inpaining arifacs occur inside sufficien large holes, and edges are blurred. Furhermore, wrong measuremens in he deph frames will be reained in he saic srucure and canno be eliminaed. As illusraed in he las row of Figure 1, spaial enhancemen based on boh deph and exure informaion produces refined saic srucures which are boh reliable and user-accepable. The resuls in green boxes show he differences beween wo ypes of spaial enhancemens. Direcly employing spaial enhancemen in raw deph frames canno obain sable resuls since randomly occurring holes and ouliers desroy he consisency beween frames and preven he regularizing of he deph map ino a emporally sable one. The saic srucure, in conras, enforces he longrange emporal connecion and incremenally refines he saic scene. As shown in red circles in Figure 1, he missed srucures canno be inferred saisfacorily jus by convenional mehods, bu hey are refined and converged as ime goes on. The reliabiliy of he esimaed saic srucure (shown in Figure 11) is measured by he proporion of samples ha agree wih he saic srucure as per equaion (4), which indicaes ha fla or smooh surfaces in he saic srucure are of high reliabiliy. Simply marking unreliable pixels by r x.5, many unreliable pixels are around disconinuiies or occlusions. I is reasonable ha measuremens around such regions end o be unreliable due o he sysemaic limiaions of Kinec and relaed deph sensors. The saic srucure can be spaially regularized furher in conjuncion wih he reliabiliy map by reducing he daa confidence in he unreliable region. Our reliabiliy map is daa-driven unlike hose by heurisic mehods [15] ha need user-uned parameers. 2) Dynamic Scenes: Our mehod can effecively exrac he dynamic conen from a saic scene and furher esimae and refine he saic srucure in he saic region. Two videos were evaluaed. One was capured by Kinec, a real indoor scene wih people moving around (dyn_kinec_l). The second was a hand sequence by a ToF camera (dyn_of_l). Kinec sequence dyn_kinec_l is a ime-lapse (3 ) Kinec sequence. Figure 12 shows he resuls of he firs five frames. The parameer se for layer assignmen: w r =5,w s = 1, τ α = 16 2,τ γ = 3 2, Σ β = I. Our proposed mehod can rapidly capure he saic srucure (boh he deph and (c) (d) (e) #1 #2 #3 #4 #1 #2 #3 #4 Fig. 12. Saic srucure esimaion on dyn_kinec_l. (a) and (b) are he firs five frames of he inpu sequence. (c) shows he layer assignmen resuls. Red, green, blue denoe l iss,l dyn,l occ, respecively. (d) represens he deph map of he saic srucure, and (e) shows he corresponding color map. The firs frame is for iniializaion. (a) (b) (c) #1 #2 #3 #4 #5 #1 #2 #3 #4 #5 #1 #2 #3 #4 #5 Fig. 13. Saic srucure esimaion on dyn_of_l. (a) shows he firs 5 frames of he inpu sequence. (b) shows he layer assignmen resuls. Red, green, blue denoe l iss,l dyn,l occ, respecively. (c) represens he deph map of he saic srucure. The firs frame is for iniializaion. color) wih very few frames. The arifac in Figure 12(d) is parially due o unreliable iniializaion, and parially because of he limied number of ieraions of hole filling in he spaial enhancemen. The laer one can be solved gradually afer a few frames, as shown in he 3 rd and 4 h frames in (d). The former problem will be relieved by deleing unreliable area in he fuure frames according o he reliabiliy map. ToF sequence The ToF sequence dyn_of_l [9] is ime-lapse (1 ) and has no color sequence embedded, as shown in Figure 13. The parameer se for layer assignmen: w r =2,w s =1, τ α =5 2,τ γ =1 2, Σ β = I. Similar o he resuls from dyn_kinec_l, he layer assignmen can effecively exclude deph values from dynamic foregrounds (l x = l dyn ) and include hose from once occluded saic srucures (l x = l occ ). Neverheless, he blurs around boundaries and high noise level in he raw deph frames lead o halo arifacs in he resulan saic srucures a he firs few frames, because in his case he layer assignmen canno definiively poin ou he exac boundaries beween layers. Forunaely laer frames provide more reliable deph samples in such #5 #5

SUBMITTED TO TRANSACTION ON IMAGE PROCESSING 11 (a) (b) (c) PP (d)

(a) and (b) are seleced frames from he es RGB-D video sequences.

(c) shows he resuls by CSTF [9], and (d) by WMF [1].

(g) compares he performances among hese mehods in he enlarged

Temporally Consisen Deph Video Enhancemen Our deph video

from enhancing he esed frame spaially and emporally.

emporally consisen enhancemen ouperforms mos exising represenaive

long-range emporally consisen deph video enhancemen [25].

highligh he advanages of he proposed mehod.

approaches are available in he supplemenary maerials.

dyn_kinec_1 are 113h, 133rd, 153rd, 173rd, 193rd and 213h, from

Severe holes occurring in each frame are parially because of

The reference mehods are he coheren spaio-emporal ﬁlering [9]

11 SUBMITTED TO TRANSACTION ON IMAGE PROCESSING 11 (a) (b) (c) PP (d) (e) (f) (g) Fig. 14. Comparison on deph video enhancemen. (a) and (b) are seleced frames from he es RGB-D video sequences. From lef o righ: he 113rd, 133h, 153h, 173h, 193h and 213h frame. (c) shows he resuls by CSTF [9], and (d) by WMF [1]. (e) by Lang e al. [25] (f) is generaed by he proposed mehod. (g) compares he performances among hese mehods in he enlarged sub-regions (shown in raser-scan order). Bes viewed in color. regions, hus eliminaing hese arifacs. See he difference from he 3rd o he 5h frame in Figure 13 (c). C. Temporally Consisen Deph Video Enhancemen Our deph video enhancemen works in conjuncion wih he online saic srucure updae scheme. The qualiy of he saic srucure deermines he resuling performance from enhancing he esed frame spaially and emporally. Thanks o he robusness and effeciveness of our proposed mehod, his emporally consisen enhancemen ouperforms mos exising represenaive approaches and shows comparable resuls wih curren sae-of-he-ar long-range emporally consisen deph video enhancemen [25]. We esed several RGB-D sequences o verify our conclusion and highligh he advanages of he proposed mehod. These videos and heir resuls by he proposed mehod and he reference approaches are available in he supplemenary maerials. As shown in Figure 14, he seleced frames from he sequence dyn_kinec_1 are 113h, 133rd, 153rd, 173rd, 193rd and 213h, from lef o righ. Severe holes occurring in each frame are parially because of occlusion and parially due o he absorben or reﬂecing maerials in he capured scene. Worse sill, he deph values around he boundaries of capured objecs end o be erraic. The raw deph and color frames are shown in Figure 14(a) and (b). The reference mehods are he coheren spaio-emporal ﬁlering [9] (CSTF), he weighed mode ﬁlering [1] (WMF) and emporally consisen deph upsampling by Lang e al. [25]. Their parameers were se up as heir defaul values as shown in heir papers. The reference resuls are shown in (c), (d) and (e) of Figure 14 and he resuls of he proposed mehod are lised in Figure 14(f). CSTF is inclined o be more blurring han he res of he mehods, especially inside he holes around he boundaries beween he foreground objecs and he background scene.

SUBMITTED TO TRANSACTION ON IMAGE PROCESSING 12 (a) dyn_kinec_2 (b) dyn_kinec_3 Fig. 15. Comparison of deph video enhancemen. (a) and (b) are seleced frames from wo differen RGB-D video sequences.

WMF needs o quanize he deph frame ino ﬁnie bins (in his experimen, 256 bins were applied), hus resuling in quanizaion arifacs even hough i encourages sharper boundaries wihou blurring.

On one hand, he reason is ha hey are no able o ﬁll large holes wihou propagaing wrong deph srucure when he exure is less informaive.

A recen pracical and remarkable improvemen aribuable o Lang e al. [25] is a pracical long-range emporal consisency enhancemen.

emporally sabilize he saic objecs and/or background, bu also enforces he long-range emporal consisency of he dynamic objecs.

However, he bleeding arifacs in he hole regions sill canno be eliminaed immediaely and are vulnerable o be propagaed over he adjacen frames.

In comparison wih he prior ars, he proposed mehod ouperforms CSTF and WMF boh spaially and emporally. Furhermore, i generally has a performance comparable o ha of Lang e al.

From lef o righ: color frame, raw deph frame and he enhanced deph frame. Arifacs are bounded by he red do boxes. sabilizing he saic region of each frame.

12 SUBMITTED TO TRANSACTION ON IMAGE PROCESSING 12 (a) dyn_kinec_2 (b) dyn_kinec_3 Fig. 15. Comparison of deph video enhancemen. (a) and (b) are seleced frames from wo differen RGB-D video sequences. From op o boom: he RGB frames, he raw deph frames, resuls by Lang e al. [25] and resuls by he proposed mehod. Bes viewed in color. WMF needs o quanize he deph frame ino ﬁnie bins (in his experimen, 256 bins were applied), hus resuling in quanizaion arifacs even hough i encourages sharper boundaries wihou blurring. Referring o any frame in Figure 14(c) and Figure 14(d), neiher of hese wo mehods can ﬁll he deph holes wih saisfacory accuracy, and he laer one performs worse in sabilizing hese holes. On one hand, he reason is ha hey are no able o ﬁll large holes wihou propagaing wrong deph srucure when he exure is less informaive. On he oher hand, he emporal consisency is enhanced only wihin a small emporal window, hus he srucure insides he holes canno be preserved over a long ime. A recen pracical and remarkable improvemen aribuable o Lang e al. [25] is a pracical long-range emporal consisency enhancemen. Is resuls shown in Figure 14(e) presen is superioriy boh in srucure regularizaion as well as emporal sabilizaion over he previous wo reference mehods. No only does he mehod by Lang e al. emporally sabilize he saic objecs and/or background, bu also enforces he long-range emporal consisency of he dynamic objecs. In comparison wih i, he proposed mehod canno preserve he emporal consisency inside he dynamic objecs. However, he bleeding arifacs in he hole regions sill canno be eliminaed immediaely and are vulnerable o be propagaed over he adjacen frames. Alhough his mehod is efﬁcien in calculaion hanks o he approximaion solver by consan-ime domain ransform ﬁlering [41], his mehod is globally opimized hus i ofen requires o sore all frames ino memory. In comparison wih he prior ars, he proposed mehod ouperforms CSTF and WMF boh spaially and emporally. Furhermore, i generally has a performance comparable o ha of Lang e al., someimes even superior around saic holes beween dynamic objecs and he saic background, and in (a) (b) Fig. 16. Failure cases of he proposed mehod. (a) and (b) are wo represenaive resuls. From lef o righ: color frame, raw deph frame and he enhanced deph frame. Arifacs are bounded by he red do boxes. sabilizing he saic region of each frame. Figure 14(g) compares he resuls of he enlarged sub-regions denoed by he red boxes in he original frames, in which our mehod feaures superior performance in regularizing hese deph srucures. In addiion, by observing he saic background behind he moving people, he proposed mehod offers much more sable resuls around regions where here were large holes, e.g., he black compuer cases and moniors placed on and under he whie ables. I boh preserves he long-range sabiliy of he deph srucure in he holes of he saic region and a he same ime prevens deph propagaion from he dynamic objecs o he saic background. Meanwhile, he spaially enhanced saic srucure by he proposed mehod can incremenally reﬁne iself by following he guidance of he corresponding color map, and gradually converges o a sable oupu, jus as discussed in Secion IV-B1. Two addiional resuls by he proposed mehod and Lang e al. [25] are presened in Figure 15, in which he proposed

SUBMITTED TO TRANSACTION ON IMAGE PROCESSING 13 (a) RGB frame (b) Raw deph frame (c) Ours (a) (b) (c) (d) Lang e al. [25] (e) CSTF [9] (f) WMF [1] Fig. 17. Examples of he background subracion.

mehod provides comparable qualiy while encourages even more delicae deails around he hands and heads, as well as blur-free boundaries beween he human and he background, owing o he success of layer

However, because he proposed mehod canno exrac a saic foreground objec from he saic background, blurring arifacs or false deph propagaion may happen around heir boundaries, jus as wih he

As referring o he sanding person near he background in Figure 15(b): boh he proposed mehod and ha by Lang e al. falsely propagaed he deph values from his lef arm ino he compuer case in he background.

To verify he reliabiliy and generaliy of he proposed mehod, more diverse sources of deph daa, e.g., deph videos capuring indoor or oudoor scenes, by Kinec, ToF or laser scanners, as well as sereo vision, should be evaluaed horoughly.

The saic srucure esimaion may hus fail if he capured scene has varying illuminaion, in which case, he spaio-emporal enhancemen urns ino a convenional spaial enhancemen approach.

13 SUBMITTED TO TRANSACTION ON IMAGE PROCESSING 13 (a) RGB frame (b) Raw deph frame (c) Ours (a) (b) (c) (d) Lang e al. [25] (e) CSTF [9] (f) WMF [1] Fig. 17. Examples of he background subracion. Bes viewed in color. mehod provides comparable qualiy while encourages even more delicae deails around he hands and heads, as well as blur-free boundaries beween he human and he background, owing o he success of layer assignmen in Secion III-D. However, because he proposed mehod canno exrac a saic foreground objec from he saic background, blurring arifacs or false deph propagaion may happen around heir boundaries, jus as wih he aforemenioned sae-of-he-ar mehod by Lang e al. and he filering-based approaches like CSTF and WMF. As referring o he sanding person near he background in Figure 15(b): boh he proposed mehod and ha by Lang e al. falsely propagaed he deph values from his lef arm ino he compuer case in he background. V. LIMITATIONS AND APPLICATIONS A. Limiaions One limiaion is ha he proposed mehod has only been esed wih indoor Kinec and ToF deph videos. To verify he reliabiliy and generaliy of he proposed mehod, more diverse sources of deph daa, e.g., deph videos capuring indoor or oudoor scenes, by Kinec, ToF or laser scanners, as well as sereo vision, should be evaluaed horoughly. For RGB-D video enhancemen, he proposed mehod is consrained by he assumpion ha he saic srucure is saic boh in he deph and color channels. The saic srucure esimaion may hus fail if he capured scene has varying illuminaion, in which case, he spaio-emporal enhancemen urns ino a convenional spaial enhancemen approach. Anoher possible drawback of he proposed mehod is ha he false esimaion in he saic srucure canno be eliminaed if fuure frames canno provide enough reliable deph samples a he same locaion. For example, he arifacs marked by he red doed boxes in he enhanced deph frames (c.f. Figure 16) correspond o he holes in he inpu deph frames. The inpu deph frames canno provide effecive and reliable deph samples a hese regions hus he arifacs canno explicily be deeced by he proposed model. One possible improvemen migh heurisically define a hreshold o delee such regions from he saic srucure when no reliable deph samples are received wihin a sufficien long ime. The proposed mehod only models he capured scene wih dynamic and saic layers, and is no capable o immediaely (d) (e) (f) Fig. 18. Examples of he novel view synhesis. (a) and (b) are he inpu RGB and deph frames. (c) is he enhanced deph frame by he proposed mehod. (d) is he synhesized view by he raw deph frame and he RGB frame. Image holes in (d) is filled by he saic srucure, as shown in (e). (f) is he synhesized view based on he enhanced deph frame and he image holes are also filled by he esimaed saic srucure. Bes viewed in color. exend o muliple (e.g., more han 3) layers. Alhough i is a ough quesion o define and model hese layers properly, we believe ha more accurae resuls are possible by inroducing such exension. For insance, he relaionship beween differen dynamic objecs can be well-defined if muliple dynamic layers compacly represen he local saisics of hese objecs. In his case, he spaial enhancemen of each objec can be handled separaely and/or hierarchically, while he emporal enhancemen can be adjused o fi heir disincive moion paerns. Therefore, his meaningful exension is worhy being explored in deph as a fuure opic. B. Applicaions A high qualiy deph video improves various applicaions in he fields of image and graphics processing, and compuer vision as well. In he following wo successful applicaions, he enhanced deph videos by he proposed mehod ac as an effecive cue o improve performance. 1) Background Subracion: We can use he processed RGB-D videos o improve he segmening of he foreground objecs from he background. As shown in Figure 17, we esed one pair of RGB-D frames for background subracion by simply exracing he region wih deph values smaller han a consan hreshold (in his case, we se he hreshold as 15mm) and replacing he background by blue color. Noe ha here was no boundary maing applied in all he cases. The proposed mehod (c.f. Figure 17(c)) shows a much more refined and complee foreground segmen han hose by he reference mehods. 2) Novel View Synhesis: A varian of novel view synhesis, named deph image-based rendering (DIBR) [45] applies he deph informaion o guide he warping of he exure map of one view o anoher synhesized view. I is a popular echnique for immersive elecommunicaion or 3D and freeview TVs. However, he performance is hampered by he qualiy of he deph video. As presened in Figure 18, he novel view generaed by he raw deph frame and he regisered RGB frame conains severe holes and cracks, as well as srucure disorion. The saic srucure is appropriae o fill he image

14 SUBMITTED TO TRANSACTION ON IMAGE PROCESSING 14 holes, bu i may replace he srucure of he foreground objecs by misake. The enhanced deph frame by he proposed mehod can preserve he deph srucures well so ha less srucure disorion occurs in is synhesized view. Thus he synhesized view is visually plausible wihou apparen arifacs. VI. CONCLUSION AND FUTURE WORK In his paper, we presen a novel mehod for robus emporally consisen deph enhancemen by inroducing he saic srucure of he capured scene, which is esimaed online by a probabilisic generaive mixure model wih efficien parameer esimaion, spaial enhancemen and updae scheme. Afer segmening he inpu frame wih an efficien fullyconneced CRF model, he dynamic region is enhanced spaially while he saic region is subsiued by he updaed saic srucure so as o favor a long-range spaio-emporal enhancemen. Quaniaive evaluaion shows he robusness of he parameers esimaion on he saic srucure and illusraes a superior performance in comparison o various saic scene esimaion approaches. Qualiaive evaluaion demonsraes ha our mehod operaes well on various indoor scenes and wo kinds of sources (Kinec and ToF camera), and proves ha he proposed emporally consisen deph video enhancemen works saisfacory in comparison wih exising mehods. As our fuure work, an exension o deal wih moving cameras will be a meaningful opic for sudy. Furhermore, we will improve he algorihm o reduce he effec of wrong esimaion and design an efficien reliabiliy check o increase he accuracy of he esimaed saic srucure. Las bu no he leas, a more general probabilisic framework o handle muliple dynamic and saic layers is necessary o explore for inherenly increasing he performance of he proposed mehod. REFERENCES [1] J. Diebel and S. Thrun, An applicaion of Markov random fields o range sensing, in Advances in Neural Informaion Processing Sysems, vol. 18. MIT press, 25, pp [2] J. Yang, X. Ye, K. Li, and C. Hou, Deph recovery using an adapive color-guided auo-regressive model, in Proc. Eur. Conf. Compu. Vis. Springer, 212, pp [3] J. Park, H. Kim, Y.-W. Tai, M. S. Brown, and I. Kweon, High qualiy deph map upsampling for 3D-ToF cameras, in Proc. IEEE In. Conf. Compu. Vis., 211, pp [4] J. Kopf, M. F. Cohen, D. Lischinski, and M. Uyendaele, Join bilaeral upsampling, in ACM Trans. Graph., vol. 26, no. 3, 27, p. 96. [5] J. Dolson, J. Baek, C. Plagemann, and S. Thrun, Upsampling range daa in dynamic environmens, in Proc. IEEE Conf. Compu. Vis. Paern Recogni., 21, pp [6] B. Huhle, T. Schairer, P. Jenke, and W. Sraßer, Fusion of range and color images for denoising and resoluion enhancemen wih a non-local filer, Compu. Vis. Image Undersanding, vol. 114, no. 12, pp , 21. [7] L. Sheng and K. N. Ngan, Deph enhancemen based on hybrid geomeric hole filling sraegy, in Proc. IEEE In. Conf. Image Process., Sep 213, pp [8] J. Lu, H. Yang, D. Min, and M. Do, Pach mach filer: Efficien edgeaware filering mees randomized search for fas correspondence field esimaion, in Proc. IEEE Conf. Compu. Vis. Paern Recogni., June 213, pp [9] C. Richard, C. Soll, N. A. Dodgson, H.-P. Seidel, and C. Theobal, Coheren spaioemporal filering, upsampling and rendering of RGBZ videos, Compuer Graphics Forum (Proceedings of Eurographics), vol. 31, no. 2, May 212. [1] D. Min, J. Lu, and M. Do, Deph video enhancemen based on weighed mode filering, IEEE Trans. Image Process., vol. 21, no. 3, pp , March 212. [11] L. Sheng, K. N. Ngan, and S. Li, Temporal deph video enhancemen based on inrinsic saic srucure, in Proc. IEEE In. Conf. Image Process., Paris, France, Oc [12] D. Herrera, J. Kannala, J. Heikkilä e al., Deph map inpaining under a second-order smoohness prior, in Image Analysis. Springer, 213, pp [13] C. D. Herrera, J. Kannala, P. Surm, and J. Heikkilä, A learned join deph and inensiy prior using Markov random fields, in Proc. IEEE 3DTV-CON, 213, pp [14] A. Chambolle and T. Pock, A firs-order primal-dual algorihm for convex problems wih applicaions o imaging, Journal of Mahemaical Imaging and Vision, vol. 4, no. 1, pp , 211. [15] F. Garcia, B. Mirbach, B. Oersen, F. Grandidier, and A. Cuesa, Pixel weighed average sraegy for deph sensor daa fusion, in Proc. IEEE In. Conf. Image Process., 21, pp [16] E. S. L. Gasal and M. M. Oliveira, Adapive manifolds for real-ime high-dimensional filering, ACM Trans. Graph., vol. 31, no. 4, pp. 33:1 33:13, 212. [17] Z. Ma, K. He, Y. Wei, J. Sun, and E. Wu, Consan ime weighed median filering for sereo maching and beyond, in Proc. IEEE In. Conf. Compu. Vis., 213. [18] Q. Yang, N. Ahuja, R. Yang, K.-H. Tan, J. Davis, B. Culberson, J. Aposolopoulos, and G. Wang, Fusion of median and bilaeral filering for range image upsampling, IEEE Trans. Image Process., vol. 22, no. 12, pp , Dec 213. [19] Q. Yang, R. Yang, J. Davis, and D. Niser, Spaial-deph super resoluion for range images, in Proc. IEEE Conf. Compu. Vis. Paern Recogni., June 27, pp [2] S. Vedula, S. Baker, P. Rander, R. Collins, and T. Kanade, Threedimensional scene flow, in Proc. IEEE In. Conf. Compu. Vis., vol. 2, 1999, pp [21] C. Vogel, K. Schindler, and S. Roh, 3D scene flow esimaion wih a rigid moion prior, in Proc. IEEE In. Conf. Compu. Vis., 211, pp [22], Piecewise rigid scene flow, in Proc. IEEE In. Conf. Compu. Vis., 213, pp [23] S.-Y. Kim, J.-H. Cho, A. Koschan, and M. Abidi, Spaial and emporal enhancemen of deph images capured by a ime-of-fligh deph sensor, in Proc. IEEE In. Conf. Paern Recogni., Aug 21, pp [24] J. Zhu, L. Wang, J. Gao, and R. Yang, Spaial-emporal fusion for high accuracy deph maps using dynamic MRFs, IEEE Trans. Paern Anal. Mach. Inell., vol. 32, no. 5, pp , 21. [25] M. Lang, O. Wang, T. Aydin, A. Smolic, and M. Gross, Pracical emporal consisency for image-based graphics applicaions, ACM Trans. Graph., vol. 31, no. 4, pp. 34:1 34:8, Jul [26] J. Shen and S.-C. S. Cheung, Layer deph denoising and compleion for srucured-ligh RGB-D cameras, in Proc. IEEE Conf. Compu. Vis. Paern Recogni., 213, pp [27] R. Szeliski, A muli-view approach o moion and sereo, in Proc. IEEE Conf. Compu. Vis. Paern Recogni., vol. 1, 1999, pp [28] P. Merrell, A. Akbarzadeh, L. Wang, P. Mordohai, J.-M. Frahm, R. Yang, D. Niser, and M. Pollefeys, Real-ime visibiliy-based fusion of deph maps, in Proc. IEEE In. Conf. Compu. Vis., Oc 27, pp [29] S. Liu and D. Cooper, A complee saisical inverse ray racing approach o muli-view sereo, in Proc. IEEE Conf. Compu. Vis. Paern Recogni., June 211, pp [3] Y. M. Kim, C. Theobal, J. Diebel, J. Kosecka, B. Miscusik, and S. Thrun, Muli-view image and ToF sensor fusion for dense 3D reconsrucion, in Proc. IEEE In. Conf. Compu. Vis. Workshops, 29, pp [31] C. Zinick, S. Kang, M. Uyendaele, S. Winder, and R. Szeliski, Highqualiy video view inerpolaion using a layered represenaion, in ACM SIGGRAPH, vol. 23, no. 3, Augus 24, pp [32] K. Pahak, A. Birk, J. Poppinga, and S. Schwerfeger, 3D forward sensor modeling and applicaion o occupancy grid based sensor fusion, in Proc. IEEE/RSJ In. Conf. Inell. Robos. Sys., 27, pp [33] B. Curless and M. Levoy, A volumeric mehod for building complex models from range images, in Proc. ACM SIGGRAPH, 1996, pp [34] R. A. Newcombe, A. J. Davison, S. Izadi, P. Kohli, O. Hilliges, J. Shoon, D. Molyneaux, S. Hodges, D. Kim, and A. Fizgibbon, KinecFusion: Real-ime dense surface mapping and racking, in Proc. IEEE In. Symp. Mixed Augmened Realiy, 211, pp

SUBMITTED TO TRANSACTION ON IMAGE PROCESSING 15 [35] O. J. Woodford and G. Vogiazis, A generaive model for online deph fusion, in Proc. Eur. Conf. Compu. Vis. springer, 212, pp. 144 157. [36] S.

Hernández, Video-based, real-ime muli-view sereo, Image and Vision Compuing, vol. 29, no. 7, pp. 434 441, 211. [38] C. M. Bishop and N. M. Nasrabadi, Paern recogniion and machine learning.

Kolun, Efficien inference in fully conneced CRFs wih Gaussian edge poenials, in Advances in Neural Informaion Processing Sysems. MI

Fersl, C. Reinbacher, R. Ranfl, M. Rueher, and H. Bischof, Image guided deph upsampling using anisoropic oal generalized variaion, in Proc. IEEE In. Conf. Compu. Vis., December 213. [43] D.

15 SUBMITTED TO TRANSACTION ON IMAGE PROCESSING 15 [35] O. J. Woodford and G. Vogiazis, A generaive model for online deph fusion, in Proc. Eur. Conf. Compu. Vis. springer, 212, pp [36] S. Thrun, Learning occupancy grids wih forward models, in Proc. IEEE/RSJ In. Conf. Inell. Robos. Sys., vol. 3, 21, pp [37] G. Vogiazis and C. Hernández, Video-based, real-ime muli-view sereo, Image and Vision Compuing, vol. 29, no. 7, pp , 211. [38] C. M. Bishop and N. M. Nasrabadi, Paern recogniion and machine learning. Springer, 26. [39] T. P. Minka, A family of algorihms for approximae Bayesian inference, Ph.D. disseraion, Massachuses Insiue of Technology, 21. [4] P. Krähenbühl and V. Kolun, Efficien inference in fully conneced CRFs wih Gaussian edge poenials, in Advances in Neural Informaion Processing Sysems. MIT press, 211, pp [41] E. S. L. Gasal and M. M. Oliveira, Domain ransform for edge-aware image and video processing, ACM Trans. Graph., vol. 3, no. 4, pp. 69:1 69:12, Jul [42] D. Fersl, C. Reinbacher, R. Ranfl, M. Rueher, and H. Bischof, Image guided deph upsampling using anisoropic oal generalized variaion, in Proc. IEEE In. Conf. Compu. Vis., December 213. [43] D. Scharsein and C. Pal, Learning condiional random fields for sereo, in Proc. IEEE Conf. Compu. Vis. Paern Recogni., June 27, pp [44] H. Hirschmuller and D. Scharsein, Evaluaion of cos funcions for sereo maching, in Proc. IEEE Conf. Compu. Vis. Paern Recogni., June 27, pp [45] C. Fehn, Deph-image-based rendering (DIBR), compression, and ransmission for a new approach on 3D-TV, in Elecronic Imaging. Inernaional Sociey for Opics and Phoonics, 24, pp Chern-Loon Lim (M 15) received he B.Eng., M.Eng.Sc. and Ph.D. degrees from he Deparmen of Elecrical Engineering, Universiy of Malaya, Malaysia, in 25, 27 and 213, respecively. He was a visiing scholar wih he Chinese Universiy of Hong Kong, Hong Kong. He is currenly a posdocoral researcher wih he Universiy of Malaya, Malaysia. His research ineress include he field of visual qualiy assessmen, video processing, and paern recogniion. Songnan Li (M 13) received his BSc and M.Phil. degrees in Compuer Science and Technology from he Harbin Insiue of Technology (China) in 24 and 26, respecively. He joined he Chinese Universiy of Hong Kong (CUHK) as a Research Assisan in 27, and obained his PhD in Elecronic Engineering (CUHK) in 212. From 212 o 214, he was appoined as a Pos-docoral Fellow in he Deparmen of Elecronic Engineering (CUHK). Currenly, he is a Research Assisan Professor in he same deparmen. His research ineress include Image and video processing, RGB-D compuer vision, visual qualiy assessmen, and ec. Lu Sheng (S 13) is currenly pursuing he Ph.D. degree in he Image and Video Processing Laboraory (IVP) of he Deparmen of Elecronic Engineering, he Chinese Universiy of Hong Kong (CUHK). Before ha he received his B.E. degree in Informaion Science and Elecronic Engineering from Zhejiang Universiy (ZJU) in 211. His curren research ineress include 3D image/video processing and compuer vision, especially RGB- D video enhancemen, 3D reconsrucion and novel view synhesis, and ec. King Ngi Ngan (M 79 SM 91 F ) received he Ph.D. degree in elecrical engineering from Loughborough Universiy, Loughborough, U.K. He is currenly a chair professor wih he Deparmen of Elecronic Engineering, Chinese Universiy of Hong Kong, Shain, Hong Kong. He was previously a full professor wih Nanyang Technological Universiy, Singapore, and wih he Universiy of Wesern Ausralia, Ausralia. He has been appoined Chair Professor a he Universiy of Elecronic Science and Technology, Chengdu, China, under he Naional Thousand Talens Program since 212. He holds honorary and visiing professorships wih numerous universiies in China, Ausralia, and Souh Eas Asia. Prof. Ngan served as associae edior of IEEE Transacions on Circuis and Sysems for Video Technology, Journal on Visual Communicaions and Image Represenaion, EURASIP Journal of Signal Processing: Image Communicaion, and Journal of Applied Signal Processing. He chaired and co-chaired a number of presigious inernaional conferences on image and video processing including he 21 IEEE Inernaional Conference on Image Processing, and served on he advisory and echnical commiees of numerous professional organizaions. He has published exensively including 3 auhored books, 7 edied volumes, over 35 refereed echnical papers, and edied 9 special issues in journals. In addiion, he holds 15 paens in he areas of image/video coding and communicaions. Prof. Ngan is a Fellow of IEEE (U.S.A.), IET (U.K.), and IEAus (Ausralia), and an IEEE Disinguished Lecurer in

Implementing Ray Casting in Tetrahedral Meshes with Programmable Graphics Hardware (Technical Report)

Implementing Ray Casting in Tetrahedral Meshes with Programmable Graphics Hardware (Technical Report) Implemening Ray Casing in Terahedral Meshes wih Programmable Graphics Hardware (Technical Repor) Marin Kraus, Thomas Erl March 28, 2002 1 Inroducion Alhough cell-projecion, e.g., [3, 2], and resampling,