Joint Example-based Depth Map Super-Resolution

Jont Example-based Depth Map Super-Resoluton Yanje L 1, Tanfan Xue,3, Lfeng Sun 1, Janzhuang Lu,3,4 1 Informaton Scence and Technology Department, Tsnghua Unversty, Bejng, Chna Department of Informaton Engneerng, The Chnese Unversty of Hong Kong 3 Shenzhen Key Lab for CVPR, Shenzhen Insttutes of Advanced Technology, Chna 4 Meda Lab, Huawe Technologes Co. Ltd., Chna carol.lyj@gmal.com, xtf009@e.cuhk.edu.hk, sunlf@tsnghua.edu.cn, jzlu@e.cuhk.edu.hk Abstract The fast development of tme-of-flght (ToF) cameras n recent years enables capture of hgh frame-rate 3D depth maps of movng objects. However, the resoluton of depth map captured by ToF s rather lmted, and thus t cannot be drectly used to buld a hgh qualty 3D model. In order to handle ths problem, we propose a novel jont example-based depth map super-resoluton method, whch converts a low resoluton depth map to a hgh resoluton depth map, usng a regstered hgh resoluton color mage as a reference. Dfferent from prevous depth map SR methods wthout tranng stage, we learn a mappng functon from a set of tranng samples and enhance the resoluton of the depth map va sparse codng algorthm. We further use a reconstructon constrant to make object edges sharper. Expermental results show that our method outperforms state-of-the-art methods for depth map super-resoluton. Keywords-Depth map; Super-resoluton (SR); Regstered color mage; Sparse representaton; I. INTRODUCTION A depth map, representng the relatve dstance of each object pont to the vdeo camera, s wdely used n 3D object presentaton. Current depth sensors for capturng depth maps can be grouped nto two categores: passve sensors and actve sensors. Passve sensors, lke a stereo vson camera system, s tme-consumng and not accurate at textureless or occluded regons. On the contrary, actve sensors generate more accurate result, and two most popular actve sensors are laser scanners and Tme-of-Flght (ToF) sensors. Laser scanners, despte the hgh-qualty depth map they generated, they have lmted applcatons n statc envronments, as they can only measure a sngle pont at a tme. Compared wth the these sensors, ToF sensors are much cheaper and can capture a depth map of fast movng objects, whch are drawng more and more nterest n recent years [1], []. However, a depth map captured by a ToF sensor has very low resoluton. For example, the resoluton of PMD CamCube 3.0 s 00 00 resoluton and the resoluton of MESA SR 4000 s 176 144. In order to mprove the qualty, a postprocess step s needed to enhance the resoluton of the depth map [3], [4], whch s called depth map super-resoluton (SR) n the lterature. Some prevous SR approaches [3] recover a hgh resoluton depth map from multple depth maps of the same statc (a) (b) (c) Fgure 1. Depth map super-resoluton. (a) Raw depth map captured by a ToF camera. (b) Correspondng regstered color mage. (c) Recovered hgh resoluton depth map. object (taken from slghtly dsplaced vewponts). For the stuaton wth only one depth map captured from the a sngle vewpont, most state-of-the-art methods focus on recoverng a hgh resoluton depth map from the low resoluton depth map wth the help of a regstered hgh resoluton color mage, as shown n Fgure 1. A common approach s to apply a jont blateral flter wth color nformaton to rase the resoluton [4]. Ths approach can obtan a depth map wth sharper boundares. However, snce the jont blateral flter do not have the tranng stage, t s senstve to nose n the color mage and a recovered depth map often contans some false edges. Other algorthms, such as detectng edges of the regstered hgh resoluton color mage to drect SR [5] and utlzng color value to calculate weghts for SR to acheve sharp boundares [6], have smlar prncple and problems as the blateral flter method. Recently there s a rapd development of the examplebased D mage SR. In ths approach, the algorthm learns the fne detals that correspond to dfferent low resoluton mage patches from a set of low resoluton and hgh resoluton tranng pars, then use learned correspondence to predct the detals of a low resoluton testng mage [7]. Sun et al. propose a Bayesan approach to sharpen the boundary of nterpolaton result by nferrng hgh resoluton patches from nput low resoluton patches based on prmal sketch prors [8]. Change et al. propose to learn a mappng functon from low resoluton patches to hgh resoluton ones usng locally lnear embeddng (LLE) [9]. Glasner et al. propose to drectly collect tranng patch pars from the sngle low resoluton nput mage, nstead of precollected data set [10]. Yang et al. proposed an example-based D mage SR approach usng sparse sgnal representaton [11]. And Dong et al. extend ths works usng multple sets of

C L H H R e g s t e r e d D m a p I c P a t c h C r a w n I c L o w r e s o l u t o n d e p t h m a p I l E d g e E x t r a c t o n E d g e m a p I n t e r p o l a t e d d e p t h m a p H n t T e x t u r e e d g e R e m o v a l R e f n e d e d g e m a p L C = r (, ) S R r e s u l t : H g h r e s o l u t o n d e p t h m a p I h P a t c h L r a w n I l I n t e r p o l a t o n P a t c h H n t n H n t F e a t u r e E x t r a c t o n P a t c h H n I h r a w H g h r e s o l u t o n d e p t h m a p R e c o n s t r u c t o n Fgure. Ppelne of Example-based SR. bases learned from tranng data set to adapt to the content varaton across dfferent mages [1]. However, all these example-based methods use nput of a sngle mage only and do not well ft to our applcaton where the nput ncludes a depth map and a regstered D color mage. In ths paper, we propose a novel jont example based SR method to enhance the resoluton of a depth map captured by a ToF camera. Unlke tradtonal example based SR methods, whch only utlze a sngle D mage as nput, our jont example based SR use both a low resoluton depth map and a regstered color mage to recover a hgh resoluton depth map. We desgn a mappng functon from a low resoluton depth patch and a color patch to a hgh resoluton depth patch, accordng to ther sparse presentatons on three related dctonares. Furthermore, we use a reconstructon constrant to promse the accuracy of reconstructon and make object edges sharper. Expermental results demonstrate that our SR method obtans a hgh resoluton depth map wth clearer boundares and fewer false edges than state-of-the-art methods. II. DEPTH MAP SR In our work, depth map SR s to reconstruct a hgh resoluton depth map I h from a low resoluton depth map I l and a hgh resoluton color mage I c. Frst, the reconstructed hgh resoluton depth map I h should be consstent wth the low resoluton depth map I l as: I l = DI h (1) where D s a downsamplng operator. Furthermore, as the reconstructed hgh resoluton depth map corresponds wth the regstered color map I c n pxel level, there s a correlaton between two mages I c and I h. Based on the analyss above, we model the correlaton between I l, I c and I h as a mappng functon r(, ) from a low resoluton depth patch and ts correspondng color patch to a hgh resoluton depth patch as: H = r(l, C ) () where L s a patch n I l, C s the correspondng color patch n I c, and H s the recovered hgh resoluton depth patch n I h. In ths paper, we defne the mappng functon r(, ) usng the sparse representaton of patches over a pre-traned dctonary, whch wll be dscuss n secton II-B. A. Feature Representaton for Patches To enhance the robustness, we do not use the raw depth map patches and color patches as nput for (). Instead, we extract features from them and use these features to represent the raw patches as shown n Fgure. For a low resoluton depth patch L raw, we use the frst and second dervatves of ts bcubc nterpolaton result to form a feature[ vector to represent ths patch: ] L = (3) x Hnt, y Hnt x Hnt, y Hnt where H nt s the bcubc nterpolaton result of L raw. Accordng to [11], the frst and second dervatves of patches can better reflect the smlarty between patches. For a color patch C raw, we use ts edge map as features. Snce some edges on the color map are caused by the texture of the object and do not correspond to the edges on the depth map, we need to remove them to enhance the correlaton between hgh resoluton depth patches and color patches. Therefore, we frst upsample the low resoluton depth patch usng bcubc nterpolaton and extract both the edges of the color mage and the upsampled depth mage. Then the pxel-wse product between these two edge maps s used as the feature to represent a color patch, whch can effcently remove the texture edges of the color mage wth: C = ( C raw ) ( H nt ) (4) where s the edge extracton operator, H nt s the bcubc nterpolaton result of L raw and C raw s the color patch. The feature to represent a hgh resoluton depth patch H raw s: H = H raw mean(h raw ) (5) where mean(h raw ) s the vector wth all ts elements equal to the mean depth value of H raw. For the practcal SR procedure, mean(h raw ) s unknown before reconstructon, so we could use mean(h nt ) to represent mean(h raw ). As H nt and H raw share smlar mean depth value, ths replacement s reasonable. B. Mappng Functon va Sparse Representaton The mappng functon r(, ) n () s defned usng the sparse representaton over low resoluton depth patch and color patch. We frst co-tran three dctonares: D h conssts of hgh resoluton patches; D l conssts of low resoluton patches; D c conssts of color patches. Notce that patches n

D l, D c and D h are correspondent,.e., the th low resoluton patch n D l corresponds wth the th hgh resoluton patch n D h and the th color patch n D c. The detals of tranng wll be ntroduced n secton II-C. Then for each new nput low resoluton patch L and ts correspondng color patch C, we can fnd the output hgh resoluton patch H as follows. We frst fnd a sparse representaton of these two patches on the dctonares D l and D c respectvely. Here we enforce L and C have the same sparse coeffcents on dctonares D l and D c. Then the hgh resoluton patch s recovered usng the same coeffcents. The mappng functon s defned as: H = r(l, C ) = D h α where: α = argmn {λ l D l α L + λc Dcα C + f(h, L)} α 0 ǫ (6) where α s the coeffcent vector consstng of all the coeffcents, each of D h, D l and D c s a matrx wth each prototype patch beng a column vector, λ l and λ c are two balance parameters, and 0 denotes the l 0 -norm. f s a constrant functon to ensure reconstructon constrant defned n (1). The detaled defnton of f(h, L ) s gven n secton II-D. Here we enforce the sparsty constrant α 0 ǫ for two reasons: frst, wth the sparsty constrant, t s reasonable to reconstruct H usng coeffcents α, whch are the lnear coeffcents for representng L (or C ) usng patches n D l (or D c ). Second, as dscussed n [13], f a hgh resoluton patch H can be represented as a suffcent sparse lnear combnaton of patches n D h, t can be perfectly recovered from a low resoluton patch. Same as prevous works on sparse representaton, we replace l 0 -norm by l 1 -norm n (6) for computatonal effcency. As dscussed n [13], the l 1 -norm constrant stll ensures the sparsty of the coeffcents α. Then whle gnorng f, (6) becomes α = argmn {λ l D l α L + λ c D cα C } (7) α 1 ǫ C. Dctonary Tranng The dctonares D l, D c and D h are traned from a set of corresponded low resoluton depth patches, hgh resoluton depth patches (ground-truth) and color patches. The tranng s to mnmze the followng estmaton error: E = r(l, C ) H (8) Combnng (7) and (8), we have: mn λ l D l α L D l,dc,d h,α +λc Dcα C + D hα H subject to: α 1 ǫ (9) In our work, we use about 10 5 patches for tranng and after tranng, each dctonary contans only 104 patches. We use dctonary sze whch s much smaller than the number of tranng samples for robustness and effcency. The above dctonary tranng formulaton (9) s common n sparse representaton, and can be solved usng an teratve optmzaton method [14], [11]. D. Mappng Functon wth a Reconstructon Constrant The reconstructon constrant defned n (1) s mportant for SR. Wthout ths constrant, the downsamplng result of the recovered hgh resoluton depth map s not guaranteed to be close to the nput low resoluton depth map, and a serous artfact wll appear when the mappng functon fals to get the correct hgh resoluton patch. Drectly combnng the reconstructon constrant (1) wth the mappng functon defned n prevous secton s not easy. So we frst apply an upsamplng operator U on both sdes of (1), resultng n: UI l = UDI h I h (10) The smplest way for upsamplng s the bcubc nterpolaton. However, there s an obvous blurrng effect on the boundares of the object n the depth map H nt obtaned from the nterpolaton. To remove ths effect, we apply the jont-blateral flter proposed n [4], whch can generate clearer object boundares than H nt : H b (x) = 1 Z x N(x) e x x θs C(x) C(x ) θ c H nt (x ) (11) where Z s the normalzaton factor, N(x) s a neghborhood of x, C(x) s the RGB color vector of the pxel at poston x n the regstered color mage I c, and H nt (x) s the depth value of the pxel at poston x n H nt. After flterng, pxels wth smlar colors tend to have smlar depth values. Therefore, the flterng result H b normally has a sharper boundary compared wth the nterpolaton result H nt. Then we use H b = UI l I h as the reconstructon constrant. Let H b be a patch on H b. Then the recovered hgh resoluton depth patch H n I h should be as near to H b as possble. We use the l -norm to model ths reconstructon constrant f and add t to the mappng functon (7): H = D h α where: α = argmn{λ l D l α L + λ c D cα C (1) α 1 ǫ + λ r D h α H b } where λ r s the weght for reconstructon constrant. (1) s a LASSO lnear regresson problem, and can be effcently solved by [15]. In the experment, we smply set all the weghtng parameters λ l, λ c and λ r to 1. E. Proposed Jont Example-based Depth Map SR Utlzng the fnal mappng functon as (1), the proposed example-based depth map SR algorthm can be summarzed n Algorthm 1. To remove the blockng effect, we dvde the low resoluton depth map nto overlappng patches, and obtan the hgh resoluton patch usng the mappng functon (1). Then we combne these patches to a whole hgh resoluton depth map by averagng the depth values over the overlappng regons.

Algorthm 1 Proposed Example-based depth map SR Input: A low-resoluton depth map I l, the correspondng color mage I c and dctonares D l, D c and D h 1) Upsample I l to the same resoluton as I c, and apply blateral flter (11) to t to get fltered mage H b ) for each patch of L n I l 3) Get the correspondng color patch C and depth patch H b from I c and H b respectvely 4) Get the hgh resoluton depth patch H from L, C and H b by solvng the optmzaton problem (1) 5) endfor 6) Recover the whole hgh resoluton depth map by combnng all patches H obtaned n step 4 III. EXPERIMENTS AND RESULTS A. Comparson wth other approaches We collect 34 pars of color mages and depth maps from 4 vdeos of Phlps 3DTV. The low resoluton depth maps for tranng are obtaned from hgh resoluton depth maps by a Gaussan blurrng and downsamplng. In the tranng stage, we only tran dctonares to ncrease the resoluton of a depth map by factor. Examples of tranng patches are shown n Fgure 3. For larger magnfyng factors, such as 8, we get the hgh resoluton depth map by applyng the SR three tmes. For each tranng patch trple (low and hgh resoluton depth patches and color patches), we randomly select a 4 4 patche from the low resoluton depth maps and ts correspondng 8 8 patches from the hgh resoluton depth maps and color mages. We extract 100,000 patch trples from the 34 groups of mages to tran the dctonares D l, D h and D c, each of whch contans 104 patches after tranng. To evaluate the performance of our algorthm, we compare t wth fve other SR methods. Among them, there are three methods use no regstered color mage nformaton: bcubc nterpolaton (Bc), blateral flterng usng only depth nformaton (D-B)) and sparse representaton method proposed by Yang et al. n [11] (Sc-Y). The other two methods are the state-of-the-art methods desgned specally for depth map SR, whch take nto account the regstered color mage nformaton: the jont-blateral flterng as defned n [4] (J-B) and the algorthm usng regstered color mage to drect calculaton of weghts for nterpolaton as n [6] (C- W). From another angle, among the fve methods, Sc-Y s an example-based algorthm, just lke our method. The same tranng set, and the dctonary sze, patch sze, and overlappng sze are used n both our algorthm and Sc-Y. The other four methods do not have a tranng stage. The testng set conssts of 13 mages randomly selected from 6 vdeos. Four of these vdeos are also from Phlps 3DTV webste as the tranng set (Grl, Football, Dandelon and Frsbee) and two are from Mcrosoft Research Asa webste (Ballet and Breakdancers). We frst evaluate the performance of these sx algorthms Fgure 3. Example of tranng patches. The frst row shows color patches, the second row shows correspondng low resoluton depth patches, and thrd row shows correspondng hgh resoluton depth patches. Table I ROOT MEAN SQUARE ERROR (RMSE) FOR SIX TESTING VIDEOS TestVdeos Average RMSE Ours Sc-Y J-B C-W D-B Bc Grl 6.5 7.03 7.3 9.4 7.35 7.36 Football 4.94 5.7 6.5 7.1 6.37 6.39 Dandelon 5.09 5.49 6.18 5.91 6.3 6.3 Frsbee 8.4 8.84 9.37 10.89 9.46 9.47 Ballet 8.83 9.63 9.98 9.4 10.16 10.5 Breakdancer 5.44 5.68 5.7 6.79 5.74 5.76 wth a magnfyng factor 8. A quanttatve result s gven n Table I. It shows that our algorthm acheves the best results on all the testng sequences. RMSE also shows that the example-based algorthms (our algorthm and Sc-Y) outperform those non-tranng methods. And the methods takng nto account the regstered color mage nformaton (J-B and C-W) perform better than the ones usng only low resoluton depth map (D-B and Bc). To further evaluate these sx methods, we compare the vsual performance between them. Fgure 4 shows examples of one MSRA vdeo (Ballet) and one Phlps vdeo (Frsbee). We can see that the depth maps recovered by our algorthm obvously outperforms Bc, D-B, J-B and Sc-Y as havng sharper boundares and smoother surfaces, whch can be easly found n the enlarged parts. Although the depth map obtaned by C-W has sharper boundares than ours, t contans some obvous artfacts (marked by crcles) and ts RMSE s much hgher than ours (see Table1). Ths s because C-W s not robust and a small nose n the color mage may greatly affect upsampled depth map. For nstance, the woman s hand n Ballet s obvously uncorrect usng C-W because the color of the woman s hand s much lke the wooden bar behnd her, and C-W manly uses the background depth value to fll the unknown pxels. Another example s that the dandelon contans many errors because the lnes are tny and are easly affected by the surroundng backgrounds. In concluson, applyng the machne learnng method n depth map SR can mprove ts performance sgnfcantly. And by takng nto consderaton the regstered color mage nformaton can further mprove ts accuracy. Also, our algorthm s robust to the errors n the color mage and s not obvously affected by them, whch s a bg problem exstng n some state-of-the-art depth map SR methods.

Regstered Color Image Ground Truth Bc D-B Sc-Y [11] Regstered Color Image Ground Truth J-B [4] (a) Ballet C-W [6] Our Algorthm Bc D-B Sc-Y [11] J-B [4] (b) Frsbee C-W [6] Our Algorthm Fgure 4. Vsual comparson between dfferent algorthms for Ballet (left) and Frsbee (rght). Because depth map has lttle texture and the qualty of the depth map s manly evaluated by the qualty of boundares of objects, we manly enlarge some part wth many elaborate boundares to show t clearer. B. Dfferent magnfyng factors From the analyss and experment above, we use Sc-Y to stand for the prevous example-based algorthm, and J-B to stand for the algorthms usng regstered color mage. Then we further test the performance of four algorthms (our algorthm, Sc-Y, J-B and Bc) wth dfferent magnfyng factors. Fgure 5 are the magnfyng factor-rmse curves for the sequences Football and Frsbee. It shows that our algorthm outperforms J-B and Bc under all the factors. Our algorthm has smlar performance wth Sc-Y under magnfyng factor. However, when the magnfyng factor ncreases, our algorthm performs better than Sc-Y. Fgure 6 also shows the vsual comparson between the algorthms under dfferent magnfyng factors, 8 and 16. The vsual qualty of depth maps obtaned by Sc-Y and J-B are serously affected as the magnfyng factor ncreases, whle the depth maps obtaned by our method are stll clear. It demonstrates that our algorthm has a good performance even for a large magnfyng factor. Ths advantage s sgnfcant for practcal applcatons, snce the depth map captured by ToF has a very low resoluton and has to be reconstructed wth a hgh resoluton for 3D modelng or 3DTV representaton. Another pont worth mentonng s that the tranng patches are all taken from four syntheszed vdeos, smlar to the Frsbee mage Fgure 4, but the testng set contans both syntheszed and real vdeos, such as the Ballet mage n Fgure 4. IV. CONCLUSION In ths paper, we have proposed a jont example-based depth map super-resoluton method, usng a regstered hgh resoluton color mage as a reference. Prevous example based SR methods only use a sngle low resoluton mage as an nput and do not well ft to our applcaton where the nput ncludes a depth map and a regstered D color mage. We propose to learn a mappng functon from both

RMSE RMSE 10 8 6 4 Football Ours Sc Y [11] J-B [4] Bc 0 4 6 8 10 1 14 16 Magnfyng factor Frsbee 14 1 10 8 6 4 Ours Sc Y [11] J-B [4] Bc 4 6 8 10 1 14 16 Magnfyng factor Fgure 5. Magnfyng factor-rmse curves for Football and Frsbee. Ths work s supported by the Development Plan of Chna (973) under Grant No. 011CB3006, the Natonal Natural Scence Foundaton of Chna under Grant No. 60833009/60933013/6097509/61070148, and Scence, Industry, Trade, and Informaton Technology Commsson of Shenzhen Muncpalty, Chna (JC00903180635A, JC0100570378A, ZYC01006130313A). REFERENCES [1] D. Chan, Nose vs. feature: Probablstc denosng of tmeof-flght range data, Techncal report, Stanford Unversty, 008. [] A. Kolb, E. Barth, R. Koch, and R. Larsen, Tme-of-flght sensors n computer graphcs, Eurographcs State of the Art Reports, pp. 119 134, 009. [3] S. Schuon, C. Theobalt, J. Davs, and S. Thrun, Ldarboost: Depth superresoluton for tof 3d shape scannng, CVPR, 009. [4] Q. Yang, R. Yang, J. Davs, and D. Nster, Spatal-depth super resoluton for range mages, CVPR, 007. Regstered Color Image Ground Truth [5] E. Ekmekcoglu, M. Mrak, S. Worrall, and A. Kondoz, Utlsaton of edge adaptve upsamplng n compresson of depth map vdeos for enhanced free-vewpont renderng, ICIP, pp. 733 736, 009. [6] Y. L and L. Sun, A novel upsamplng scheme for depth map compresson n 3dtv system, Pcture Codng Symposum, pp. 186 189, 010. Ours Sc-Y [11] [7] W. Freeman, T. Jones, and E. Pasztor, Example-based superresoluton, IEEE Computer Graphcs and Applcatons, pp. 56 65, 00. [8] J. Sun, N. Zheng, H. Tao, and H. Shum, Image hallucnaton wth prmal sketch prors, CVPR, 003. [9] H. Chang, D. Yeung, and Y. Xong, Super-resoluton through neghbor embeddng, CVPR, 004. [10] D. Glasner, S. Bagon, and M. Iran, Super-resoluton from a sngle mage, ICCV, pp. 349 356, 009. J-B [4] (a) Magnfyng factor s 8 (b) Magnfyng factor s 16 Fgure 6. Comparson on Grl when the magnfyng factor s equal to 8 and 16. a patch n the low resoluton depth map and a patch n the color mage to a patch n the hgh resoluton depth map. We also utlze a hgh resoluton depth map reconstructed by the jont-blateral flter as a reconstructon constrant, whch can generate sharp object edges. Our experments have shown that our algorthm outperforms state-of-the-art algorthms. V. ACKNOWLEDGEMENT [11] J. Yang, J. Wrght, T. Huang, and Y. Ma, Image superresoluton va sparse representaton, IEEE Trans. Image Processng, vol. 19, no. 11, pp. 861 873, 010. [1] W. Dong, L. Zhang, G. Sh, and X. Wu, Image deblurrng and super-resoluton by adaptve sparse doman selecton and adaptve regularzaton, IEEE Trans. Image Processng, 011. [13] D. Donoho, For most large underdetermned systems of lnear equatons the mnmal l1-norm soluton s also the sparsest soluton, Communcatons on Pure and Appled Mathematcs, vol. 59, no. 6, pp. 797 89, 006. [14] H. Lee, A. Battle, R. Rana, and A. Ng, Effcent sparse codng algorthms, NIPS, 007. [15] R. Tbshran, Regresson shrnkage and selecton va the lasso, Journal of the Royal Statstcal Socety. Seres B, vol. 58, no. 1, pp. 67 88, 1996.