Multiview plus depth video coding with temporal prediction view synthesis

Size: px

Start display at page:

Download "Multiview plus depth video coding with temporal prediction view synthesis"

Cecil Austin
6 years ago
Views:

1 1 Multiview plus depth video coding with tempoal pediction view synthesis Andei I. Puica, Elie G. Moa, Beatice Pesquet-Popescu, Fellow, IEEE, Maco Cagnazzo, Senio Membe, IEEE and Bogdan Ionescu, Senio Membe, IEEE Abstact Multiview video plus depths fomats use view synthesis to build intemediate views fom existing adjacent views at the eceive side. Taditional view synthesis exploits the dispaity infomation to intepolate an intemediate view by taking into account inte-view coelations. Howeve, tempoal coelation between diffeent fames of the intemediate view can be used to impove the synthesis. We popose a new coding scheme fo 3D-HEVC that allows us to take full advantage of tempoal coelations in the intemediate view and impove the existing synthesis fom adjacent views. We use optical flow techniques to deive dense motion vecto fields fom the adjacent views and then wap them at the level of the intemediate view. This allows us to constuct multiple tempoal pedictions of the synthesized fame. A second contibution is an adaptive fusion method that judiciously selects between tempoal and inte-view pediction in ode to eliminate atifacts associated with each pediction type. The poposed system is compaed against the state-of-theat VSRS-1DFast technique used in 3D-HEVC standadization. 3 intemediay views ae synthesized. Gains of up to 1.21 db Bjontegaad Delta PSNR ae shown, when evaluated on seveal standad multiview video test sequences. Index Tems multiview video plus depth, 3DV, tempoal and inte-view pediction, view synthesis, 3D-HEVC. I. INTRODUCTION RECENT advances in video acquisition, compession and tansmission technologies have bought significant maket potential fo immesive communications. Common examples [1] [2] include immesive teleconfeence systems, 3D video, hologaphy and Fee Viewpoint Television (FTV). A typical fomat fo some of these applications is the MultiView Video (MVV) composed of a set of N video sequences epesenting the same scene, efeed to as views, acquied simultaneously by a system of N cameas positioned unde diffeent spatial configuations. An altenative epesentation is the Multiview-Video-Plus-Depth fomat (MVD) [3], whee the depth infomation is used in addition to textue fo each viewpoint. This allows fo a less costly synthesis of much moe vitual views, using fo example Depth-Image-Based- Rendeing (DIBR) methods [4]. View synthesis is the pocess of extapolating o intepolating a view fom othe available views. It is a popula eseach topic in compute vision, and numeous methods have been developed in this field ove the past fou decades. View synthesis techniques can be mainly classified in thee categoies [5]. The methods in the fist categoy, like DIBR, equie explicit geomety infomation such as depth o dispaity maps to wap the pixels in the available views to the coect position in the synthesized view [6] [7]. Methods in the second categoy equie only implicit geomety, like some pixel coespondences in the available and synthesized view, that can be computed using optical flow [8] [9] fo instance. Finally, methods in the thid categoy equie no geomety at all. They appopiately filte and intepolate a pe-acquied set of samples (examples of tools in this categoy include light field endeing [10], lumigaph [11], concentic mosaics [12]). A common poblem in view synthesis ae aeas that ae occluded in the available views but should be visible in the vitual ones. These aeas appea as holes in vitual views, also efeed to as disocclusions. This poblem is cuently esolved by using inpainting algoithms such as the ones descibed in [13] and [14]. Two of the most popula inpainting algoithms wee developed by Betalmio and Sapio [15] and Ciminisi et al. [16]. Recently, the Moving Pictues Expets Goup (MPEG) expessed a significant inteest in MVD fomats fo thei ability to suppot 3D video applications. This new activity is mainly focused on developing a 3D extension of the HEVC [17] video coding standad, afte a fist standadization activity finalized with Multiview Video Coding (MVC) [18]. An expeimental famewok was developed as well, in ode to conduct the evaluation expeiments [19]. This famewok defined a View Synthesis Refeence Softwae (VSRS) as pat of the 3D- HEVC test model [20], which would late become an ancho to seveal new endeing techniques. Futhemoe, establishing whethe encoding all views o synthesizing some fom coded views is bette fo multiview video sequences is still an open matte. Recently MPEG decided to dedicate 6 months to compae the two schemes [21]. Taditionally, view synthesis methods, and VSRS in paticula, only use inte-view coelations to ende vitual views. Howeve tempoal coelations can also be exploited to impove the quality of the synthesis. In geneal, this type of methods synthesize o impove the synthesis of a fame by extacting additional infomation fom diffeent time instants, as opposed to DIBR methods which only use adjacent views at the same time instant. Fo instance in [22] the authos use motion vecto fields between fames of the intemediate views to impove the view synthesis in MVC standad. Chen et al. [23] use motion vecto fields computed though blockbased motion estimation in the efeence views and then wap both the stat and end point of the vectos in the synthesized view. The motion vectos ae then used to etieve infomation about dis-occluded egions fom othe fames. Sun et al. [24] and Kuma et al. [25] use adjacent views to extact backgound infomation fom multiple time instants,

2 used fo hole filling in a DIBR synthesis. In [26] the authos use the infomation fom the cuent and othe fames of the synthesized video to fill hole egions.

2 2 used fo hole filling in a DIBR synthesis. In [26] the authos use the infomation fom the cuent and othe fames of the synthesized video to fill hole egions. Othe studies use the inte-view coelations diectly duing coding, view-synthesis pediction (VSP) [27] [28] [29] o take advantage of multiview fomat edundancies to deal with netwok packet loss [30]. Yuan et al. [31] use Weine filte to impove the synthesis by eliminating distotions caused by coding. In this pape, we popose a new coding scheme fo 3D- HEVC built aound a novel view synthesis method that fully exploits tempoal and inte-view coelations. Ou method is designed to complement the synthesis method used in the 3D-HEVC standadization pocess in ode to impove the quality of the synthesis. We use the optical flow to deive dense motion vecto fields between fames in the adjacent views which ae available at the decode side, then wap them at the level of the intemediate view. This allows us to build diffeent tempoal pedictions fom left and ight adjacent views using efeence fames at two time instants (past and futue). Othe motion estimation techniques that ae less computationally intensive can also be used at the cost of pediction accuacy [32] [33] [34]. Howeve, since it does not equie sending any esidual infomation, we pefe using an optical flow motion estimation technique, since it offes a moe accuate pediction [35]. The efeence fames used fo motion compensation ae peviously encoded and sent as an additional fame pe GOP in the intemediate view, we will efe to these fames as key fames in the est of this pape. The fou pedictions ae then meged into a single one, with the aim of educing the numbe of holes in the final synthesis. Due to a big tempoal distance between efeence and synthesized fames, the motion vecto fields may be impecise especially fo fames with intense motion. We mitigate these effects by using the so-called Hieachical synthesis scheme, in which tempoal layes ae used to pefom symmetic synthesis (whee each fame is synthesized fom eithe a key fame o a peviously synthesized fame) and we compae it with a Diect scheme (whee each fame is diectly synthesized fom a past and a futue key fame). To futhe impove the quality of the synthesis, we intoduce an adaptive fusion method that selects between inte-view and tempoal pediction. The emaining disocclusions in the synthesized image ae then filled by a linea inpainting method. The emaining of this pape is oganized as follows. The second section of this pape pesents a state-of-the-at of view synthesis techniques. The poposed method is descibed in the thid section. The esults obtained ae summaized in Section IV with a detailed intepetation, and finally conclusions and futue wok diections ae pesented in Section V. II. STATE OF THE ART OF VIEW SYNTHESIS TECHNIQUES In this state of the at, we focus on the fist class of view synthesis methods, also efeed to as DIBR techniques. We fist discuss the endeing technique used in the efeence softwae fo view synthesis, and in the endeing softwae used by the Joint Collaboative Team on 3D Video Coding Extension Development (JCT-3V) [36]. Then, an oveview of othe endeing techniques found in the liteatue is pesented. A. Refeence softwae 1) View Synthesis Refeence Softwae: VSRS inputs two textue views and thei two associated depth maps, along with intinsic and extinsic camea paametes. The output is a synthesized intemediate view. VSRS allows synthesizing fames using two opeational modes: a geneal mode and a 1D mode, espectively used fo non-paallel (e.g. cameas aligned in an ac) and 1D-paallel (cameas ae aligned in a staight line pependiculaly to thei optical axes) camea settings. Figue 1 illustates the endeing pocess in the geneal mode of VSRS. Fist, the left and ight efeence depth maps (s D,l and s D, ) ae waped to the vitual view position, giving s D,l ands D,. The occlusions ae handled by the highest depth value (closest to the camea), usually the depth values ae evesed quantified fom 0 to 255 such that the highest value in the depth map coesponds to the lowest depth of the scene [1]. s D,l ands D, ae then median filteed to fill small holes, giving s D,l and s D,. A binay mask is maintained fo each view to keep tack of lage holes caused by disocclusions. s D,l and s D, ae then used to wap the textue views s T,l and s T, to the vitual view position, giving s T,l and s T, (this evese waping pocess wheein the depths ae waped fist and then used to wap the textue is epoted to give a highe endeing quality [19]). Holes in one of the waped views ae filled with collocated non-hole pixels fom the othe waped view, if available. This gives s T,l and s T,, which ae then blended togethe to fom a single epesentation. The blending can be a weighted aveage accoding to the distance of each view to the vitual view point (Blending-On mode), o it can simply consist in taking the closest view to the vitual view point, and discading the othe (Blending-Off mode). The binay masks of each view ae meged togethe at this stage and the emaining holes ae filled at the final stage of the algoithm by popagating the colo infomation inwad fom the egion boundaies. Left efeence view (s T,l ) s' T,l s'' T,l Right efeence view (s T, ) s' T, s'' T, Output Left efeence depth map (s D,l ) s' D,l s'' D,l Right efeence depth map (s D, ) Fig. 1. Flow diagam fo View Synthesis Refeence Softwae (VSRS) geneal mode [19]. The 1D mode of VSRS woks a bit diffeently. In this mode, the camea setup is assumed to be 1D paallel. This allows to make a numbe of simplifications to the waping pocess which is educed to a simple hoizontal shift. Fist, the colo video is up-sampled fo half-pixel o quate pixel accuacy. s' D, s'' D,

3 3 A CleanNoiseOption and WapEnhancementOption avoid waping uneliable pixels. The pocess gives two waped images, two waped depth maps and two binay masks fom the left and ight efeence views. Each pai is then meged togethe. When a pixel gets mapped fom both the left and the ight efeence views, the final pixel value is eithe the pixel closest to the camea o an aveage of the two. Remaining holes ae filled by popagating the backgound pixels into the holes along the hoizontal ow. Finally, the image is downsampled to its oiginal size. 2) View Synthesis Refeence Softwae 1D Fast: Each contibution to the 3D-HEVC standadization that poposes to modify the coding of dependent views o depth data, is equied to pesent coding esults on synthesized views. The softwae used fo synthesizing the intemediate views is a vaiant of VSRS, called View Synthesis Refeence Softwae 1D Fast (VSRS-1DFast). This softwae is included in the HTM package, and is documented in the 3D-HEVC test model [20]. VSRS-1DFast allows inputting two o thee textue and depth views along with thei coesponding camea paametes, and synthesize an abitay numbe of intemediate views. Just like the 1D mode of VSRS, VSRS-1DFast assumes that the camea setup is 1D paallel. Figue 2 illustates the diffeent steps of Fig. 2. Flow diagam fo View Synthesis Refeence Softwae 1D Fast (VSRS- 1DFast) [20]. the endeing algoithm used in VSRS-1DFast. The textue views s T,l and s T, ae fist upsampled to obtain ŝ T,l and ŝ T, : the luma component is upsampled by a facto of fou in the hoizontal diection, and the choma by a facto of eight in hoizontal diection and two in vetical diection, thus yielding the same esolution fo all components. The waping, intepolation and hole filling ae caied out fo ŝ T,l and ŝ T, line-wise. This gives two epesentations of the synthesized fame: s T,l and s T,. Then, two eliability maps s R,l and s R, ae detemined indicating which pixels coespond to disocclusions (eliability of 0). A similaity enhancement stage then adapts the histogam of s T,l to the one of s T,. Finally, s T,l and s T, ae combined. If the intepolative endeing option is activated, the combination would depend on the waped depth maps and the two eliability maps ceated. If not, the synthesized view is mainly endeed fom one view and only the holes ae filled fom the othe view. The esulting combination is late down-sampled to the oiginal size of the textue views. B. Rendeing techniques in liteatue In [37], a endeing technique called View Synthesis using Invese Mapping (VSIM) is intoduced. It opeates at full-pel accuacy and assumes a 1D-paallel camea setting. The left and ight textue views ae waped to the synthesized view position using simple hoizontal shifts, also called column shifts. A table is maintained fo the left and ight intepetations of the synthesized fame which ecods the column shift of each pixel. Holes in these two tables ae filled using a median filte. Then, the two epesentations ae meged and the emaining holes ae filled by checking the collocated value in the tables, and invese mapping the pixel back to its oiginal value in the left o ight view. Residual holes ae filled by simply assuming that thei depth is the same as the depth of the collocated pixels in the oiginal views. VSIM outpefoms VSRS, on aveage, by 0.41 db at quate-pel accuacy and by 1.35 db at full-pel accuacy on 5 sequences. Howeve, the endeing untime is not povided, making it difficult to assess the complexity of the method. In [38], the depth maps ae pe-pocessed with an adaptive smoothing filte in ode to educe holes afte synthesis. The filte is only applied to edges in the depth map (coesponding to an abupt tansition in depth values) since these ae the main cause fo holes. The method is thus less complex than methods which apply a symmetic o asymmetic smoothing filte to the entie depth map. Futhemoe, if hole egions coespond to vetical edges, an asymmetic Gaussian smoothing filte is used to futhe pe-pocess the depth map. No objective gains ae epoted, but a peceptual impovement is noticed on some synthesized sequences. A technique that does not equie pe-pocessing the depth map is intoduced in [39]. A hole in the synthesized textue image is filled by the colo of the neighboing pixel (between the 8 diect neighboing pixels) with the smallest depth value in the synthesized depth map (this is efeed to as Hoizontal, Vetical and Diagonal Extapolation (HVDE)). The two waped textue images ae complemented (holes in one ae filled with available pixel values in the othe), and late blended, giving a final image W. The same pocess (HVDE, complementation, and blending) can also be pefomed in case the depth maps wee pe-pocessed with a bi-lateal smoothing filte, giving an image A, which would then be used to fill emaining holes in W. This technique is epoted to outpefom basic DIBR by 1.78 db on one sequence. Anothe method fo impoving the quality of the synthesis is to apply a non-linea tansfomation to the depth maps [40]. Specifically, the depth ange of points in the backgound is compessed, such that these points would have the same o slightly diffeent depths. This epotedly educes holes in the synthesis. The tansfomation depends on the depth map

4 4 histogam. Objective gains ae not pesented but a visible impovement is noticed on the shown images. Anothe desied featue is the possibility to feely change the quality of a synthesized view. Since the quality of DIBR endeing depends on the actual synthesis pocess, additional bounday atifact pocessing can be used to adjust the quality of the synthesis. Zhao et al. analyze and educe the bounday atifacts fom a textue-depth alignment pespective in [41]. In [42] Cheung et al. tackle the poblem of bit allocation fo DIBR multiview coding. The authos use a cubic distotion model based on DIBR popeties and demonstate that the optimal selection of QPs fo textue and depth maps is equivalent to the shotest path in a specially constucted 3D tellis. Xiao et al. [43] popose a scalable bit allocation scheme, whee a single odeing of depth and textue packets is deived. Futhemoe, depth packets ae odeed based on thei contibution to the eduction of the synthesized view distotion. Othe woks also exploit pixel-based pocessing with dense MVFs with an end goal of impoving the synthesis at the decode side. Li et al. compute dense MVFs on textue in [44]. Time consuming optical flow computations ae limited only aound the edges of objects. Additional depth pedictos ae obtained by mapping the MVs computed on textue to depth. The depth map impovement is eflected in a high incease of quality fo synthesized views. C. Remaks The endeing techniques used in the efeence softwaes, and in most contibutions in liteatue, ae all based on 3D image waping using depth maps. Pixels fom efeence views ae mapped to pixels in the vitual view using the dispaity infomation that the depth maps convey. Howeve, we show that the synthesis can be impoved by extending DIBR to the tempoal axis. In the emaining wok, we pesent a endeing method whee tempoal coelations between diffeent fames in the synthesized views ae exploited to impove the quality of the synthesis. Ou method is detailed in the next section. time instants. Let us conside It 1, I t, Is t 1, Is t which ae, espectively, the efeence () view fames and the synthesized (s) view fames at time instants t 1 and t. Let M N be the size of the image with M being the height and N the width. Let k = (x,y) be a point in It 1, v (k) its associated motion vecto (It is the efeence fame fo It 1), pointing to a coesponding point in It, and d t 1(k) its associated dispaity vecto, pointing to a coesponding point in It 1. s Let v s (k + d t 1 (k)) be the motion vecto of the pojection of k in It 1 s and d t(k + v (k)) the dispaity vecto of the pojection of k in It. If the point is not occluded, thee is only one pojection of k in It s, so the two vectos will point to the same position. This defines a so-called epipola constaint [45] on k, which can be witten as: v (k)+d t (k+v (k)) = d t 1 (k)+v s (k+d t 1 (k)) (1) y Refeence view Synthesized view x k=(x,y) I t-1 d t-1 (k) I s t-1 v (k) v s (k+d t-1 (k)) d t (k+v (k)) t-1 t Fig. 3. Epipola constaint, the elation between the dispaity fields d t 1 and d t at two time instants t and t 1 espectively, and the motion vecto fields in the synthesized and efeence view v s and v espectively fo a position k in the efeence fame I t 1. I t I s t III. PROPOSED METHOD Taditional endeing techniques synthesize an intemediate fame only fom the left and ight efeence views at the same time instant. By exploiting the tempoal coelations in the multiview sequence, we ae able to obtain additional pedictions fom past and futue fames and mege them togethe to obtain the synthesized fame. We efe to ou synthesis method as View Synthesis exploiting Tempoal Pediction (VSTP). In this section, we descibe the epipola constaint fo dispaity maps and optical flows, on which the poposed method is based. We then povide a desciption of the algoithm and popose two synthesis schemes fo a Goup Of Pictues (GOP) that exploit this idea. A. Epipola constaint Figue 3 shows the elation between the positions of a eal-wold point pojection in diffeent views and at diffeent B. Method desciption The goal of the method is to synthesize I s t fom a past and futue key fame in the synthesized view. Knowingv, d t, and d t 1, v s can be deived using Equation (1) fo evey pixel in I s t 1 that has a coespondence in I t 1: v s (k+d t 1 (k)) = v (k)+d t (k+v (k)) d t 1 (k) (2) v can be obtained by inputting It 1 and I t in an optical flow algoithm [46]. The esult is a dense motion vecto field v whee each pixel in It 1 is associated with a motion vecto. The dispaity maps d t and d t 1 can be obtained by simply conveting the values in the depth maps Zt and Zt 1 associated with It and I t 1 espectively into dispaity values. We assume that we ae dealing with a 1D paallel camea setup, and that only hoizontal dispaities exist. In this simple setup, the dispaity value fo a point k of coodinates (x,y)

5 5 in It 1 can be witten as: [ Z d x t(k) = f B t (x,y) 255 d y t(k) = 0 ( 1 1 ) + 1 ] Z min Z max Z max whee f is the focal length of the camea, B the baseline between the efeence and synthesized views, and Z min and Z max the extemal depth values. The same fomula can be applied to obtain d t 1. If we decompose Equation (2) fo the x and y components sepaately, we obtain: vs x (x+dx t 1 (x,y),y) = v(x,y)+d x x t(x+v(x,y),y+v x (x,y)) d y x t 1(x,y) vs y (x+dx t 1 (x,y),y) = vy (x,y) (4) Thee will be holes in v s that coincide with disocclusions ceated when waping It 1 with the d t 1 dispaity vecto field. If two o moe positions in It 1, k 1 and k 2 fo instance, ae waped to the same position k 3 in It 1 s (occlusion), the vecto v s (k 3 ) etained is the one which coesponds to the pixel with the highest depth value, as shown in Equation (5): the motion vectos fo occluded points of the scene ae thus ignoed. v (k 1 )+d t (k 1 +v (k 1 )) d t 1 (k 1 ) if Z v s (k 3 ) = t 1(k 1 ) > Zt 1(k 2 ) v (k 2 )+d t (k 2 +v (k 2 )) d t 1 (k 2 ) othewise (5) Using the motion vecto fieldv s andit 1 s, a pediction ofis t can be made, although it will contain holes due to disoccluded aeas in v s. A total of fou pedictions can be made by exploiting the epipola constaint, one fo each efeence view (left and ight, L and R) and at each time instant (past and futue, p and f), they will be denoted by P (i) (It) s whee i {0,1,2,3}. This is shown in Figue 4. I p,l d p,l I p s d p,r I p,r past v,p,l v s,p,l v s,p,r v,p,r I t,l d t,l I t s d t,r I t,r t v,f,l v v s,f,l s,f,r v,f,r d f,l d f,r I f,l I f s I p,r futue Fig. 4. Fou pedictions using the epipola constaint: dotted lines epesent the new tempoal pedictions intoduced by ou method. The fou pedictions ae then meged into a single one Ĩs t, whee the value of each pixel equals the aveage of the nondisoccluded pixel values in the fou pedictions as shown in (3) the following equation. When all fou pedictions contain the same disocclusion, the pixel value is computed by inpainting. Indeed, while the fou pedictions contain disocclusions, the majoity of these holes ae not the same in all pedictions and thus they will be filled afte the meging step: A(k) P (i) (It(k)) s Ĩt s (k) = i=0 if A(k) 0 A(k) inpainted if A(k) = 0 whee A(k) is the numbe of existing pedictions fo position k. Disocclusions (A(k) = 0) ae filled using the same inpainting method used in VSRS-1DFast, which is a simple line-wise intepolation. Figue 5 illustates the steps of VSTP algoithm. In ode to geneate a tempoal pediction, the algoithm inputs two fames of the efeence view at two time instants, i.e., a cuent and a futue o past time instants, denoted by It,L and Ip,L espectively in the figue, and computes a dense motion vecto field between the two (v,p,l ). The dense MVF is then waped at the level of the synthesized view using the coesponding dispaity maps (d t,l and d p,l ). We also etain a dispaity map coesponding with the new MVF (d ). Thus, each pixel has an associated motion vecto and dispaity. The next step is the backwad motion compensation in which we use a key fame (Ip s ) as efeence in ode to obtain a fist tempoal pediction, in case of ovelapping values we use d to select the foegound pixel. Îp,R s, Îs f,l, Îs f,r ae obtained using the same steps in the ight efeence view at the same time instant and at a futue time instant in the left and ight efeence views espectively, as descibed in Figue 4. The final synthesis is obtained by pefoming a simple mege between the fou tempoal pedictions o an inte-view/tempoal fusion as descibed in Section III-D. The inte-view pediction is denoted by Îi in Figue 5. C. Pediction schemes in a GOP The synthesized view is endeed GOP-wise in ou algoithm. The GOP stuctue is the one used to code the left and ight efeence views. In addition to the efeence views (as equied by VSRS-1DFast) we send a fist fame pe GOP of the synthesized view (at the encode side we equie this view, it can be eithe oiginal o synthesized fom uncompessed adjacent views if not available) in the bitsteam. These fames, efeed to in the est of this wok as key fames, ae efficiently coded using 3D-HEVC with the left view seving as inte-view efeence (the base view). The est of the fames ae synthesized using ou method with one of the tempoal pediction schemes descibed below. Fo the fist fame actually synthesized in a GOP, the key fame of the cuent GOP and the one of the futue GOP espectively ae the past and futue efeence fames, I s p and Is f espectively. Figue 6 shows the diffeence between the two tempoal pediction schemes. The Diect scheme uses the key fame of the cuent GOP and the one of the next GOP as past and futue efeence fames fo all emaining fames to synthesize in the GOP. This esults in an asymmetic pediction, with (6)

6 6 I p,l I t,ld t,l d p,l I s p Motion estimation v,p,l MVF waping d' v s,p,l Backwad Motion compensation I s p,l I s p,r I s f,l I s f,r Tempoal Mege I s t Adaptive Fusion Fig. 5. Flow diagam fo View Synthesis exploiting Tempoal Pediction (VSTP). The dotted dash line is the tempoal pediction block, which is applied fou times, i.e. past and futue time instants (p and f) in the left and ight (L and R) efeence () views. two diffeent tempoal distances between each of the two key fames and the cuent fame. The tempoal distance can be as high as the GOP size minus one, and an optical flow computation with such lage tempoal distances can give impecise motion vecto fields thus making the Diect scheme inefficient. An altenative scheme, called the Hieachical scheme, can be used, in which tempoal layes ae used to pefom symmetic pedictions (with equal tempoal distances). In each laye, the past and futue efeences fo the cuent fame ae eithe the key fames o aleady synthesized fames in lowe layes. The maximal tempoal distance in this scheme equals half of the GOP size. Tempoal Level 0 I 0 I 8 1 Tempoal Level I 0 I s 1 I s 2 I s 3 I s 4 I s 5 I s 6 I s 7 (a) Diect I s 4 I s 1 I s 3 I s 5 I s 7 (b) Hieachical Fig. 6. Tempoal pediction schemes inside a GOP of the synthesized view. D. Adaptive Fusion In the poposed method the synthesized fame is obtained by meging ou fou tempoal pedictions as descibed in Equation (6). When dealing with fast moving objects, the I s 6 I i I 8 optical flow computation between fames with high tempoal distance may give impecise motion vecto fields which lead to an inconsistent positioning of the objects in the fou tempoal pedictions. In this case, a simple aveage-based meging would esult in a bad epesentation of objects with high motion intensity. In what follows, we efe to the taditional dispaity based synthesis used in VSRS-1DFast as the inteview pediction. We intoduce a diffeent meging algoithm called Adaptive Fusion which uses the inte-view pediction and ou tempoal pediction altenatively fo diffeent pats of the image. The idea of this method is to geneate a binay fusion map in which we mak the bad pixels fom the tempoal pediction, to be eplaced by the inte-view pediction. The fist step of this algoithm is to estimate which aeas will select the inte-view pediction and which ones will select the tempoal pediction. The next step is the actual fusion, whee each pixel value is computed as an aveage between eithe the tempoal o inte-view pedictions, depending on the peviously computed binay map. In ode to descibe ou selection pocess fo a pixel, let us conside: î t pl, ît fl, ît pr, ît fr fou tempoal pedictions of a pixel at position k and î i the blend between the left and ight inte-view pedictions obtained fom VSRS-1DFast. It is safe to assume that good tempoal pedictions of a pixel ae simila, i.e., the values ae close to each othe (have a low spead). On the contay, impecise motion vecto fields might lead to dissimila values that span ove a lage ange (have a wide spead) and in this case inte-view pediction should be used. Note that in some cases î i is wose than the tempoal pediction even if we have a wide spead. The challenge is to emove atifacts in the tempoal pediction without intoducing new ones fom the inte-view pediction. By compaing the value of î i to ou fou tempoal pedictions we can identify fou cases. In the following, the maximum and minimum value of the tempoal pedictions ae denoted by î t max and ît min espectively : Case 1: Wide spead and î i [î t min,ît max ] Case 2: Wide spead and î i / [î t min,ît max] Case 3: Low spead and î i [î t min,ît max ] Case 4: Low spead and î i / [î t min,ît max] We conside Case 1 and Case 4 as typical situations in which we should select inte-view and tempoal pedictions espectively. Indeed, in Case 1, wide spead means thee is a bad match between the fou tempoal pedicted values, which indicate an impecise optical flow computation. An inte-view pediction inside this ange is pobably the best value. Case 4 indicates a good tempoal pediction and we should use the aveage of the fou points. In Case 2 the inte-view pedicted value is eithe good o vey bad depending on how fa away it is fom î t min o ît max. In Case 3 the two pediction values ae close and we pioitize the tempoal one. When dealing with disocclusions, the numbe of available tempoal o inte-view pedictions fo a pixel can vay, i.e., a cetain position (x,y) can be a disocclusion in one o moe tempoal o inte-view pedictions. In situations when only one type of pediction is available we select it, and if we have no pediction at all, we mak the pixel to be late filled.

7 7 Consideing the vectos p t = [î t pl,ît fl,ît pr,ît fr ] and p t&i = [î t pl,ît fl,ît pr,ît fr,îi ], the selection between inteview and tempoal pediction fo a pixel is done as follows: î t if mean( p t mean(p t ) ) mean( p t&i mean(p t&i ) ) < α î = î i (7) if mean( p t mean(p t ) ) mean( p t&i mean(p t&i ) ) > α whee î t = mean(p t ) and α is a theshold used to contol the selection pocess (by inceasing α we favo the tempoal pediction). Adding an outlying value to the p t vecto will incease its mean absolute deviation, on the contay an inlying value will maintain a simila mean absolute deviation. In ou model we select tempoal pediction when î i is an outlie, this coesponds to Case 4. Fo Case 2 and Case 3 we favo the tempoal pediction and fo Case 1 we favo the inte-view pediction. The value fo α used in this wok was empiically found to be optimal at 0.5. Fom this pocess, we deduce a binay selection map: { 0 if î = î t B(k) = 1 if î = î i (8) which indicates the selected pediction type fo each pixel. E. Discussion on the method In dense camea ig systems, a high numbe of views ae available at the encode side. Typically, only a subset is coded and sent in the bitsteam, the est being synthesized at the eceive side [20]. Ou pediction method uses the synthesized view at the encode side, since one fame pe GOP of that view is tansmitted in the bitsteam. Indeed, synthesizing the intemediate views instead of sending them is a moe efficient altenative as show in [47]. Ou method can be seen as in between these two scenaios: we only send some infomation on the synthesized views, which we exploit to impove the synthesis. Consequently, in this wok, we do not only popose a endeing method, but also a change in the design of the tansmission stage. Note that we could have poposed a method whee the key fames in the synthesized view ae endeed with the left and ight efeence views using VSRS fo instance, but then the endeing atifacts ceated in these key fames would be popagated to the est of the fames in the motion compensation stage. Futhemoe, we use a backwad motion compensation stage in ou method: the vectos in v s point fom Ip s (o Is f ) to It s. We can have a v s that points fom It s to a past o futue efeence if the vectos in v point in the same diection (e.g., fom It to Ip o If s ). This can easily be done if the inputs of the optical flow algoithm that outputs v ae evesed. In this case, and if k = (x,y) is a point in It, Equation (2) becomes: v s (k+d t (k)) = v (k)+d t 1 (k+v (k)) d t (k) (9) Fom Equation (9), we can see that v s is now defined fo evey pixel in It that has a coespondence in It s. The holes in v s (and in the coesponding pediction) coespond to disocclusions when waping fom It to It s. Even if we use a diffeent time instant (f), the holes in the coesponding pediction would still come fom waping It to It s and thus will coincide with the holes of the fist pediction. The meging pocess will not be able to fill in these holes and they will eventually have to be inpainted. In ou method, the holes coespond to disocclusions when waping fom Ip to Ip s in the fist pediction, and fom If to Is f in the second. The holes do not necessaily coincide, and thus, pixels can be efficiently pedicted fom one o the othe fame, duing the meging pocess. In compaison to othe pixel-based methods such as [44] which impove the encoding of the depth map using dense MVFs computed on textue, ou method waps the dense MVFs at the level of the intemediate view and uses them to motion compensate textue images as shown in this section. Futhemoe, bounday atifacts eduction methods such as [41] can be used in paallel with VSTP. Since, ou final synthesis is a blend between DIBR endeing and the tempoal pedictions, educing the atifacts in the DIBR synthesis will incease the quality of ou method. Also, a bette textuedepth alignment can benefit the waping of the dense MVFs. Howeve, ou method also gives the possibility to adjust the QP of the key fames which will in tun affect all fames inside a GOP o modify the fequency of the key fames which will educe o incease the tempoal distance of the pediction esulting in a highe quality endeing and a vaiation of the tansmission ate. As discussed above, ou method povides new possibilities to contol the ate and distotion in compaison to VSRS- 1DFast: modifying the QP of key fames o adjusting thei fequency. The bit allocation optimization scheme fo DIBR multiview coding pesented in [42] can be employed with ou method as-well. Howeve, a study towads the integation of the additional ate and distotion contol options povided by VSTP within such schemes should be pefomed. Fo simplicity easons in ou expeiments we will use the ecommended depth and textue QPs fo 3D-HEVC testing, as discussed in section IV-A. A. Expeimental setting IV. EXPERIMENTAL RESULTS Ou algoithm takes as input two coded left and ight views with thei associated depth videos and camea paametes, and one fame pe GOP of the intemediate view, and outputs the whole intemediate view afte synthesizing the est of the fames. The synthesis esults ae compaed against the oiginal intemediate sequences to measue the PSNR. We thus conside a five-view scenaio in these expeiments in which we code two views (left and ight) and key fames fom 1/2 view and synthesize thee intemediay views at 1/4, 1/2 and 3/4 positions between the two base views. We assume that one of the thee intemediay views is available at the encode side(1/2). The coding configuation descibed in the Common Test Conditions (CTCs) defined by JCT-3V fo conducting expeiments with the efeence softwae of 3D-HEVC [48] is used fo coding the left and ight views. The ecommended textue and depth QPs ae 25, 30, 35, 40 and 34, 39, 42, 45

8 8 espectively. The optical flow algoithm used in ou method can be downloaded fom [46], the configuation paametes ae epoted in Table I and moe details can be found in [49]. TABLE I OPTICAL FLOW PARAMETERS Paamete Desciption Value Alpha Regulaization weight Ratio Downsampling atio 0.4 MinWith Width of the coasest level 20 noutefpiteations Numbe of oute fixed point iteations 7 ninnefpiteations Numbe of inne fixed point iteations 1 nsoriteations Numbe of Successive Ove Relaxation iteations 30 We test ou method on fou sequences of the test set in the CTCs: Balloons, Kendo, Newspape and PoznanHall2. Each sequence is composed of thee eal views and we also conside two vitual views. The CTCs indicate to use the middle view as base view, and the left and ight views as dependent views. Howeve, hee we want the left and ight views to be decodable without the middle view because only the fist fame in each GOP of that view will be sent in the bitsteam. We thus set the left view as base view, and the othes as dependent views. Also, we code oughly 10 seconds of video of each sequences. Note that the numbe of fames is lowe in PoznanHall2 because its fame ate is lowe as well (cf. Table II). Class class A ( ) class C ( ) TABLE II SEQUENCES USED IN OUR EXPERIMENTS Sequence Fames pe second Numbe of fames PoznanHall Balloons Kendo Newspape We compae ou synthesis method to the efeence VSRS- 1DFast in 3D-HEVC test model, HTM. We evaluate the pefomance of the efeence and the poposed methods using the Bjontegaad delta-psnr (BD-PSNR) [50] metic on the synthesized views. The PSNR is evaluated against the oiginal intemediate views. Evaluating ou synthesis against fames synthesized fom uncompessed views, as indicated by the CTCs, would penalize the lack of atifacts that aise fom dispaity waping, which ae pesent in both compessed and uncompessed VSRS synthesis. The ate in the efeence method is the sum of the ates needed to code the left and ight views with thei associated depth videos. The same ate is consideed in the poposed method, to which is added the ate needed to code the fist fame in each GOP of the intemediate view. We use the BD-PSNR metic to measue the impovement (see Figue 7). B. Synthesis esults Table III gives the BD-PSNR values obtained with the two pediction schemes with simple meging ( Diect and Hieachical ) and Adaptive Fusion applied in the Hieachical scheme ( HieachicalAF ) when consideing only the PSNR of the 1/2 intemediay view synthesized with VSTP. In Table IV we show the BD-PSNR fo the 3 intemediay views. Hee, the PSNR is computed as the aveage between the 3 (1/4, 3/4 synthesized with VSRS-1DFast and 1/2 with VSTP). A positive value in this table indicates a gain. On aveage, ou method bings 0.53dB, 0.59dB and 0.87dB BD-PSNR incease with Diect and Hieachical schemes with simple tempoal pedictions meging, and the Hieachical scheme with the Adaptive Fusion method espectively, compaed to the efeence VSRS-1DFast method. In the last column of the table (HieAF+HieSynth) we show the BD-PSNR obtained if we synthesize the 1/4 and 3/4 vitual views fom left base view and ou VSTP synthesis, and fom VSTP synthesis and the ight base view espectively. The depth map fo the 1/2 view is synthesized fom ight and left base views. By employing this hieachical synthesis we take advantage of the highe quality of ou endeing method to impove the 1/4 and 3/4 views without modifying the ate. The delta-psnr between efeence and ous fo 1/4 and 3/4 views is -0.09dB, -0.01dB, 1.58dB fo Balloons, Kendo and Newspape sequences in aveage ove all QPs. As expected these esults ae consistent with the BD-PSNR epoted in Table IV(HieAF+HieSynth compaed to HieachicalAF), since the ate is not modified. Note, that the 5 view test case scenaio no longe contains the Poznan Hall2 sequence. This is due to using oiginal views as efeence fo evaluating the PSNR of the 1/4 and 3/4 views which in the case of Poznan Hall2 sequence ae not available. As discussed in Section III-E synthesis is poven to be moe efficient. Howeve, the quality of an encoded view is always highe than that of a synthesis, we obtained 38.50dB PSNR compaed to 35.81dB PSNR and 32.99dB PSNR fo diect 3D-HEVC encoding, VSTP synthesis and VSRS-1DFast synthesis, espectively, in aveage ove all sequences and all QPs. TABLE III BD-PSNR VALUES FOR A 3 VIEW TEST CASE, OBTAINED WITH BOTH PREDICTION SCHEMES AND ADAPTIVE FUSION IN THE PROPOSED METHOD COMPARED WITH THE REFERENCE VSRS-1D FAST METHOD. Sequence BD-PSNR (in db) Diect Hieachical HieachicalAF Balloons Kendo Newspape PoznanHall Aveage The Rate Distotion (RD) cuves fo the efeence and the poposed method (fo both schemes and meging methods) ae given in Figue 7. We can see that while both schemes with simple meging outpefom the efeence method fo Balloons and Newspape, ou method outpefoms the efeence only with the Hieachical scheme with adaptive fusion in Kendo. This is also epesented in BD-PSNR values fo this sequence which ae only positive in the Hieachical scheme with adaptive fusion, as shown in Table III. Using the Adaptive Fusion method with the Hieachical scheme bings high

9 9 TABLE IV BD-PSNR VALUES FOR A 5 VIEW TEST CASE, OBTAINED WITH BOTH PREDICTION SCHEMES, ADAPTIVE FUSION AND HIERARCHICAL SYNTHESIS IN THE PROPOSED METHOD COMPARED WITH THE REFERENCE VSRS-1D FAST METHOD. Sequence BD-PSNR (in db) Diect Hieachical HieachicalAF HieAF + HieSynth Balloons Kendo Newspape Aveage PSNR (db) RD cuve Balloons additional gains fo Kendo sequence and modeate additional gains fo Balloons, Newspape sequences. This is expected because the fusion method was designed with the main goal of coecting bad tempoal pedictions caused by high intensity motion as is the case of Kendo sequence. To bette evaluate ou method we pefom an additional test. Since VSTP synthesis equies infomation to be sent though the bitsteam, mainly one fame pe GOP, we pefom a diect compaison between the encoding of a dependant view and ou VSTP synthesis. The esults indicate we ae able to outpefom the encoding at low bitates. This is possible due to encoding eos at lowbitates having a geate impact on the quality of the image as compaed to synthesis eos; while, at the same time synthesis povides bette ate. The tests wee pefomed on Balloons, Kendo and Newspape sequences fo QPs anging fom 50 to 35 and we obtained: 1.33, 1.061, 0.62 db BD- PSNR gain, ove 3D-HEVC, espectively fo each sequence. Figue 8 shows, fo the fou tested sequences, the vaiation of the PSNR of the synthesized view ove time with the efeence and the poposed method (both schemes and Hieachical scheme with Adaptive Fusion ). Only one QP (25) is epesented fo simplicity as the behavio of any cuve is simila acoss all QPs. In the poposed method and fo all sequences, we notice peiodic peaks in the synthesized view PSNR, which coespond to the fist fame of each GOP. Since these fames ae not synthesized but athe decoded, thei PSNR is highe than any othe fames in the GOP. Fo the Balloons, NewspapeCC and PoznanHall2 sequences, the poposed method outpefoms the efeence VSRS-1DFast endeing fo most fames. Fo the Kendo sequence, ou method is bette only in cetain pats. Figue 9 shows two side-by-side examples of ideal and eal fusion maps fo Kendo and NewspapeCC sequences. The ideal fusion map displayed hee is only showing, in black, the pixels that if eplaced by inte-view pediction, would have thei absolute eo deceased by at least 5 (we ignoe small gains). We can see that ou map is consistent with the ideal map fo coecting high eos. This is also shown in Figue 10 whee we display the diffeence between the absolute eo of tempoal and inte-view pediction fo the same fame of Kendo sequence. Positive values indicate inte-view pediction is bette and we can see a coespondence between high values and ou fusion map. Figue 11 shows pats of fames synthesized using the ef- PSNR (db) PSNR (db) 33.5 Refeence 33 VSTP Diect VSTP Hieachical VSTP HieachicalAF Rate (kb/s) (a) Balloons RD cuve Kendo Refeence 33 VSTP Diect VSTP Hieachical VSTP HieachicalAF Rate (kb/s) (b) Kendo RD cuve NewspapeCC Refeence 28.5 VSTP Diect VSTP Hieachical VSTP HieachicalAF Rate (kb/s) (c) Newspape Fig. 7. RD cuves of the efeence and poposed method on 5 view test scenaio fo the Balloons, Kendo and NewspapeCC sequences.

10 PSNR ove time Balloons QP25 Refeence VSTP VSTPHie VSTPHieAF PSNR ove time Kendo QP PSNR (db) PSNR (db) Fame numbe (a) Balloons 30 Refeence VSTP 28 VSTPHie VSTPHieAF Fame numbe (b) Kendo 42 PSNR ove time NewspapeCC QP25 43 PSNR ove time PoznanHall2 QP PSNR (db) Refeence 30 VSTP VSTPHie VSTPHieAF Fame numbe (c) Newspape PSNR (db) Refeence VSTP 34 VSTPHie VSTPHieAF Fame numbe (d) PoznanHall2 Fig. 8. Vaiation of the PSNR of the middle synthesized view ove time fo the efeence and poposed method at QP 25. eence and the poposed method with hieachical scheme and Figue 12 shows pats of fames using the poposed method with and without adaptive fusion. Fo fainess of compaison, fo ou method, we show fames that ae actually synthesized and not decoded. We can notice a clea impovement in the synthesis quality with ou method: the atifacts obtained with VSRS-1DFast (highlighted in ed in the figues) ae efficiently emoved and also atifacts in ou method ae emoved when using the adaptive fusion. C. Results intepetation The Adaptive Fusion method with the Hieachical scheme bings high gains in BD-PSNR. To bette descibe ou esults we will efe to an ideal case whee we use the oiginal fames to ceate a fusion map in which we mak all the pixels that have a lowe eo in the inte-view pediction compaed to the tempoal one, fo simplicity we will only test 3 seconds fom each sequence. As a mean of veifying the quality of ou obtained fusion map we compute the diffeence between the mean absolute eo (MAE) of pixels maked by a fusion map, fo tempoal and inte-view pedictions, efeed to as MAE as shown in the following equation, whee Î is eithe the tempoal o inte-view pediction, B is the binay fusion map and Ît, Îi and I ae the tempoal and inte-view pedictions and the oiginal fame espectively. 0, if B(x,y) = 0 x,y M N MAE(Î,B) = B(x,y) Î(x,y) I(x,y) x=1 y=1 M N, othewise B(x,y) x=1 y=1 MAE(Ît,Îi,B) = MAE(Ît,B) MAE(Îi,B) (10) Table V shows the pecentages of eplaced pixels and the MAE eduction fo ou method and the ideal case. The values in Table V ae the aveages fo all QPs. Fo example let us

11 11 (a) Kendo ideal fusion map (b) Kendo fusion map obtained with ou method (c) Newspape ideal fusion map (d) Newspape fusion map obtained with ou method Fig. 9. Fusion maps fo fame 4 in Kendo and Newspape sequences, at QPs 30 and 25 espectively. Pixels in the tempoal pediction that ae eplaced with inte-view pediction ae black. Figues 9(a) and 9(c) ae the ideal maps in which inte-view pediction is only selected if it coects high tempoal eos (the oiginal view was used fo this computation). Figues 9(b) and 9(d) ae obtained with the Adaptive Fusion method. TABLE V A DAPTIVE F USION RESULTS : PERCENTAGE OF REPLACED PIXELS AND MAE GAINS FOR OUR METHOD AND THE IDEAL CASE IN WHICH THE FUSION MAP IS DETERMINED USING THE ORIGINAL VIEW. Sequence Balloons Kendo Newspape PoznanHall2 Aveage Inte-view pedicted pixels (%) Real Ideal MAE Real Ideal conside the Kendo sequence at QP 25. In aveage fo this case 25.39% of the pixels in a fame ae bette pedicted with inte-view pediction, ou method selects 3.48% of the pixels to be eplaced by inte-view pediction, out of which 1.6% is a bad selection (tempoal pediction was actually giving bette esults and we eplaced it with inte-view pediction). Note that the 25.39% ideally selected pixels include pedicted aeas which ae bette only by a small magin. Ou selection howeve focuses on coecting high eos. Even though pats of ou eplaced aeas ae actually wose pedictions and incease the MAE, oveall we still obtain a positive MAE which shows we ae coecting the high eos, as also shown in Figues 9 and 10. Fo the Balloons and Newspape sequences whee the intoduction of Adaptive Fusion bings a small additional incease in BD-PSNR we have a smalle pecentage of eplaced pixels with a small MAE in contast to the Kendo sequence whee this method bings a high additional incease in BD-PSNR. Fo the PoznanHall2 sequence we have a simila esult in BD-PSNR, the Diect and Hieachical schemes aleady povide a vey good esult due to low intensity motion. Hee the Adaptive Fusion method coects some small tempoal pediction eos but also intoduces inte-view pediction eos, this explains why we have a negative MAE ove the eplaced pixels in this sequence. Note that the numbe of eplaced pixels is smalle compaed to the othe sequences, only 0.81% of a fame on aveage, thus the quality of the entie image is affected only by a small magin. The esults of Table III and the RD cuves in Figue 7 show that the Hieachical scheme outpefoms the Diect

12 scheme, which was expected, since the tempoal pediction distances ae shote in the fist scheme.

each GOP but also in the fifth fame. 40 100 200 300 400 500 30 20 10 0 (a) Balloons - VSRS - QP 25 (b) Balloons - VSTP Diect - QP 25 600 700 10 200 400 600 800 1000 20 Fig. 10. Diffeence between inte-view and tempoal pediction eo ( MAE) on fame 4 in Kendo sequence, QP 30.

est of the fames using motion compensation. Second, ou method fills holes due to disocclusions moe efficiently than VSRS-1DFast.

In ou method, the disocclusion aeas can be found in peviously synthesized fames. Thid, foegound objects ae bette endeed because the method is less sensitive to depth distotions.

12 12 scheme, which was expected, since the tempoal pediction distances ae shote in the fist scheme. Note that in a GOP of 8 fames, the fifth fame is synthesized in the same way in both shemes, which is why the cuves of Figue 8 coesponding to the two schemes, intesect not only in the fist fame of each GOP but also in the fifth fame (a) Balloons - VSRS - QP 25 (b) Balloons - VSTP Diect - QP Fig. 10. Diffeence between inte-view and tempoal pediction eo ( MAE) on fame 4 in Kendo sequence, QP 30. Ou method impoves the quality of the synthesis on thee levels: fist, it accounts fo a diffeence in illumination between the coded efeence views and the synthesized view, which endeing techniques such as VSRS-1DFast cannot do. Indeed, while VSRS-1DFast cannot wap a diffeent illumination level fom the efeence views into the synthesized view, ou method popagates the coect illumination level of the sent key fames accoss the est of the fames using motion compensation. Second, ou method fills holes due to disocclusions moe efficiently than VSRS-1DFast. Indeed, these holes ae filled using inpainting in the latte, hence ceating atifacts such as the ones highlighted in Figue 11. In ou method, the disocclusion aeas can be found in peviously synthesized fames. Thid, foegound objects ae bette endeed because the method is less sensitive to depth distotions. We use dispaity to wap dense MVFs athe than diectly waping the textue (cf. Figues 11(e), 11(f), 11(g), 11(h)). In addition, VSTP bings textue infomation fom diffeent time instants that cannot be obtained fom inte-view pediction. The fusion between the two pediction types will educe the chance of having esidual holes in the final synthesis. This explains how ou method efficiently emoves the afoementioned atifacts, as shown in Figue 11. Also, subjective viewing of the sequences has shown that thee ae no flickeing effects with ou method. A synthesis example can be downloaded fo viewing at the following links: cagnazzo/vss.zip cagnazzo/vstp.zip fo VSRS-1DFast and VSTP espectively. Ou method is inheently moe complex than VSRS-1DFast due to the dense motion estimation / compensation stage. Shotcuts that can educe the complexity of ou method, at the (c) Kendo - VSRS - QP 30 (d) Kendo - VSTP Hieachical - QP 30 (e) Newspape - VSRS - QP 35 (f) Newspape - VSTP Hieachical - QP 35 (g) PoznanHall2 - VSRS - QP 30 (h) PoznanHall2 - VSTP Adaptive Fusion - QP 30 Fig. 11. Pats of fames synthesized with the efeence VSRS-1DFast and the poposed method. Highlighted atifacts in VSRS-1DFast (Figues 11(a), 11(c), 11(e) and 11(g)) ae efficiently emoved in ou method (Figues 11(b), 11(d), 11(f) and 11(h)).

13 pice of loosing some pediction accuacy, include block-based motion estimation/compensation and uni-pedictive motion compensation (pedict using only a past fame, o only a futue fame).

Fusion - QP 25 V. CONCLUSION AND FUTURE WORK In this pape, we pesented a view synthesis technique that exploits tempoal pediction in ode to impove the quality of the synthesis.

Fou pedictions using the left and ight efeence view, and a past and futue time instant can be constucted and then meged togethe into a single pediction of the synthesized fame.

13 13 pice of loosing some pediction accuacy, include block-based motion estimation/compensation and uni-pedictive motion compensation (pedict using only a past fame, o only a futue fame). (a) Balloons - VSTP - QP 25 (c) Kendo - VSTP - QP 30 (e) Newspape - VSTP - QP 25 (b) Balloons - VSTP Adaptive Fusion - QP 25 (d) Kendo - VSTP Adaptive Fusion - QP 30 (f) Newspape - VSTP Adaptive Fusion - QP 25 V. CONCLUSION AND FUTURE WORK In this pape, we pesented a view synthesis technique that exploits tempoal pediction in ode to impove the quality of the synthesis. Namely, some key fames of the synthesized view ae encoded in the bitsteam, and the est ae intepolated using motion compensation with vectos waped fom efeence views. Fou pedictions using the left and ight efeence view, and a past and futue time instant can be constucted and then meged togethe into a single pediction of the synthesized fame. Two pediction schemes efeed to as Diect and Hieachical have been pesented in this wok. The fist synthesizes fames using motion compensation only with key fames, while the othe motion compensates with peviously synthesized fames, hence educing the pediction distances. We also intoduced a pediction meging method efeed to as Adaptive Fusion that selects between inte-view and tempoal pediction, thus emoving some of the motion estimation eos. Ou method bings 0.53dB and 0.59dB PSNR incease with the Diect and Hieachical schemes espectively and 0.87dB PSNR with Hieachical scheme and Adaptive Fusion in aveage fo seveal test sequences ove the state-of-the-at VSRS-1DFast softwae unde 3D-HEVC standads. Futhemoe, the MVF pecision on fames with high intensity motion can be impoved by using a bette motion estimation technique o using an adaptive GOP size with espect to motion intensity. The Adaptive Fusion method can be futhe impoved by finding a bette inteview/tempoal selection citeion. Additional adjacent views that ae not available at the encode side can be futhe impoved by deiving the vecto fields equied to diectly pedict the fames fom the key fames. Finally, the fequency at which key fames ae sent in ou method, which, in the cuent vesion, follows the GOP stuctue used fo coding the efeence views, can be modified: lowe fequencies allow bitate savings since less key fames will be sent but they also imply motion estimation between distant fames, which will decease the pediction accuacy. Finding a good tade-off fo this paamete is an inteesting futue eseach subject. REFERENCES (g) PoznanHall2 - VSTP - QP 30 (h) PoznanHall2 - VSTP Adaptive Fusion - QP 30 Fig. 12. Pats of fames synthesized with and without Adaptive Fusion. Highlighted atifacts afte meging the tempoal pedictions (Figues 12(a), 12(c), 12(e) and 12(g)) ae efficiently emoved when using Adaptive Fusion (Figues 12(b), 12(d), 12(f) and 12(h)). [1] F. Dufaux, B. Pesquet-Popescu, and M. Cagnazzo, Eds., Emeging technologies fo 3D video: content ceation, coding, tansmission and endeing. Wiley, May [2] M. Tanimoto, M. P. Tehani, T. Fujii, and T. Yendo, Fee-Viewpoint TV, IEEE Signal Pocessing Magazine, vol. 28, pp , [3] P. Mekle, A. Smolic, K. Mulle, and T. Wiegand, Multi-view video plus depth epesentation and coding, IEEE Intenational Confeence on Image Pocessing, vol. 1, pp , [4] C. Fehn, A 3D-TV appoach using depth-image-based endeing, in 3d IASTED Confeence on Visualization, Imaging, and Image Pocessing, Benalmadena, Spain, 8-10 Septembe 2003, pp [5] H. Shum and S. B. Kang, Review of image-based endeing techniques, SPIE Visual Communications and Image Pocessing, vol. 4067, pp. 2 13, [Online]. Available:

View Synthesis using Depth Map for 3D Video

View Synthesis using Depth Map fo 3D Video Cheon Lee and Yo-Sung Ho Gwangju Institute of Science and Technology (GIST) 1 Oyong-dong, Buk-gu, Gwangju, 500-712, Republic of Koea E-mail: {leecheon, hoyo}@gist.ac.k