Multi-Viewpoint Video Coding with MPEG-2 Compatibility. Belle L. Tseng and Dimitris Anastassiou. Columbia University New York, N.Y.

Muli-Viewpoin Video oding wih MPEG-2 ompaibiliy Belle L. Tseng and Dimiris Anasassiou olumbia niversiy New York, N.Y. 10027 SA Absrac An ecien video coding scheme is presened as an exension of he MPEG-2 sandard o accommodae he ransmission of muliple viewpoin sequences on bandwidh-limied channels. Wih he goal of compression and speed, he proposed approach incorporaes a variey of exising compuer graphics ools and echniques. onsrucion of each viewpoin image is prediced using a combinaion of perspecive projecion of 3D models, exure mapping, and digial image warping. Immediae applicaion of he coding specicaion is foreseeable in sysems wih hardware-based realime rendering capabiliies, hus providing fas and accurae consrucions of muliple perspecives. 1 Inroducion Recen ineress in 3D echnologies promp he addiion of deph impression ono he oherwise common 2D video signals. Two major processes o perceiving 3D can be caegorized. The rs is due o he wo slighly dieren perspecives of he world oered simulaneously o our lef and righ eyes. The human visual sysem hen convers hese wo sereo images ino one single fused 3D percepion. The oher approach o sense deph, even wih only one eye-viewpoin, is hrough moion parallax. Due o moion from our head movemens, he relaive objec displacemens of he resuling perspecive view are sucien cues in deriving he 3D sensaion. Accordingly, inour presenaion, he deph appearance is conribued by boh processes. Amuli-viewpoin video, muliview for shor, is a 3D exension of he radiional movie sequence, in ha here are muliple perspecives of he same scene a any one insance in ime. omparable o a movie made by a sequence of holograms, a muliview video oers a similar lookaround capabiliy. An ideal muliview sysem allows any user o wach a rue 3D sereoscopic sequence from any perspecive he viewer chooses. Such a sysem has pracical uses in ineracive applicaions, medical echnologies, educaional and raining demonsraions, remoe sensing developmens, and is a sep owards virual realiy. 1

Wih he developmen of digial video echnology, a video daa compression sandard, namely he second Moion Picure Expers Group specicaion (MPEG-2), has been adoped by he Inernaional Sandards Organizaion (ISO) and he Inernaional Telecommunicaions nion (IT). MPEG-2 species he coding process for one video sequence; deailed descripions can be found in [1]. Recenly, MPEG-2 has also been shown o be applicable o wo sequences of sereoscopic signals hrough he use of spaial and emporal scalabiliy exensions[2, 3, 4, 5]. However, exending he number of video viewpoins beyond wo canno be done pracically by using he same mehodology. For his moivaion, a novel muliview codec is presened o complemen he MPEG-2 sandard. 2 Geomeric Deniions and Noaions An ordinary 2D video sequence oers only one perspecive of he acquired scene. Fora3D viewing experience, a leas 2 viewpoins are required o obain he deph impression from he lef and righ perspecives of a sereoscopic signal. In a muliview sysem, muliple viewpoins are involved. Le N be he number of viewpoin sequences, where he minimum number of viewpoins is equal o 3 for he presened specicaion, i.e., N 3. The number of viewpoins mus be exendable in he fuure, hus accommodaion of addiional viewpoins is an essenial feaure. One cenral viewpoin image is designaed o be he principal view in which he oher muliview images are prediced from. The cenral image is denoed by I from viewpoin V. The cenral viewpoin, usually he middle viewpoin, is chosen so ha is image has he highes collecion of overlapping objecs wih each of he oher viewpoin images. In his manner, he cenral viewpoin image can be used o inerpolae and predic mos of he oher views. The oher muliview images are designaed by I X from viewpoins V X. An example of four addiional views is illusraed in Figure 1, where cameras posiioned a viewpoins V X = fv L ;V R ;V T ;V B g capure respecive images I X = fi L ;I R ;I T ;I B g corresponding o he Lef, Righ, Top, and Boom. These camera-capured images available a he encoder are referred o as real views, whereas images no direcly aken by a camera, bu derived by predicion or inerpolaion mehods, are called virual views. Virual picures include hose viewpoin images seen beween wo real cameras, hus having such virual consrucions allow he viewer o see a smooher video ransiion beween wo real views. The image coordinae sysem corresponding o each viewpoin is dened as (X i ;Y i ), where viewpoin index i = f; L; R; T; Bg. Le he global recangular coordinae sysem (X; Y; Z) be 2

dened as corresponding o he image coordinae sysem (X ;Y ) of he cenral viewpoin, where Z denes he orhogonal axis from he cenral image plane. The camera posiion is represened by he global coordinaes (vx i ;vy i ;vz i ), and he camera zooming parameer is described by va i. In addiion, he camera roaions are given by he horizonal panning angle vb i and he verical iling angle vc i. The oal viewpoin vecor for some viewpoin index i is denoed as V i =[vx i ;vy i ;vz i ;va i ;vb i ;vc i ]. The oher viewpoin vecors are obained relaive o he cenral viewpoin, V =[0; 0; 0; 1; 0; 0]. Given he acquisiion camera conguraions, he oher viewpoin vecors are relaive ranslaions and roaions wih respec o he cenral viewpoin. If camera conguraions are unknown, hen locaing a couple of xed poins beween muliviews allows deerminaion of he relaive orienaion ransformaion[6, 7]. 3 Deph and Dispariy Esimaion Possessing he deph of an objec in one image allows for geomeric predicion of he objec locaion in all oher viewpoin images. onsequenly, our desire o deermine he deph of every objec in a scene permis consrucion of he scene from any viewpoin. The deph of an objec can be geomerically calculaed if wo or more perspecives of he objec are given, as in a collecion of muliviews. Firs, he posiions of he objec in each of he available viewpoin images mus be locaed; his problem is widely known as he correspondence problem. Afer locaing he objec posiions from wo views, he dierence in image coordinaes is ermed dispariy. Following, i can be shown mahemaically ha he deph is inversely proporional o he derived dispariy[7]. nder he block-based moion characerizaion consrain of MPEG-2, one dispariy vecor corresponding o each block of an image can be incorporaed for predicion of a second image. Block-based dispariy compensaion is he dominan approach for he coding of sereoscopic video sequences. These approaches are sucien for coding and ransmission of wo sereo signals wihou enailing rue deph deails o he decoding sysem. For accepable predicions of muli-viewpoin images however, accurae deph informaion for every pixel is required for muliple spaially-coninuous inerpolaions of inermediae viewpoin images. Thus a deph map is required covering every pixel, whose quanized values can be ransmied as he gray-level inensiy of a second \image". Since he majoriy of he deph image is quie a, he deph map can be considerably compressed. To obain a dense deph map, many ecien echniques have been developed for sereo image pairs as well as for muliple views, including [7, 8, 9, 10]. 3

4 MPEG-2 ompaible Muli-Viewpoin Video oding The delivery of wo viewpoin video sequences is accomplished by he scalabiliy exensions of he MPEG-2 video coding sandard, where spaial predicions are obained by dispariycompensaion on a macroblock basis. As menioned in Secion 3 however, for muliview video coding a dense deph map is required for accurae predicions of muliple views. Seeking compaibiliy wih he compression sandard, an ad hoc group of MPEG-2 is esablished o invesigae a new prole for muli-viewpoin sysems[11]. For conformiy, he muliview prole is o supplemen he main prole of MPEG-2 so ha all sandardized sysems are capable of deciphering a leas one video sequence from one viewpoin. onsequenly, he cenral viewpoin is seleced o be processed and ransmied in accordance wih he main prole. The remainder of his secion is devoed o describe he proposed encoding and decoding sysems for he processing of he oher viewpoins. 4.1 Muli-Viewpoin Encoding Sysem A block diagram of he proposed muliview encoding sysem is illusraed in Figure 2 for ve muli-viewpoin sequences: I ;I L ;I R ;I T ; and I B. Firs, deermine all viewpoin vecors, V ;V L ;V R ;V T ; and V B ; from he acquisiion camera conguraions posiioned in Figure 1. Find he cenral viewpoin image, in our case I,achieving he bes predicions for he oher views. The cenral image sequence I encoded bisream for I receiver-reconsruced cenral images, denoed c I. is hen processed in he main prole of MPEG-2. Following, he is ready for ransmission, and in parallel is decoded o deermine he Saring wih he cenral viewpoin image I a ime, a deph value z = D (x ;y )is calculaed for every pixel I (x ;y ), hus forming a deph map image, named D, corresponding o he cenral viewpoin image. Aferwards, encode he deph map D compressible image, and ransmi he encoded bisream for D. as a secondary highly Simulaneously, he encoded bisream for D is decoded o deermine he received-reconsruced deph map, assigned d D. Obain a 3D mesh model represenaion[12], called M, for he cenral image I c of ime by associaing each coordinae pair (x ;y ) wih is corresponding deph value d D (x ;y ). onsequenly, he 3D surface exure is derived from he 2D image inensiy I (x ;y ), and a graphical model of he scene is obained. This geomeric represenaion oers ease in inerpolaing dieren viewpoins. Similar o rendering approaches in compuer graphics[13, 14], given a 3D geomerical model and is associaed exure inensiy, every viewpoin can be eecively and ecienly 4

consruced by simple geomeric ransformaions followed by 3D exure mapping. Furhermore, inerpolaing virual viewpoin images is faciliaed and feasible. Selec a non-cenral viewpoin, designaed as V X,chosen from he se of original real viewpoins in which he bes consrucion of is image I X is desired for ime. The selecion of viewpoin V X is based on a round-robin schedule where every viewpoin is successively seleced in a cycle. In Figure 3, a round-robin roaional scheme for four non-cenral views is shown where he selecion of each image is relaed o ime in he following manner: IL, I +1, R I +2, T I +3, B I +4, L I +5, R I +6, ec, T disinguished by a double border. Following, ransmi he seleced viewpoin vecor V X for ime. The nex sep is o inerpolae he seleced non-cenral viewpoin image, referred o as prediced image PI X,by rendering he 3D mesh model M in he specied viewpoin V X. This inerpolaion sep requires rendering a wireframe image of he desired viewpoin by simple geomeric ransformaions wih perspecive projecions of he mesh model, followed by exure mapping he corresponding areas of he cenral image ono he appropriae wireframe subsrucures. Subsequenly, calculae he predicion errors PE required for he nal reconsrucion of he seleced view, by examining he dierence beween he original image I X and he prediced image PI X. Finally, encode and ransmi he residual predicion errors PE. Before moving on o he nex frame, deermine if he allocaed bi rae allows ransmission of an addiional non-cenral image. If bandwidh allows, he predicion errors for anoher view can be deermined and send. Alernaively, a new round-robin schedule can be adoped where wo viewpoins are seleced for every ime, hus perfec reconsrucion is always achievable wih unlimied bandwidh. process sars all over for he nex frame. Subsequenly, he enire 4.2 Muli-Viewpoin Decoding Sysem A block diagram of he proposed muliview decoding sysem is illusraed in Figure 4 for consrucion of any viewpoin images. Firs, he received bisream of he cenral view is decoded on he main prole of MPEG-2, whose decoded images, denoed I c, are sored in memory. Second, he bisream of he deph map is decoded and hereafer referred o as D d. Nex, obain a 3D mesh model, called M, for he cenral image c I sore he model M for ime in memory. pon receiving he seleced viewpoin vecor V X as performed in he encoding sysem. onsequenly, Following, he predicion errors associaed wih he seleced viewpoin V X for ime, sore he vecor in memory. is decoded, denoed as d PE. Now, deermine he viewpoin requesed from he user a he decoding sysem for ime 5

, assigned V. Subsequenly, he user-requesed viewpoin image (wheher real or virual) is inerpolaed, referred o as prediced image PI, where he index designaes he appropriae viewpoin V. The predicion PI is consruced by rendering he 3D mesh model M in he specied viewpoin V by he same viewpoin image inerpolaion procedure as in he encoder. Deermine if he encoder-seleced non-cenral viewpoin V X is similar o he user-requesed viewpoin V. If he same, V = V X, hen he user can immediaely consruc he desired image. The nal reconsruced image c I X is obained by combining he predicion errors d PE wih he prediced image PIX. This kind of reconsrucion is hereafer dened as Type IPredicion. Following, he reconsruced non-cenral image c I X is sored in memory for laer reference. Saisfying he user's reques for his viewpoin image, he whole process is repeaed for he nex ime frame. On he oher hand, if he user did no reques he same viewpoin as he one seleced by he encoder, V 6= V X, he following seps are carried ou o generae such a view. Given he userrequesed viewpoin V, rerieve from memory he image I,f d corresponding o he neares pas image I reconsruced by atype I Predicion. Similarly, rerieve he image d I,b o he neares fuure image I corresponding reconsruced by atype I Predicion. For example, referring o Figure 3, if he user requess viewpoin V R a ime + 2, hen he fron image is I +1 R (f = 1) and he back image is I +5 R (b = 3). Due o he round robin schedule for one non-cenral viewpoin selecion during each ime frame, he maximum value of f and b is limied by N, 2, where N is he number of viewpoins. Noe ha a delay may be required in order o access d he fuure d image. Following, load he 3D mesh model M,f creaed a ime, f from I,f and D,f. In d parallel, load he 3D mesh model M +b creaed a ime + b from I +b d and D d +b. Nex, generae he fron mesh image MI,f by locaing a grid of xed poins on he image I,f. Also, generae d he back mesh image MI +b by locaing a corresponding grid of xed poins on he image I +b. Aferwards, generae he inermediae mesh-prediced image MPI for ime by digially image warping beween he fron mesh image MI,f and he back mesh image MI +b. The image warping echnique and generaing mesh images by locaing appropriae xed poins are well explained in [15]. Finally, a consrucion for I c is obained by combining he inermediae mesh-prediced image MPI wih he prediced image PI. The mehod for his nal combinaion is lef o he decoding sysem using any image fusion echniques[16][17], e.g., XOR operaor, average, ec. onsrucion of his nal image c I is ermed as Type IIPredicion. A his poin, if oher viewpoin images are requesed by he users, he predicion seps may be repeaed o generae oher muliviews. Oherwise he decoding process sars a he beginning o decode he nex ime frame. 6

5 onclusions The proposed MPEG-2 compaible video coding specicaion for muli-viewpoin sysems offers many advanages. In addiion o providing a srucural framework for he new muliview prole of he MPEG-2 coding sandard, his codec reains he same encoding and decoding processing for one basic viewpoin sequence based on he main prole. Furhermore, because he sandard only denes he bi-sream synax and he decoding process, here is freedom in designing very high qualiy encoders and very low-cos decoders. Similarly, he presened specicaion suggess he same exibiliy, providing for creaiviy and enhancemens in accordance wih advanced echnologies. Our novel conribuion o he muliview coding research is mainly due o he concep of combining 2D image processing wih 3D compuer graphics. Many exising compuer graphical ools and animaion faciliies, currenly available in hardware, oer speedy rendering capabiliies. Thus inerpolaion of any viewpoin image sequence can be generaed quickly and ecienly. Also, consrucion of non-ransmied virual viewpoins is possible due o availabiliy of 3D models. In rendering an image of a 3D surface srucure, he exured image daa are mapped ono corresponding hree-dimensional geomeric meshed surfaces. onsequenly, exure mapping auomaically incorporaes he fore-shorening eec of 3D surface curvaures. In addiion, since he exure is a 3D funcion of is posiion in he objec, racking of image poins is faciliaed by knowing he real world 3D coordinaes of each poin. Furhermore, addiional enhancemens on he consruced images may be obained by incorporaing oher graphical ools, e.g., illuminaion and shading. Thus, innovaive ideas can be freely augmened ono our sysem as echnologies progress. References [1] \MPEG Draf Inernaional Sandard. ISO/IE 13818-2: Generic oding of Moving Picures and Associaed Audio: Video," 1995. [2] B. L. Tseng and D. Anasassiou, \ompaible video coding of sereoscopic sequences using MPEG-2's scalabiliy and inerlaced srucure," in In'l Workshop on HDTV '94, (Torino, Ialy), Oc. 1994. [3] A. Puri, R. Kollaris, and B. Haskell, \Sereoscopic video compression using emporal scalabiliy," in Proceedings of SPIE Visual ommunicaions and Image Processing '95, (Taipei, Taiwan), May 1995. [4] T.-H. hiang and Y.-Q. Zhang, \Sereoscopic video coding," in Symposium on Mulimedia ommunicaions and Video oding, (New York iy), o appear in Oc. 1995. [5] B. L. Tseng and D. Anasassiou, \Percepual adapive quanizaion of sereoscopic video coding using MPEG-2's emporal scalabiliy srucure," in Inernaional Workshop on Sereoscopic and Three Dimensional Imaging IWS3DI '95, (Sanorini, Greece), Sep. 1995. [6] W. Burger and B. Bhanu, \Esimaing 3-D egomoion from perspecive image sequences," IEEE Transacions on Paern Analysis and Machine Inelligence, vol. 12, pp. 1040{1058, Nov. 1990. 7

[7] B. K. P. Horn, Robo Vision. ambridge, Massachuses: The MIT Press, 1986. [8] S. Barnard and M. Fischler, \ompuaional sereo," AM ompuing Surveys, vol. 14, pp. 553{572, Dec. 1982. [9] S. Mira, \Vision models for 3D surfaces," SPIE Vol 1826 Inelligen Robos and ompuer Vision XI, pp. 182{188, 1992. [10] B. L. Tseng and D. Anasassiou, \A heoreical sudy on an accurae reconsrucion of muliview images based on he vierbi algorihm," in Inernaional onference on Image Processing IIP '95, (Washingon, D..), Oc. 1995. [11] T. Homma, \MPEG onribuion 95/N0861: Repor of he Ad Hoc Group on MPEG-2 Applicaions for Muli-Viewpoin Picures," Mar. 1995. [12] G. Bozdagi, A. M. Tekalp, and L. Onural, \3-D moion esimaion and wireframe adapaion including phoomeric eecs for model-based coding of facial image sequences," IEEE Transacions on ircuis and Sysems for Video Technology, vol. 4, pp. 246{256, June 1994. [13] J. D. Foley, A. van Dam, S. K. Feiner, and J. F. Hughes, ompuer Graphics: Principles and Pracice. Reading, Massachuses: Addison Wesley Publishing ompany, second ed., 1990. [14] J. Neider, T. Davis, and M. Woo, OpenGL Programming Guide. Addison-Wesley Publishing, 1993. [15] G. Wolberg, Digial Image Warping. Los Alamios, alifornia: IEEE ompuer Sociey Press, 1990. [16].-P. Yeh, \Deph percepion based on fusion of sereo images," SPIE Vol. 1778 Imaging Technologies and Applicaions, pp. 221{226, 1992. [17] Y. T. Zhou, \Muli-sensor image fusion," in Inernaional onference on Image Processing, (Ausin, Texas), Nov. 1994. I T Top I L Lef I ener I R Righ I B Boom Figure 1: amera onguraion of Muliple Viewpoin Images 8

I alculae Deph D Deermine Viewpoin V X V X D Encode I Encode D Build and Accumulae 3D Srucure & Texure M Decode I I Decode D D Bisream for I Bisream for D Seleced Viewpoin V X D E O DI N G Inerpolae Viewpoin Image S Y ST E M PI X I L if V X = V L I R I T if V X = V R if V X = V T I X PE Encode PE Pred Error PE I B if V X = V B Figure 2: Block Diagram of Muliview Encoding Sysem 9

IMAGES from Viewpoins I I L I R I T I B V V L V R V T V B I I L I R I T I B Type I Type II I +1 I +1 L I +1 R I +1 T I +1 B T I M E I +2 I +2 L I +2 R I +2 T I +2 B I +3 I +3 L I +3 R I +3 T I +3 B I +4 I +4 L I +4 R I +4 T I +4 B I +5 I +5 L I +5 R I +5 T I +5 B Figure 3: A Round-Robin Roaional Schedule for Encoder Selecion of One Viewpoin 10

Bisream for I Decode I I I V X I I X Bisream for D Decode D D Build and Accumulae 3D Srucure & Texure M FRAME STORE MEMORY V E N O DI N G ser-requesed Viewpoin V V M Inerpolae Viewpoin Image M -f I -f Generae Mesh Image by locaing Fixed Poins I +b PI MI -f MI +b M +b Generae Mesh Image by locaing Fixed Poins S Y ST WARPING beween he Two Mesh Images E M MPI Seleced Viewpoin V X if V = V OMBINE X Prediced Image PI & Mesh Pred Image MPI I Pred Error PE Decode PE PE if V = V X I X Figure 4: Block Diagram of Muliview Decoding Sysem 11