Image Based Interactive Rendering with View Dependent Geometry

EUROGRAPHICS 2003 / P. Brunet and D. Fellner (Guest Edtors) Volume 22 (2003), Number 3 Image Based Interactve Renderng wth Vew Dependent Geometry J.-F. Evers-Senne and R. Koch Insttute of Computer Scence and Appled Mathematcs, Chrstan-Albrechts-Unversty of Kel, Germany Abstract In ths paper we present a novel approach for nteractve renderng of vrtual vews from real mage sequences. Combnng the concepts of lght felds, depth-compensated mage warpng and vew dependent texture mappng, ths plenoptc modelng approach can handle large and complex scenes. A portable, handheld mult-camera system has been developed that allows to record multple mage streams by smply walkng around the scene. These mage streams are automatcally calbrated and depth maps for all vews are generated as nput to the renderng stage. For renderng a vew dependent warpng surface s constructed on the fly and depth-compensated mage nterpolaton s appled wth vew-dependent texture mappng. Renderng qualty s scalable to allow fast prevew and to acheve hgh-end qualty wth the same approach. The system can handle large and geometrcally complex scenes wth hundreds of real mages at nteractve rates. Categores and Subject Descrptors (accordng to ACM CCS): I.3.3 [Computer Graphcs]: Vewng algorthms, I.4.1 [Image Processng and Computer Vson]: Dgtzaton and Image Capture, I.4.8 [Image Processng and Computer Vson]: Scene Analyss 1. Introducton One of the major goals n computer graphcs s to dsplay vrtual worlds smlar to real ones. For complex scenes, however, t s often not feasble to create them by hand wth 3D constructon tools. Even worse, those models are most often recognzed as synthetc after just a few seconds due to the lack of realstc surface appearance. One well known approach to vsualze complex scenes s Image- Based-Renderng or IBR. The dea behnd t s to capture the appearance of a real scene wth mages and use ths materal to generate and dsplay new vrtual vews of the scene. Modern CCD cameras allow fast and effcent capturng of the vsual components of a scene (color, lght), whle the geometrcal components are more dffcult to obtan. The same s true for mage-based renderng. In most cases t s obvous how to dsplay the mages, but geometrcal nformaton s needed for a correct synthess of novel vews. Vewdependent local geometry nformaton n form of depth maps can be computed from mage sequences by ether range data scanners or stereoscopc mage analyss algorthms. However, due to ncorrect camera calbraton, dffcult lghtng condtons and non-statc scenes t s often not possble to generate one globally consstent 3D model from hundreds of mages and depth maps automatcally. In ths work we wll present a renderng system whch generates vew-dependent local geometry on the fly from multple depth maps. For each new vew the depth maps of the surroundng real vews are fused n a scalable fashon to obtan a locally consstent 3D model. Ths geometrcal representaton s based on trangles and can then be textured wth the mages correspondng to the depth maps usng hardware-accelerated technques. The frst step n mage-based renderng s the acquston of mages of the real scene from many dfferent vew ponts. Here we want to be able to scan the scene by smply walkng around the area of nterest and to automatcally calbrate the cameras from the mage data alone. To meet these requrements we have developed a flexble and moble capturng system for effcent multvew recordng n ndoor and outdoor envronments, usng standard laptops and four battery powered synchronsed cameras mounted on a rg. The synchronsaton n conjuncton wth the rgd couplng of the cameras supports the calbraton even for non-rgd and geometrcally very complex scenes wth occlusons. In an offlne Publshed by Blackwell Publshers, 108 Cowley Road, Oxford OX4 1JF, UK and 350 Man Street, Malden, MA 02148, USA.

modelng step a set of dense depth maps s then computed from mult-vewpont stereo analyss. These depth maps are then used as nput to the proposed onlne renderng system for novel vew synthess. In the next secton an overvew of related work wll be gven to help classfyng ths paper. In secton 3 the multcamera system and the necessary preprocessng of the mages wll be presented. Then, n secton 4 the onlne renderng system based on mages and depth maps s ntroduced. And fnally n secton 5 some results and conclusons are gven. 2. Prevous Work and Motvaton Image-based renderng s closely connected to the plenoptc functon ntroduced by McMllan and Bshop n 12. Ths functon defnes all radance emtted from one pont nto every drecton; for a dynamc scene the dmenson of the plenoptc functon s 7. Levoy and Hanrahan proposed n 9 an IBR system called Lght Feld whch nterpolated new vews usng a 4D representaton of the plenoptc functon (for a statc surface). To approxmate the plenoptc functon a very dense mesh of mages from cameras lyng n a regular sampled vewpont plane s used. Less dense samplng of the vewng space results n vsual artfacts when nterpolatng between vews. The Lumgraph ntroduced by Gortler et al n 4 uses a convex 3D shape approxmaton for depth compensated nterpolaton. They also suggested an approach to allow the usage of a hand-held camera and used rebnnng to map the orgnal mages. However, ths ntermedate step of nterpolaton reduces the qualty of the mages. Vew-dependent texture mappng (VDTM) s an alternatve way of renderng vsual effects from dfferent vews. In 2 Debevec et al. descrbe a real-tme VDTM algorthm whch uses hardware-accelerated projectve texture mappng. For VDTM a consstent 3D model s requred, whch s not always easy to obtan. In 1999 Hegl et al. presented n 5 a plenoptc modelng approach based on the mages from a hand-held camera. They used depth maps as a local representaton of the scene geometry and corrected the nterpolaton of each ray by usng ths nformaton. Lookng up the color for each ray n the three surroundng cameras s smlar to VDTM wth thre blended textures. Buehler et al. n 1 proposed ther unstructured lumgraph renderng whch s a hybrd desgn between VDTM and lght feld renderng. Unlke VDTM, they do not rely on a hgh-qualty geometrc model, but they need a geometrcal approxmaton of the scene. Also Pull et al. n 15 descrbed vew-based renderng as a method between purely model-based and purely mage-based methods. Each real vew conssts of a colored range mage, then several partal models are bult and blended together. All mentoned renderng technques share one ssue: They need approxmate geometry nformaton. To create 3D models from range mages Pull proposed a volume based approach n 14, but for more complex scenes ths could be very hard or even mpossble. Pont based renderng systems lke descrbed n 17 are also useful to render from depth maps and mages. But due to holes n the depth maps t s often necessary to fll holes n the local geometry. Ths problem s better solved by usng nterpolatng surfaces as renderng prmtves nstead of ponts. The other open ssue common to most IBR systems s the acquston of mages. In the begnnng, statc grds wth lots of fully calbrated cameras were used. Alternatvely, moton control systems scan the vewng space wth one sngle camera. Ths results n very dense samplng whch s a good start for renderng, but whch also results n large amounts of data. Levoy and Hanrahan frst suggested a method for lght feld compresson, snce then ths problem has been the focus of many publcatons 10 16. Gortler et al. 4 started usng weakly calbrated hand-held cameras and used known markers for pose estmaton. At last years SIGGRAPH Matusk et al. presented a system for Image-Based 3D Photography n 11. They use several calbrated cameras, a turntable and rotatng lght sources as well as known background mages dsplayed on a large plasma screen. Wth ths system they are able to capture the appearance of very detaled objects ncludng specular reflectons and fuzzy materal. All these methods do not scale well wth the sze and complexty of the scene and they are often specalzed to sample sngle objects n controlled envronments. Koch and Pollefeys as n, 8 and 13 used mage sequences from uncalbrated hand-held cameras and Structure From Moton (SFM) algorthms. Ths approach scales well despte the fact that scannng a large vewng volume wth one sngle camera s tmeconsumng even when usng vdeo frame rate. In addton, specular reflectons, changng lghtng condton (clouds), unsteady movement of the camera and dynamc scenes may cause the SFM algorthms to fal. The renderng method proposed n ths paper s smlar to the mesh creaton from Hegl et al. 5, but the constructon of the underlyng model s dfferent. The resultng mesh s textured n realtme usng VDTM smlar to Debevec s approach, but because the geometry nformaton of each surface area may result from up to three cameras, the texture s also chosen from these cameras. 3. Image Acquston and Offlne Geometrc Modelng In ths secton we descrbe the mage acquston and preprocessng modules. These modules operate offlne and must be performed once for each acqured scene only. The steps are: Image capture, Camera calbraton, mult-vew depth estmaton, preparaton of depth samples for realtme geometrc modelng.

3.1. Mult-Camera mage capture For fast and effcent scene acquston we ntroduce a handheld mult-camera system whch s scalable for dfferent scenes. In outdoor envronment moblty s gven by usng standard laptop computers. Ther lmted performance result n a reduced frame rate. On the other hand n a studo envronment the capturng system benefts from the hgh performance of modern PC systems wth hgher frame rates. Fgure 1: Mult-camera rg wth synchronzed mage capture A prototype system consstng of four dgtal fre wre (IEEE 1394) cameras and two laptops n a rack, whch can be carred by the operator, has been bult. The cameras are mounted on a pole wth adjustable poston and orentaton of each camera, so any desred confguraton of the camera poses on the pole can be used. The sensor can easly be moved by hand wthout a trpod, n partcular allowng horzontal, vertcal and rotatonal moton to scan the scene by smply walkng by. The cameras are tme-synchronzed so that non-statc scenes can be sampled by obtanng "tmeslced", synchronzed shots of multple cameras n whch all movements n the scene are frozen for the tmeslce. These frozen multple vews of the scene help obtanng 3D nformaton and perform plenoptc renderng of non-cooperatve scenes because lmted moton n the scene can be tolerated. Dfferent scenes may requre a dfferent vewng space. By adjustng the cameras on the pole or by usng a pole of dfferent length the vewng space can be adapted to ft the scene. Even more cameras can be added to enhance the qualty of the sample mesh. Hence, a 2D scannng of the vewpont surface s obtaned wth a sngle walk-by. To handle data from four cameras wth a moble system, we decded to use two laptops each connected to two cameras. The laptops are synchronzed va standard ethernet wth a scalable network protocol 3 whch serves two purposes: The dstrbuton of parameters (e.g. exposure tme) to all cameras, and capture synchronzaton n tme. Usng doublebufferng and three threads per node, the resultng frame rate for color mages of the sze 1024x768 wth four cameras and two computers s four fps. Ths s suffcent for dense vew space samplng by a slowly walkng operator. In a controlled studo envronment, the synchronzaton protocol allows an arbtrary amount of computers to be connected and the cameras can operate at full frame rate. 3.2. Camera trackng and calbraton The mages are acqured wth a hand-held mult-camera system wth arbtrary camera moton. Snce we want to avod puttng markers nto the scene, we have no control over the scene content. Therefore, all camera vews are uncalbrated and the calbraton and the camera track must be estmated from the mage sequence tself. Furthermore, the scene s most lkely uncooperatve, meanng that complex geometry, massve occluson and movng objects may hnder the calbraton. The only control we have s the specfc confguraton of our camera setup and, to a lmted extend, some knowledge about the performed camera moton. Snce all cameras operate synchronously we know that,for one nstant, all cameras vew a tme-frozen statc scene. We also know that from one recordng tmestep to the next, the relatve confguraton between the cameras dd not change, hence we can predct possble object postons n the mages of the next tmestep. Thus the relatve orentaton (the fundamental geometry between the mages) s used to stablze trackng. For trackng and calbraton we have extended the SFMapproach of Pollefeys and Koch 13 8 to a mult-camera confguraton. Some detals can also be found n 6. In ths approach the ntrnsc and extrnsc camera parameters for each mage are estmated. The standard decrpton of a projectve camera conssts of two 3x3 matrces K and R and the 3-dmensonal vector C. K contans the ntrnsc camera parameters focal length, aspect rato and mage center, R descrbes the rotaton of the camera n space and C s the translaton vector of the camera center. The projecton matrx M projects the homogeneous 3D pont P nto a 2D mage at mage pont p wth zp MP, z s the projectve depth. M s defned as the followng projectve 3x4-matrx: M K R T R T C The SFM approach automatcally tracks salent 3D features (ntensty corners) throughout the mages. The calbraton results n a projecton matrx for each camera of the sequence and a sparse pont cloud of the tracked 3D feature ponts. The approach has been tested extensvely wth a wde varety of camera confguratons and scenes and has proven to be very successful. The camera trackng and calbraton may suffer from a projectve skew and from error accumulaton n the case of very long mage sequences. Recordng large scenes means that most mages do not share a common regon of the scene. Ths afffects the qualty of trackng and matchng of features. Therefore we cannot guarantee a globally consstent scene model. These problems are greatly reduced wth the mult-camera rg snce we can explot the

rgdty constrants of the camera confguraton. The problem s even less mportant n the case of mage-based renderng. It s suffcent to render from local geometry that s supported from a local neghborhood only, because durng renderng only parts of the scene are vsble at once. Depth maps are approprate local scene descrptons that may be used for ths purpose. 3.3. Mult-vewpont depth estmaton Wth the calbrated mage sequence at hand, one can obtan dense depth maps from mult-vewpont dsparty estmaton. From the calbraton the eppolar geometry between pars of mages s known and can be used to restrct the correspondence search to a lnear search. We extend the method of Koch et al. 7 for mult-vewpont depth estmaton to the mult-camera confguraton. Ths method s deally suted snce we can explot the 2D grd of lnked depth maps for all cameras of the rg. Ths results n very dense depth maps of the local scene geometry. Only n very homogeneous mage regons t mght not be possble to extract suffcent scene depth. These regons wll be nterpolated from neghborng depth values. Usng more than one par of cameras allows us to fll holes from occlusons and to enhance the precson of the depth maps. To gve an dea about the qualty of the estmated calbraton and the depth maps, a complex real scene was evaluated. 240 mages were taken wth a four-camera rg at the Natonal Hstory Museum, London, featurng a large dnosaur skeleton n the man entrance hall. The rg was moved alongsde the dnosaur lookng at the skeleton and the back hall regon. One may judge the hgh qualty of the camera calbraton and the densty of the depth map even wth ths hghly complex scene. In fgure 2, left, we see an overvew pcture of the museum hall wth the skeleton. To the rght, the tracked camera postons (pyramds) and the 3D feature cloud wth tracked ponts of the dnosaur and the hall back area are dsplayed. The dnosaur shape shows that even a globally consstent reconstructon of the scene was possble. Fgure 3 shows one of the orgnal camera mages and the correspondng depth map. The densty and resoluton of the depth map s very detaled. The camera s rather near to the skeleton and the depth of the scene s very large, causng dsplacements of up to 80 pxels between adjacent mages. Therefore, mage nterpolaton alone wll not suffce to render novel vews of the scene and depth compensaton s necessary. The scene contans a lot of occluded regons (around the rbs) and one can see that n some of those regons no depth could be estmated. These regons are colored black. 3.4. Samplng the Depth maps The depth maps serve as geometry nput for the vewdependent onlne modelng that wll be descrbed n the next secton. Usng the depth maps drectly wth full resoluton s not feasble due to the vast amount of data. We do not Fgure 2: 4-camera acquston and calbraton of the dnosaur scene. Left: Overvew of the scene to be captured. Rght: SFM-calbraton wth camera postons (pyramds) and 3D feature ponts of dnosaur and hall background (snapshot from 3D scene model). Fgure 3: Left: One of the acqured orgnal camera mages. Rght: correspondng depth map of camera vew (near=dark, far=lght, undefned=black). need such hgh resoluton geometry for vew nterpolaton. Therefore, each depth map s subsampled n a regular grd wth a spacng that s parametrzed such that t can easly be adapted to the specfc needs. Ths grd s located n the mage plane of the real camera. At each grd pont a 2D medan flter s appled to reduce the effects of outlers and to fnd the most probable depth. The fltered depth value corresponds to the dstance between the pont n the scene and the camera center. In other words, t s the length of a ray orgnatng from the camera center through the grd pont n the mage to the pont n the scene whch caused the mage pont. Takng the j-th 2D pont p j n the mage plane of camera and the correspondng dstance z j from the depth map, the eucldean 3D scene pont P j can be calculated as: P j z j K R T 1 p j C The 3D pont P j of the 2D pont p j from camera s called sample j of camera. For each grd pont n each camera one sample s created. In untextured regons of the mage t s often mpossble to determne the depth. Ths results n holes n the depth map, samples n these regons are dscarded. Later on n the renderng stage the geometry n these regons s constructed from the samples from other cameras f possble. The vald samples serve as geometrc approxmatons

that are used to defne the vew-dependent nterpolaton surface. For a better trade-off between performance and qualty, a Level-of-Detal (LOD) s ntroduced at ths pont. When creatng the samples, the desred number of levels L max can be chosen. After ther generaton all samples belong to level zero L 0 by default. To generate level L k 1 each second sample n both drecton from level L k s moved to level L k 1. Thus the number of samples n level L k 1 s 1 4 of the prevous number of samples n level L k whch s now reduced by 1 4. Flterng the samples before subsamplng s not requred, due to the medan flter whch was appled when generatng the samples. Whle renderng wth a specfc LOD n, the samples of several levels are used n combnaton. Renderng wth the coarsest level only the samples of L max are used. For the next level L max 1 the samples of L max and of L max 1 are used. In general, f level L n s requested, all levels L x wth x n are used. 4. Image-based nteractve renderng The calbrated vews and the preprocessed depth maps are used as nput to the mage-based nteractve renderng engne. The user controls a vrtual camera whch vews the scene from novel vewponts. The novel vew s nterpolated from the set of real calbrated camera mages and ther assocated depth maps. Durng renderng t must be decded whch camera mages are best suted to nterpolate the novel vew, how to compensate for depth changes and how to blend the texture from the dfferent mages. For large and complex scenes hundreds or even thousands of mages have to be processed. All these operatons must be performed at nteractve frame rates of 10 fps or more. We address these ssues n the followng sectons: Selecton of best real camera vews, fuson of multvew geometry from the vews, vewpont-adaptve mesh generaton, vewpont-adaptve texture blendng. 4.1. Camera rankng and selecton For each novel vew to render, t has to be decded whch real cameras to use. Several crteras are relevant for ths decson. We have developed a rankng crteron for orderng the real cameras. In the followng, N real cameras C, 0 N are compared to the vrtual camera C v. Dstance: The frst crteron concerns camera proxmty. Takng the vewng drecton A v and the center C v of the vrtual camera, the orthogonal dstance d of each real camera center C to ths ray can be determned: d C C v A v A v For better evaluaton all dstances d are normalzed wth the maxmum dstance d max of all cameras. d max s the maxmum of all d for the current C v: d max max d 0 N Vewng angle: The second crteron s the angle between A v and the vewng drecton A of camera. Cameras lookng nto drectons dfferent to the vrtual camera are penalsed because they are less useful for vew nterpolaton. The angular penalty a for camera s defned as: a arccos A A v a max For normalzaton a maxmum threshold angle a max =a feld-of-vew s gven. Cameras whch are more than ther feld-of-vew off axs do not share a common vewng range and can not be used for proper geometry or texture nterpolaton, these are marked nvald. C 2 C d2 1 d1 α 1 C v Samples C1 Samples C2 Fgure 4: Crtera evaluaton. The rankng crtera are shown for two real cameras C 1 and C 2. C v sees 3 samples from C 1 and 4 samples from C 2. Vsblty: The thrd crteron s a more complex one, we call t the vsblty. It evaluates the scene volume that a gven real camera has as seen n the context of the vrtual camera. For ths purpose very few (n max 20) regular samples of the depth maps as descrbed above are chosen such that the whole mage of the real camera s covered. These samples are then projected nto the vrtual camera and checked for vsblty. The number of vsble samples n dvded by the number of possble samples n max gves a rough approxmaton of the regon covered by a real camera. Cameras for whch ths rato results to zero are marked nvald. To convert ths value nto a vsblty penalty v t s subtracted from one: v 1 n n max Fgure 4 sketches the dfferent selecton crtera. Two real cameras C 1 and C 2 are evaluated w.r.t. the vrtual camera C v. α 2

Dstance and vewng angle are gven n the fgure. For vsblty, the depth samples from camera 1 (crcles) and camera 2 (crosses) are projected nto the vrtual camera and evaluated. All three crtera are weghted and combned nto one scalar value q whch represents the nverse qualty of the real camera to generate the new vew: q w d d w aa w vv After calculatng q for each camera, the lst of vald cameras s sorted n ascendng order. The nterpolaton mode fnally decdes how many of the best suted cameras are selected for vew nterpolaton. 4.2. Multvew depth fuson and mesh creaton The ranked cameras are now used to nterpolate novel vews. Snce the novel vew may cover a feld of vew that s larger than any real camera vew, we have to fuse vews from dfferent cameras nto one locally consstent mage. To effcently warp mages from dfferent real vews nto the novel vewpont we generate a warpng surface approxmatng the geometry of the scene. Startng from a regular 2D-grd that s placed n the mage plane of the vrtual camera ths warpng surface wll be updated for each camera moton. The spacng of ths grd S s x s y wth s x s y n pxels can be scaled to the complexty of the scene. Wth each pont n the grd, a 5- tuple g P g p g g b g d g called grd pont s assocated. p g s the 2D poston n the mage plane, P g the 3D pont to be constructed, g s the number of the camera responsble for P g, b g s a boolean markng ths grd pont vald or nvald and d g s the dstance from P g to C v. The b g and d g components of all grd ponts are set to default values, whch are b g nvald and d g. To fuse 3D nformaton from the frst n ranked cameras, for each camera the followng algorthm s used. Each vald sample P j vrtual camera wth p j P j C v s calculated. If p j for camera s projected nto the M vp j and the dstance d j s not n the vsble area, the followng steps are s taken. If t s n the vsble area, the nearest grd pont g n s selected. Due to the regularty of the grd, ths s easly done wth g n rnd p S If the current sample p j has a smaller dstance d j to C v then the selected grd ponts depth d gn, then the grd ponts data s updated from ths sample, else the update s skpped. When updatng a grd pont from a sample, P gn skpped and the next sample P j 1 s set to P j, pgn s adjusted to p j, the cameras number s stored n gn and the new dstance value d j s assgned to d gn. The algorthm proceeds wth the next sample P j 1. Updatng grd ponts only wth samples whch are nearer to the vrtual camera (d j d gn ) ensures that occluson are handled correctly. The samples of the hghest ranked cameras are projected frst. Ths ensures that most part of the new vew s nterpolated from the best cameras. Only n regons that are occluded for the best camera, a lower ranked camera sample s used. Fgure 5 depcts ths stuaton. "!"!"!"!"!"!"!"!"!"!"!"!"!"!"!" "!"!"!"!"!"!"!"!"!"!"!"!"!"!"!"!!!!!!!!!!!!!! C 2 Regons seen from: Camera 1 Camera 2!#!#!$!$!#!#!$!$ C v Fgure 5: Geometry fuson from two cameras, wth C 1 ranked hgher than C 2. Based on the rankng, C 1 wll supply most of the nformaton. Only n parts that are occluded for C 1, data from C 2 s flled n. After projectng samples and adjustng grd ponts, most grd ponts are vald and contan 3D nformaton sutable to represent the part of the scene vsble n the vrtual camera. But because some grd ponts could be stll nvald and the 2D postons n p gn are ftted to projected samples, the connectvty of the grd ponts cannot be gven from the grd tself. Usng the 2D poston of the all vald grd ponts, a 2D Delauny trangulaton of all grd ponts n the mage plane of the vrtual camera s performed. Transferrng ths 2D mesh to the 3D ponts P gn gves us a scalable approxmaton of the 3D scene wth trangles. The approxmaton can be scaled wth respect to the samplng densty of depth samples and the densty of the grd ponts for trangulaton. Ths surface mesh s recreated after each camera movement as vewpontadaptve geometry. 4.3. Texturng The texturng step effectvely maps the real cameras nto the vrtual vew wth the help of the vewpont-adaptve surface mesh. Several slghtly dfferent methods for texturng are consdered. The most smple one s to chose the best C 1

ranked camera as texture source. If ths real camera s not too far away from the vrtual camera and both have a smlar feld of vew, the results are good. Ths s the fastest texturng method snce swtchng between dfferent textures n one render cycle s not necessary and each trangle has to be drawn only once. Problems arse when parts of the mesh are not seen from the selected camera. These parts reman untextured. To texture all trangles properly t s necessary to select the texture accordng to the cameras where the geometry orgnated from. The trangle vertces are depth sample ponts where the orgnatng real camera s known. However, snce each vertex s generated ndependently, a trangle may have vertces from up to three cameras. Here one may decde to ether select the best-ranked camera (sngle-texture mode) or to blend all assocated camera textures on the trangle (multtexture mode). Proper blendng of all textures wll result n smoother transton between vews but wth hgher renderng costs for mult-pass renderng. To test the renderng qualty, a synthetc scene was generated by composng a VRML model of the Arenberg Castle together wth a VRML model of an entrance portal to smulate occluson, as shown n fgure 6 (rght). Ths textured 3D model was then rendered from dfferent vews and screenshots along wth syntheszed depth maps and correspondng projecton matrces were saved. Based on ths ground truth materal, the renderng qualty was verfed. To measure the mage qualty, one vew (shown n fgure 6, rght) from the sequence was taken as reference mage. The correspondng camera was then removed and ths vew was nterpolated from the remanng mages wth dfferent texturng methods. The most crtcal regons are the edges between the background wall and the foreground portal. Due to the varaton n depth, most artfacts are located here. For better comparson ths regon s magnfed (fgure 6, left) and the reference mage s then compared to the dfferent nterpolated vews by mage substracton. The resultng dfference mage serves as vsual error measure and addtonally the mean absolute ntensty dfference (MAD) s gven. To emphazse the vsual errors a gamma correcton of 0.3 was appled on the dfference mages. Fgure 6: An nput vew of a synthetc scene (rght) and a magnfed closeup of crtcal regons (left). Fgure 7: Renderng (left) and dfference mage (rght) when a nearest neghbor selecton wthout nterpolaton s used. The dfference mage shows that large mage dsplacements occur especally n the foreground regons. MAD = 26.5 Fgure 8: Renderng wth sngle-camera, sngle-texture mode: The depth-compensated nterpolaton removes the mage dsplacements, but edge artefacts reman. MAD = 12.8 Fgure 9: Renderng wth mult-camera, sngle-texture mode: The vew-dependent texturng removes some edge artefacts at the rght occludng border of the portal. MAD = 12.4 For comparson only, the most smple approxmaton of the desred vew s used. For ths purpose the nearest mage s taken wthout depth warpng (nearest neghbour), n fact, ths s equvalent to standard mage nterpolaton. Fgure 7 shows that there are large areas around the portal where ths nterpolaton fals due to the large mage dsplacements. The dfference mage shows that the objects are not rendered at the correct depth and the MAD value s 26.5. Wth our proposed renderng system we can apply depth compensaton wth dfferent texturng modes. Fgure 10: Renderng wth mult-camera, three-texture mode: Texture blendng reduces some texture edge artefacts, although the mprovement s mnor n ths scene. MAD = 11.8

Evers-Senne and Koch / Image Based Interactve Renderng wth Vew Dependent Geometry The depth maps were subsampled wth 200x160 samples, the sze of the grd was also 200x160 ponts. Fgure 8 was rendered usng only the hghest-ranked camera for texturng (sngle camera, sngle texture mode). It s vsble that the geometrcal approxmaton removes most of the more serous errors resultng n a MAD of 12.8. Due to the dscretzed samplng and geometrcal constructon, fne structures lke the wndows n the background are not modeled suffcently whch results n the remanng errors. Ths renderng mode s very fast snce all textures are taken from a sngle texture map. Usng more sophstcated texturng compensates some more errors so that the fnal vsual qualty s very close to the orgnal vew. Fgure 9 s rendered from multple cameras, but each trangle uses one texture only (mult-camera, sngle texture). From three possble cameras, agan the best ranked can be taken to texture the trangle. Some texture swtchng s requred but recent graphcs hardware s fast enough for ths. In regons where most trangles are textured from one camera the result s the same as n the frst method. In regons textured from many dfferent cameras ths can result n texture artefacts because adjacent fragments do not always match. Ths method gves a MAD of 12.4. Snce each trangle has to be drawn only once, the slowdown compared to sngle camera texturng s small. Sharp edges be- Fgure 11: Parkng lot scene. Top: Two orgnal mages. Bottom left: Depth map, rght: Novel vewpont rendered from the parkng lot scene n mult-camera, mult-texture mode. tween textures can be avoded by mult-texturng and blendng. Each trangle s drawn three tmes usng the textures assocated wth the three cameras (mult-camera, mult-texture mode). The prevously mentoned edges between textures are blended smoothly, as shown n fgure 10 resultng n a MAD fo 11.8. On modern graphcs hardware t s also possble to use sngle-pass mult-texturng. Dfferent texture unts are loaded wth the three textures and then the trangle s drawn only once. Ths gves a speed-up of approxmatly 30% compared to the mult-pass texturng. The performance gan s not factor 3 as one would expect, because for each trangle the texture unts have to be reloaded whch s qute expensve. 5. Performance ssues We have mplemented the descrbed renderng system n C++ usng OpenGL and tested t on a standard PC (Lnux) wth 1.7 GHz AMD Athlon, 1GB memory and a Geforce4 T 4400 wth 64 MB memory. The followng secton dscusses some of the performance ssues that arse. We wll dscuss memory and processng speed requrements. 5.1. Memory Handlng Expectng several hundreds of camera mages to render from, some care has been taken to handle such amounts of textures and depth maps. After creatng the depth samples, the depth maps are released mmedately. To avod sample creaton for each start of the program, the samples are stored and reloaded f necessary. Even f modern graphcs boards have 128 MB of memory, ths s far too lttle for our purpose, but the AGP nterface allows the usage of man memory for textures wthout sgnfcant performance loss. When startng, the program reserves a gven amount of memory for textures and can also be told to load as many mages as possble n advance. Later on, mages are loaded on demand when they are used for the frst tme. If the upper bound of memory for textures s reached, the least recently used texture s released. Texture compresson s used to preserve memory, too. The fxed compresson rato s 1 6, allowng to handle 6 tmes more textures than wthout compresson. The only problem s that the compresson takes some tme. In case the texture s loaded on demand, whch means the renderng stops untl the texture s ready, t takes around 5 tmes longer compressng the texture than loadng t uncompressed. We have therefore precompressed the mages, stored them on dsk and are loadng them as already compressed textures on demand. Usng these technques, we are able to handle mage sequences of nearly arbtrary sze. In an ntalzaton run all mages are prepared as compressed textures and all depth maps are sampled n an LOD herarchy. Ths results n a space- and tme-optmzed storage of the plenoptc model. The synthetc castle scene for example conssts of 216 mages of the sze 768x512 and the requred space for the orgnal mages and depth maps s about 566 MB. The compressed representaton uses 132MB for depth sample storage and 109MB for compressed textures only. Tests wth scenes of 1057 mages (3 GByte raw data) on a PC wth 1GByte memory showed that all textures could be preloaded. To handle even larger scenes on-demand-loadng s also possble

and does not reduce the frame rate sgnfcantly. The bottleneck of loadng textures s removed. Typcally, the performance of a render engne s measured n the dsplayed frames per second whch are dependent on the scene complexty and vsble area. Our renderng system s dfferent n ths respect snce the frame rate s roughly ndependent of the scene complexty but depends on the number and resoluton of the real vews that are used for nterpolaton. Each tme the vewpont s changed, the warpng surface mesh s reconstructed from the depth samples of the best ranked n cameras. Ths operaton s lnear n the number of cameras tmes samples and performed by the CPU. The necessary steps are camera rankng, projecton of the samples and Delauney trangulaton, as descrbed above. Camera rankng s qute fast, the selected rankng crtera can be computed wth lttle cost, even 1000 cameras are ranked n 3.9 ms. The Delauney trangulaton s lnear n tme w.r.t. the number of ponts. A mesh for 100x75 grd ponts s created n 25 ms, for 200x150 grd ponts t took around 101 ms. to 10 fps can always be guaranteed, even when usng scenes wth 1000 cameras. Usng coarse resoluton to plan a camera path, the fnal renderng of a vrtual fly by could be done fully automated wth hghest accuracy n non-nteractve frame rates. 5.2. Renderng performance Projectng samples from real cameras nto the vrtual camera s also qute expensve. A typcal densty of 200x150 samples results n 30000 projectons per camera, and usng 10 real cameras gves 300,000 matrx-vector multplcatons. These are calculated n approxmatly 77 ms. Addng the tme for renderng wth 4 ms, a total of 182 ms per geometry update for a grd wth 200x150 ponts and 10 cameras wth 200x150 samples each s reached. For ths reason, the level-of-detal and the scalable densty of grd ponts s used to reduce the complexty. Each levelof-detal and also each steppng n the grd densty reduces the computaton tme by factor four. Also a proper selecton of the relevant cameras helps to reduce the number of projectons needed. All three values can be adjusted by the user to trade between qualty and speed. So usng only 3 cameras wth 100x75 samples and a grd wth 100x75 ponts t took only 31 ms to generate a new vew. The loss n qualty depends on the complexty of the scene but to enhance the qualty sgnfcantly, a slowdown of factor 8 has to be tolerated for most scenes. In general, for closeup vews of small parts of the scene only a small number of cameras s requred, but the depth granularty should be fne. On the other hand, to obtan an overvew over a large scene, many real cameras are requred smultaneously but the level-of-detal can be chosen moderately because of the reducton of scale. Fnally, texture mappng and blendng s performed. The texture mappng s fully hardware accelerated and mostly determned by the dsplay resoluton and the texturng engne of the graphcs card. Wth ths scalable approach, nteractve frame rates of 5 Fgure 12: Rendered new vew of dnosaur sceleton and museum hall. Man bones and hall background are reconstructed correctly. The peaks of the spnal bones are dstorted due to mproper depth compensaton. 6. Experments The system was tested wth a large varety of real footage. We tested scenes of dfferent complexty and very dfferent spatal sample rate. Some scenes were taken wth a handheld stll camera and very few mages were used. For other scenes we used two or four cameras of the rg to obtan many mages. As an example of a complex outdoor scene we recorded a walk over a parkng lot wth trees, cars, and buldngs. 214 mages (2x107 stereo pars) wth 1024x768 pxels each were recorded. The covered track was about 35 m, resultng n a spatal samplng between the mages of about 30 cm. Scene depth extended between 3 and 75 m, causng mage dspartes of about 100 pxel between adjacent mages. Fgure 11 shows two mages of the orgnal scene and a depth map of the rght orgnal mage to document that a dense reconstructon of trees, cars and bushes could be obtaned. The rendered novel vew (bottom rght) shows that even small detals lke the lamp post and the trees are rendered from new a perspectve wth hgh realsm and lttle dstorton. Fnally we show renderng results of the dnosaur scene as descrbed n secton 3.3. An orgnal mage (fg. 3,left) was removed and nterpolated from the remanng mages (fg. 12, left). Most mage regons are rendered wth hgh qualty. The peaks of the spnal bones are dstorted snce a

proper depth reconstructon was not feasble due to occlusons. These regons were marked black n the depth map (fg. 3,rght). 7. Conclusons We have dscussed a new mage-based renderng approach that can handle uncalbrated mult-camera sequences from a hand-operated camera rg. In a preprocessng step the camera calbraton and depth reconstructon s performed automatcally from the mage sequence tself. These vew-dependent data are then used for nteractve magebased renderng by depth-compensated warpng of vewdependent textures. The system can handle very large data sets of hundreds of mages at nteractve rates. Experments have shown that the surface-based renderng approach s successful f dense depth can be computed. However, n regons wth many small and occludng objects, depth computaton may fal and dstortons occur. Also, f the dsparty between adjacent mages s very hgh, edge nterpolaton artefacts occur. We are currently nvestgatng how to overcome these drawbacks. Possble solutons may be to selectvely swtch from surface to pont-based renderng or to use mult-layered depth mages to account for the occludng regons. We wll nvestgate further n ths drecton. Acknowledgements The work s beng funded by the European project IST-2000-28436 ORIGAMI. The mages for the dnosaur scene were suppled by Olver Grau, BBC Research. References 1. Chrs Buehler, Mchael Bosse, Leonard McMllan, Steven J. Gortler, and Mchael F. Cohen. Unstructured lumgraph renderng. In Eugene Fume, edtor, SIG- GRAPH 2001, Computer Graphcs Proceedngs, pages 425 432. ACM Press / ACM SIGGRAPH, 2001. 2 2. Paul Debevec, Yzhou Yu, and George Boshokov. Effcent vew-dependent mage-based renderng wth projectve texture-mappng. Techncal Report CSD-98-1003, 20, 1998. 2 3. Jan-Mchael Frahm, Jan-Frso Evers-Senne, and Renhard Koch. Network protocol for nteracton and scalable dstrbuted vsualzaton. In 3DPVT 2002, 2002. 3 4. Steven J. Gortler, Radek Grzeszczuk, Rchard Szelsk, and Mchael F. Cohen. The lumgraph. Computer Graphcs, 30(Annual Conference Seres):43 54, 1996. 2 5. Benno Hegel, Renhard Koch, and Mark Pollefeys. Plenoptc modelng and renderng from mage sequences taken by a hand-held camera. In Proceedngs of DAGM 1999, 1999. 2 6. R. Koch, J.F. Frahm, J.M.and Evers-Senne, and J. Woetzel. Plenoptc modelng of 3d scenes wth a sensor-augmented mult-camera rg. In Tyrrhenan Internatonal Workshop on Dgtal Communcaton (IWDC): proceedngs, Sept. 2002. 3 7. R. Koch, Pollefeys M., and L. Van Gool. Mult vewpont stereo from uncalbrated vdeo sequences. In Proc. ECCV 98, number 1406 n LNCS. Sprnger, 1998. 4 8. R. Koch, M. Pollefeys, B. Hegl, L. Van Gool, and H. Nemann. Calbraton of handheld camera sequences for plenoptc modelng. In Proceedngs ICCV 99, Korfu, Greece, 1999. 2, 3 9. Marc Levoy and Pat Hanrahan. Lght feld renderng. Computer Graphcs, 30(Annual Conference Seres):31 42, 1996. 2 10. M. Magnor and B. Grod. Data compresson for lght feld renderng. IEEE Trans. Crcuts and Systems, 2000. 2 11. Wojcech Matusk, Hanspeter Pfster, Addy Ngan, Paul Beardsley, Remo Zegler, and Leonard McMllan. Image-based 3d photography usng opacty hulls. 2 12. Leonard McMllan and Gary Bshop. Plenoptc modelng: An mage-based renderng system. Computer Graphcs, 29(Annual Conference Seres):39 46, 1995. 2 13. Marc Pollefeys, Renhard Koch, and Luc J. Van Gool. Self-calbraton and metrc reconstructon n spte of varyng and unknown nternal camera parameters. Internatonal Journal of Computer Vson, 32(1):7 25, 1999. 2, 3 14. K. Pull, T. Duchamp, H. Hoppe, J. McDonald, L. Shapro, and W. Stuetzle. Robust meshes from multple range maps, 1997. 2 15. Kar Pull, Mchael Cohen, Tom Duchamp, Hugues Hoppe, Lnda Shapro, and Werner Stuetzle. Vewbased renderng: Vsualzng real objects from scanned range and color data. In Jule Dorsey and Phllpp Slusallek, edtors, Renderng Technques 97 (Proceedngs of the Eghth Eurographcs Workshop on Renderng), pages 23 34, New York, NY, 1997. Sprnger Wen. 2 16. X. Tong and R.M. Gray. Compresson of lght felds usng dsparty compensaton and vector quantsaton. In Proc. IASTED Conf. Computer Graphcs and Imagng, pages 300 305, October 1999. 2 17. Matthas Zwcker, Hanspeter Pfster, Jeroen van Baar, and Markus Gross. Surface splattng. In Eugene Fume, edtor, SIGGRAPH 2001, Computer Graphcs Proceedngs, pages 371 378. ACM Press / ACM SIGGRAPH, 2001. 2