Implementation of a Dynamic Image-Based Rendering System

Implementaton of a Dynamc Image-Based Renderng System Nklas Bakos, Claes Järvman and Mark Ollla 3 Norrköpng Vsualzaton and Interacton Studo Lnköpng Unversty Abstract Work n dynamc mage based renderng has been presented by Kanade et al. [4] and Matusk et al. [5] prevously. We present an alternatve mplementaton that allows us to have a very nexpensve process of creatng dynamc mage-based renderngs of dgtally recorded photo realstc, real-lfe objects. Together wth computer vson algorthms, the mage-based objects are vsualzed usng the Relef Texture Mappng algorthm presented by Olvera et al [6]. As the relef engne requres depth nformaton for all Texels representng the recorded object n an arbtrary vew, a recordng soluton makng depth extracton possble s requred. Our eyes use bnocular vson to produce dspartes n depth, whch also s the most effortless technque of producng stereovson. By usng two dgtal vdeo cameras, the dynamc object s recorded n stereo n dfferent vews to cover ts whole volume. As the depth nformaton from all vews are generated, the dfferent vews from the mage-based object are textured on a pre-defned boundng box and relef textured nto a three dmensonal representaton by applyng the known depth dspartes. System Prototype The frst step n the process s to record a dynamc object n stereo, whch gves us the photo textures for the mage-based object and the possblty to derve depth nformaton from the stereo mage-pars. To be able to use the recorded vdeo as a texture when renderng, t s mportant that one camera (.e. the left) s nstalled parallel to the normal of the sdes of the boundng box surroundng the object, and the other (.e. the rght) next to, n a crcular path so that both cameras have the same radus to the object. As we are nterested n the recorded object only, the mage background should be as smple as possble. By usng a blue or green screen, the object can easly be extracted later on. A blue screen can easly be nstalled by usng cheap blue matte fabrc on the walls. Dependng on the amount of cameras avalable, the dynamc object s recorded n stereo n up to fve vews (front, back, left, rght and top). In ths project, only two cameras were used, gvng us only one vew when flmng the dynamc object. As the recordng s fnshed, the vdeo streams are sent va frewre to a PC, where the resoluton s rescaled to 56x56 pxels, the background s removed and the depth maps are calculated, enhanced, cropped and sent to the relef renderng engne. (Ppelne n fgure ). Recorded stereo vdeo (DV-PAL 70x576) Removng background, creatng slhouettes (56x56) Vrtual Camera Real Scene + BlueScreen (Sony Dgtal Vdeo Cameras) Relef Texturng (OpenGL) Vdeo stream wth depth maps Unque vrtual vewponts Recorded stereo vdeo (DV-PAL 70x576) Correlaton-Based Stereo Depth Maps (56x56) Error removal, depth map smoothng Relef Textured Boundng box Boundng Box (-6 polygons) Fgure : Prototype overvew. A schematc vew over the dfferent stages requred n the process of renderng new vews of an magebased object. Depth approxmaton When the stereo vdeo have been recorded and streamed to the computer clent, our algorthms start processng the data to create useful vdeo frames and nformaton about the scene. As the objects are extracted from the orgnal vdeo, the process of estmatng the depth of the scene s ntated. When the approxmated depth map for a certan frame s generated, t s used together wth the object mage to render unque vews, usng the relef-renderng engne. Ths sesson starts wth a bref overvew of the depth algorthm, followed by complete descrptons about all the steps from usng orgnal vdeo streams to sendng a fnalzed depth map and vdeo frame to the renderng process of vrtually vewng the object from an arbtrary vew. nkba@tn.lu.se claja6@student.lu.se 3 marol@tn.lu.se 5

. Algorthm overvew A summary of the algorthm ppelne s shown n fgure. From the N stereo vdeo cameras, we have N vdeo streams. From the left camera (whch sees the scene straght from the front), the object-only vdeo frames and slhouette wll be created. As the scene s recorded wth a blue screen background, both the slhouette and the object extracton are created rapdly. Smultaneously, both the left and the rght vdeo streams are segmented nto frames and sent nto our flter-based depth algorthm. At ths stage, the frames can be downszed for optmzaton purpose, whch wll result n faster depth map approxmatons wth lower qualty. For each frame, each pxel from the left mage s analyzed and compared wth a certan area of the rght mage to fnd the pxel correspondence. Wth ths known, the depth could be estmated for each frame. Snce ths mathematcal method outputs a relatvely dstorted mage, t needs to be retouched to ft the relef engne better. Frst, the depth map s sent to an algorthm for detectng edges, where an edge could be thought Vdeo stream (Left camera) Image-based object Object slhouette Relef renderng Object depth map Vdeo stream (Rght camera) Flter-based stereo scene depth map Error Removal Smoothng Fgure. of as nose, dstortng the depth map, and removed by pastng the ntensty value of neghborng pxels. Wth the errors removed, the depth approxmaton of the mage-based object wll contan less nose and unnecessary holes, but dspartes between contguous object regons mght be rendered wth too sharp ntensty varances, whch wll exaggerate the dsplacement of some object parts when applyng the relef mappng. To solve ths, the depth map s smoothened and fnally, the slhouette s added to remove approxmated background depth elements... Flter-based stereo correspondence The method mplemented n our system prototype uses flter-based stereo correspondence developed by Jones and Malk [], a technque usng a set of lnear flters tuned n dfferent rotatons and scales to enhance the features of the nput mage-par for better correlaton opportuntes. A beneft of usng spatal flters s that they preserve the nformaton between the edges nsde an mage. The bank of flters s convolved wth the left and the rght mage to create a response vector at a gven pont that characterzes the local structure of the mage patch. Usng ths nformaton, the correspondence problem can be solved by searchng for pxels n the other mage where the response vector s maxmally smlar. The reason for usng a set of lnear flters at varous orentatons s to obtan rch and hghly specfc mage features sutable for stereo matchng, wth fewer chances of runnng nto false matches. The set of flters F (fg. 3) used to create the depth map conssts of rotated copes of flters generated by Gn, 0 ( x, Gn ( u) G0 ( v) ; u x cos y sn, v x sn y cos where n=,, 3 and Gn s the n th dervatve of the Gaussan functon, defned as z x G0 ( x) e ; z G ( x) zg0 ; 3 G ( x) ( z ) G0 ; G3 ( x) ( z 3z) G0. 3 The matchng process was performed usng dfferent flter szes to fnd the optmzed flter settngs, resultng n an x-szed matrx wth a standard devaton value of. The number of flters used depends on the requred output qualty. Usng all flters would result n a hgh detaled depth approxmaton, but the processng tme would be mmense. Testng dfferent flters to optmze speed and output qualty, the resultng flters conssted of nne lnear flters at equal scale, wth some of them rotated, as shown below. 6

Fgure 3: Spatal flter bank. Image plots of the nne flters generated by copes of rotatons of Gaussans. Fgure 4: Response vectors. An llustraton of how the response vectors wll look lke after beng convolved wth dfferent flters. In realty, a response vector never represents a whole mage. The dsadvantage of usng one scalng level only s the loss of accuracy when matchng pxels near object boundares or at occluded regons. But agan, usng more scales, the renderng tme wll ncrease proportonally. To search for pxel correspondence, an teratve process s created, scannng the left mage horzontally, pxel by pxel, left to rght, and seeks for smlar ntensty values nsde a defned regon surroundng the current pxel locaton. For each row, the set of lnear flters are convolved wth a regon of the rght mage determned by ts wdth and the heght of the flter sze, to create a response vector that characterzes the features of ths row. At ths row, a new response vector for each pxel s created by convolvng the flter bank wth a flter-szed regon from the left mage. How the convolved response vectors for a whole mage would look lke s llustrated n fgure 4. (Note that the response vectors are only representng a small regon of the mage for each teraton of the correspondence process). v, = Rght mage (r) F = rx', y' F x x', y y' rght left v, = Left mage (l) F = lx', y' F x x', y y' x' x' y' y' The convolvng returns only those parts of the convoluton that are computed wthout the zero-padded edges, whch mnmzes the response vectors and optmzes the whole process of fndng the correspondence. As soon as the mages are convolved wth the flters, the matchng process for fndng the correlaton s ntated. To restrct the searchng area, a one-dmensonal regon needs to be determned. By usng a small regon, the correspondng pxels may not be found, as the equvalent pxel probably s located outsde ths regon. On the other hand, f the regon s too large, a pxel not related to that area mght be thought of as correct. When the regon s establshed, ths s used to crop the response vector v,rght created from the rght mage. When the response vectors are defned at a gven pont, they need to be compared n some way to be able to extract some nformaton about how the pxels are related. By calculatng the length of ther vector dfference e, whch wll equal zero f the response vectors are dentcal, ths can be used to solve the correspondence problem. Ths s done by takng the sum of the squared dfferences (SSD) of the response vectors, e v, left v, rght where s the amount of flters used and the pxel poston (defned as k) contanng the value closest to zero s saved. When the correspondence has been establshed, the dsparty has to be defned to be able to create a depth map. For each pxel n the left mage, we know the poston of the matchng pxel n the rght mage. To create a connecton between ths data, the depth value d (, j) for each pxel could be estmated by matchng regon k d(, j) k Fgure 5. where k s the horzontal poston of the correspondng pxel and s the current pxel poston. The depth map (fg. 5) s approxmated wth ntensty levels dependng on the sze of the constant defnng the sze 7

of the matchng regon and f a correspondng pxel s found to the left of current pxel, the ntensty s set to a value ponted to whte, and vce versa, dependng on the rotaton of the mage-par... Locatng errors and nose The prmary depth map mage generated by the flter-based stereo algorthm s a general approxmaton of the depth nformaton regardng the objects n the vdeo frames. As ths algorthm has no knowledge n form of estmatng the structure of object connectvty or how the scene s desgned, unpredcted outputs mght appear. They can be found by convolvng the mage wth an edge detecton flter [7]. The operator best suted for our needs turned out to be the Robnson flters h and h 3. h h 3 Fgure 6. Wth the vertcal and the horzontal Robnson flters defned, they are convolved wth the depth map to fnd obvous edges n t, usng the convoluton formula for two dmensons. We now have two temporary depth map mages, wth the edges defned vertcally and horzontally, shown n fgure 6. From ths, the edge magntude of each pxel could be derved as d( x, d ( x, d ( x, d ( x, d ( x, y The result s shown n fgure 7a and gves a better analyss of how the errors are structured. To be able to use ths nformaton cleverly, the pxels convolved and defned as postons of eventual errors need to be saved. Also, these pxels need to be easly accessed. By usng a threshold value, we can decde whch of the convolved edge -pxels that wll belong to the error pxels n the orgnal depth map, shown n fgure 7b. Wth the postons of the erroneous pxels known, they are replaced by neghborng pxel values, whch creates a smoother depth map, although not mathematcally perfect, snce t s only assumed that these pxels have the same propertes as the nvald and replaced ones. On the other hand, the noseless depth maps, shown n fgure 8, wll generate tremendously enhanced renderngs when appled by the relef engne. ) Fgure 7a & b. Fgure 8. 8

Fgure 9...3 Smoothng the depth map The output from the edge detecton process s a more or less error free depth map, regardng the hole fllng and the depth ntensty nterpretaton. On the subject of ntensty, t can fluctuate sgnfcantly over connected and contguous surfaces over the object. As some ntensty values dverges n areas were they actually would be smlar, the soluton would be to decrease the hgher values and ncrease the lower to create more smlar ntenstes over that specfc area, n other words, smoothng the mage. Ths mght generate an ntensty value ncorrect for the true depth of that part of the object, but applyng ths soluton to the whole mage, the dsplacement would act as an ntensty threshold only. The Gauss functon s used to generate a smooth depth map, defned as the well-known Gaussan blur flter []. We defned a Gaussan operator and convolved t wth the depth map to obtan the smooth result, seen n fgure 9...4 Renderng A fully functonal applcaton for the relef renderng of the magebased object and ts depth maps was wrtten n C++ usng OpenGL, created n parallel to ths project [3] and modfed to fulfll the crteron of our system prototype. The number of polygons requred for renderng equals the amount of stereo cameras used. Because of the good depth nformaton approxmated wth the flter-based stereo algorthm, the vewng angle was set to 45N from the center of the orgn of the textured polygon box, llustrated n fgure 0. N= polygon -45 o +45 o Stereo camera 80 o o 90 Fgure 0. Vewng angles 0 o 3 Results The resultng applcaton conssts of two demos (screenshots avalable on the last page): Statc demo (yellow pullover) - Requres two nput textures and wth two depth maps, textured on two polygons. From two orgnal vews, wth 90 degrees separaton, new unque vews can be created wthn 80 degrees. The polygons are mapped wth textures of sze 56x56 pxels and the frame rate s ~5 frames/sec. Dynamc demo (pnk pullover) - Representng a person walkng around. Textured on only one polygon, whch restrcts the vewng angle to 90 degrees. The amount of nput data requred depends on the frame rate. We used a frame rate of 0 frames/sec, wth a vdeo buffer of 40 mages and 40 depth maps. The relef engne had no problems wth renderng a constantly updatng mage buffer and the anmated sequence showed no ndcatons of flckerng. References [] BOGACHEV, V. 998. Guassan measures. Mathematcal Surveys and Monographs 6. [] JONES, D., AND MALIK, J. 99. A computatonal framework for determnng stereo correspondence from a set of lnear spatal features. In EECV, 395 40. [3] JÄRVMAN, C., Statc and Dynamc Image-Based Applcatons usng Relef Texture Mappng, Lnköpng Unversty, LITH-ITN-MT-0-SE. May 00. [4] KANADE, T., NARAYAN, P., AND RANDER, P. W. 997. Vrtualzed realty: Constructng vrtual worlds from real scenes. IEEE Multmeda 4,, 34 47. [5] MATUSIK, W., BUEHLER, C., RASKAR, R., GORTLER, S. J., AND MCMILLAN, L. 000. Image-based vsual hulls. In Proceedngs of the 7th annual conference on Computer graphcs and nteractve technques, ACM Press/Addson-Wesley Publshng Co., 369 374. [6] OLIVEIRA, M. M., BISHOP, G., AND MCALLISTER, D. 000. Relef texture mappng. In Proceedngs of the 7th annual conference on Computer graphcs and nteractve technques, ACM Press/Addson-Wesley Publshng Co., 359 368. [7] SONKA, M., HLAVAC, V., AND BOYLE, R. 996. Image Processng, Analyss, and Machne Vson, second ed. Brooks/Cole Publshng Company. N= polygon -45 o +45 o Stereo camera 9