Direct Methods for Visual Scene Reconstruction

Size: px

Start display at page:

Download "Direct Methods for Visual Scene Reconstruction"

Berenice Carter
6 years ago
Views:

1 To appear at the IEEE Workshop on Representatons of Vsual Scenes, June 24, 1995, Cambrdge, MA 1 Drect Methods for Vsual Scene Reconstructon Rchard Szelsk and Sng Bng Kang Dgtal Equpment Corporaton Cambrdge Research Lab One Kendall Square, Bldg. 700 Cambrdge, MA Abstract There has been a lot of actvty recently surroundng the reconstructon of photorealstc 3-D scenes and hgh-resoluton mages from vdeo sequences. In ths paper, we present some of our recent work n ths area, whch s based on the regstraton of multple mages (vews) n a projectve framework. Unlke most other technques, we do not rely on specal features to form a projectve bass. Instead, we drectly solve a least-squares estmaton problem n the unknown structure and moton parameters, whch leads to statstcally optmal estmates. We dscuss algorthms for both constructng planar and panoramc mosacs, and for projectve depth recovery. We also speculate about the ultmate usefulness of projectve approaches to vsual scene reconstructon. 1 Introducton The recovery of 3-D scene nformaton from multple vews has long been one of the central problems n computer vson. Over the last decade, many researchers observed that such a full reconstructon may not be necessary for many vson-based tasks, e.g., face or object recognton. More recently, however, there has been a resurgence of nterest n 3- D scene reconstructon, motvated both by mprovements n algorthms and processng speeds, and the emergence of nterestng applcatons such as vrtual realty and model-based vdeo compresson. Tradtonally, 3-D scene reconstructon has been the focus of both stereo and structure from moton, two subfelds wth complementary sets of assumptons and technques. In ths paper, we present some of our recent technques n ths area, whch blend aspects of both stereo and structure from moton [14, 15, 13]. We call our technques drect, snce they both drectly mnmze an mage-based msregstraton measure (wthout specal algebrac or geometrc transformatons), and because they are (usually) based on the drect mnmzaton of ntensty errors. Our technques share a number of characterstcs whch dstngush them from tradtonal approaches to structure from moton and stereo. Whenever possble, we use many vews nstead of just two vews, snce ths leads to more relable estmates. We formulate our reconstructon algorthms usng projectve geometry, whch allows them to work wth uncalbrated cameras as well as cameras wth tme-varyng parameters (e.g., zoom). We also formulate our problems as the drect (teratve) mnmzaton of mage-based measures of msregstraton, nstead of usng algebrac manpulatons whch can result n marked senstvty to nose. Under small Gaussan nose n ether feature poston or ntenstysamples, such technques are statstcally optmal. Fnally, our projectve depth recovery algorthm yelds a dense estmate of scene depth, unlke most structure from moton algorthms. The current focus for our work has been the creaton of realstc hgh-resoluton magery and 3-D envronments from low-resoluton, uncalbrated vdeo. Our applcatons range from automatcally creatng 360 panoramas from vdeo or photographs (e.g., of an offce or a whteboard), to reconstructng the 3-D shape of ndvdual objects. Our long term goal s to automatcally construct 3-D ndoor and outdoor envronments for applcatons such as home sales, vrtual supermarket shoppng, and tele-travel (Secton 6). We begn the paper wth the constructon of hgh-resoluton mage mosacs from low-resoluton vdeo (Secton 2). We then present our algorthm for projectve depth recovery, and dscuss ts applcaton to vew nterpolaton and extrapolaton (Secton 3). For those cases when t s necessary to bootstrap the ntensty-based shape from moton algorthm wth a feature-based algorthm, we present our affne and projectve structure from moton algorthms (Secton 4). Fnally, we dscuss possble vsual scene representatons based on our technques, and some potental applcatons. 2 Vdeo Mosacs The frst technque we descrbe automatcally algns and compostes multple mages nto hgh-resoluton mosacs [13]. Buldng aeral photomosacs has long been a staple of photogrammetry, but only recently have fully automated technques for buldng mosacs been developed. Most technques stll only estmate pure translatons or affne transformatons [4], but some recent work has dealt wth the full projectve case [8]. Our approach s, to our knowledge, the frst to combne full projectve warpng wth near real-tme performance. Our technques for automatcally algnng mages nto photomosacs explot the partcularly smple form of the moton feld resultng from two specfc magng stuatons. The frst case s when the mages cover a porton of a planar scene, e.g., a whteboard, a desktop, or a wall. The second case s when the camera rotates around an axs through ts focal pont (or when all scene objects are very far from the camera). Under

To appear at the IEEE Workshop on Representatons of Vsual Scenes, June 24, 1995, Cambrdge, MA 2 Fgure 1: Whteboard mage mosac example. The central square shows the sze of one nput mage (tle).

2 To appear at the IEEE Workshop on Representatons of Vsual Scenes, June 24, 1995, Cambrdge, MA 2 Fgure 1: Whteboard mage mosac example. The central square shows the sze of one nput mage (tle). ether of these two condtons, the nter-frame moton can be represented by a homography,.e., a lnear functon of projectve mage coordnates u 0 = Mu (see [13] for a smple proof). In the subsectons below, we descrbe our algorthm n more detal, and gve examples of ts applcaton to the specfc cases of planar scenes, panoramas, and multresoluton mosacs. 2.1 Planar Scenes The compostng of multple mages nto larger mosacs requres two basc steps: an mage-to-mage algnment (preferably to sub-pxel precson), and a method for seamlessly blendng mages. Many dfferent solutons are possble for the frst problem, ncludng matchng four or more feature ponts and then solvng for the homography, or manually adjustng mage postons usng a blnk comparator. The approach we have taken s to drectly mnmze the dscrepancy n ntenstes between pars of mages after applyng the transformaton we are recoverng. Our technque does not requre the locaton and correspondence of feature ponts, and s statstcally optmal n the vcnty of the true soluton [14]. Let us wrte the 2-D homography as x 0 m 0x = + m1y + m2 m 3x m6x + m7y + 1 ;y0 = + m4y + m5 m6x + m7y : (1) + 1 Our technque mnmzes the sum of the squared ntensty errors E =X[I 0 (x 0 ;y 0 ), I(x ;y )] =X 2 e 2 (2) over all correspondng pars of pxels whch are nsde both mages. Once we have found the best transformaton M, we can warp mage I 0 nto the reference frame of I usng M and then blend the two mages together. To reduce vsble artfacts, we weght mages beng blended together more heavly towards the center, usng a blnear weghtng functon. To perform the mnmzaton, we use the Levenberg- Marquardt teratve non-lnear mnmzaton algorthm (see [14, 13] for detals). The advantage of usng Levenberg- Marquardt over straghtforwardgradent descent s that t converges n fewer teratons. Unfortunately, both gradent descent and Levenberg-Marquardt only fnd locally optmal solutons. If the moton between successve frames s large, we use herarchcal matchng, whch frst regsters smaller, subsampled versons of the mages where the apparent moton s smaller. For even larger dsplacements, we use phase correlaton,whchs a technque based on 2-D Fourer transforms [6]. To demonstrate the performance of our algorthm, we dgtzed an mage sequence wth a camera pannng over a whteboard. Fgure 1 shows the fnal mosac of the whteboard, wth the locaton of a consttuent mage shown as a whte outlne. Ths mosac s pxels, based on compostng 39 NTSC (640480) resoluton mages. 2.2 Panoramc Mosacs In order to buld a panoramc mage mosac, we rotate a camera around ts optcal center. Images taken n ths manner are related by 2-D projectve transformatons, just as n the planar case [13]. Intutvely, we cannot tell the relatve depth of ponts n the scene as we rotate (there s no moton parallax), so they could be located on a plane.

for the whole Instead, we can select dfferent tles as base mages.

3 To appear at the IEEE Workshop on Representatons of Vsual Scenes, June 24, 1995, Cambrdge, MA 3 Fgure 2: A porton of the Bryce Canyon mosac. Because of the large motons nvolved, we cannot use a sngle plane for the whole mosac. Instead, we can select dfferent tles as base mages. Fgure 3: Crcular panoramc mage mosac example (offce nteror). A total of 36 mages were pasted onto a cylndrcal vewng surface.

To appear at the IEEE Workshop on Representatons of Vsual Scenes, June 24, 1995, Cambrdge, MA 4 Fgure 4: Zoom sequence. The outlnes show the extents of the four consttuent mages.

4 To appear at the IEEE Workshop on Representatons of Vsual Scenes, June 24, 1995, Cambrdge, MA 4 Fgure 4: Zoom sequence. The outlnes show the extents of the four consttuent mages. More formally, the 2-D transformaton denoted by M s related to the 3 3 vewng matrces V and V 0 and the ntervew rotaton matrx R by [13] M = V 0 RV,1 (3) (see Secton 3 for defntons of V and R). In the case of a completely calbrated camera, M s a pure rotaton matrx (only three unknowns). If the focal lengths n the two mages are unknown, then these two parameters must also be estmated. In ether case, we can regster any two overlappng mages usng the same technque as for the planar mosac case. How do we represent a panoramc scene composted usng our technques? One approach s to dvde the vewng sphere nto several large, potentally overlappng regons, and to represent each regon wth a plane onto whch we paste the mages. Examples of such mosacs are gven n [13]. Another approach s to compute the relatve poston of each frame relatve to some base frame, and to perodcally choose a new base frame for dong the algnment. We can then re-compute an arbtrary vew on the fly from all vsble peces, gven a partcular rotaton matrx R and zoom factor f. Thsstheapproach used to composte a large wde-angle mosac of Bryce Canyon, as shown n Fgure 2. A thrd approach s to use a cylndrcal vewng surface to represent the mage mosac [8]. In ths approach, we map world coordnates p = (x; y; z) onto 2-D cylndrcal screen locatons u = (; v), wth = tan,1 (x=z) and v = y= p x 2 + z 2. Fgure 3 shows a complete crcular panorama of an offce unrolled onto a cylndrcal surface. 2.3 Multresoluton Mosacs The technques descrbed so far have used a sngle-resoluton compostng surface to blend all of the mages together. In many applcatons, we may wsh to have spatally-varyng amounts of resoluton, e.g., for zoomng n on areas of nterest. The modfcatons to the basc planar mosac buldng algorthm are relatvely straghtforward, and affect only the mage blendng porton of the algorthm. To create the new composte mosac, we weght each mage by an amount proportonal to the dfference n scale from the desred vew. Fgure 4 shows the result of compostng four mages of a cty scene taken from an offce tower. These mages were taken wth a hand-held 35mm camera equpped wth a mm zoom lens, and the resultng 4 6 photographs were scanned n at 300dp. The multresoluton mosac has a 7:1 varatonn orgnal mage scales. The vdeo sequence seen by a user zoomng n on the central feature of nterest (the State House) shows an even wder range of scales. To zoom from an NTSC resoluton wde-angle shot encompassng all four mages down to a slghtly magnfed (4:1) verson of the most detaled mage nvolves a scalng of over 100:1. 3 Projectve Depth Recovery Whle mosacs of flat or panoramc scenes can be useful for some applcatons, other applcatons requre the recovery of dense depth maps. When the camera moton s known, the problem of depth map recovery s called stereo reconstructon (or mult-frame stereo f more than two vews are used). When the camera moton s unknown, we have the more dffcult structure from moton problem [2, 15]. In thssecton, we present our soluton to ths latter problem based on recoverng projectve depth, whch s partcularly smple and robust and fts n well wth the methods already developed n ths paper. To formulate the projectve structure from moton recovery problem, we frst wrte the perspectve projecton from world coordnates p = (X; Y; Z; W ) to screen coordnates

5 To appear at the IEEE Workshop on Representatons of Vsual Scenes, June 24, 1995, Cambrdge, MA 5 u =(x; y; w) as u = V [R j t] p; (4) where V s the upper trangular vewng matrx, andr and t are the usual rotatonal and translatonal components of the camera moton [2]. Wthout loss of generalty, we can set R = I and t = 0 n the frst frame. The world coordnates correspondng to an optcal ray (n the frst mage) passng through u are therefore V,1 u p = d where d s the projectve depth of the world pont [15]. The coordnates correspondng to a pxel u wth projectve depth d n some other frame can therefore be wrtten as u 0 = V 0 RV,1 u + dv 0 t = Mu + d~t; (5).e., as the summaton of a planar projectve transformaton (homography) and a depth-dependent parallax moton (n the drecton of the eppole). Ths formulaton has formed the bass of both our projectve structure from moton algorthms [15] and our projectve dense depth estmaton algorthm [14]. More recently, t been used by other researchers under the names of affne depth [12] and planar parallax [10, 7] (see Secton 4.2 for a more detaled dscusson of projectvedepth). The above formulaton extends naturally to multframe depth recovery by smply assocatng a separate M j and ~ t j wth each frame and mnmzng the summed ntensty error E =XX[I j (x 0 j;y 0 j ), I 0(x ;y )] =XX 2 j6=0 j6=0 e 2 j: (6) To recover the parameters n M j and ~ t j for each frame along wth the depth values d (whch are the same for all frames), we use the same Levenberg-Marquardt algorthm as before. Once the projectve depth values are recovered, they can be used drectly n vewpont nterpolaton (usng a new M and ~t), or they can be converted to true Eucldean depth usng at least 4 known depth measurements [2]. In more detal, we can wrte the projecton equaton (5) as x m(j) 0 0 x + m (j) 1 y + t (j) 0 d + m (j) 2 j = m (j) 6 x + m (j) 7 y + t (j) 2 d ; + 1 y m(j) 0 3 x + m (j) 4 y + t (j) 1 d + m (j) 5 j = m (j) 6 x + m (j) 7 y + t (j) 2 d : (7) + 1 To estmate the unknown parameters, we alternate teratons of the Levenberg-Marquardt algorthm over the moton parameters fm (j) 0 ;:::;t(j) 2 g and the depth parameters fd g. In our current mplementaton, n order to reduce the total number of parameters beng estmated, we represent the depth map usng a tensor-product splne, and only recover the depth estmates at the splne control vertces (the complete depth map s avalable by nterpolaton) [14]. Fgure 5 shows an example of usng our projectve depth recovery algorthm. The mage sequence was taken by movng the camera up and over the scene of a table wth stacks of papers (Fgure 5a). The resultng depth-map s shown n Fgure 5b as ntensty-coded range values. 3.1 Vew Interpolaton Once a dense depth map has been recovered for the scene, we can use ths nformaton to synthesze (nterpolate or extrapolate) novel vews [1, 11, 13]. When a Eucldean depth map s avalable, regular 3-D graphcs can be used for the vew synthess [1]. In other stuatons, correspondng ponts must be found between the orgnal vews and the novel vew n order to compute the requred transformatons[11], or the projectve depth descrpton must be converted to a Eucldean one [2]. A smpler approach, whch often produces results of acceptable qualty, s to smply re-scale the projectve depths by an amount whch yelds a sensble 3-D scene when vewed from moderate vewng angles. Ths s the approach we used to generate the pctures n Fgure 5. Fgure 5c shows the orgnal ntensty mage texture mapped onto the surface seen from a sde vewpont whch s not part of the orgnal sequence (an example of vew extrapolaton). Fgure 5d shows a set of grd lnes overlayed on the recovered surface to better judge ts shape. 4 Affne and Projectve Structure from Moton In the precedng secton, we gnored the problem of local mnma n the search space. Our experence has been that our drect ntensty-based projectve depth recovery algorthm converges to a good soluton wth only a small hnt as to the camera translaton drecton (e.g., vertcal for Fgure 5). In some stuatons, however, t may be necessary to bootstrap the dense depth recovery algorthm by frst estmatng the camera moton usng a feature-based structure from moton algorthm. Tradtonal structure from moton algorthms attempt to recover a Eucldean reconstructon of the world [2]. More recent algorthms, motvated by the dffculty of obtanng metrcally accurate 3-D reconstructons, have attacked the problem of recoverng an affne [5, 16] or projectve [3, 9, 11] descrpton. The advantage of ths approach s that t does not requre camera calbraton and can lead to more relable estmates [3]. It may also be suffcent for many vson-based tasks such as re-projecton and object recognton [11]. Our structure from moton algorthm [15] drectly mnmzes (usng Levenberg-Marquardt) the squared dfference between predcted and measured screen coordnates E =XX j,2 j [(u j, x 0 j )2 +(v j, y 0 j )2 ]; (8) where (u j ;v j ) s the screen locaton of the th feature n the jth frame, and (x 0 j ;y0 j ) are gven by (7). Each measurement can be weghted by ts nverse varance,2, whch can be set j to zero for mssng measurements. Such as weghtng leads to

To appear at the IEEE Workshop on Representatons of Vsual Scenes, June 24, 1995, Cambrdge, MA 6 (a) (b) (c) (d) Fgure 5: Depth recovery example table wth stacks of papers: (a) nput mage, (b)

For our feature-based algorthm, we optmze over all frames (ncludng frame zero), and the x and y coordnates of the 3-D ponts n (7) are also treated as unknowns.

Unlke most other projectve reconstructon algorthms, we do not choose a set of feature ponts as a projectve bass.

1 Algorthm ntalzaton To ntalze our non-lnear least-squares algorthm, we have tred two approaches (a thrd approach to bootstrappng the algorthm, whch we have not nvestgated, s to use fundamental

6 To appear at the IEEE Workshop on Representatons of Vsual Scenes, June 24, 1995, Cambrdge, MA 6 (a) (b) (c) (d) Fgure 5: Depth recovery example table wth stacks of papers: (a) nput mage, (b) ntensty-coded depth map (dark s farther back) (c) texture-mapped surface seen from novel vewpont, (d) grdded surface. a statstcally optmal (maxmum lkelhood) estmate of the unknown parameters. For our feature-based algorthm, we optmze over all frames (ncludng frame zero), and the x and y coordnates of the 3-D ponts n (7) are also treated as unknowns. However, snce we set M0 =0and ~t0 =0as before, the (x ;y ) values reman close to the screen postons measured n frame zero. Unlke most other projectve reconstructon algorthms, we do not choose a set of feature ponts as a projectve bass. Ths allows the algorthm to work wth tracks where features may dsappear at any tme, and avods the senstvty of the results to the choce of bass ponts. 4.1 Algorthm ntalzaton To ntalze our non-lnear least-squares algorthm, we have tred two approaches (a thrd approach to bootstrappng the algorthm, whch we have not nvestgated, s to use fundamental matrces). The frst s to smply set (x ;y ;d ) = (u 0;v 0; 0), M j = I, and ~ t j =0,.e., to set the 3-D ponts to le on a null plane, and to assume no moton. In our experments, the algorthm usually converges n under a dozen teratons. Our second approach, whch yelds much qucker results, s to frst solve for the M j by computng a planar projectve transformaton,.e., to fx (x ;y ;d ) = (u 0;v 0; 0) and to optmze (8). Then, a guess for the focus of expanson for each frame, whch corresponds to ~ t j, can be computed by fndng the domnant egenvalue of the moment matrx of the resdual vectors (u j, x j 0;v j, y 0 j ). It turns out that n the orthographc case,.e., for affne structure from moton (where the denomnators n (7) are unty), ths two step approach results n an exact soluton (n the nose-free case), and s equvalent to sngular value decomposton [16] but at a lower computatonal cost. For perspectve projecton, the planar moton computed by the frst step may not correspond to the moton of an actual plane, but ths wll be corrected durng the teratve mnmzaton, whch often converges n just a sngle step.

7 To appear at the IEEE Workshop on Representatons of Vsual Scenes, June 24, 1995, Cambrdge, MA Ambgutes n soluton The projectve depths d and moton descrptors (M j ; ~ t j ) have a four-parameter ambguty assocated wth them, even after we have set M0 = I and ~ t0 =0. One of these ambgutes corresponds to the scale ambguty present n Eucldean structure from moton,.e., we can scale d and ~ t j by recprocal factors and stll obtan the same soluton (predcted feature postons). In a smlar manner, we can add a multple of ~t j to any column of M j and modfy d approprately, whch corresponds to addng a plane equaton to the d. Ths three-parameter ambgutycorresponds to choosng the plane relatve to whch the projectves depths d are defned. Planar parallax technques [10, 7] assume that ths plane s the one wth the domnant moton. Our structure from moton technque fnds a plane whch s close to a least-squares plane ft to the depths. 1 5 Representatons for complex scenes The reconstructon of 3-D scenes usng a projectve framework rases some nterestng questons about the representaton of the scene. At the most prmtve level, the output of a structure from moton algorthm may just be a collecton of ponts and camera matrces. Whle ths may be adequate for certan tasks such as navgaton, t s not that useful for tasks such as vew-based recognton or vrtual realty. The dense range maps avalable from multframe stereo technques are more nterestng. They can be used to synthesze novel vews usng vew nterpolaton [1], even n the absence of full metrc nformaton [11]. For true vrtual envronments, however, multple depth maps must be combned nto a rcher structure, whch may requre segmentaton. Several alternatves exst for the representaton of such envronments. One possblty would be to reduce the world to a collecton of (hopefully contnuous) planar surfaces [17], whch could then be texture mapped. Another possblty s to have a collecton of contguous depth maps and mages, whch could then be rendered usng ether conventonal graphcs or mult-frame vew nterpolaton [1]. The queston of how to merge such multple depth maps s an actve research area. Such systems would also have to nclude multresoluton representatons, at least f a large range of vewng postons or vrtual camera settngs were permtted. For true 3-D objects, however, volumetrc or parametrc 3-D models may be the best choce. 6 Applcatons The reconstructon of vsual scenes has many potental applcatons, ncludng object recognton, model-based vdeo compresson, and the constructon of hghly detaled vrtual envronments. Our research has concentrated on ths last class of applcatons. In the smplest case, planar mosacs can be used for scannng whteboards as an ad to vdeoconferencng or as an easy way to capture deas. Scannng can produce mages of much greater resoluton than sngle wde-angle lens 1 We can enforce ths constrant, f desred, by modfyng the d and M j after each teraton to mantan a zero bas n the d. shots; the technques developed n ths paper enable any vdeo camera attached to a computer to be used. Pecewse planar mosacs could also be used to model certan vrtual envronments, e.g., the asles at your local supermarket. Panoramc mosacs can have many applcatons, ncludng tele-toursm (e.g., lookng at the vews from the Effel Tower or the rm of the Grand Canyon), educaton (tours of museums), and home sales (vews of room nterors). True walkthroughs of exstng buldngor outdoor envronments requre the soluton of a much more dffcult problem,.e., full 3-D reconstructon. They also requre the rapd dsplay of very complex scenes, for whch vew nterpolaton may be useful. The ultmate n vrtual realty systems s true telepresence, whch compostes vdeo from multple source n real-tme to create the llusonof beng n a dynamc (and perhaps reactve) 3-D envronment. An example of such an applcaton mght be to vew a 3-D verson of a concert or sportng event wth control over the camera shots, even beng able to see the event from the players pont of vew. Other examples mght be to partcpateor consult n a surgery from a remote locaton (telemedcne), or to remotely partcpate n a vrtual classroom. 7 Dscusson and Open Questons The recent nterest n projectve approaches to vsual scene reconstructon and representaton appears to be motvated by two man concerns. The frst s a desre to avod camera calbraton. The second s a dsappontment wth the (metrc) qualty of the results avalable wth Eucldean technques. The need for accurate camera calbraton depends very much on the task at hand. For example, robot systems that handle or nspect parts beneft greatly from accurate calbraton. On the other hand, vsual servong, whch does not requre precse calbraton, s suffcent for robot systems that are capable of hybrd force and poston control. Accurate calbraton s necessary for verdcal scene reconstructon, e.g., for vrtual realty envronments and games. Of course, computng a projectve descrpton frst and then convertng t to a Eucldean representaton later through control ponts may be a reasonable approach. We beleve that the qualty of Eucldean reconstructons must be examned n more detal, snce ts underlyng problems can also plague projectve reconstructon technques. We see four man reasons why reconstructon technques may not produce relable results: a poor choce of technque, usng an napproprate representaton, usng too lttle data, and fundamental lmtatons on the achevable accuraces. Tradtonally, structure from moton algorthms have been developed usng geometrc arguments about pont, lnes, and planes, followed by a reducton to an algebrac formulaton or seres of estmaton steps. The problem wth ths approach s that whle geometrc or algebrac constructs are correct n the nosefree case, there s no guarantee that they wll produce reasonable estmates for nosy data. Our approach has been to estmate the unknown structure and moton parameters usng a non-lnear least-squares mnmzaton of the mage plane measurement errors, whch s statstcally optmal for small

8 To appear at the IEEE Workshop on Representatons of Vsual Scenes, June 24, 1995, Cambrdge, MA 8 Gaussan nose, and can be made robust aganst gross errors usng robust statstcs. Furthermore, ths approach provdes explct measures of uncertanty n the estmates, whch can be used to great advantage when processng sequences of data. Carefully choosng the coordnate frame for the structure reconstructon,.e., usng an object-centered representaton, can dramatcally mprove the qualty of Eucldean reconstructon [16, 15]. Ths advantage s shared by many projectve reconstructon technques, whch often choose the reconstructon plane to be located near the nterestng structure. Many structure from moton algorthms are also restrcted to usng only a few ponts or frames. Our estmaton-theoretc approach encourages the use of as much redundant data as possble, and can easly accommodate mssng or nosy estmates. Fnally, t s mportant to understand that structure from moton, and scene reconstructon n general, are fundamentally lmted n accuracy by the qualty of the feature tracks, regardless of the choce of algorthm and representaton. Therefore, the mportance of feature tracker accuracy cannot be overemphaszed. Dscardng unrelable feature tracks usng robust statstcs, as s thecase n our structurefrom moton algorthm, wll greatly mprove the qualty of the reconstructons. A large number of open questons reman n ths doman. In terms of effcency, there s the queston of the relatve accuracy of recursve vs. batch estmaton algorthms. Wthn ths context, a better understandng of the structure of the uncertanty (covarance) n the estmates should mprovethe qualty of recursve algorthms. Another nterestng queston s whether the recovery of a projectve scene descrpton s a useful ntermedate step n the process of recoverng Eucldean structure. How relable s such an approach compared wth drect Eucldean estmaton? Does t offer sgnfcant mprovements n terms of speed? What are the lmtatons on the accuracy of Eucldean reconstructons, and what knd of metrc nformaton s most useful when constructng such estmates? To summarze, we have descrbed our phlosophy and our algorthms n the area of scene reconstructon from multple vews. In partcular, we beleve that the approach to scene reconstructon should be dctated by the task requrements, whch s consstent wth the noton of task-orented vson. For example, for scene nterpretaton tasks where relatve depths are used to qualtatvelydescrbe thespatal orderng of objects n the scene, recovery of projectve depth s adequate. For applcatons such as vrtual realty envronment constructon, Eucldean (true or scaled) s requred. In ether case, t s mportant to understand the nature of structure recovery errors both to optmze the algorthms we use and to understand the fundamental lmtatons of these technques. References [1] S. Chen and L. Wllams. Vew nterpolaton for mage synthess. Computer Graphcs (SIGGRAPH 93), pages , August [2] O. Faugeras. Three-dmensonalcomputer vson: A geometrc vewpont. MIT Press, Cambrdge, Massachusetts, [3] O. D. Faugeras. What can be seen n three dmensons wth an uncalbrated stereo rg? In Second European Conference on Computer Vson (ECCV 92), pages , Santa Margherta Lguere, Italy, May Sprnger-Verlag. [4] M. Hansen, P. Anandan, K. Dana, G. van der Wal, and P. Burt. Real-tme scene stablzaton and mosac constructon. In IEEE Workshop on Applcatons of Computer Vson (WACV 94), pages 54 62, Sarasota, Florda, December [5] J. J. Koendernk and A. J. van Doorn. Affne structure from moton. Journal of the Optcal Socety of Amerca A, 8: , [6] C. D. Kugln and D. C. Hnes. The phase correlaton mage algnment method. In IEEE 1975 Conference on Cybernetcs and Socety, pages , New York, September [7] R. Kumar, P. Anandan,and K. Hanna. Drect recovery of shape from multple vews: A parallax based approach. In Twelfth Internatonal Conference on Pattern Recognton (ICPR 94), volume A, pages , Jerusalem, Israel, October IEEE Computer Socety Press. [8] S. Mann and R. W. Pcard. Vrtual bellows: Constructng hghqualty mages from vdeo. In Frst IEEE Internatonal Conference on Image Processng (ICIP-94), volume I, pages , Austn, Texas, November [9] R. Mohr, L. Vellon, and L. Quan. Relatve 3D reconstructon usng multple uncalbrated mages. In IEEE Computer Socety Conference on Computer Vson and Pattern Recognton (CVPR 93), pages , New York, New York, June [10] H. S. Sawhney. 3D geometry from planar parallax. In IEEE Computer Socety Conferenceon Computer Vson and Pattern Recognton (CVPR 94), pages ,Seattle, Washngton, June IEEE Computer Socety. [11] A. Shashua. Projectve depth: A geometrc nvarant for 3D reconstructon from two perspectve/orthographc vews and for vsual recognton. In Fourth Internatonal Conference on Computer Vson (ICCV 93), pages , Berln, Germany, May IEEE Computer Socety Press. [12] A. Shashua and N. Navab. Relatve affne structure: Theory and applcatons to 3D reconstructon from perspectve vews. In IEEE Computer Socety Conference on Computer Vson and Pattern Recognton (CVPR 94), pages , Seattle, Washngton, June IEEE Computer Socety. [13] R. Szelsk. Image mosacng for tele-realty applcatons. In IEEE Workshop on Applcatons of Computer Vson (WACV 94), pages 44 53, Sarasota, Florda, December IEEE Computer Socety. [14] R. Szelsk and J. Coughlan. Herarchcal splne-based mage regstraton. In IEEE Computer Socety Conference on Computer Vson and Pattern Recognton (CVPR 94), pages , Seattle, Washngton, June IEEE Computer Socety. [15] R. Szelsk and S. B. Kang. Recoverng 3D shape and moton from mage streams usng nonlnear least squares. Journal of Vsual Communcaton and Image Representaton,5(1):10 28, March [16] C. Tomas and T. Kanade. Shape and moton from mage streams under orthography: A factorzaton method. Internatonal Journal of Computer Vson, 9(2): , November [17] J. Y. A. Wang and E. H. Adelson. Layered representaton for moton analyss. In IEEE Computer Socety Conference on Computer Vson and Pattern Recognton (CVPR 93), pages , New York, New York, June 1993.

An efficient method to build panoramic image mosaics

An efficient method to build panoramic image mosaics An effcent method to buld panoramc mage mosacs Pattern Recognton Letters vol. 4 003 Dae-Hyun Km Yong-In Yoon Jong-Soo Cho School of Electrcal Engneerng and Computer Scence Kyungpook Natonal Unv. Abstract