A Computer Vision Approach for 3D Reconstruction of Urban Environments

Size: px

Start display at page:

Download "A Computer Vision Approach for 3D Reconstruction of Urban Environments"

Beryl Haynes
6 years ago
Views:

1 A Computer Vision Approach for 3D Reconstruction of Urban Environments P. Mordohai 1, J.-M. Frahm 1, A. Akbarzadeh 2, B. Clipp 1, C. Engels 2, D. Gallup 1, P. Merrell 1, C. Salmi 1, S. Sinha 1, B. Talton 1, L. Wang 2, Q. Yang 2, H. Stewenius 2, H. Towles 1, G. Welch 1, R. Yang 2, M. Pollefeys 1 and D. Nistér 2 1 University of North Carolina, Chapel Hill, NC, USA {mordohai, jmf, bclipp, gallup, pmerrell, salmi, taltonb, herman, welch, marc}@cs.unc.edu 2 University of Kentucky, Lexington, KY, USA {amir, engels, qingxiong.yang, stewe}@vis.uky.edu, {lwangd, ryang, dnister}@cs.uky.edu Abstract We present an approach for automatic reconstruction of detailed 3D models of outdoor scenes using computer vision techniques. Our system collects video, GPS and INS data which are processed in real-time to produce geo-registered 3D models that represent the geometry and appearance of the world. These models are generated without manual measurements or markers in the scene and can be used for visualization from arbitrary viewpoints, documentation and archiving of large areas. Furthermore, virtual objects and buildings can be inserted in them to generate previews of potential alterations to the environment. The GPS/INS measurements allow us to geo-register the pose of the camera at the time each frame was captured. Several offset frames of the same part of the scene serve as input to the processing pipeline which estimates the positions of points in the world from pixel correspondences and generates the 3D models. These models are in the form of a triangular mesh and a set of images that provide texture for the mesh. 1. INTRODUCTION Detailed, 3D models of cities were usually made from aerial data, in the form of range or passive images combined with other modalities, such as measurements from a Global Positioning System (GPS). While these models can be useful for navigation, they provide little additional information compared to maps in terms of visualization. Buildings and other landmarks cannot be easily recognized since the facades are poorly reconstructed from aerial images due to highly oblique viewing angles. To achieve high-quality ground-level visualization one needs to capture data from the ground. Recently, the acquisition of ground-level videos of primarily cities using cameras mounted on vehicles has been the focus of several research and commercial initiatives, such as Google Earth and Microsoft Virtual

2 Earth, that aim at generating high-quality visualizations. Such visualizations can be in the form of panoramic mosaics [1,2], simple geometric models [3] or accurate and detailed 3D models such as the ones produced by our method. The construction of mosaics and simple geometric models is not very computationally intensive, but the resulting visualizations are either as simple as the employed models or limited to rotations around pre-specified viewpoints. For instance, the system of Cornelis et al. [3] reconstructs three planes, the ground and two facades parallel to the driving direction, for each frame. While it can provide effective visualization from the perspective the videos were captured, the visual quality from other viewpoints is limited. In this paper, we investigate the reconstruction of accurate 3D models from a large set of images with emphasis on urban environments. These models offer considerably more flexible visualization in the form of walk-throughs and fly-overs. Moreover, due to the use of GPS information, the models are accurately geo-located and measurements with a precision of a few cm can be made on them. Examples of ground-level reconstructions from our system can be seen in Figure 1. The massive amounts of data needed in order to generate 3D models of complete cities pose significant challenges for both the data collection and processing systems, which we address in this paper. Figure 1: Screenshots of reconstructed models

3 The video acquisition system consists of eight cameras mounted on a vehicle, with a quadruple of cameras looking to each side. The cameras have a resolution of pixels and a frame rate of 30 Hz. Additionally, the acquisition system employs an Inertial Navigation System (INS) and a GPS, which are synchronized with the cameras, to enable geo-registration of the cameras. As the vehicle is driven through urban environments, the captured video is stored on disk drives in the vehicle. After a capture session, the drives are moved from the vehicle to a computer cluster for processing. Processing entails estimating the vehicle s trajectory, recovering the camera poses at the time instances when frames were captured, reconstructing 3D points based on stereo correspondence and finally creating a model in the form of a polygonal mesh with associated texture maps. We show results on several video sequences that consist of up to hundreds of thousands of frames and also present a quantitative evaluation of the accuracy and completeness of our reconstructions by comparing them with accurately surveyed models of a building. This paper is organized as follows: the next section is an overview of related work; Section 3 presents the acquisition system; Section 4 discusses different pose estimation alternatives; Section 5 describes our 3D reconstruction pipeline; Section 6 shows experimental results; and Section 7 concludes the paper. 2. RELATED WORK The research community has devoted a lot of effort to the modeling of man-made environments using a combination of sensors and modalities. Here, we briefly review work relying on ground-based imaging since it is more closely related to our project. An equal, if not larger, volume of work exists for aerial imaging. The goal is the accurate reconstruction of urban or archaeological sites, including both geometry and texture, in order to obtain models useful for visualization, for quantitative analysis in the form of measurements at large or small scales and potentially for studying their evolution through time. A natural choice to satisfy the requirement of modeling the geometry and appearance is the combined use of active range scanners and digital cameras. Stamos and Allen [4] used such a combination, while also addressing the problems of registering the two modalities, segmenting the data and fitting planes to the

4 point cloud. El-Hakim et al. [5] propose a methodology for selecting the most appropriate modality among range scanners, ground and aerial images and CAD models. Früh and Zakhor [6] developed a system that is similar to ours since it is also mounted on a vehicle and captures large amounts of data in continuous mode, in contrast to the previous approaches that captured data from a few distinct viewpoints. Their system consists of two laser scanners, one for map construction and registration and one for geometry reconstruction, and a digital camera, for texture acquisition. A system with similar configuration, but smaller size, that also operates in continuous mode was presented by Biber et al. [7]. Laser scanners have the advantage of providing accurate 3D measurements directly. On the other hand, they can be cumbersome and expensive. Researchers in photogrammetry and computer vision address the problem of reconstruction relying solely on passive sensors (cameras) in order to increase the flexibility of the system while decreasing its size, weight and cost. The challenges are due mostly to the well-document inaccuracies in 3D reconstruction from 2D measurements. This is a problem that has received a lot of attention from computer vision and photogrammetry. Due to the size of the relevant literature we refer interested readers to the books by Faugeras [8] and Hartley and Zisserman [9] for computer vision based treatments and the Manual of Photogrammetry [10] for the photogrammetric approach. It should be noted here that while close-range photogrammetry solutions would be effective for 3D modeling of buildings, they are often interactive and would not efficiently scale to a complete city. Among the first attempts for 3D reconstruction of large scale environments using computer vision techniques was the MIT City Scanning project [11]. The goal of the project was the construction of panoramas from a large collection of images of the MIT campus. A semi-automatic approach under which simple geometric primitives are fitted to the data was proposed by Debevec et al. [12]. Compelling models can be reconstructed even though fine details are not modeled but treated as texture instead. Rother and Carlsson [13] show that multiple-view reconstruction can be formulated as a linear estimation problem given a known fixed plane that is visible in all images. Werner and Zisserman [14] presented an automatic method, inspired by [12], that fits planes and polyhedra on sparse reconstructed primitives by examining the support they receive via a modified version of the space sweep algorithm [15]. Research on large scale

5 urban modeling includes the 4D Atlanta project described by Schindler and Dellaert in [16], which also examines the evolution of the model through time. Recently, Cornelis et al. [3] presented a system for realtime modeling of cities that employs a very simple model for the geometry of the world. Specifically, they assume that the scenes consist of three planes: the ground and two facades orthogonal to it. A stereo algorithm that matches vertical lines across two calibrated cameras is used to reconstruct the facades, while objects that are not consistent with the facades or the ground are suppressed. We approach the problem using passive sensors only, building upon the experience from the intensive study of structure from motion and shape reconstruction within the computer vision community [17,18,19,20,21]. The emphasis in our project is on the development of a fully automatic system that is able to operate in continuous mode without user intervention or the luxury of precise viewpoint selection since capturing is performed from a moving vehicle and thus constrained to the vantage points of the streets. Our system design is also driven by the performance goal of being able to process the large video datasets in a time at most equal to that of acquisition. 3. ACQUISITION SYSTEM AND CALIBRATION In this section, we describe the hardware used in our data acquisition system that records videos, GPS and INS data to a set of hard drives. Each video frame and sensor measurement is associated with a timestamp which serves to synchronize all data for further processing. We proceed with an overview of the calibration procedure that is employed to estimate the intrinsic parameters of the cameras as well as their position and orientation relative to the vehicle Data acquisition system The on-vehicle video acquisition system consists of two main sub-systems: an eight-camera digital video recorder (DVR) and an Applanix GPS/INS (model POS LV) navigation system. The DVR streams the raw images to disk, and the Applanix system tracks the vehicle s position and orientation. The DVR is built with eight Point Grey Research (PGR) Flea color cameras, with one quadruple of cameras for each side of the vehicle. Each camera has a field-of-view of approximately 40 o 30 o, and within a quadruple the cameras are arranged with minimal overlap in field-of-view. As shown in Figure 2, three cameras are mounted in

6 the same plane to create a horizontal field of view of approximately 120 degrees. The fourth camera is tilted upward to create an overall vertical field of view of approximately 60 degrees with the side-looking camera. Figure 2: Left: A camera quadruple such as the ones used in our system. Right: A frame of each camera arranged according to the quadruple configuration. The cameras are interfaced via IEEE-1394 connectors to eight Windows-Intel VME-based processor modules from Concurrent Technologies, each with two high-capacity SATA data drives. The eight-camera DVR is capable of streaming to disk Bayer-pattern images at 30Hz. The CCD exposure on all cameras is synchronized using IEEE-1394 synchronization units from PGR. With each video frame recorded, one of the VME processor modules also records a GPS-timestamp message from the Applanix navigation system. These messages are synchronized to the CCD exposure by means of an external TTL signal output by one of the cameras. The GPS timestamps are used to recover the position and orientation of the vehicle at the corresponding time from the post-processed trajectory. Recovering camera positions and orientations, however, requires knowledge of the hand-eye calibration between the cameras and the GPS/INS system that one establishes during system calibration Calibration The first stage of calibration is the estimation of the intrinsic parameters of the camera which determine the mapping of rays emanating from the optical center to pixels in the image. These parameters include the focal length, aspect ratio and principal point as well as nonlinear parameters such as lens distortion. All

7 parameters are jointly estimated using a planar calibration object and applying Zhang s algorithm [22] as implemented by Ilie and Welch [23]. During video collection, we set the camera gains to the automatic setting to adjust for lighting variations. The changes in gain are handled by the reconstruction module described in Section 5. We do not attempt to estimate the external relationships of cameras since there is little or no overlap in their fields of view. Instead, we estimate the calibration of the cameras relative to the GPS/INS coordinate system. This process is termed hand-eye or lever-arm calibration. The hand-eye transformation, which consists of a rotation and a translation, between the camera coordinate system and the vehicle coordinate system, which is in UTM coordinates given by the GPS sensor, is computed by detecting the positions of a few surveyed points in a few frames taken from different viewpoints. The pose of the camera can be determined for each frame from the projections of the known 3D points and the pose of the vehicle is given by the sensor measurements. The optimal linear transformation can be computed using this information. After the hand-eye transformation for each camera has been computed, all 3D positions and orientations of points, surfaces or cameras can be transformed to UTM coordinates. This is accomplished using the timestamps, which serve as an index to the vehicle s trajectory, to retrieve the vehicle s pose at the time instance an image was captured. We can, then, compute the pose of a particular camera by applying the appropriate hand-eye transformation to the vehicle pose. 4. POSE ESTIMATION Accurate camera poses are critical for 3D reconstruction, during which the depth of pixels is estimated. The reconstructed 3D points are placed on the ray going through the optical center and the corresponding pixels at a distance from the optical center equal to the estimated depth. If the camera pose is not estimated correctly, these points will be off and the resulting 3D model will be distorted or completely incomprehensible. In this section, we present three alternatives for estimating camera pose with varying degrees of reliance on the GPS/INS measurements and vision-based tracking. It should be pointed out that since the cameras have very little or no overlap, our system can be perceived as eight single-camera systems, unless a sophisticated multi-camera tracking module is employed.

8 The first option is to estimate camera poses relying solely on image data. This process is called structure from motion and has received a lot of attention in the computer vision literature [24,25,26,17,18,19,20]. Processing begins by automatically detecting 2D features in a frame and trying to locate corresponding features in subsequent frames. For this purpose, we use an implementation of the KLT tracking algorithm [27] on the Graphics Processing Unit (GPU) [28] which can achieve throughputs faster than 30 Hz. Since the intrinsic parameters of the cameras are known, we can use Nistér s five-point algorithm [19], which is very fast and robust to degenerate scene configurations. This option produces a trajectory for each camera which is Euclidean in the sense that angles and length ratios are preserved between the actual and estimated poses. A global rotation, translation and scaling transformation is necessary to align the camera trajectory and the models with the world. The inability to geo-register and determine the absolute size of the model along with the drift that typically occurs in vision-based tracking are the limitations of this option. Its advantage is that it does not require the expensive INS sensor and thus reduces the cost of the system significantly. The second option is to combine vision-based pose estimation with a low-end GPS sensor. Unlike visionbased tracking, GPS sensors are globally accurate, typically within 2-3m, but susceptible to short-term drift. The advantage of this option is that it combines two measurements with different characteristics and it can benefit from the small-scale consistency of the vision measurements and the large-scale consistency of the GPS data. This option produces geo-registered models of known size while keeping the cost of the system in a reasonable range. The third option, which is the one we employ for most of the results shown in this paper, is to rely on a very accurate GPS/INS system to compute the trajectory. Our experiments have shown that the Applanix system can provide pose estimates that are accurate enough to be used as inputs for 3D reconstruction. Even though, these estimates can be improved to some extent using vision-based measurements, we use this alternative because it enables our system to achieve real-time performance by bypassing vision-based pose estimation.

9 5. 3D RECONSTRUCTION The 3D reconstruction module consists of three parts: multiple-view stereo that computes depthmaps given images and the corresponding camera poses, depthmap fusion that merges the depthmaps and model construction which generates the 3D mesh and defines the mapping of texture on the faces of the mesh. Note that the video sequence of each camera is processed separately here Multiple-view stereo The multiple-view stereo module takes as input the camera poses and images from a single video stream and produces a depth map for each frame. It uses a modified version of the plane-sweep algorithm of Collins [15], which is an efficient multi-image matching technique. Conceptually, a plane is swept through space in steps along a predefined direction, typically parallel to the optical axis. At each position of the plane, all views are projected on it. If a point on the plane is at the correct depth, then, according to the brightness constancy assumption, all pixels that project on it should have consistent intensities. We measure this consistency by summing the absolute intensity differences in square aggregation windows defined in the reference image, for which the depth map is computed. The hypothesis with the minimum cost (sum of absolute differences) is selected as the depth estimate for each pixel. We opted for the plane-sweep algorithm because it can be easily parallelized and thus is amenable to a very fast GPU implementation as shown by Yang and Pollefeys [29]. We have made several augmentations to the original algorithm to improve the quality of the reconstructions and to take advantage of the structure exhibited in man-made environments, which typically contain planar surfaces that are orthogonal to each other. We begin by identifying the scene s principal plane orientations combining information from the trajectory of the vehicle as well as the distribution of the tracked features, if vision-based tracking is performed. Once these principal plane orientations are known, we can sweep a plane in all of them. Three orientations are typically sufficient to overcome the bias of the single sweep towards planes orthogonal to the direction of the sweep and to reconstruct slanted surfaces. As mentioned in Section 3, a key feature of our stereo module is that it compensates for global camera gain changes without impacting the real-time performance [30]. This is very important for outdoor applications where the brightness range often far exceeds the dynamic range of the camera.

10 The stereo module is invoked to produce a depthmap typically for every frame using between 4 and 10 other images to evaluate the cost of each depth estimate. Due to being called this frequently, stereo consumes the largest portion of our computational resources. Nevertheless, it can achieve a throughput of more than 100 frames per second on images using a single high-end commodity GPU Depthmap Fusion Multiple-view stereo provides a depthmap for every frame. Since we do not enforce consistency between depthmaps during stereo, we need to enforce it in a separate depthmap fusion step. Fusion serves two purposes: it improves the quality of the depth estimates by ensuring that estimates for the same point are consistent across multiple depthmaps and it produces more economical representations by merging multiple depth estimates into one. Related work includes volumetric [31] and viewpoint-based [32,33] approaches. We select a viewpoint-based approach since a volumetric one would be impractical due to memory and processing requirements. Given depthmaps from a set of consecutive frames, the depthmap fusion step resolves conflicts between computed depths and produces a depthmap in which most of the noise has been removed. By dividing the processing into two stages we are able to achieve very high throughput mostly because we are able to use a computationally cheap stereo algorithm and because both stages are suitable for hardware-accelerated GPU implementations. Fusion begins by rendering all depthmaps onto the reference depthmap which is typically the central one. Visibility reasoning is used to detect conflicts and determine which depth estimates are the most consistent. Our approach considers three types of visibility relations for the hypothesized depths in the reference view with respect to depth estimates originating from other views. These are illustrated in Figure 3. In view i, the observed surface passes behind A as A is observed in this view. There is a conflict between these two measurements since view i would not be able to observe A if there was truly a surface at A. We say that A violates the free space of A. The second relation occurs between B and B which are in close proximity of each other. We say that these depth estimates support each other. The third relationship is illustrated by C and C. C which is rendered from view i to the reference view, is in front of C. There is a conflict between

these two measurements since it would not be possible to observe C if there was truly a surface at C. We say that C occludes C. Figure 3: Illustration of visibility reasoning.

11 these two measurements since it would not be possible to observe C if there was truly a surface at C. We say that C occludes C. Figure 3: Illustration of visibility reasoning. A, B and C are depth estimates of the reference depthmap and A, B and C are depth estimates of depthmap i rendered onto the reference depthmap. We have developed a rigorous formulation based on the notion of stability of a depth estimate, which aims to determine the validity of a depth estimate by rendering multiple depthmaps into the reference view as well as the reference depthmap into the other views in order to detect occlusions and free-space violations. We have also developed an approximate alternative formulation that selects and validates only one hypothesis based on the accumulated confidence of the most likely depth candidate. Both algorithms enable us to perform video-based reconstruction at tens of frames per second. Details have been omitted due to lack of space Model construction The depthmap fusion step generates a depth estimate for each pixel of the reference view which is passed to the model module along with the image taken from that viewpoint. The output is a model in the form of a

set of triangles in 3D whose faces are piecewise planar approximations of the scene surfaces. Each triangle is associated with a part of the image to add appearance to the geometry.

12 set of triangles in 3D whose faces are piecewise planar approximations of the scene surfaces. Each triangle is associated with a part of the image to add appearance to the geometry. Exploiting the fact that the image plane can serve as reference for both geometry and appearance, we can construct the triangular mesh rapidly. We employ a multi-resolution quad-tree algorithm in order to minimize the number of triangles while maintaining geometric accuracy as in [34]. The vertices of the model are constructed by placing points in 3D along the rays of the reference camera at distances equal to the fused depth estimates. We then split the image in fairly large quads and attempt to fit triangles to the vertices corresponding to the corners of the quads. If the triangles pass a planarity test they are kept, otherwise the quad is subdivided and the process is repeated at a finer resolution. A part of a multiresolution mesh can be seen in Figure 4. The known correspondence between the vertices of the model and pixel positions in the images allows us to determine where each triangle projects and thus which part of the image to use for its appearance. The model construction stage also suppresses the generation of duplicate representations of the same surface. This is necessary since one cannot predict a priori if a surface is visible in more than one fused depthmap. Figure 4: Left: Input image. Right: Illustration of multi-resolution triangular mesh 6. EXPERIMENTAL RESULTS Our processing pipeline can achieve very high frame rates by leveraging both the CPU and GPU of a highend personal computer. An implementation of our software targeted towards speed can generate 3D models at rates faster than those of video collection. This can be achieved with the following settings:

GPS/INS-only pose estimation, 7 color images with a resolution of 512 384 as input to the multiple-view stereo module which evaluates 48 plane positions and 11 depthmaps as input to the fusion stage.

13 GPS/INS-only pose estimation, 7 color images with a resolution of as input to the multiple-view stereo module which evaluates 48 plane positions and 11 depthmaps as input to the fusion stage. This configuration achieves a frame rate of more than 50 Hz on PC with a dual-core AMD Opteron processor at 2.4GHz with an NVidia GeForce 8800 series GPU. Screenshots of 3D models generated by our system can be seen in Figure 5. Figure 5: Screenshots of 3D models from urban and sub-urban environments Quantitative evaluation To evaluate the ability of our method to reconstruct urban environments, a 3,000 frame video of the exterior of a Firestone store which was captured using two cameras. One of the cameras was pointed horizontally towards the middle and bottom of the building and the other was pointed up 30 degrees towards the building. The videos from each camera were processed separately. For the evaluation, the Firestone building was surveyed to an accuracy of 6mm and the reconstructed model was directly compared with the surveyed model (Figure 6). There are several objects such as parked cars that are visible in the video, but were not surveyed. Several of the doors of the building were left open causing some of the interior of the building to be reconstructed. Since accurate measurements of these objects were not available in the ground truth the reconstructed objects were not considered in the evaluation. For all of the remaining parts, the distance from each point to

14 the nearest point on the ground truth model was calculated to measure the accuracy of the reconstruction. A visualization showing these distances is provided in Figure 6, in which blue, green and red denote errors of 0cm, 30cm and 60cm or above, respectively. A histogram of these distances is shown in Figure 6. 83% of the points were reconstructed within 5cm of the ground truth model. (The building is 80m by 40m.) We also evaluated the completeness of our model by reversing the roles of the model and the ground truth. In this case, we measure the distance from each point in the ground truth model to the nearest point in the reconstructed model. Points in the ground truth that have been reconstructed poorly or have not been reconstructed at all, result in large values for this distance. Figure 6 also shows a visualization of the completeness evaluation using the same color scheme as the accuracy evaluation. Note that some parts of the building remain invisible during the entire sequence. Figure 6: Reconstruction and evaluation results for the Firestone building. Top left: reconstructed model. Top right: surveyed model. Bottom left: accuracy evaluation (blue, green and red denote errors of 0cm, 30cm and 60cm or above, respectively). Bottom right: completeness evaluation (color coding same as bottom left). 7. CONCLUSIONS We have described a system aimed at real-time, detailed, geo-registered, 3D urban reconstruction from video captured by a multi-camera system and GPS/INS measurements. The quality of the reconstructions, with respect to both accuracy and completeness, is very satisfactory, especially considering the very high processing speeds we achieve. The system can scale to very large environments since it does not require

15 markers or manual measurements and is capable of capturing and processing large amounts of data in realtime by utilizing both the CPU and GPU. The models can be viewed from arbitrary viewpoints thus providing more effective visualization than mosaics or panoramas. One can virtually walk around the model at ground-level, fly over it in arbitrary directions, zoom in or out. Since the models are metric, they enable the user to make accurate measurements by selecting points or surfaces on them. Parts of a model can be removed or other models can be inserted to augment the existing model. The additions can be either reconstructed models from a different location or synthetic. Effective visualizations of potential modifications in an area can be generated this way. Finally, our system can be used for archiving and change detection. Future challenges include a scheme for better utilization of the multiple video streams to improve the quality of pose estimation and to reconstruct each surface under the optimal viewing conditions while reducing the redundancy of the overall model. Along a different axis, we intend to investigate ways of decreasing the cost and size of our system by further exploring the implementation that uses a low-end INS and GPS or even only GPS instead of the Applanix system and by compressing the videos on the data acquisition platform. Acknowledgements This work has been supported in part by the DARPA UrbanScape program and the DTO VACE program. References 1. Teller S., Antone, M., Bodnar, Z., Bosse, M., Coorg, S., Jethwa, M. and Master, N. Calibrated, registered images of an extended urban area, Int. Journ. of Computer Vision, 2003, 53(1), Román, A., Garg, G. and Levoy, M., Interactive design of multiperspective images for visualizing urban landscapes, in IEEE Visualization, 2004, Cornelis, N., Cornelis, K. and Van Gool, L., Fast compact city modeling for navigation previsualization, in Int. Conf. on Computer Vision and Pattern Recognition, 2006.

16 4. Stamos, I. and Allen, P.K., Geometry and texture recovery of scenes of large scale, Computer Vision and Image Understanding, 2002, 88(2), El-Hakim, S.F., Beraldin, J.-A., Picard, M. and Vettore, A., Effective 3D modeling of heritage sites, in 4th International Conference of 3D Imaging and Modeling, 2003, Früh, C. and Zakhor, A., An automated method for large-scale, ground-based city model acquisition, Int. J. of Computer Vision, 2004, 60(1), Biber, P., Fleck, S., Staneker, D., Wand, M. and Strasser, W., First experiences with a mobile platform for flexible 3D model acquisition in indoor and outdoor environments the Waegele, in ISPRSWorking Group V/4: 3D-ARCH, Faugeras, O.D., Three-Dimensional Computer Vision: A Geometric Viewpoint, MIT Press, Hartley, R. and Zisserman, A., Multiple View Geometry in Computer Vision, Cambridge University Press, American Society of Photogrammetry, Manual of Photogrammetry (5th edition), Asprs Pubns, Teller, S., Automated urban model acquisition: Project rationale and status, in Image Understanding Workshop, 1998, Debevec, P., Taylor, C.J., Malik, J., Modeling and rendering architecture from photographs: A hybrid geometry- and image-based approach, in SIGGRAPH, 1996, Rother, C. and Carlsson, S., Linear multi view reconstruction and camera recovery using a reference plane, Int. J. of Computer Vision, 2002, 49(2-3), Werner, T. and Zisserman, A., New techniques for automated architectural reconstruction from photographs, in European Conf. on Computer Vision, 2002,

17 15. Collins, R.T., A space-sweep approach to true multi-image matching, in Int. Conf. on Computer Vision and Pattern Recognition, 1996, Schindler, G. and Dellaert, F., Atlanta world: an expectation maximization framework for simultaneous low-level edge grouping and camera calibration in complex man-made environments, in Int. Conf. on Computer Vision and Pattern Recognition, 2004, Pollefeys, M. Koch, R., and Van Gool, L., Self-calibration and metric reconstruction in spite of varying and unknown intrinsic camera parameters, Int. Journ. of Computer Vision, 1999, 32(1), Nistér, D., Automatic dense reconstruction from uncalibrated video sequences, PhD Thesis, Royal Institute of Technology KTH, Stockholm, Sweden, Nistér, D., An efficient solution to the five-point relative pose problem, IEEE Trans. on Pattern Analysis and Machine Intelligence, 2004, 26(6), Pollefeys, M., Van Gool, L., Vergauwen, M., Verbiest, F., Cornelis, K., Tops, J., and Koch, R., Visual modeling with a hand-held camera, Int. Journ. of Computer Vision, 2004, 59(3), Pollefeys, M., Van Gool, L., Vergauwen, M., Cornelis, K., Verbiest, F., and Tops, J., Imagebased 3d recording for archaeological field work, Computer Graphics and Applications, 2003, 23(3), Zhang, Z., A flexible new technique for camera calibration, IEEE Trans. on Pattern Analysis and Machine Intelligence, 2000, 22(11), Ilie A. and Welch, G., Ensuring Color Consistency across Multiple Cameras, in Int. Conf. on Computer Vision, 2005, Faugeras, O., Luong, Q.-T. and Maybank, S., Camera self-calibration: theory and experiments, in European Conf. on Computer Vision. 1992,

18 25. Hartley, R., Euclidean reconstruction from uncalibrated views, in Applications of Invariance in Computer Vision, 1994, Koch, R. Pollefeys, M., Heigl, B., Van Gool, L.J. and Niemann, H., Calibration of hand-held camera sequences for plenoptic modeling, in Int. Conf. on Computer Vision, 1999, Lucas, B.D. and Kanade, T., An iterative image registration technique with an application to stereo vision, in Int. Joint Conf. on Artificial Intelligence, 1981, Sinha, S., Frahm, J.-M., Pollefeys, M. and Genc, Y., eature tracking and matching in video using programmable graphics hardware, Machine Vision and Applications, Yang R. and Pollefeys, M., Multi-resolution real-time stereo on commodity graphics hardware, in Int. Conf. on Computer Vision and Pattern Recognition, 2003, Kim, S.J., Gallup, D., Frahm, J.-M., Akbarzadeh, A., Yang, Q., Yang, R., Nistér, D. and Pollefeys, M., Gain adaptive real-time stereo streaming, in Int. Conf. on Vision Systems, Curless B. and Levoy, M., A volumetric method for building complex models from range images, in SIGGRAPH, 1996, Narayanan, P.J., Rander, P.W. and Kanade, T., Constructing virtual worlds using dense stereo, in Int. Conf. on Computer Vision, 1998, Koch, R., Pollefeys, M. and Van Gool, L.J., Multi viewpoint stereo from uncalibrated video sequences, in European Conf. on Computer Vision, 1998, Pajarola, R., Overview of quadtree-based terrain triangulation and visualization, Tech. Rep. UCI- ICS-02-01, Information & Computer Science,University of California Irvine, 2002.

REAL-TIME VIDEO-BASED RECONSTRUCTION OF URBAN ENVIRONMENTS

REAL-TIME VIDEO-BASED RECONSTRUCTION OF URBAN ENVIRONMENTS P. Mordohai 1, J.-M. Frahm 1, A. Akbarzadeh 2, B. Clipp 1, C. Engels 2, D. Gallup 1, P. Merrell 1, C. Salmi 1, S. Sinha 1, B. Talton 1, L. Wang