Real-Time Video- Based Modeling and Rendering of 3D Scenes

Image-Based Modeling, Rendering, and Lighting Real-Time Video- Based Modeling and Rendering of 3D Scenes Takeshi Naemura Stanford University Junji Tago and Hiroshi Harashima University of Tokyo In research on 3D image communications and virtual reality, developing techniques for synthesizing arbitrary views has become an important technical issue. Given an object s structural model (such as a polygon or volume model), it s relatively easy to synthesize arbitrary views. Generating a structural model of an object, however, isn t necessarily easy. For this reason, research has been progressing on a technique called image-based modeling and rendering (IBMR) that avoids this problem. To date, researchers have performed studies on various IBMR techniques. (See the Related Work sidebar for more specific information.) Using densely arranged Our work targets 3D scenes in motion. In this article, we propose a cameras, we generate a method for view-dependent layered representation of 3D dynamic dynamic layered model of a scenes. Using densely arranged cameras, we ve developed a system 3D scene in real time, that can perform processing in real time from image pickup to interactive display, using video sequences suppressing visible errors in instead of static images, at 10 frames the synthetic virtual views. per second (frames/sec). In our system, images on layers are view dependent, and we update both the shape and image of each layer in real time. This lets us use the dynamic layers as the coarse structure of the dynamic 3D scenes, which improves the quality of the synthesized images. In this sense, our prototype system may be one of the first full real-time IBMR systems. Our experimental results show that this method is useful for interactive 3D rendering of real scenes. System configuration Our camera array system consists of 16 cameras and is versatile in arranging cameras in several configurations. We designed our input system to capture 16 video sequences simultaneously and to suppress individual differences between cameras. We use a real-time depth estimation system that can synthesize 30 depth maps per second from the input images. Versatile camera array system First, we constructed a versatile camera array system named Camera Array Pursuits for Plenoptic Acquisition (CAPPA). Figure 1a shows the modular camera head units we developed. By connecting several units, we can freely construct an array of NTSC color camera heads (Sony XC-333). For example, cameras can be densely arranged in a lattice (see Figure 1b) or connected in a row (see Figure 1c). Moreover, we can arrange cameras sparsely by inserting empty units between camera head units. We can also change lenses to capture light rays from several viewing angles. The distance between neighboring cameras is about 31 mm. With an empty unit between two camera head units, the distance is 62 mm approximately the 1 Versatile camera array system CAPPA, which consists of Sony XC-333 camera heads and (a) our original modular units. Cameras can be densely arranged in (b) a lattice or (c) a row. (a) (b) (c) 66 March/April 2002 0272-1716/02/$17.00 2002 IEEE

Related Work A key advantage of IBMR is that using actual images makes it possible to synthesize real images even when estimating object structure is difficult. A system that makes absolutely no use of an object s structural elements can therefore be thought of as the most illustrative example of IBMR. 1 It has a close relationship with 3D display technology as methods of reproducing light rays. 2 Nevertheless, researchers have proposed systems that use an object s structural elements either in a partial or supplementary manner. These include techniques that interpolate between light-ray data using a rough structural model obtained from the object s silhouette 3 and that synthesize images for virtual views by surrounding the object with cameras and generating structural models of even higher accuracy. 4 Techniques exist at various levels, from ones using no structural characteristics of the object at all to conventional techniques that depend on such structural characteristics. An IBMR system can be roughly divided into two modules: One that performs data input and data generation from image capture to data interpolation and structure estimation. One that synthesizes desired images from the data obtained by the first module. The first module requires more processing time as complexity increases. It also requires some technique to make transmitting and storing data more efficient as the number of input images increases. The second module, on the other hand, can synthesize images in real-time interactively. Most research has targeted static space. In other words, an image is displayed only after capturing an object from various viewpoints and processing the data obtained, which takes time. Even in research targeting dynamic space in which scenes change with time, processing is applied only after multiview video input has been recorded. 4 However, some research on data input and generation modules that take real-time characteristics into account have appeared recently. 5-8 Naemura and Harashima 5 implemented a system that can simultaneously capture 4 4 multiview video sequences using five quad processors and a single video capturing device (Onyx2 DIVO) at the cost of resolution. They realized a real-time method of interactively synthesizing arbitrary views of dynamic 3D scenes from 4 4 camera array input. The method is a kind of ray-based representation of visual cues 1 that we ll extend in this article. Kunita et. al 6 developed a hardware system that adaptively extracts scan lines from videos captured by 12 cameras arranged in a row and composes them into an appropriate video sequence. However, while this approach can avoid the resolution problem in Naemura and Harashima, 5 it s limited to cameras arranged in a row. Ooi et. al 7 designed a smart image sensor specialized for image-based rendering. Setting an array of lenses in front of the sensor, a multiview image is projected onto the sensor. Then, appropriate pixels are selected with the method Naemura and Harashima 5 describe and a composed image is output directly from the sensor as a video sequence. Although this approach minimizes the processing delay, it still requires further improvements to enhance its resolution. All these approaches utilize densely arranged cameras to capture numerous light rays. In contrast, Matusik et. al 8 proposed a method that uses sparsely arranged cameras and generates visual hulls a better approximation of 3D dynamic scenes in real time. Alternatively, we use densely arranged cameras to generate a dynamic layered model of 3D scenes in real time, suppressing visible errors in the synthesized images. References 1. M. Levoy and P. Hanrahan, Light Field Rendering, Computer Graphics (Proc. Siggraph 96), ACM Press, New York, 1996, pp. 31-42. 2. T. Naemura and H. Harashima, Ray-based Approach to Integrated 3D Visual Communication, Three- Dimensional Video and Display: Devices and Systems, SPIE Press, Bellingham, Wash., vol. CR76, 2000, pp. 282-305. 3. S. Gortler et al., The Lumigraph, Computer Graphics (Proc. Siggraph 96), ACM Press, New York, 1996, pp. 43-54. 4. T. Kanade, P. Rander, and P.J. Narayanan, Virtualized Reality: Constructing Virtual Worlds from Real Scenes, IEEE MultiMedia, vol. 4, no. 1, Jan. Mar. 1997, pp. 34 47. 5. T. Naemura and H. Harashima, Real-Time Video-Based Rendering for Augmented Spatial Communication, Proc. Visual Comm. and Image Process (VCIP 99), vol. 3653, SPIE Press, Bellingham, Wash., 1999, pp. 620-631. 6. Y. Kunita et al., Real-Time Rendering System of Moving Objects, Proc. IEEE Workshop Multiview Modeling and Analysis of Visual Scenes (MVIEW 99), IEEE CS Press, Los Alamitos, Calif., 1999, pp. 81-88. 7. R. Ooi et al., Pixel Independent Random Access Image Sensor for Real Time Image-Based Rendering System, Proc. IEEE Int l Conf. Image Processing (ICIP 2001), IEEE Signal Process. Soc., Piscataway, N.J., 2001, Vol. II, pp. 193-196. 8. W. Matusik et al., Image-Based Visual Hulls, Computer Graphics (Proc. Siggraph 2000), ACM Press, New York, 2000, pp. 369-374. IEEE Computer Graphics and Applications 67

Image-Based Modeling, Rendering, and Lighting 2 System configuration. Blue lines indicate a simultaneous capturing path, and red lines indicate a realtime depth estimation path. Quad processors 1 2 5 6 3 4 7 8 9 10 13 14 11 12 15 16 Camera array 1 2 3 4 5 6 7 8 DIVO 9 10 11 12 13 14 15 16 Combined 16 views Onyx2 PC FZ930 Estimated depth map TCP/IP Viewpoint Interactive display puter as the number of cameras increases, which isn t conducive to real-time processing. To solve this problem, we adopted the system configuration in Figure 2. In Figure 2, the quad processor (Sony YS-Q430) is a piece of equipment that combines the video sequences from four cameras and outputs a video sequence divided into four screens (the blue lines in the figure). Video from the 16 cameras therefore consists of four-screen sequences obtained through the use of four quad-processor units. A fifth quad processor combines these four sequences so that the video from 16 cameras becomes a single 16-screen sequence. An SGI Onyx2 workstation (four 400-MHz R12000) with a DIVO (digital video option) captures and processes this single 16-screen sequence. In this regard, when connecting quad-processor units in an L-level cascade in this way, one video board can accommodate 4 L cameras worth of video in the computer. Here, however, the resolution of each video sequence drops, and system design must consequently consider an appropriate balance between resolution and the number of cameras. In this study, L = 2, which means that the horizontal resolution of each contracted image becomes about 720/4 = 180 pixels. 3 Effect of realtime adjustment. (a) Input from a fifth quad processor. (b) Adjusted 16- screen image. (a) Real-time adjustment Unless we can prepare extremely sophisticated hardware, however, there is always the possibility of a slight offset in each image s position. Typical reasons for this are differences between extrinsic camera parameters, such as camera position and direction; differences between intrinsic camera parameters, such as lens distortion and the relationship between the lens and charge-coupled device (CCD) surface; and individual differences in the way that images are divided in the quad processors. (b) distance between the left and right eyes which is generally regarded as a good condition for stereoscopic imaging. In this article, we focus on the camera configuration in Figure 1b, by which we can capture light rays at 4 4 sampling points on a 93 mm 93 mm square. Simultaneous capture Our system aims to process multiview video input obtained from multiple cameras simultaneously. One method of achieving this is to install in a computer as many video-capturing boards as the number of cameras. However, this approach increases the load on the com- Taking all these effects into account would require nonlinear processing. However, considering our emphasis on real-time characteristics, we perform the correction through translation operations. This isn t enough in some cases but quite effective within the limitation of real-time processing. We determine the amount of translation here for each image. In particular, the amounts of translation in the horizontal and vertical directions for the image captured by the camera C k (k = 1,, 16) are denoted by o x(k) and o y(k), respectively. o x(k) and o y(k) are characteristic parameters of the capturing system. We determine them only once after constructing the system. Figure 3a shows an example of a 16-screen combined video sequence. Here, we ve selected corresponding points manually and connected them with white lines. We can make the points line up correctly by determin- 68 March/April 2002

ing the appropriate amount of translation from the least squares method. Figure 3b shows the results of performing a translation on each image. The manual operations consist of setting the camera system s characteristic parameters. However, we only need to do these once after constructing the camera system. A A 1.5 1 50% 0% 50% 100% Object plane A 3 A 2.5 A 2 50% 0% 0% 100% 50% 0% 100% Real-time depth estimation We input nine of the 16 video sequences into the real-time depth estimation system Komatsu FZ930 (http://www7.airnet.ne.jp/komatsu/ Stereo/stereoj/index.htm [in Japanese]; see the red lines in Figure 2). FZ930 is a commercially available peripheral component interconnect (PCI) board that can estimate a depth map by employing the stereo matching method between the central image and the eight neighboring images. We use the set of eight neighboring images to overcome difficult occlusion problems by using a single neighboring image. FZ930 can provide 30 depth maps (280 200 pixels, 8 bits) per second. We transfer this depth map to the Onyx2 workstation through a 100- Mbps Ethernet. The Onyx2 workstation interactively synthesizes and displays views that correspond to the user s viewpoint from the 16-color video sequences and an estimated depth-map sequence. Image synthesis basics Now let s move on to how we synthesize arbitrary views from the light field captured by the camera array (see Figure 4). We want to synthesize a view I v corresponding to the viewpoint P v from the images I i (i = 1, 2, 3) captured by cameras C i (i = 1, 2, 3), respectively. We can assume that the light ray P 1P v that passes through the point P 1 will arrive at P v as camera C 1 captures it. The problem is that we must interpolate the light rays that pass between cameras such as P 1.5P v and P 2.5P v in Figure 4. Blending images For the purpose of interpolation, we approximate the shape of the object by a plane. For example, a point A 1.5 on the object plane is the cross point of the object plane and the light ray P 1.5P v. This point A 1.5 is captured by both cameras C 1 and C 2 as pixel data denoted by I 1(A 1.5) and I 2(A 1.5). Here, we consider the case where we can use the intrinsic parameters of the cameras (C 1 and C 2) and the positions of P 1, P 2, and A 1.5 to read out the pixel value corresponding to A 1.5 from images I 1 and I 2. In this article, we use the average value of those pixels as the color data of the light ray P 1.5P v that is, I v(a 1.5). More generally, we calculate the color data of the light ray P 1 + ap v that passes a point P 1 + a (0 a 1) with this equation: I v(a 1 + a) = (1 a)i 1(A 1 + a) + ai 2(A 1 + a) P 1 P 1.5 C 1 C 2 P 2 P 2.5 Viewpoint P v C 3 Camera plane In other words, to synthesize a view I v at the viewpoint P v, we blend the two images I 1 and I 2. This blending effect suppresses the discontinuities between stitched views. Figure 4 shows how to blend more images. We use image I 2 captured by the camera C 2 between I v(a 1) and I v(a 3). This image is more effective than the other images (I 1 and I 3) between I v(a 1.5) and I v(a 2.5). More generally, we use the image I i between I v(a i 1) and I v(a i+1) but most effectively between I v(a i 0.5) and I v(a i+0.5). We can adopt the alpha-blending architecture to realize this bilinear interpolation in real time. Results Although our discussion so far has concentrated on the case where cameras are arranged in a row, we can easily extend this concept to cameras arranged in a lattice. Figure 5 (next page) shows the experimental results of this method. By moving the object plane to the position of the object closer to the camera (a toy pig), we can synthesize a view I v, as Figure 5b shows. We synthesized this image from portions of 4 4 multiview images by using the alpha-blending technique. In Figure 5a, the red squares denote the areas that correspond to those between I v(a i 0.5) and I v(a i+0.5) in our earlier discussion of the white squares between I v(a i 1)and I v(a i+1). We can see visible errors in the background object (a clock). Figures 5c and 5d show the result when we adjust the object plane to the clock, which blurs the toy pig. These results demonstrate a limitation of the plane approximation method. Our system takes into account that users can interactively move their viewpoints and the location and direction of the object plane. In other words, the focal plane is variable. 1 Processing for this interactive view synthesis of dynamic scenes was performed at 47 frames/sec. P 3 4 Blending several images to generate synthetic views. IEEE Computer Graphics and Applications 69

Image-Based Modeling, Rendering, and Lighting 5 Examples of the plane approximation method. (a) Input views used for synthesizing (b). (b) The toy pig in focus. (c) Input views used for synthesizing (d). (d) The clock in focus. (a) (b) (c) (d) This rate is greater than the 30 frames/sec of video input, which suggests the possibility of applying even more complex processing. Obviously, the single plane approximation of scene objects isn t good enough to render higher quality synthetic images. Dynamic layered representation To improve the quality of view synthesis, we adopt the depth map provided by the FZ930 system to better approximate the shape of the scene objects. Figure 6 illustrates the flowchart of this method. FZ930 estimates a depth map corresponding to the view indicated by the white square, which is the center of the nine views indicated by the white dashed square in the figure s input view. In the depth map, the brighter points indicate bigger depth values but mean that the object is closer to the camera array system. Because 8-bit depth values represent the depth map, it might be possible to approximate the shape by 256 planes (layers). From the viewpoint of real-time processing, however, we limit the number of layers to three in this article. Each layer is illustrated by red, blue, or green in Figure 6. View-dependent video layers Let s represent the red, blue, and green layers as L 1, L 2, and L 3, respectively. Each layer has its own depth value. D i(i = 1, 2, 3) denotes the depth value of layer L i, and D 1 < D 2 < D 3. (L 2 is farther from the camera array than L 3 but closer than L 1.) For each layer L i, we can synthesize an appropriate layer image T i by using the method in Figure 5. It s important that the layer image T i is view dependent in the sense that it varies according to the user s viewpoint. 2 Moreover, T i is a kind of video sequence that is updated periodically by the video-based rendering method. Figures 7a, 7b, and 7c show the results of synthesizing T i. In the layer image T 1 synthesized for the background layer L 1, we can see that background objects are in focus, but others (a toy cat and a toy pig) are blurry. In the middle layer L 2, the toy cat is in focus, but the toy pig is still blurry and the background has visible errors in T 2. In T 3 for L 3, all the objects except the toy pig have visible errors. Every focused part of T 1, T 2, and T 3 should be stitched together to synthesize a better quality view of the scene. For this purpose, we need a kind of mask for each layer to superimpose or blend the layer images T i. 70 March/April 2002

FZ930 Depth map Graph 1 Input 16 view Output (layer) Input (depth) User's viewpoint Layered alpha map 6 Flowchart of layered videobased rendering. Graph 2 L 1 L 2 T 1 α 1 (1 α 3 )(1 α 2 )α 1 T 1 Blending ratio D 1 D 2 D 3 Depth T 2 α 2 (1 α 3 )α 2 T 2 L 3 T 3 a 3 α 3 T 3 View-dependent layer images Alpha values for layer images Synthesized view α 3 T 3 + (1 α 3 ){α 2 T 2 + (1 α 2 )α 1 T 1 } Dynamic layer integration To synthesize arbitrary views of 3D scenes, a threelayer representation is obviously better than a single plane approximation. However, this might not be enough to describe complex scenes. To suppress the effect of this simplification, we adopt the alphablending technique. This concept is illustrated in Figure 6 s Graph 2. It is straightforward to render the object whose depth is close to D i simply by using the layer image T i. The problem is how to render such an object located between D 1 and D 2. For example, let s consider the case where the object is located at the depth value of D = a(d 1 + b) + (1 a)(d 2 b), where 0 a 1 and 0 b < (D 2 D 1)/2. We can render the view of this object I v by blending T 1 and T 2 according to this equation: (a) (c) (b) (d) 7 Result of dynamic layered representation. (a) Background layer image T 1, (b) a toy cat image layer T 2, (c) a toy pig image layer T 3, and (d) the result of blending the layers. I v = at 1 + (1 a) T 2 This approach isn t accurate but is a useful way of suppressing visible errors. Graph 2 illustrates the relationship between the depth of object and blending ratio of each layer. We can set the parameter b interactively while viewing the synthesized images. In this article, b = (D 2 D 1)/3. To implement this concept, we use the depth map gen- IEEE Computer Graphics and Applications 71

Image-Based Modeling, Rendering, and Lighting 8 The real scene captured for use in Figure 7. By comparing the images, we can see the relationship between scene objects and a camera array. 9 A plot of the frame rate as a function of the number of layers. Frames per second 14 12 10 8 6 4 2 0 2 3 4 5 6 7 8 9 10 Number of layers erated by FZ930. We first synthesize filtered depth maps using the transfer function Graph 1 in Figure 6. We regard these maps as alpha maps for blending layers. Then, we translate and scale them so that they fit the layer images T i according to the relationship between the user s viewpoint and each layer s position. The translated alpha maps are denoted as α i in Figure 6. Note that α i isn t a static mask for each layer but dynamically updated. Finally, we render the synthetic view I v according to this blending equation: I v = α 3T 3 + (1 α 3){α 2T 2 + (1 α 2) α 1T 1} Results Figure 7d shows the final result. Most of the scene objects have better quality in Figure 7d than in Figures 7a, 7b, or 7c. Figure 8 shows the real scene captured and rendered in the experiment. We can see the relationship between scene objects and find that their size in Figure 7d is appropriate. The reason why the toy pig seems larger than the toy cat in the input 16 view in Figure 6 is that the camera array is close to the toy pig. The toy pig in Figure 7 isn t as large as in the input 16 view because we set the viewpoint behind the camera array as shown in Figure 4 during the rendering process. This indicates that the perspective produced by our system is consistent with the virtual viewpoint s position. Processing for this interactive view synthesis of dynamic scenes was performed at 10 frames/sec. This means that we achieved all these processes in real time: capturing 16 video sequences; estimating the depth map from nine of the 16 video sequences; synthesizing three view-dependent layer images T 1, T 2, and T 3 from the 16 video sequences; synthesizing layered alpha maps α 1, α 2, and α 3 from the estimated depth map; and blending the three layers. Users can move their viewpoint interactively to observe a 3D dynamic scene. Figure 9 illustrates the relationship between the number of layers and the processing speed (frames/sec). Even the 10-layer representation lets us observe the 3D dynamic scene interactively. However, we selected the three-layer representation for the experiment because we believe the real-time application should perform at more than 10 frames/sec. Future work One of the most important tasks of future work is to enhance the resolution of the synthetic images. For this purpose, it s possible to control the quad processors according to the user s viewpoint. This means that we should be able to adaptively input portions of the 16 video sequences into the computer. On the other hand, PC clusters might also be a practical solution for processing several views simultaneously. Moreover, we can utilize an optical system of integral photography and an HDTV camera to capture light fields. 3 We plan to keep investigating these approaches to find the most efficient way to capture and process multiple video sequences. Another important area for future work is to change the camera array configurations by using our versatile camera head units. The 4 4 camera array in this article is just one possible configuration. We plan to utilize the units to realize a panoramic stereo video system. 4 72 March/April 2002

Acknowledgments We thank Takuya Nitta for his support in system development. image-based rendering and augmented spatial communications. He has a BE, ME, and PhD in electronic engineering from the University of Tokyo. He is an IEEE member. References 1. A. Isaksen, L. McMillan, and S.J. Gortler, Dynamically Reparameterized Light Fields, Computer Graphics (Proc. Siggraph 2000), ACM Press, New York, 2000, pp. 297-306. 2. P.E. Debevec, C.J. Taylor, and J. Malik, Modeling and Rendering Architecture from Photographs: A Hybrid Geometry- and Image-Based Approach, Computer Graphics (Proc. Siggraph 96), ACM Press, New York, 1996, pp. 11-20. 3. T. Naemura, T. Yoshida, and H. Harashima, 3D Computer Graphics Based on Integral Photography, Optics Express, vol. 8, Feb. 2001, pp. 255-262, http://www. opticsexpress.org/oearchive/source/30085.htm. 4. T. Naemura, M. Kaneko, and H. Harashima, Multi-User Immersive Stereo, Proc. IEEE Int l Conf. Image Process. (ICIP 98), IEEE CS Press, Los Alamitos, Calif., 1998, pp. 903-907. Takeshi Naemura is a visiting assistant professor in computer graphics at Stanford University and supported by the Japan Society for Promotion of Science (JSPS) Postdoctoral Fellowships for Research Abroad. His research interests include Junji Tago is a graduate student in the Department of Electronic Engineering at the University of Tokyo. His research interests include imagebased rendering and multimodal dialog systems. He has a BE in information and communication engineering from the University of Tokyo. Hiroshi Harashima is a professor in the Interfaculty Initiative in Information Studies at the University of Tokyo. His research interests include communication theory, image coding and processing, and human communication engineering. He has a BE, ME, and PhD in electrical engineering from the University of Tokyo. He is the president of the Virtual Reality Society of Japan. He is an IEEE member. Readers may contact Takeshi Naemura at Stanford Univ., Gates Computer Science Building 3B-390, Stanford, CA 94305-9035, email naemura@graphics.stanford.edu. 2002 Editorial Calendar January March: Advances in Multimedia Multimedia means different things to different communities. For researchers, it might be databases, search engines, or indexing tools, whereas content providers might be more concerned with streaming audio and video, compression techniques, and content distribution methods. This issue will offer various viewpoints, practices, evolving standards, and innovative projects in multimedia, covering current practice and research work. April June: Content-Based Multimedia Indexing and Retrieval Important research areas in multimedia indexing include audio, video, image, textual, and information retrieval. This special issue will cover the state of the art in multimedia indexing, especially image indexing, video indexing, user access and annotation, description of semantic content, and applications. July September: Multimedia R&D Multimedia systems and applications involve a broad range of topics, including hardware and software for media compression, media storage/transport, workstation support, data modeling, and abstractions to embedded multimedia in application programs. Even with this wide coverage, multimedia is still spreading its influence to nontraditional professional sections such as medicine, education, and the military. October December: Multimedia Trends Learn about the latest trends in multimedia and explore what researchers are developing for the next generation of multimedia applications. Find out what practitioners have learned in the field and what they plan to do next to improve the form and function of tomorrow s multimedia. IEEE Computer Graphics and Applications 73