Free-viewpoint video renderer - PDF Free Download

jgt08 2008/9/26 14:24 page 1 #1 Vol. [VOL], No. [ISS]: 1 13 Free-viewpoint video renderer J. Starck, J. Kilner, and A. Hilton Centre for Vision, Speech and Signal Processing, University of Surrey, UK. Abstract. Free-viewpoint video allows interactive control of camera viewpoint in video playback. This paper describes a state-of-the-art technique to render freeviewpoint video on the GPU. The algorithm requires video streams from a set of fixed or dynamic real-world video cameras. Arbitrary viewpoints are synthesised from the video streams using a three-dimensional (3D) proxy for the world. The source code for the render system is made available as a resource to the computer graphics and computer vision community. This provides (i) the facility to simulate camera viewpoints from publically available multiple-view video datasets and (ii) a baseline technique for free-viewpoint video synthesis in the development of interactive 3D video and 3DTV applications. 1. Introduction In traditional video and film, events are recorded from a single fixed viewpoint. This confines the viewer to the fixed linear format dictated by the director and a flat two-dimensional (2D) viewing experience. Free-viewpoint video breaks this restriction by providing three-dimensional (3D) content with interactive control of viewpoint in visualisation. Application areas range from on-line visualization for mixed reality environments [Allard et al. 06], communications [Gross et al. 03], as well as production or pre-visualization in television [Grau et al. 03], games [Starck and Hlton 07] and 3DTV [Matusik and Pfister 04]. Free-viewpoint video is synthesised using video streams from a set of realworld cameras that record a scene from different viewpoints. A novel view is A K Peters, Ltd. 1 1086-7651/06 $0.50 per page

jgt08 2008/9/26 14:24 page 2 #2 2 journal of graphics tools rendered using a 3D proxy for the scene geometry. The proxy is rendered to a virtual viewpoint and surface texture is sampled from adjacent camera images as illustrated in Figure 1. This approach, termed view-dependent rendering [Debevec et al. 98, Buehler et al. 01] can provide highly realistic digital images simply by resampling real-world content. A technique is presented to synthesise free-viewpoint video from multipleview video streams together with a time-varying geometric proxy for a scene. The technique is implemented on the GPU for video-rate view synthesis. The render system is based on developments in free-viewpoint video production of people [Starck and Hlton 07, Starck et al. 07] and incorporates stateof-the-art techniques in view-dependent rendering [Debevec et al. 96, Pulli et al. 97, Buehler et al. 01, Raskar and Low 02, Starck and Hilton 05, Eisemann et al. 08]. The source code for the renderer is released as an open source project. This provides a complete application for free-viewpoint video synthesis as a resource to the computer vision and computer graphics community. The software provides the following specific contributions. 1. A tool to synthesise camera viewpoints from publically available multipleview video data for the development of 3D video production technology. 2. The source for a state-of-the-art render technique as a baseline for the development of interactive 3D video techniques. 2. Free-viewpoint video The synthesis of visually realistic digital images is a central goal in computer graphics. Research to date has seen a convergence of computer vision and computer graphics techniques to synthesise highly realistic digital content directly from video images. Research has focussed on the multiple camera acquisition systems and the computer vision algorithms required to recover 3D scene geometry and perform virtual view synthesis either in real-time or as an off-line post-process [Starck et al. 07]. Recent advances have exploited imagebased reconstruction and image-based rendering to produce free-viewpoint video at a quality comparable to captured video [Zitnick et al. 04] a process termed image-based modelling and rendering (IBMR). Image-based reconstruction deals with the problem of deriving scene geometry from the appearance sampled in camera images. The Virtualized Reality system [Kanade et al. 97] first used 51 cameras distributed over a 5m dome to capture and visualise the performance of an actor in a studio. Real-time systems for mixed reality applications have since been been developed using geometry from image silhouettes [Grau et al. 03, Allard et al. 06]. Offline

jgt08 2008/9/26 14:24 page 3 #3 Starck et al.: Free-viewpoint video renderer 3 Figure 1. Image-based rendering: free-viewpoint visualisation is achieved by rendering a 3D scene model to a virtual viewpoint with the appearance sampled from adjacent camera images. systems [Vedula et al. 05, Starck and Hlton 07] combine multiple image cues to recover accurate scene representations for view synthesis. Image-based rendering is the process of synthesising novel views from camera images. Light-field techniques [Levoy and Hanrahan 96] perform view synthesis by directly resampling camera images independent of scene geometry. This technique requires dense camera samples to avoid artefacts in interpolating between views and has been applied in 3DTV applications [Matusik and Pfister 04] where the viewpoint is restricted. Scene geometry is used to provide the correspondence in sparse camera sets for image-based rendering [Debevec et al. 96]. Buehler et al. [Buehler et al. 01] provide a unified framework that extends light-field rendering to incorporate a geometric representation for virtual view synthesis. In free-viewpoint video a 3D proxy for a scene is rendered with a viewdependent appearance derived from real-world images. The underlying problem in rendering is to composite a virtual viewpoint by blending the appearance sampled from different camera viewpoints. Debevec et al. [Debevec et al. 96] introduced the concept of view-dependent texturing in which camera

jgt08 2008/9/26 14:24 page 4 #4 4 journal of graphics tools images are used as a set of view-dependent textures in rendering. Pulli et al. [Pulli et al. 97] applied view-dependent rendering to composite appearance and geometry from multiple viewpoints at the point of view synthesis. Buehler et al. [Buehler et al. 01] define a camera blend field to composite texture from camera images. Raskar and Low [Raskar and Low 02] compute global visibility constraints to feather image blending at depth discontinuities. Starck and Hilton [Starck and Hilton 05] pre-compute view-dependent shape and visibility for free-viewpoint video synthesis. Eisemann et al. [Eisemann et al. 08] synthesis free-viewpoint video in real-time on the GPU. These approaches are combined to present a single state-of-the-art render system. 3. Rendering Algorithm 3.1. Overview The input to the render technique is a set of real world video images and a 3D proxy for the scene geometry, together with the camera calibration defining the projective transformation from the scene coordinate system to each camera. Camera images are denoted I c, c = {1... N} where N is the total number of cameras and the surface of the scene is denoted S. For simplicity we consider only a single frame in the time varying data. Points in the scene x R 3 project to homogeneous image coordinates using the camera project matrix P c as u = P c x. View synthesis entails rendering the surface S to a virtual viewpoint with a projective transformation ˆP. The centre of projection for the real world cameras are denoted o c and the virtual viewpoint ô. A virtual view is synthesised by compositing the appearance sampled from the cameras closest to the virtual camera. At the point of view synthesis, the algorithm first selects a subset of cameras to use in rendering. Surface visibility is then computed in each camera to prevent sampling appearance in the presence of occlusion. A camera blend field is then derived to define the relative contribution for a camera at each pixel in the virtual view. The scene is then rendered using the camera images as view-dependent textures, compositing appearance using the camera blend fields. Successive stages in rendering are outlined in Figure 2. 3.2. Camera selection A subset of cameras M < N are selected for rendering according to the proximity to the virtual viewpoint ô. Typically only two or three cameras are used [Eisemann et al. 08], although with complex self occlusions in a scene

jgt08 2008/9/26 14:24 page 5 #5 Starck et al.: Free-viewpoint video renderer 5 Figure 2. Overview of the free-viewpoint video rendering technique. and irregular sampling of the scene appearance across camera viewpoints, a larger number of cameras can be required. Cameras are selected according to proximity in viewing direction. Given the centroid of the scene x 0, cameras are selected to minimise the angular difference between the virtual viewing direction (x 0 ô) and camera viewing direction (x 0 o c ) as depicted in Figure 3.

jgt08 2008/9/26 14:24 page 6 #6 6 journal of graphics tools Figure 3. Camera selection according to proximity in viewing direction. 3.3. Camera visibility Surface visibility is computed using a soft z-buffer technique [Pulli et al. 97], also termed an ɛ-z-buffer [Pajarola et al. 04]. The surface S is rendered to a depth buffer D c, c M. A fixed offset is applied in rendering to prevent z- fighting in subsequent depth tests. A conservative visibility test is required as the scene proxy S is often inexact and errors in visibility will occur at occlusion boundaries causing incorrect sampling of appearance across the scene surface [Starck and Hilton 05, Eisemann et al. 08]. The surface is therefore extended at occlusion boundaries as proposed by Carranza et al. [Carranza et al. 03] and Eisemann et al. [Eisemann et al. 08]. The surface S is rendered multiple times to the depth buffer D c with a fixed offset such that back-faces of the mesh extend the occlusion boundaries. An expected reprojection error e c is defined and the surface is rendered upto the expected error e c in each camera image.

jgt08 2008/9/26 14:24 page 7 #7 Starck et al.: Free-viewpoint video renderer 7 Figure 4. Camera visibility: A conservative depth test is provided by rendering an ɛ-z-buffer and displacing back-face occluders to extend occlusion boundaries. 3.4. Camera blend fields A blend field W c is constructed for each camera in the virtual view to define the relative contribution in view dependent texturing. The blend field consists of three components as proposed by Pulli et al. [Pulli et al. 97]. The first component defines the surface visibility in the camera viewpoint, a binary weight is derived using a depth test against D c. The second component defines the surface sampling density such that the surface appearance is derived from the camera images with the greatest sampling rate. The third component defines the view-dependent contribution of the camera according to the proximity to the virtual viewpoint. An angular weighting scheme is adopted [Pulli et al. 97, Buehler et al. 01] in which the weights are derived from the cosine of the angle subtended at the surface as illustrated in Figure 5. The weight maps W c, c M define the blend fields to composite the appearance for the camera images in the virtual view. With inexact surface geometry S for the scene, appearance will not necessarily be sampled consistently across camera viewpoints. Only a subset of appearance samples B < M are therefore blended at each output pixel. The weight maps W c are thresh-

jgt08 2008/9/26 14:24 page 8 #8 8 journal of graphics tools Figure 5. View-dependent weighting: Camera blend fields are constructed from three components (i) visibility, (ii) surface sampling and (iii) view proximity. olded against the B th largest weight at each pixel. The technique enables a larger number of cameras M to be used in rendering to ensure complete surface coverage in view synthesis with complex self occlusions in a scene while minimising artefacts in blending using a limited number of cameras B at each output pixel. The thresholded weight maps W c are finally feathered to smooth transitions in blending appearance towards boundaries [Pulli et al. 97, Raskar and Low 02, Eisemann et al. 08]. The feathered blend fields are derived using a distance filter on each weight map [Eisemann et al. 08]. Figure 6 illustrates the resulting blend fields. 3.5. Final composite Once the blend fields are derived for the camera set M, the final view is composited. The camera images are treated as a set of projective texture maps [Debevec et al. 98] using the projective transformation P c for each camera. The surface S is rendered to the virtual camera with texture modulated by the weight map W c. The weights at each output pixel are normalised such that they sum to one. Surface visibility is not guaranteed in the subset M and holes can result in the rendered surface appearance. A fill operation is

jgt08 2008/9/26 14:24 page 9 #9 Starck et al.: Free-viewpoint video renderer 9 Figure 6. Camera blend fields: Camera weighting in the virtual view is thresholded to blend a maximum of B cameras per pixel and feathered for smooth transitions at boundaries. performed by rendering to texture and applying a gaussian filter to propagate appearance to undefined regions. Holes in the final render pass are then filled from the composited texture. Figure 7 shows the resulting composite. 4. GPU Implementation The free-viewpoint video rendering algorithm requires the camera images I c, c = {1... N}, the camera calibration P c and geometric proxy for the scene S as a triangulated surface mesh at each time frame. The algorithm has several parameters, the number M < N of cameras to use in view-dependent

jgt08 2008/9/26 14:24 page 10 #10 10 journal of graphics tools Figure 7. Final composite: Camera images are used as projective texture maps and combined according to the camera blend fields with missing surface appearance filled from surrounding regions. rendering, the maximum number B < M of samples to blend at each output pixel and the reprojection error e c at occlusion boundaries in the camera images. The size of the depth buffers D c and blend fields W c are predefined. The rendering technique is implemented in OpenGL/GLSL for real-time rendering. Initially a frame buffer object is constructed to render depth and a frame buffer plus render buffer is constructed to render to texture. Textures are also constructed for the fixed size render targets D c and W c. At each time frame, a display list is built for the surface S to speed up multiple render passes. When the scene is drawn the set of cameras M is selected according to the virtual viewpoint. The render technique then proceeds as follows. Build Textures Texture maps for the camera images c N are built on demand. A camera image I c is cropped to the bounding box of the scene S and copied to texture to ensure no resampling. Camera Visibility The depth buffer D c for each camera is rendered on demand. The texture for the buffer D c is attached to the depth buffer object and a vertex shader used to render the scene. A soft z-buffer is achieved

jgt08 2008/9/26 14:24 page 11 #11 Starck et al.: Free-viewpoint video renderer 11 using GL POLYGON OFFSET FILL to displace the surface and a conservative visibility test is achieved using four render passes with front-face culling, displacing the surface in the camera image plane by ±e c. Camera Blend Field The view-dependent blend field W c for each camera is recomputed for each virtual view. A fragment shader is used to test visibility against the depth buffer D c and compute the view-dependent weighting at each pixel. The blend weight is thresholded to ensure only B cameras are used in blending. The weight map W c is then feathered using ping-pong texture processing with a two-pass distance filter. Render to Texture A texture target is bound to the render buffer and the scene is rendered using a fragment shader to combine the camera textures using the blend fields W c. Blend weights are normalised to one at each pixel. A texture fill operation is then performed using ping-pong texture processing with a two-pass gaussian filter sampling missing texels. Final Render The surface is finally rendered using a fragment shader to combine the camera textures with the blend fields W c and fill missing fragments using the rendered texture. 5. Application The open source project for the free-viewpoint video renderer provides a complete code base for scene graph management, scene and image input/output together with an OpenGL/GLSL implementation of the render technique together with tools and applications to use the renderer. Example usage is presented for publically available multiple view data-sets provided courtesy of the University of Surrey [Starck and Hlton 07, Starck et al. 07]. The data consists of 8 HD resolution 1920 1080 video streams recorded in a blue screen studio together the reconstructed scene geometry and camera calibration. 6. View synthesis Rendering is shown in Figure 8 for a variety of different motion sequences. The technique was tested at 1024 768 rendering resolution using a NVIDIA Quadro FX 1700 graphics card. Rendering achieved an average frame rate of 19fps for a static frame with free-viewpoint interaction and 1fps while streaming image and geometry from disk.

jgt08 2008/9/26 14:24 page 12 #12 12 journal of graphics tools Figure 8. The open-source free-viewpoint video renderer, demonstrating interactive visualisation of multiple view video data. 7. Camera configuration synthesis Rendering is now shown in Figure 9 to simulate a novel camera configuration for the multiple view video data. The synthesised data-set provides both camera calibration parameters and ground-truth geometry in terms of the underlying 3D geometry used to synthesise the new viewpoints. Acknowledgments. i3dpost, EU Framework 7 ICT Programme http://www.i3dpost.eu This work was supported by the DTI Technology programme under Free-viewpoint video for interactive entertainment production TP/3/DSM/6/I/ 15515 and EP- SRC Grant EP/D033926, the EU Framework 7 ICT Project i3dpost, and the UK EPSRC Visual Media Platform Grant. For further details visit the iview project (http://www.bbc.co.uk/rd/iview) and the i3dpost project (http://www.i3dpost.eu). Web Information: For more information on the project please visit http://sourceforge.net/projects/fvvr. For further details on the iview project please visit http://www.bbc.co.uk/rd/iview. J. Starck, Centre for Vision, Speech and Signal Processing, University of Surrey, Guildford, UK. (J.Starck@surrey.ac.uk) J. Kilner, Centre for Vision, Speech and Signal Processing, University of Surrey,

jgt08 2008/9/26 14:24 page 13 #13 Starck et al.: Free-viewpoint video renderer 13 Figure 9. Application of the render technique to simulate a novel camera configuration from fixed multiple view video data. Guildford, UK. (J.Kilner@surrey.ac.uk) A.Hilton, Centre for Vision, Speech and Signal Processing, University of Surrey, Guildford, UK. (A.Hilton@surrey.ac.uk ) Received [DATE]; accepted [DATE].

jgt08 2008/9/26 14:24 page 14 #14 14 journal of graphics tools References [Allard et al. 06] J. Allard, J-S. Franco, C. Menier, E. Boyer, and B. Raffin. The GrImage Platform: A Mixed Reality Environment for Interactions. IEEE International Conference on Computer Vision Systems (ICVS), p. 46. [Buehler et al. 01] C. Buehler, M. Bosse, L. McMillan, S. Gortler, and M. Cohen. Unstructured lumigraph rendering. ACM Transactions on Graphics (SIG- GRAPH), pp. 425 432. [Carranza et al. 03] J. Carranza, C. M. Theobalt, M. Magnor, and H.P. Seidel. Free-Viewpoint Video of Human Actors. ACM Transactions on Graphics (SIGGRAPH) 22(3) (2003), 569 577. [Debevec et al. 96] P. Debevec, C. Taylor, and J. Malik. Modeling and rendering architecture from photographs: a hybrid geometry- and image-based approach. ACM Transactions on Graphics (SIGGRAPH), pp. 11 20. [Debevec et al. 98] P. Debevec, Y. Yu, and G. Borshukov. Efficient View- Dependent Image-Based Rendering with Projective Texture-Mapping. Proceedings of Eurographics Workshop on Rendering,, pp. 105 116. [Eisemann et al. 08] M. Eisemann, B. Decker, M. Magnor, P. Bekaert, E. Aguiar, N. Ahmed, C. Theobalt, and A. Sellent. Floating Textures. Computer Graphics Forum (Eurographics) 27(2) (2008), 409 418. [Grau et al. 03] O. Grau, T. Pullen, and G. Thomas. A combined studio production system for 3D capturing of live action and immersive actor feedback. IEEE Transactions on Circuits and Systems for Video Technology 14(3) (2003), 370 380. [Gross et al. 03] M. Gross, S. Würmlin, M. Naef, E. Lamboray, C. Spagno, A. Kunz, E. Koller-Meier, T. Svoboda, L. Van Gool, S. Lang, K. Strehlke, A. Vande Moere, and O. Staadt. blue-c: a spatially immersive display and 3D video portal for telepresence. ACM Transactions on Graphics (SIGGRAPH) 22(3) (2003), 819 827. [Kanade et al. 97] T. Kanade, P.W. Rander, and P.J. Narayanan. Virtualized Reality: Constructing Virtual Worlds from Real Scenes. IEEE Multimedia 4(1) (1997), 34 47. [Levoy and Hanrahan 96] M. Levoy and P. Hanrahan. Light Field Rendering. ACM Transactions on Graphics (SIGGRAPH) 30 (1996), 31 42. [Matusik and Pfister 04] W. Matusik and H. Pfister. 3D TV: a scalable system for real-time acquisition, transmission, and autostereoscopic display of dynamic scenes. ACM Transactions on Graphics (SIGGRAPH), pp. 814 824. [Pajarola et al. 04] R. Pajarola, M. Sainz, and Y. Meng. DMesh: Fast depth image meshing and warping. International Journal of Image and Graphics 4(4) (2004), 1 29. [Pulli et al. 97] K. Pulli, M. Cohen, T. Duchamp, H. Hoppe, L. G. Shapiro, and W. Stuetzle. View-base rendering: Visualizing real objects from scanned range and color data. Eurographics workshop on Rendering (EGWR), pp. 23 34.

jgt08 2008/9/26 14:24 page 15 #15 Starck et al.: Free-viewpoint video renderer 15 [Raskar and Low 02] R. Raskar and K.-L. Low. Blending multiple views. Pacific Conference on Computer Graphics and Applications, pp. 145 153. [Starck and Hilton 05] J. Starck and A. Hilton. Virtual view synthesis of people from multiple view video sequences. Graphical Models 67(6) (2005), 600 620. [Starck and Hlton 07] J. Starck and A. Hlton. Surface Capture for Performance Based Animation. IEEE Computer Graphics and Applications 27(3) (2007), 21 31. [Starck et al. 07] J. Starck, A. Maki, S. Nobuhara, A. Hilton, and T. Matsuyama. The 3D Production Studio. Technical Report VSSP-TR-4/2007. [Vedula et al. 05] S. Vedula, S. Baker, and T. Kanade. Image-Based Spatio- Temporal Modeling and View Interpolation of Dynamic Events. ACM Transactions on Graphics 24(2) (2005), 240 261. [Zitnick et al. 04] C.L. Zitnick, S. B. Kang, M. Uyttendaele, S. A. J. Winder, and R. Szeliski. High-quality Video View Interpolation Using a Layered Representation. ACM Transactions on Graphics (SIGGRAPH) 23(3) (2004), 600 608.