INTERACTIVE VIRTUAL VIEW VIDEO FOR IMMERSIVE TV APPLICATIONS

Size: px

Start display at page:

Download "INTERACTIVE VIRTUAL VIEW VIDEO FOR IMMERSIVE TV APPLICATIONS"

Paulina Boone
5 years ago
Views:

1 INTERACTIVE VIRTUAL VIEW VIDEO FOR IMMERSIVE TV APPLICATIONS C. Fehn, P. Kauff, O. Schreer and R. Schäfer Heinrich-Hertz-Institut, Germany ABSTRACT This paper presents an evolutionary approach on Immersive TV, an exciting new concept of future TV entertainment, where TV watchers get the impression of being present at a real event. Starting from a short-term solution which can be introduced with available hardware at reasonable costs, the paper particularly discusses thinkable extensions towards a longterm solution, called IVVV (Interactive Virtual View Video), targeting at a highly interactive TV system. The idea is based on an implicit 3D representation of the video scene supporting head motion parallax viewing, as well as user-guided navigation with a virtual camera. The concept distinguishes between accurate offline 3D analysis of static scene parts and online real-time processing of dynamic video objects. Static and dynamic 3D data are merged at the receiver and used, together with the transmitted video sequences, to create the desired intermediate views by means of image-based rendering. INTRODUCTION Being present at a live event is undeniably the most exciting way to experience any entertainment. It is therefore the mission of immersive media to bring this experience to those users who are not able to participate. The wide field of related immersive media applications can be classified into two main branches. One application branch is based on the natural audio-visual presentation well-known from conventional cinema. The best example is the successful story of the IMAX technology. Basically, it makes use of advanced audio-visual features of cinema, such as widescreen presentation, panoramic views and surround sound to provide the impression of being immersed. The second main branch is located in the domain of Virtual Reality (VR). In the past, most of the research on immersion had been focused on this branch because it is well suited to producing experiences for training, education, computer-aided design and game applications. The main advantage of VR is that it supports a high range of interactivity while users are immersed in the virtual scene. As a natural-looking distributed entertainment medium, however, VR has severe limitations. The quality of computer-generated imagery will for many years be inadequate to convince a user that he is actually viewing real-world scenes. In addition, there is no mechanism for full 3D capturing of large-scale live events in real-time.

Against this background, the idea of Immersive TV is to design a new broadcast medium which combines the natural experience of the first application branch with the functionality of interacting with

2 Against this background, the idea of Immersive TV is to design a new broadcast medium which combines the natural experience of the first application branch with the functionality of interacting with the scene contents of the second branch. The ultimate goal is to provide a future experiential and sensory home entertainment which bridges the gap between audiovisual sensation of state-of-the art consumer electronics and real sensation of live events (see Fig. 1). Costs Football Rock concert Theatre VoD Cinema Immersive TV CD TV Interactive TV Intensity of Experience Figure 1 - Objective of Immersive TV (1) In this sense, the next section will present a description of a short-term solution which is achievable with available hardware at reasonable costs. Then, the paper will discuss thinkable extensions towards a long-term solution targeting at a highly interactive TV system. This so-called IVVV (Interactive Virtual View Video) is based on an implicit 3D representation of the video scene enabling a user-guided navigation with a virtual camera. High resolution view Viewer Figure 2 Short-term solution of Immersive TV

3 H2U-BOX High to Ultra DefinitionMerger Box SHORT-TERM SOLUTION A very attractive short-term concept of such an Immersive TV (ImTV) approach has recently been proposed in (1). Following this approach, a first hardware system of a ImTV receiver based on a set-top box architecture using available DVB and MPEG technology is currently under development at Heinrich-Hertz-Institut (HHI) (2). In this concept it is envisaged to capture, encode and broadcast wide-angle, high resolution views of live events combined with multi-channel audio, and to display these events in a way which accounts for an individual immersive viewing experience using a head-mounted system in combination with a headtracker as well as other multi-sensory devices like motion-seats (see Fig. 2). The most significant feature, however, is that it targets a one-way distribution service. This means that, in contrast to usual VR applications, the same signal can be sold to millions of viewers without the broadcasters having to handle costly interactive networks and servers. DVB Transmission 3 HD MPEG-2 Recorder Decoder (Hyperbox) UHDTV Merger Unit HD Projectors HMD Figure 3 Outline of ImTV receiver hardware Fig. 3 outlines the receiver system that is currently under development at HHI and which will be demonstrated the first time at the IFA 2001 fair in Berlin. It consists of a couple of so-called hyperboxes. Each hyperbox can decode, record and/or replay a DVB MPEG-2 HD stream. A particular feature of the hyberbox is the ability to cascade some of these devices (e.g. three in Fig. 3) such that multiple MPEG-4 HD streams taken from different sources (DVB transmission, local storage, etc.) can be synchronised and decoded simultaneously. Subsequently, a special UHDTV (Ultra High Definition TV) merger unit stitches the synchronised HDTV frames together to a high resolution view in which a ImTV user can look around while using a head mounted display (see Fig. 2). Optionally, for other applications like electronic cinema, the stitched HDTV frames can also be watched jointly by a small group of viewers as wide-screen UHDTV panorama projection. For this purpose a given number of HDTV projectors can be plugged into the UHDTV merger unit. The main restriction of this short-term solution is the key assumption that the viewer always remains at a fixed location. He can certainly look all around and even incline his head, but he is viewing as a stationary spectator, as he would do from a fixed seat at a life event. However, to enjoy a large-scale event in future Immersive TV in the most exciting manner, it would be

4 desirable to be able to look at the scene from several viewpoints of interest. In such an extended system the functionality of interacting with the scene content would no longer be limited to looking around from a fixed default position. It would also allow the viewer to move his head, resulting in a different perspective, to look behind occluded areas or, in more futuristic scenarios, to walk through a natural video scene as it is known from virtual reality. This particular extension of Immersive TV fits very well into the current trend of digital TV to offer added value in terms of interactivity (e.g. the possibility to hop between several channels showing the same event from different camera positions). LONG-TERM SOLUTION In this context HHI has recently started a new research activity on such a long-term scenario, called IVVV (Interactive Virtual Viewpoint Video). The objective of IVVV is to develop new computer-vision based techniques, that extend the above mentioned short-term scenario, and allow the viewer to enjoy a large scale live event, such as a soccer game or a theatre play, from virtually every desired viewpoint with the quality of natural video. For this purpose the information is captured by a multi-baseline array of strongly convergent cameras arranged around the particular location of interest. The multi-view images will be analysed and all data are merged together to one compact image-based 3D representation of the scene which can then be transmitted via a broadcast channel. At the receiver the viewer has the opportunity to control a "virtual camera" providing intermediate virtual views interpolated from the real camera images. Operator Static scene parts Figure 4 Operator assisted offline analysis of static scene parts Basically, the concept of IVVV consists of two different phases, taking into account that the 3D geometry of the scene can be decomposed into static parts, such as a sports arena or a theatre stage, and dynamic objects, such as the soccer players or the actors. As sketched in Fig. 4, the static parts of the scene can be analysed during an offline phase without any realtime constraints and with the possibility of supervising the process and correcting it manually whenever necessary. The output, obtained during this initial offline process, carries information about the static 3D structure of the scene (static disparity maps, segmentation masks and further information about occlusions and depth layers). The texture images of the video sequences, however, are still transmitted online. Furthermore, the dynamic objects have to be analysed and processed in real-time during this online phase. Then, a 3D representation of the dynamic objects, including segmentation, disparities and depth layering, is transmitted jointly with the video sequences.

Broadcast Virtual camera Synthesis Download Control Viewer Figure 5 Selection

receiver, the 3D representation data of the dynamic objects are merged with

video scene which can then be used, together with the video images, to render

5 Broadcast Virtual camera Synthesis Download Control Viewer Figure 5 Selection of the desired viewpoint with remote-controlled virtual camera At the receiver, the 3D representation data of the dynamic objects are merged with the data from the offline phase, previously down-loaded and stored at a local disk (see Fig. 5). The result is a permanently updated 3D representation of the transmitted video scene which can then be used, together with the video images, to render the desired perspective view of the scene. virtual camera real cameras virtual view calculated by 3D warping Figure 6 Three camera test environment

First investigations on the algorithmic feasibility of the IVVV approach have been carried out on the basis of computer simulations. The experimental environment sketched in Fig.

Then, after 3D analysis of static and dynamic parts of the scene, a novel view is synthesised in dependence of the 3D position of the virtual camera.

6 First investigations on the algorithmic feasibility of the IVVV approach have been carried out on the basis of computer simulations. The experimental environment sketched in Fig. 6 has been used for this purpose. A toy racing scene is first captured by three real cameras representing a part of the convergent multi-baseline camera array used in IVVV. Then, after 3D analysis of static and dynamic parts of the scene, a novel view is synthesised in dependence of the 3D position of the virtual camera. The following sub-sections present some details of the algorithms which are required for both, offline- and online processing of 3D analysis and novel view synthesis. Weak Camera Calibration The first step during the offline phase is the weak calibration of the multi-baseline camera array. An automatic algorithm, based on a trifocal matching procedure, has been implemented for this purpose. At first, a robust feature extraction and matching algorithm selects a given number of reliable point correspondences between all three cameras. (see Fig. 7). It is modified version of the algorithm proposed by Zhang (3). Then, a robust estimation scheme proposed by Torr is applied to these pre-selected point correspondences to estimate the trifocal tensor (4). The estimation makes use of a robust RANSAC algorithm to eliminate outliers within the given set of point correspondences. To reduce the influence of Gaussian noise, the coefficients of the trifocal tensor found by RANSAC are subsequently refined by a suitable non-linear minimisation. Further weak calibration parameters like the fundamental matrices and the epipoles are finally extracted from the trifocal tensor. In its current stage, the algorithm is limited to a three camera test environment (see Fig. 6). However, it can easily be extended to configurations with more than three cameras, using Figure 7 Feature point correspondences for weak calibration of camera array registration of multiple image triplets and bundle adjustment, as proposed by Fitzgibbon (5). Analysis of Multi-View Images Analysing the 3D structure of the given multi-view images is obviously the most difficult task. However, during the offline analysis, some a-priori knowledge about the structure of the envisioned scenarios can be utilised to ease it. For example, it can be assumed that some parts of the scene consist of planar regions as they are typical for architectural buildings. This is an important model assumption which significantly simplifies the solution of the correspondence problem between different views. Following this philosophy and starting with the robust point correspondences from the weak calibration process, a region-growing technique guided by epiolar and trifocal constraints is used for an estimation of dense disparity fields of high precision. This approach is inspired by a proposal from Lhuillier (6) and has been extended towards implicit consistency checks, with respect to the epipolar and

Depending on the complexity of the scene, this segmentation process is automatic, semiautomatic or even manual. The result of segmentation is a mask labelling different depth layers and planes.

7 Figure 8 Generation of multi-label mask for disparity regularisation by homography fitting trifocal constraints. Then, to meet the above model of planar patches, the disparity maps are regularised by socalled homography fitting. For this purpose the images are first segmented as shown in Fig. 8. Depending on the complexity of the scene, this segmentation process is automatic, semiautomatic or even manual. The result of segmentation is a mask labelling different depth layers and planes. Subsequently, the regions which represent a planar patch in the scene are identified by a RANSAC algorithm estimating and testing corresponding homography parameters. As soon as a region has been linked to a valid homography, the disparities within this region are fitted to this supplementary condition. Thus, the related homography parameters are exploited for further improvements towards robustness and accuracy of the dense disparity fields. For the online process, the disparities of the dynamic foreground objects are estimated by an object-oriented algorithm which is more suitable for real-time applications. It is based on a hybrid pixel- and block-recursive matching technique that has been developed at HHI in the context of immersive tele-conferencing (7). Due to its object-oriented structures it presumes that dynamic objects are reliably detected by means of segmentation and tracking. Figure 9 Trajectory samples of virtual camera moving through test scene Synthesis of Virtual Views Exploiting the joint disparity maps, merged from both, offline- and online estimation, new virtual views can be generated from the available reference views. To meet real-time constraints, only techniques based on trilinear warping and image based rendering are used for this purpose (8)(9). Self-occlusions appearing during the movement of the virtual camera are handled by an occlusion-compatible warp order, guided by the weak calibration data. The composition of the virtual views out of the reference frames is controlled by coplanarity checks, image segmentation and confidence measures, in order to use the reference image with best spatial resolution and to cope with exposed areas in the most efficient manner. Finally, the texture in those exposed areas, for which no information is available in any of the reference views, is extrapolated by suitable hole filling algorithms. Fig. 9 shows results for various positions of the virtual camera in the test environment from Fig. 6.

8 CONCLUSION & OUTLOOK Starting from a discussion on new concepts of immersive entertainment in general, this paper has particularly introduced an evolutionary approach on immersive television. This approach is based on a short-term solution (ImTV), already suitable to be implemented in first prototypes with available technology, and a long-term solution, called IVVV (Interactive Virtual Viewpoint Video), that allows the user to watch a large scale live event from virtually every desired viewpoint. First key components for the IVVV concept, such as weak camera calibration, offline analysis of the 3D scene geometry and image-based rendering for view synthesis, have already been developed on the basis of computer simulations and show promising results. Further investigations will mainly focus on the following two issues. The first one concerns the development of a suitable navigation model which assists the user's movement through the scene. This is mainly a matter of calibration, focusing on a suitable trade-off between expensive full-calibration, providing a real Euclidian world for navigation, and a pragmatic approximation based on the less complex techniques of weak calibration. A second issue is the exploitation of redundancies for transmission purposes. At the moment, it is assumed that the video data from all camera views are transmitted simultaneously, resulting in an overwhelming need of bandwidth. Here, the objective is to limit the visual data that have to be encoded such that for the whole scene, each area that is visible within more than one camera view is encoded only once with the highest possible resolution. REFERENCES 1. Lodge, N. and Harrison, D., Being Part of the Action - Immersive Television! Proceedings of International Broadcasting Convention. September, Kauff, P., Höfker, U. and Gölz, U., Immersive TV - The TV Experience of the Future? Proceedings of 19 th Annual Conference of FKTG. May, Zhang, Z., Deriche, R., Faugeras, O. D. and Luong, Q.-T., A Robust Technique for Matching Two Uncalibrated Images Through the Recovery of the Unknown Epipolar Geometry. Research Report 2273, INRIA. May, Torr, P.H.S. and Zisserman, A., Robust Parameterization and Computation of the Trifocal Tensor. Image and Vision Computing. 15(8)., August, pp. 591 to Fitzgibbon, A. W. and Zisserman, A., Automatic Camera Recovery for Closed or Open Image Sequences. Proceedings of European Conference on Computer Vision. June, pp. 311 to Lhuillier, M., Towards Automatic Interpolation for Real and Distant Image Pairs. Research Report 3619, INRIA. February, Kauff, P., Brandenburg, N., Karl, M. and Schreer O., Fast Hybrid Block- and Pixel- Recursive Disparity Analysis for Real-Time Applications in Immersive Tele-Conference Scenarios. In Proceedings of 9 th International Conference in Central Europe on Computer Graphics Visualization and Computer Vision. February, pp. 198 to Avidan, S. and Shashua, A., Novel View Synthesis by Cascading Trilinear Tensors. IEEE Transactions on Visualisation and Computer Graphics. 4(4); October-December,1998. pp. 293 to Oliveira, M., Bishop, G. and McAllister, D., Relief Texture Mapping. In Proceedings of SIGGRAPH. July, pp. 259 to 268.

Challenges and solutions for real-time immersive video communication

Challenges and solutions for real-time immersive video communication Part III - 15 th of April 2005 Dr. Oliver Schreer Fraunhofer Institute for Telecommunications Heinrich-Hertz-Institut, Berlin, Germany