3D-TV Content Creation: Automatic 2D-to-3D Video Conversion Liang Zhang, Senior Member, IEEE, Carlos Vázquez, Member, IEEE, and Sebastian Knorr

372 IEEE TRANSACTIONS ON BROADCASTING, VOL. 57, NO. 2, JUNE 2011 3D-TV Content Creation: Automatic 2D-to-3D Video Conversion Liang Zhang, Senior Member, IEEE, Carlos Vázquez, Member, IEEE, and Sebastian Knorr Abstract Three-dimensional television (3D-TV) is the next major revolution in television. A successful rollout of 3D-TV will require a backward-compatible transmission/distribution system, inexpensive 3D displays, and an adequate supply of high-quality 3D program material. With respect to the last factor, the conversion of 2D images/videos to 3D will play an important role. This paper provides an overview of automatic 2D-to-3D video conversion with a specific look at a number of approaches for both the extraction of depth information from monoscopic images and the generation of stereoscopic images. Some challenging issues for the success of automatic 2D-to-3D video conversion are pointed out as possible research topics for the future. Index Terms Broadcasting, data conversion, image generation, stereo displays, stereo synthesis, stereo vision, three-dimensional displays, 3D-TV. I. INTRODUCTION T HREE-dimensional television (3D-TV) is anticipated to be the next step in the advancement of television. The term 3D in this context denotes stereoscopic, meaning a two-view system is used for visualization. Stereoscopic images that are displayed on 3D displays can increase the visual impact and heighten the sense of presence for viewers [1]. The successful adoption of 3D-TV by the general public will depend not only on technological advances in 3D displays [2] and 3D-TV broadcasting systems [3], [4] but also on the availability of a wide variety of program content in stereoscopic 3D (S3D) format for 3D-TV services [5]. The supply of adequate S3D content will be especially critical in the early stages of 3D-TV rollout to ensure that the public would be willing to spend money for 3D displays and 3D-TV services. However, a certain length of time will be required for content providers to capture and to create enough S3D material with stereoscopic cameras. We consider that the conversion of 2D images/videos to S3D is one way to alleviate this difficult problem. In this way, the vast collection of 2D material that currently exists, in the form of television programs and theatrical films, and their conversion to stereoscopic imagery should subdue any apprehension by the general public of purchasing 3D displays and 3D-TV services for fear that there is not much 3D programs to watch. Manuscript received October 12, 2010; revised February 08, 2011; accepted February 23, 2011. Date of publication March 28, 2011; date of current version May 25, 2011. L. Zhang and C. Vázquez are with the Communications Research Centre Canada, Ottawa, ON, K2H 8S2, Canada (e-mail: liang.zhang@crc.gc.ca; carlos. vazquez@crc.gc.ca). S. Knorr is with the Communication Systems Lab of Technische Universität Berlin, Berlin D-10587, Germany (e-mail: knorr@imcube.de). Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org. Digital Object Identifier 10.1109/TBC.2011.2122930 2D-to-3D conversion techniques can be profitable for content providers who are always looking for new sources of revenue for their vast library of program materials. This potential market is attracting many companies to invest their manpower and money for developing 2D-to-3D conversion techniques. The list of companies providing 2D-to-3D conversion includes, but is not limited to, Dynamic Digital Depth (DDD), HDlogix, Himax Technologies, In-Three, Industrial Light and Magic, Legend 3D, Passmorelab, Prime Focus, Samsung, Sassoon Film Design, Sony Image Works, Stereo Pictures, Stereo D, imcube and Victor Company of Japan (JVC). The fundamental principle of 2D-to-3D conversion techniques rests on the fact that stereoscopic viewing involves binocular processing of two slightly dissimilar images [6]. The human visual system transforms the slight differences (pixel horizontal shifts) between the left-eye and right-eye images into distance information such that objects are perceived at different depths and outside of the 2D display plane. Thus, converting 2D images to stereoscopic 3D images involves the underlying principle of horizontal shifting of pixels to create a new image, so that there are horizontal disparities between the original image and a new version of it. The extent of pixel horizontal shift depends not only on the distance of an object to the stereoscopic camera, but also on the inter-lens separation that determines the new image viewpoint. Various approaches for 2D-to-3D conversion have been proposed. These approaches can be classified into three schemes, namely: manual, human-assisted and automatic conversion. The manual scheme is to shift the pixels horizontally with an artistically chosen depth value for different regions/objects in the image to generate a new image [5], where hand drawing produces high quality depth, but is very time consuming and expensive. The human-assisted scheme is to convert 2D images to stereoscopic 3D with some corrections made manually by an operator [7]. Even though this scheme reduces the time consumed in comparison to the manual conversion scheme, a significant amount of human engagement is still required to complete the conversion. To convert the vast collection of available 2D material into 3D in an economic manner, an automatic conversion scheme is desired. The automatic conversion scheme exploits depth information originated from a single image or from a stream of images to generate a new projection of the scene with a virtual camera of a slightly different (horizontally shifted) viewpoint. It may be done in real-time or in a more time-consuming off-line process. The quality of the resulting product is related to the level of processing involved, so the real-time solutions usually produce the lowest quality conversion. There are two key issues to consider for automatic 2D-to-3D conversion techniques: how to retrieve depth information from a mono- 0018-9316/$26.00 2011 IEEE

ZHANG et al.: 3D-TV CONTENT CREATION: AUTOMATIC 2D-TO-3D VIDEO CONVERSION 373 scopic image or video [8] as well as how to generate high-quality stereoscopic images at new virtual viewpoints [9]. In addition to the 2D-to-3D conversion of standard video content, there are also works that have addressed other types of conversion like for example [10], where a model of an actor is extracted and incorporated into a virtual environment. This type of application that imposes conditions in the type of content that is converted is out of the scope of this contribution. This paper will provide an overview of automatic 2D-to-3D video conversion. After the introduction, Section II will describe the framework for automatic 2D-to-3D video conversion. The extraction of scene depth information will be reviewed in Section III. Section IV is devoted to the generation of stereoscopic images. Some challenging issues will be discussed in Section V as future research topics. Section VI concludes this paper. II. FRAMEWORK FOR AUTOMATIC 2D-TO-3D VIDEO CONVERSION 2D-to-3D video conversion involves the generation of new images from a single 2D image or a single stream of 2D images (video sequence). From this perspective, 2D-to-3D video conversion can be seen, at least conceptually, as a special case of image-based modeling and rendering techniques developed to provide new virtual views (view synthesis) from a given set of images [11]. Image-based modeling and rendering techniques can be classified into three main categories, according to the amount of geometry information explicitly used in the process [11]: 1) Methods that use a full 3D model of the scene: This category requires the complete and accurate reconstruction of a geometric model for the captured scene. Such a model will contain all the required information for the rendering of a new virtual view from a given viewpoint. Structure from silhouette [12], for example, is a commonly used technique to construct 3D models of objects. Given the 3D model and the lighting condition in the scene, a new virtual view can be easily rendered from a desired viewpoint using conventional computer graphics techniques [13]. In the 2D-to-3D video conversion context, it is usually extremely difficult and error-prone to recover the complete 3D scene structure from a single image or video sequence unless the video is captured under stringent conditions. It is thus impractical to use a full 3D model approach for automatic 2D-to-3D video conversion. 2) Methods that use only images and no explicit geometric information: This category directly renders new virtual views from an over-captured set of images, usually hundreds to thousands of images are required, with no or very little geometric information, e.g. Lightfields [14] and Lumigraph [15]. In 2D-to-3D video conversion, the number of images available for rendering is normally small, making this approach impractical for the automatic 2D-to-3D video conversion. 3) Hybrid methods that explicitly use some geometric information: This category uses a hybrid geometry- and image-based approach. New virtual views are rendered from a limited number of images with the help of incomplete geometrical information from the scene. Methods Fig. 1. Diagram for automatic 2D-to-3D video conversion. in this category include depth-image-based rendering (DIBR) [16], layered depth images (LDI) [17], and intermediate view reconstruction (IVR) [18], [19]. Most of the proposed 2D-to-3D video conversion algorithms use a framework that falls into this category, a hybrid geometryand image-based approach. The framework commonly used for the automatic 2D-to-3D video conversion basically consists of two elements (Fig. 1), namely: the extraction of depth information and the generation of stereoscopic images in accordance with both the estimated depth information and the expected viewing conditions. The extraction of depth information aims to exploit pictorial cues and motion parallax, contained in a single 2D image or video, to recover the depth structure of the scene. The retrieved depth information is then converted into a suitable representation for usage in the 2D-to-3D video conversion process. A sparse 3D scene structure and a depth map are two representations of incomplete geometry of a captured scene that are commonly used. A sparse 3D scene structure usually consists of a number of 3D real world coordinates, while a depth map 1 is essentially a two-dimensional (2D) function that provides the depth, with respect to the camera position, as a function of the image coordinates [5], [20]. For simplicity, we use the term of depth information to include both representations of incomplete scene geometry. The generation of stereoscopic images is the step that involves warping textures from original images in accordance with the depth information retrieved to create a new image or video sequence for the second eye. The developed view synthesis algorithms can also be used to generate multiple new images to meet the format requirement of auto-stereoscopic 3D displays. In this latter case, the specificities of multi-view image content should be taken into account, e.g., the possibly wider baseline between multi-view images will generate more disoccluded regions that have to be filled with an advanced hole-filling algorithm such as inpainting algorithms. III. EXTRACTION OF SCENE DEPTH INFORMATION A variety of depth cues are exploited by the human being to perceive the world in three dimensions. These are typically classified into binocular and monocular depth cues. Binocular 1 We do not distinguish between depth maps containing relative depth information ( disparity map ) and that containing absolute depth information ( range maps ) in this paper. Theoretically, absolute depth information can be derived from the relative depth information if sufficient camera and capture information are provided.

374 IEEE TRANSACTIONS ON BROADCASTING, VOL. 57, NO. 2, JUNE 2011 Fig. 2. Classification of depth cues. Fig. 3. Pictorial depth cues in a 2D image. Visible depth cues: linear perspective, relative and known size, texture gradient, atmospheric scattering, relative height in picture, and interposition. cues provide depth information when viewing a scene with both eyes through exploitation of differences between the perceived images, while monocular cues provide depth information when viewing a scene with one eye. An incomplete list of the depth cues is presented in Fig. 2. The extraction of scene depth information aims to convert monocular depth cues contained in video sequences into quantitative depth values of a captured scene. An overview of methods employed for the extraction of depth from monocular depth cues for 2D-to-3D conversion was compiled in [21], which presents a brief description of methods and a simple comparison of their main characteristics. Monocular depth cues can be subdivided into pictorial and motion cues. In the following, we will focus on the methods for extracting depth from both pictorial and motion cues. A. Depths From Pictorial Cues Pictorial depth cues are the elements in an image that allow us to perceive depth in a 2D representation of the scene. This has been known for centuries and has been extensively applied in visual arts to enhance the perception of depth. Depth perception could be related to the physical characteristics of the Human Visual System (HVS) such as the perception of depth by accommodation or could be learned from experience like the perception acquired from the relative height of objects in the picture, perspective, shadows, and other pictorial cues [22], [23]. An example is shown in Fig. 3, which presents an image where a clear depth order can be extracted by using pictorial depth cues. The generation of depth information from pictorial cues embedded in an image can be subdivided into two approaches. The first one relates to the extraction of real depth information from available pictorial cues in an image. By real we mean relative depths between objects in the scene. It is impossible to obtain absolute depths without knowledge of the position and optical characteristics of the capturing device. The second approach creates artificial or non-veridical depth information by exploiting pictorial cues that are commonly found in all scenes of a given category like landscapes or indoor. We will discuss three categories of pictorial cues commonly used to extract depth information in the following subsections. 1) Depth From Focus/Defocus: Accommodation is the mechanism of the human eye used to focus on a given plane in depth. Real aperture cameras do similarly by focusing on a given plane. This in practice makes the rest of the scene blurred in a measure that depends on the distance to the focusing plane of the optics. This mechanism can be exploited for the generation of depth information from captured images, which contain a focused plane and objects out of the focused plane. This topic is known in the literature as depth-from-focus/defocus, which is one of the first mechanisms to be employed to recover the depth from single images [24] [26]. In practice, there are two main approaches that are used to implement this mechanism. The first one employs several images with different focus characteristics in order to extract the variation of blur for a given image feature across the available images. This variation can be translated into depth by finding the point where the particular feature should be in focus [27]. Although this approach is reliable and provides good depth estimation, the requirement of having several images of the same scene captured with different optical systems simultaneously is too restrictive to be of any practical application in the 2D-to-3D conversion problem. The second approach tries to extract the blur information from a single image by measuring the amount of blur associated with each pixel and then mapping the blur measures to the depth of that pixel. A deconvolution process in the frequency domain using inverse filtering was introduced in [28] to recover the amount of blur in an image. To solve instabilities related to inverse filtering in the frequency domain a regularization approach was proposed in [24]. A local scale control method was alternatively proposed in [29] to detect edges at different levels of blur and to compute the blur associated to those edges. A Gaussian blurring kernel was used to model the blur of edges and its second derivative was used to measure the spread of the edge in order to extract the blur level. More recently, a wavelet-based approach was proposed in [30], where a wavelet

ZHANG et al.: 3D-TV CONTENT CREATION: AUTOMATIC 2D-TO-3D VIDEO CONVERSION 375 decomposition of macro-blocks within an image was performed to recover the high frequency content of that macro-block and the number of high-value wavelet coefficients was counted to be used as a measure of blur. A similar approach was presented in [26], but extended to use 2D wavelet analysis for the detection and analysis of edges and to use color-based segmentation for adding consistency to the depth map. High order statistics has also been used to estimate the amount of blur in images for 2D-to-3D conversion [25]. Although the approach of recovering depth from focus/defocus is relatively simple, it suffers from a major drawback, how to distinguish the foreground from the background when the amount of blur is similar. In many cases the foreground corresponds to the focusing plane, but when this is not the case, then it is impossible to distinguish an out-of-focus region in the foreground from an out-of-focus region in the background. 2) Depth From Geometric Cues: Depth from geometric cues is an interesting approach to obtaining depths from a 2D image. Geometric related pictorial depth cues are linear perspective, known size, relative size, height in picture, interposition, and texture gradient. Some of these cues are stronger than others. The interposition, for example, can teach us the order of objects in depth, but not the distance in depth between them. Some cues might be hard to use in an application for the estimation of depth. For example, the information related to the size of objects is hard to use since it requires the identification of objects and the knowledge of normal sizes for those objects. The most commonly used geometric cues are the linear perspective and the height in picture. Linear perspective refers to the property of parallel lines of converging at infinite distance, or equivalently, a fixed size object will produce a smaller visual angle when more distant from the eye. This characteristic is used for depth estimation by detecting parallel lines in the images and identifying the point where these lines converge (vanishing point). Then a suitable assignment of depth can be derived based on the position of the lines and the vanishing point [31], [32]. The height in picture denotes that objects that are closer to the bottom of the images are generally closer than objects at the top of the picture. Outdoor and landscape scenes mainly contain this pictorial depth cue. To extract this depth cue, horizontal lines usually have to be identified so that the image can be divided into stripes that go from the left border to the right border. For this purpose, a line-tracing algorithm is applied to recover the optimal dividing lines subject to some geometric constraints [33]. A depth-refining step is further applied to improve the quality of the final depth map. For assigning depths to a 2D image, a predefined depth model, which can be adjusted according to image structure, is used [34]. Aside from linear perspective and height in picture it is also possible to recover depths from texture (called shape-from-texture), which aims to estimate the shape of a surface based on cues from markings on the surface or its texture [35] [37]. Those methods, however, are normally restricted to specific types of images and cannot be applied to 2D-to-3D conversion of general video content. 3) Depth From Color and Intensity Cues: Variations in the amount of light arriving to the eye could also provide information of the depth of objects. This type of variation is reflected on captured images as variations of intensity or changes in color. Depth cues that are based on this mechanism include atmospheric scattering, light and shadow distribution, figure-ground perception, and local contrast. Atmospheric scattering refers to the scattering of light rays by the atmosphere producing a bluish tint and less contrast to objects that are in the far distance and a better contrast to objects that are in close range [38]. Based on color rules, which are learned heuristically using a large number of landscape/outdoor images, semantic region detection is performed to divide landscape/outdoor images into six regions such as sky, farthest mountain, far mountain, near mountain, land, and other [31]. Then, a depth value would be assigned to each region so that the sky region will be the farthest one and the other region the closer one. The warm/cool color theory is used to generate the depths from the images [34], where a pixel with a warm color is assigned a depth closer to the camera than the pixel of a cool color. This color-based depth value was further added to the blended pre-defined depth data, which was selected according to the content of the scene, to provide the final virtual depth data for virtual stereoscopic view generation. Color dependent information is also used in the depth creation algorithm [39] and in the 2D-to-3D conversion system [40]. In the proposed method the component in the -color space is used as a substitute for the actual depth in a scene (called surrogate depth). The component is first mapped to a given range of depth and an adjustment is then made for very bright red regions in order to prevent them from coming too close to the viewer. The selection of the bright red regions takes into account the skin tone to avoid negatively affecting the depth of human characters. Experimental evidence shown in [39] suggests that this simple approach can provide a significant enhancement to the 3D effect with respect to the perceived depth in 2D images. Light and shadow distribution refers to the information provided by shadows with respect to the position and shape of objects relative to other objects and the background. The problem of acquiring depths from light and shadow distribution is known as shape-from-shading in computer vision. The shape-from-shading aims to extract the shape of objects by identifying the position of the light source to recover the shape of the reflecting surface through analyzing the variations in intensity. A good survey of those kinds of techniques is found in [41], where six algorithms are reviewed and compared. Figure-ground perception [42] is another mechanism that helps in the perception of depth. Edges and regions in the image are the depth cues providing this information. For two contiguous regions in the visual field, the common perceptual outcome is that the edge between them appears to be a boundary for only one of them. The region claiming the edge appears to have a definite shape and is called the figure. The contiguous region appears shapeless near the edge and is perceived to continue behind it, it is called the ground. Thus, in addition to being shaped, the figure appears nearer than the ground. A method based on this principle is proposed in [43] where detected edges/object boundaries are used as substitutes for the depth. Although only edges/object boundaries are assigned depth values, the HVS seems to be able to fill objects with the appropriate depths and further separate them from the local background. Experimental evidence suggests that this approach

376 IEEE TRANSACTIONS ON BROADCASTING, VOL. 57, NO. 2, JUNE 2011 provides enhanced 3D effects with respect to the perceived depth in a 2D image. More recently a depth creation method based on analysis of visual attention is proposed in [44], where a saliency map acts directly as a depth map. Local features such as color contrast, luminance, orientation, texture, and motion are used to provide their correspondent saliency maps, respectively. These individual saliency maps are finally fused in a linear or non-linear manner to achieve a final saliency map that is a good estimator for a visual attention map. Experimental evidence demonstrates that the resulting depth map provides good quality depth [44]. It is worth noting that most of the methods based on these intensity and color related depth cues do not attempt to recover the depth in the scene, but to provide a S3D image pair that is perceptually acceptable with an increased sensation of depth with respect to the original 2D images. B. Depths From Motion Cues Different from the pictorial depth cues as described in the previous subsection; video sequences provide motion parallax as an additional cue to depth [45]. Motion parallax refers to the relative motions of objects across the retina. For a moving observer, near objects move faster across the retina than far objects, and so relative motion provides an important depth cue. This is usually called the principle of the depth from motion parallax approach. Note that not all video sequences will provide motion parallax to depth. In principle, only video sequences that are captured by a freely moving camera have motion parallax that is closely related to the captured scene structure. If the camera has no motion, the captured video sequence does not have motion parallax. Even if there are some Independently Moving Objects (IMOs) in the scene, their motions will provide some cue to depth under certain circumstances, but this cue could be error-prone. In addition, different camera motions will lead to different strengths of depth perception. A freely moving camera can provide more information about the depth in the scene than a camera with a zooming/translating motion along the optical axis. A camera that only rotates around the optical axis does not provide any information about the depth. For video sequences, more than one image is available for the extraction of depth information. Algorithms such as structurefrom-stereo (SfS) [46] [48] and structure-from-motion (SfM) [49] [51] can be applied for this purpose. Those algorithms consist of two parts: determination of motion parallax from the sequence, and the mapping of motion parallax into depth information. 1) Motion Parallax Between Images: Motion parallax may be seen as a form of disparity over time and allows perceiving depth from the differences between two frames in a video sequence. These differences are observed in the video as image motion. By extracting this image motion, the motion parallax could be recovered. Image motion may relate to the whole image (global motion estimation) or specific parts, such as rectangular blocks, arbitrary shaped patches or even per pixel. A motion model that approximates the motion of a moving object or a real video camera, such as rotation and translation in all three dimensions and zoom, can describe the motion in the image. Motion model parameters are determined by motion estimation methods. Motion estimation is an ill-posed problem as the motion is in three dimensions, but the images are a projection of the 3D scene onto a 2D plane. Motion estimation methods can be generally classified into direct and indirect. The direct methods estimate motion model parameters directly from the images. They include, but are not limited to, block-matching algorithm [52], [53], pixel recursive algorithm [54], MAP/MRFbased Bayesian estimator [55] [57], and optical flow estimation [58]. Indirect methods estimate motion model parameters through image features. They detect image features like corners [59], [60] and track image features throughout the sequence [61]. From those tracked image features, motion model parameters are then estimated by a statistical method such as RANSAC [62], which removes the tracked image features (outliers) that do not correspond to the actual motion. For a static scene, image motion is caused only by camera motion. This camera-induced image motion depends on the camera motion parameters and the scene structure. In the case of scenes with independently moving objects (IMOs), the IMOs have a relative motion with respect to the camera that is different from that of the background. The existence of IMOs violates the principle of depth from motion parallax and makes the motion-to-depth conversion ambiguous. Therefore, those IMOs are either removed from the image for the background reconstruction process [63] or detected and segmented for an individual 3D reconstruction [64], [65]. In [64], the motion vectors of IMOs are directly converted into depths. Compared to their real depths in the captured scene, those IMOs will visually pop-up and catch more attention when the IMOs have similar motion direction as the camera. In [65], the segmented IMOs are rendered into the generated background sequence. For depth from motion parallax, actual camera motion trajectories have to be estimated from a video sequence. Due to occlusion and noise when capturing video sequences, the estimates of motion model parameters are usually inaccurate and temporally inconsistent. To improve the accuracy of the estimates, it is common to impose geometric coherence constraints such as trifocal or epipolar constraints [58]. Once the initial estimates for each frame are obtained, bundle adjustment [66] or bundle optimization [54] can be applied to refine the estimated parameters by taking into account temporal and spatial geometric constraints. 2) Conversion of Motion Parallax Into Depth Information: Dependent on the depth representation, motion parallax estimated from a video sequence is then converted into depth information in terms of a 2D depth map or a 3D scene structure. a) 2D depth map reconstruction: A 2D depth map can be reconstructed from image motion. The magnitude of motion vectors within each video frame are directly treated as depth values, when consecutive video frames are taken in almost parallel viewing or when they are acquired in a small baseline distance of two consecutive viewpoints. Such a motion-to-depth mapping might not generate correct depth magnitudes, but would have a correct depth order in the absence of IMOs in the scene. Several motion-to-depth mappings have been proposed in the literature. A linear mapping is presented in [67], which divides each video frame into blocks, directly calculates a motion vector per block, and then uses the magnitude as the depth value of that block. A scaled linear mapping is also presented in [52],

ZHANG et al.: 3D-TV CONTENT CREATION: AUTOMATIC 2D-TO-3D VIDEO CONVERSION 377 [68], which further scales the magnitudes of motion vectors to keep the maximal depth value constant across all video frames. A nonlinear mapping is proposed to enhance the perceptual depth [53], where the whole scene is divided into several depth layers and each depth layer is assigned a different scaling factor. Besides motion vector magnitude, additional factors such as camera movement and scene complexity are also taken into account for mapping the motion information of each video frame into depth values [69]. A captured video sequence contains the so-called stimulus-shearing motion in the image plane [45] when the camera moves around a particular object in the scene. In such a case, objects that lie between the particular object and the moving camera will have a motion in the image plane in opposite direction to that of the camera. These objects are closer to the camera than other objects that move in the same direction as the camera. To take this into account, the dominant camera motion direction is included for the motion-to-depth mapping [68]. For this kind of motion-to-depth mapping, feature-based motion estimation results in more accurate and stable depth values than block-matching based motion estimation, as shown by the preliminary experiments on video sequences with only camera panning [68]. The path taken by a moving camera, called tracks, and the camera parameters are required for the mapping of image motion into a depth value when camera motion is complicated or a more realistic depth value is desired [54], [70]. For this, camera parameters, including intrinsic and extrinsic parameters, will be estimated by structure-from-motion techniques [71] [73]. Once camera parameters are obtained, the depth of each pixel is calculated by the triangulation of the correspondent pixels between two images. Such a mapping generates a depth map that closely approximates the correct relative depths of the scene. b) Sparse 3D scene structure reconstruction: A 3D scene structure is a scene representation that consists of a number of 3D real world coordinates. As stated in Section II, a complete scene structure reconstruction is normally impracticable for an automatic 2D-to-3D video conversion. Therefore, we briefly describe the steps for the reconstruction of a sparse 3D scene structure [51], [63]. A sparse 3D scene structure is represented by a set of 3D feature points in a reconstructed 3D world. The first step of the process is the initial 3D scene structure reconstruction and camera track estimation from the first pair of images, which uses the tracked feature points to estimate the fundamental matrix. Then, the projection matrices for these two video frames are determined with singular value decomposition (SVD). If the internal calibration parameters are unknown, which in general is the case for TV broadcast, home videos, or cinema movies, a self-calibration procedure has to be carried out. Once the projection matrices are known, the 3D points of the tracked feature points can be found via optimal triangulation as described in [74]. The next step is to update the initial 3D scene structure and camera tracks using consecutive video frames. First the camera projection matrix for the next selected video frame is determined using already existing correspondences between the previously reconstructed 3D points and the tracked 2D points in the current frame. Then, the 3D scene structure and the camera tracks are refined with all tracked feature points between the Fig. 4. Sparse 3D scene structure and camera track determined by SfM and positioning of a virtual stereo camera. current frame and the previous ones. This refinement is done via global nonlinear minimization techniques, also known as bundle adjustment [66]. This procedure is repeated for all selected frames, until the refined estimates of the sparse 3D scene structure and camera tracks reach the desired accuracy (Fig. 4). The sparse 3D scene structure reconstruction is a very computationally intensive process if it is performed with all the frames in a video sequence. Such a reconstruction is also heavily dependent on the initial structure computation. Camera motion degeneracy could make the estimation of the fundamental matrix fail [75] and consecutive frame pairs may have baselines that are too short for accurate triangulation. To address these problems the reconstruction of the sparse 3D scene structure and camera tracks is usually performed only on selected video frames, called key frames, chosen in order to reduce computational complexity, to improve triangulation accuracy, and to avoid camera motion degeneracy. An alternative approach to deal with these problems is to use a prioritized sequential 3D reconstruction approach as proposed in [51] for a fast and reliable structure and motion computation. The main element of such an approach is that key frames are selected according to a priority metric and the frame pair with the highest priority metric is then used for the initial reconstruction. IV. GENERATION OF STEREOSCOPIC IMAGES The procedures for the generation of stereoscopic images vary with the representations of depth information. In the following, we address the methods that rely on both the 2D depth maps and the sparse 3D scene structure. A. Approaches Based on 2D Depth Maps Depth image based rendering (DIBR) permits the creation of novel images, using information from depth maps, as if they were captured with a camera from different viewpoints [76]. The DIBR system usually consists of three steps: (i) pre-processing of depth map, (ii) 3D image warping, and (iii) hole-filling. The pre-processing step includes choosing the zero-parallax setting (ZPS) [77], converting depth maps from 8-bit gray values to metric depth values [16], and optionally smoothing the depths to reduce block artifacts and/or sharp transition around object boundaries. The 3D image warping step renders a virtual image

378 IEEE TRANSACTIONS ON BROADCASTING, VOL. 57, NO. 2, JUNE 2011 Fig. 5. Parallel camera configuration used for the generation of virtual stereoscopic images. f is the focus length of the camera. c is the viewpoint of the original center image. c and c are the viewpoint of virtual left- and right-eye images to be generated. t is the distance between these two virtual cameras. Under this camera configuration, one point X with depth Z in the world is projected onto the image plane of three cameras at pixels x = (x ; y), x = (x ; y), and x = (x ; y), respectively. of any nearby viewpoint by projecting the pixels of the original image to their proper 3D locations and re-projecting them onto the virtual image plane (Fig. 5). The virtual camera is usually positioned to form a parallel stereoscopic setup with the original one. The virtual camera has the same intrinsic parameter and camera optical direction. The last step is to fill in the newly exposed areas resulting from a difference in viewpoints. The most significant problem in DIBR is how to deal with newly exposed areas (holes) appearing in the virtual images. Holes appear along object boundaries where the depth changes (Fig. 6). They are due to the accretion (disocclusion) of portions/ regions of objects or background that would have been visible only from the new viewpoint but not from the original location that was used in capturing the original image [78], [79]. There is no information in the original color image and its associated depth map about these disoccluded regions and, therefore, they would appear empty, like holes, in the new virtual image. One solution is to replace the missing image areas (holes) during the view synthesis with useful color information. A simple method is to horizontally interpolate the information in the holes. No depth information is taken into account in this method so foreground and background objects are fused together to fill in the holes [16], [80]. Another simple method that fills in the holes is to horizontally extrapolate the values of pixels on the boundary that belongs to the local background object [16], [80]. This method assumes that the hole is part of the local background object and the depth information is taken into account to avoid using local foreground objects, which are visible and not occluded, in the hole filling process. A more sophisticated method is to diffuse pixel values at the boundaries by solving the Laplacian equation on and near the hole boundaries and propagating the values inside the hole [81]. Another more sophisticated method is the so-called depth-aided in-painting method [82], which uses inpainting techniques [83] to fill in the holes by the incorporation of a depth cue into texture estimation. A temporal inpainting approach is also proposed in [84], which looks for the occlusion information in other frames. In addition, a wavelet-transform based hole filling is presented in [85]. However, these solutions result in so-called texture artifacts due to the operation of duplicating neighboring spatial color information as well as the inaccuracies in depth boundaries of objects. To reduce texture artifacts, another solution is proposed to preprocess depth maps [9], [16], [86]. The disocclusion problem arises where there are sharp edges in the depth map. Smoothing edge transitions in the depth map will help avoid the appearance of disocclusions. Consequently, it improves the image quality of the synthesized view and reduces texture artifacts in the disoccluded regions. One undesirable effect of smoothing edges is the geometrical distortion; for example, vertical straight lines in the left image could be rendered as curves in the right image if they are close to a sharp depth edge. To reduce this geometry distortion, asymmetric smoothing of depth map with more strength in the vertical direction than in the horizontal direction is proposed in [87], which is consistent with known characteristics of the binocular system of the human eyes. Different from the constant smoothing strength for the whole depth map [87], some more sophisticated methods for adaptive smoothing of depth maps are further proposed in the literature, for example, edge dependent depth filtering [88], distance dependent depth filtering [89], discontinuity-adaptive smoothing filtering [90], and layered depth filtering [91]. Nevertheless, smoothing of depth maps, while attenuating some artifacts, will lead to a reduction of the depth contrast contained in the rendered stereoscopic views. Future studies will be required to examine this trade-off more closely, although based on the studies conducted so far the benefits appear to outweigh this disadvantage [86], [92]. B. Approaches Based on Sparse 3D Scene Structure This subsection reviews the stereoscopic view generation based on a sparse 3D scene structure, for which a procedure is proposed in [93]. The basic idea in [93] is to determine the transformation between the original and virtual views, based on the sparse 3D scene structure, to enable the generation of virtual views. This procedure consists of three steps and will be described in the following paragraphs. 1) Setup of Virtual Stereo Rig: The first step of virtual stereo rig setup is to determine the projection matrix of the virtual camera. A virtual stereoscopic camera can be defined for each original video frame to simulate a stereo camera rig as indicated in Fig. 4 [94]. In case of a 2D to multi-view conversion, several virtual cameras have to be defined for each video frame. The virtual camera is usually positioned in such a way that both cameras form a stereoscopic parallel configuration. The rotation matrix of the virtual camera is identical to the one of the corresponding original view, which is estimated during the reconstruction of the sparse 3D scene structure. The internal parameters are set to be identical as well. The translation vector of the virtual camera is defined by the desired stereoscopic view setup to adjust the baseline of the virtual stereo rig, i.e. to control the desired 3D effect. With this virtual camera setup, the projection matrix for a virtual camera can be determined [50]. 2) Determination of Planar Homographies for Image Warping: The second step is to determine the relation between the virtual and neighboring original views to enable image warping. In the case of a small baseline, the relation between the virtual and neighboring original views can be represented by a planar homography (Fig. 7).

ZHANG et al.: 3D-TV CONTENT CREATION: AUTOMATIC 2D-TO-3D VIDEO CONVERSION 379 Fig. 6. Stereoscopic image synthesis with DIBR. (a) Original color image interview, (b) correspondent depth image, and (c) rendered image without hole-filling. Holes are marked with a green color. a reliable solution, more than 4 pairs are required. In this case, the planar homography is estimated by a regression method to minimize the error (3) i.e., (4) Fig. 7. Stereoscopic view synthesis using planar homographies. The planar homography is a perspective transformation that transfers the pixel point of an original view into the pixel point of the virtual view, namely: To enable this transformation, the planar homography has to be determined. This can be done by using the depth information represented in terms of the reconstructed sparse 3D scene points, which are projected into each virtual image plane according to the corresponding projection matrix These points in the virtual image plane correspond to feature points in the original view. With these corresponding feature pairs, the planar homography between the original view and the virtual view can be estimated. is a 3 3 matrix and has 8 independent parameters [50], [57]. A minimum number of four point corresponding pairs is required to provide a solution up to a scale. To have (1) (2) where is a cost function, such as mean square error (MSE) or a likelihood cost, to measure how well the planar homography matches those correspondent feature pairs. 3) Virtual View Generation: The final step is to render the virtual view from the original view to build up a stereoscopic image pair. Usually, the closest original view to the virtual view is selected. Once the planar homography between the virtual view and the closest original view is determined, all pixel values of the original view are projected to their corresponding locations in the virtual view by applying (1). Normally, these positions do not exactly correspond with the pixel grid in the virtual view. The value of pixels at locations in the pixel grid is then obtained by performing a bilinear or bicubic interpolation [93]. In general, the closest original view does not cover the whole scene that should be visible with the virtual stereo camera as depicted in the left side of Fig. 8(b) due to the difference in viewpoint. This is particularly the case when the orientation of both cameras differs significantly. The missing parts mainly appear at the image boundaries of the virtual stereo image. In order to fill the missing parts, additional surrounding views are taken into account. Therefore, pixels from temporal neighboring views can be projected to their corresponding grid locations in the virtual view using interpolation techniques (see Fig. 8(c) and (d)). Due to the homography warping involved during the view generation, the view rendering approach based on a sparse 3D scene structure does not have holes within the images, which is different to the DIBR approach [94]. In addition, super-resolution techniques can be exploited to enhance the image resolution of the reconstructed stereoscopic views since additional information from consecutive frames of the monocular video sequence is available [94]. However, such an approach could become error-prone due to the assumption of a global planar

380 IEEE TRANSACTIONS ON BROADCASTING, VOL. 57, NO. 2, JUNE 2011 Fig. 8. Virtual view generation from several temporal neighboring original views: (a) original left view of the sequence Dome, (b) virtual right view, only rendered with the closest neighboring view from the original camera track, (c) virtual right view using 30 additional neighboring frames from the camera track, and (d) final virtual right view using 62 additional neighboring frames. TABLE I SUMMARY OF CHARACTERISTICS OF DEPTH REPRESENTATION homography, which is basically valid for rotating cameras (virtual and original view share the same camera center) or for a scene of one plane. One possible solution to this restriction is to apply local homographies on segmented image regions. C. Summary of Characteristics of Depth Representation In the previous sub-sections we have presented two different approaches for the generation of stereoscopic images, which are associated with the two representation of depth information, namely: a 2D depth map representation and a sparse 3D point representation. Both of these representations have characteristics that make them preferable for particular applications or types of content. These characteristics are summarized in Table I. V. CHALLENGING ISSUES Even though much research has been done to enable automatic 2D-to-3D conversion, the techniques are still far from mature. Most available products and methods are only successful in certain circumstances. In addition to the limitations imposed on each approach, the following are some key challenging issues to be solved. 1) One issue that directly affects the image quality is the occlusion/disocclusion problem during the generation of the stereoscopic images. As long as the camera is horizontally translated, the captured sequence in principle contains enough scene information to generate a new view. When a new version of an image is generated from a virtual viewpoint, which is parallel to the current viewpoint, the texture information, which is missing from the current viewpoint, can be found from its temporal neighboring frames. However, the question is how to detect this missing texture information in the current frame from the other video frames. 2) The depth ambiguity from monocular depth cues is one issue that impacts the depth quality. The depth ambiguity originates from the violation of the principles of depth generation. One example is to retrieve the depths from motion parallax. The motion parallax can provide reasonable scene depths only when the camera is translated and the scene is static. With the scene containing moving objects, the motion parallax might ambiguously reflect the depth of the moving object. Imagine that two objects are located at the same distance to the camera, but have different motions. According to the principle of depth from motion, the object with fast motion will be assigned a depth that is closer to the camera than the other object. The solution to solve the depth ambiguity of moving objects is to use additional depth cues, such as depth from geometrical information, to generate the depth information for each video frame. 3) The integration of various depth cues is another issue affecting the success of automatic 2D-to-3D video conversion. The integration of various depth cues is an extremely

ZHANG et al.: 3D-TV CONTENT CREATION: AUTOMATIC 2D-TO-3D VIDEO CONVERSION 381 important issue since video sequences unavoidably contain different motion types, e.g., camera motion, object motion, or no motion. To retrieve the depth from such video sequences, different strategies of depth generation are required. The challenge is how to integrate all the extracted depths from different cues to form, not only spatially, but also temporally stable and reasonable depths. From the literature, only some research on the integration of various depth cues in the spatial domain has been proposed. A linear integration of depths was proposed in [95], where the weighting factors are determined from the perceived importance of depth cues. Another depth integration method is proposed in [32], in which two depth maps, one from motion and another from geometry, are integrated into a final depth map according to the binarization results of moving object detection. However, more investigations are still required to provide spatially and temporally consistent depths. 4) The real-time implementation of 2D-to-3D conversion is also a critical issue for the adoption of the proposed techniques by the general public. The availability of real-time 2D-to-3D conversion will allow broadcasters and consumer electronics (CE) manufacturers to remove the natural fear from users of not having enough content for their new 3D-enabled TV set. Real-time, however, adds a new hard constraint that is difficult to meet while maintaining a high quality standard for the converted material. There are several real-time implementations incorporated into TV sets (e.g., Samsung s 3D-TVs), sold as stand-alone equipment (e.g., JVC s IF-2D3D1 Stereoscopic Image Processor), or incorporated into software packages (e.g., DDD s TriDef-Player). However, the quality of the resulting stereoscopic images, with respect to the depth sensation, is still an outstanding issue that requires more research. VI. CONCLUSION This paper summarized current technical advances related to the development of automatic 2D-to-3D video conversion. The underlying principle of the conversion is to horizontally shift the pixels of an original image to generate a new version of it. To enable this conversion, different approaches for the extraction of depth information from monoscopic images and the generation of stereoscopic images were reviewed. A number of challenging issues that have to be solved for the success of automatic 2D-to-3D video conversion were pointed out as possible research topics. With the development of more advanced techniques for 2D-to-3D video conversion, the vast collection of 2D material currently available will be converted into stereoscopic 3D to boost the general public interest in purchasing 3D displays and 3D-TV services. ACKNOWLEDGMENT The authors would like to express their sincere thanks to Dr. W. J. Tam and R. Klepko for many fruitful discussions and constructive suggestions during the preparation of this manuscript. Thanks are also due to HHI for providing the Interview sequence and to NHK for providing the Balloons sequence. REFERENCES [1] S. Yano and I. Yuyama, Stereoscopic HDTV: Experimental system and psychological effects, J. SMPTE, vol. 100, pp. 14 18, 1991. [2] N. S. Holliman, N. A. Dodgson, and G. Favalora, Three-dimensional display technologies: An analysis of technical performance characteristics, IEEE Trans. Broadcast., 2011, to be published. [3] H. Tilton, Broadcast standards for 3D HDTV, in Int. Symp. 3 D Image Technol. Arts, Tokyo, Feb. 1992, pp. 187 191. [4] P. V. Harman, An architecture for digital 3D broadcasting, in SPIE Conf. Stereoscopic Displays Virtual Reality Syst. VI, San Jose, California, Jan. 1999, vol. 3639, pp. 254 259. [5] P. V. Harman, Home-based 3D entertainment An overview, in IEEE Int. Conf. Image Process., Vancouver, 2000, pp. 1 4. [6] W. J. Tam and L. Zhang, 3D-TV Content Generation: 2D-To-3D Conversion, in IEEE Int. Conf. Multimedia Expo, Toronto, ON, Jul. 9 12, 2006, pp. 1869 1872. [7] P. Harman, J. Flack, S. Fox, and M. Dowley, Rapid 2D to3d conversion, in SPIE Conf. Stereoscopic Displays and Virtual Reality Systems IX, 2002, vol. 4660, pp. 78 86. [8] G. Iddan and G. Yahav, 3D imaging in the studio, in SPIE Conf. 3 D Image Capture Appl. IV, 2001, vol. 4298, pp. 48 55. [9] L. Zhang, J. Tam, and D. Wang, Stereoscopic image generation based on depth images, in IEEE Conference on Image Processing, Singapore, Oct. 2004, pp. 2993 2996. [10] O. Grau, M. Price, and G. A. Thomas, Use of 3-D techniques for virtual production, in Proc. SPIE, 2001, vol. 4309, pp. 40 50, Invited Paper. [11] S. C. Chan, H.-Y. Shum, and K.-T. Ng, Image-based rendering and synthesis, IEEE Signal Process. Mag., pp. 22 33, Nov. 2007. [12] C. Liang and K. K. Wong, 3D reconstruction using silhouettes from unordered viewpoints, Image Vis. Comput., vol. 28, no. 4, pp. 579 589, 2010. [13] J. F. Blinn and M. E. Newell, Texture and reflection in computer generated images, Commun. ACM, vol. 19, no. 10, pp. 542 547, Oct. 1976. [14] M. Levoy and P. Hanrahan, Light field rendering, in Proc. SIG- GRAPH, Aug. 1996, pp. 31 42. [15] S.J. Gortler, R. Grzeszczuk, R. Szeliski, and M. F. Cohen, The lumigraph, in Proc. SIGGRAPH, Aug. 1996, pp. 43 54. [16] C. Fehn, Depth-image-based rendering (DIBR), compression and transmission for a new approach on 3D-TV, in SPIE Conf. Stereoscopic Displays Virtual Reality Syst. XI, CA, Jan. 2004, vol. 5291, pp. 93 104. [17] J. Shade, S. Gortler, L. He, and R. Szeliski, Layered depth image, in Proc. SIGGRAPH 98, Jul. 1998, pp. 231 242. [18] C. L. Zitnick, S. B. Kang, M. Uyttendaele, S. Winder, and R. Szeliski, High-quality video view interpolation using a layered representation, ACM Trans. Graphics, vol. 23, no. 3, pp. 600 608, 2004. [19] L. Zhang, D. Wang, and A. Vincent, Adaptive reconstruction of intermediate views from stereoscopic images, IEEE Trans. Circuits Syst. Video Technol., vol. 16, no. 1, pp. 102 113, Jan. 2006. [20] A. Redert, M. Op de Beeck, C. Fehn, W. Ijsselsteijn, M. Pollefeys, L. Van Gool, E. Ofek, I. Sexton, and P. Surman, ATTEST advanced three-dimensional television system techniques, in Proc. 3DPVT, Padova, Italy, Jun. 2002, pp. 313 319. [21] Q. Wei, Converting 2D to 3D: A survey, Delft University of Technology, The Netherlands, Project Report, Dec. 2005. [22] Tutorial: Radar and stereoscopy visual ability in 3-D depth perception, Canada Centre for Remote Sensing [Online]. Available: http:// cct.rncan.gc.ca/resource/tutor/stereo/chap2/chapter2_5_e.php [23] R. Ostnes, V. Abbott, and S. Lavender, Visualisation techniques: An overview Part 1, Hydrographic J., no. 113, pp. 4 7, Jul. 2004. [24] J. Ens and P. Lawrence, An investigation of methods of determining depth from focus, IEEE Trans. Pattern Anal. Mach. Intell., vol. 15, no. 2, pp. 523 531, 1993. [25] J. Ko, M. Kim, and C. Kim, 2D-To-3D stereoscopic conversion: depth-map estimation in a 2D single-view image, in Proc. SPIE, 2007, vol. 6696. [26] G. Guo, N. Zhang, L. Huo, and W. Gao, 2D to 3D conversion based on edge defocus and segmentation, in IEEE Int. Conf. Acoust., Speech Signal Process., Mar. 31 April, 4 2008, pp. 2181 2184. [27] Favaro, Shape from focus/defocus, Washington University, Department of Electrical Engineering Electronic Signals and Systems Research Lab, Jun. 25, 2002. [28] A. P. Pentland, A new sense for depth of field, IEEE Trans. Pattern Anal. Mach. Intell., vol. 9, pp. 523 531, 1987.

382 IEEE TRANSACTIONS ON BROADCASTING, VOL. 57, NO. 2, JUNE 2011 [29] J. H. Elder and S. W. Zucker, Local scale control for edge detection and blur estimation, IEEE Trans. Pattern Anal. Mach. Intell., vol. 20, no. 7, pp. 699 716, 1998. [30] S. A. Valencia and R. M. Rodriguez-Dagnino, Synthesizing stereo 3D views from focus cues in monoscopic 2D images, in Proc. SPIE, 2003, vol. 5006, pp. 377 388. [31] S. Battiato, S. Curti, M. La Cascia, M. Tortora, and E. Scordato, Depth map generation by image classification, in Proc. SPIE, Apr. 2004, vol. 5302, pp. 95 104. [32] X. Huang, L. Wang, J. Huang, D. Li, and M. Zhang, A depth extraction method based on motion and geometry for 2D to 3D conversion, in 3rd Int. Symp. Intell. Inf. Technol. Appl., 2009, pp. 294 298. [33] Y.-J. Jung, A. Baik, and D. Park, A novel 2D-to-3D conversion technique based on relative height-depth-cue, in SPIE Conf. Stereoscopic Displays Appl. XX, San José, CA, 2009, vol. 7237, 72371U. [34] K. Yamada, K. Suehiro, and H. Nakamura, Pseudo 3D image generation with simple depth models, in Int. Conf. Consum. Electron., Dig. Techn. Papers, Jan. 8 12, 2005, pp. 277 278. [35] D. A. Forsyth, Shape from texture without boundaries, in ECCV02, 2002, pp. 225 239. [36] A. Torralba and A. Oliva, Depth estimation from image structure, IEEE Trans. Pattern Anal. Mach. Intell., vol. 24, no. 9, Sept. 2002. [37] A. M. Loh and R. Hartley, Shape from non-homogeneous, non-stationary, anisotropic, perspective texture, in Proc. British Mach. Vis. Conf., 2005. [38] F. Cozman and E. Krotkov, Depth from scattering, in IEEE Conf. Comput. Vis. Pattern Recog. (CVPR), 1997, pp. 801 806. [39] W. J. Tam, C. Vázquez, and F. Speranza, 3D-TV: A novel method for generating surrogate depth maps using colour information, in SPIE Conf. Stereoscopic Displays Appl. XX Conf., San José, CA, 2009, vol. 7237. [40] C. Vázquez and W. J. Tam, CRC-CSDM: 2D to 3D conversion using colour-based surrogate depth maps, in Int. Conf. 3D Syst. Appl. (3DSA), Tokyo, Japan, May 2010. [41] R. Zhang, P.-S. Tsai, J. E. Cryer, and M. Shah, Shape-from-shading: A survey, IEEE Trans. Pattern Anal. Mach. Intell., vol. 21, no. 8, pp. 690 706, 1999. [42] E. Rubin, Figure and ground, in Visual Perception Philadelphia, S. Yantis, Ed. London, U.K.: Psychology, 2001, pp. 225 229. [43] W. J. Tam, A. S. Yee, J. Ferreira, S. Tariq, and F. Speranza, Stereoscopic image rendering based on depth maps created from blur and edge information, in Proc. Stereoscopic Disp. Appl., Jan. 2005, vol. 5664, pp. 104 115. [44] J. Kim, A. Baik, Y. J. Jung, and D. Park, 2D-to-3D conversion by using visual attention analysis, in Proc. SPIE 7524, Feb. 2010, 752412. [45] B. J. Rogers and M. E. Graham, Motion parallax as an independent cue for depth perception, Perception, vol. 8, pp. 125 134, 1979. [46] E. Arce and J. Marroquin, High-precision stereo disparity estimation using HMMF models, Image Vis. Comput., vol. 25, no. 5, pp. 623 636, May 2007. [47] N. Atzpadin, P. Kauff, and O. Schreer, Stereo analysis by hybrid recursive matching for real-time immersive video conferencing, IEEE Trans. Circuits Syst. Video Technol., vol. 14, pp. 321 334, 2004. [48] D. Scharstein and R. Szeliski, A taxonomy and evaluation of dense two-frame stereo correspondence algorithms, Int. J. Comput. Vis., no. 47, pp. 7 42, 2002. [49] T. Jebara, A. Azarbayejani, and A. Pentland, 3D structure from 2D motion, IEEE Signal Process. Magaz., vol. 16, no. 3, pp. 66 84, May 1999. [50] R.-I. Hartley and A. Zisserman, Multiple View Geometry in Computer Vision, Second ed. Cambridge, U.K.: Cambridge Univ. Press, 2004. [51] E. Imre, S. Knorr, A. A. Alatan, and T. Sikora, Prioritized sequential 3D reconstruction in video sequences of dynamic scenes, in IEEE Int. Conf. Image Process. (ICIP), Atlanta, GA, 2006. [52] I. Ideses, L. P. Yaroslavsky, and B. Fishbain, Real-time 2D to 3D video conversion, J. Real-Time Image Process., vol. 2, no. 1, pp. 3 7, 2007. [53] M. T. Pourazad, P. Nasiopoulos, and R. K. Ward, Generating the depth map from the motion information of H.264-encoded 2D video sequence, EURASIP J. Image Video Process., Article ID 108584, 2010. [54] G. Zhang, J. Jia, T. T. Wong, and H. Bao, Consistent depth maps recovery from a video sequence, IEEE Trans. Pattern Anal. Mach. Intell., vol. 31, no. 6, pp. 974 988, 2009. [55] K. Moustakas, D. Tzovaras, and M. G. Strintzis, Stereoscopic video generation based on efficient layered structure and motion estimation from a monoscopic image sequence, IEEE Trans. Circuits Syst. Video Technol., vol. 15, no. 8, pp. 1065 1073, Aug. 2005. [56] P. Gargallo and P. F. Sturm, Bayesian 3D modeling from images using multiple depth maps, in IEEE Conf. Comput. Vis. Pattern Recog., 2005, vol. 2, pp. 885 891. [57] P. H. S. Torr, Bayesian model estimation and selection for epipolar geometry and generic manifold fitting, TechReport MSR-TR-2002-29, Apr. 2002, Microsoft Research. [58] L. Alvarez, R. Deriche, T. Papadopoulo, and J. Sánchez, Symmetrical dense optical flow estimation with occlusions detection, Int. J. Comput. Vis., vol. 75, no. 3, pp. 371 385, 2007. [59] J. Shi and C. Tomasi, Good features to track, in IEEE Int. Conf. Comput. Vis. Pattern Recog., Seattle, WA, Jun. 1994. [60] D. G. Lowe, Distinctive image features from scale-invariant keypoints, Int. J. Comput. Vis., vol. 60, no. 2, pp. 91 110, 2004. [61] C. Tomasi and T. Kanade, Detection and tracking of point features, Tech. Rep. CMU-CS-91-132, 1991, Carnegie Mellon University Technical. [62] M. Fischler and R. Bolles, Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography, Commun. ACM, vol. 1981, pp. 381 385. [63] S. Knorr, E. Ýmre, A. A. Alatan, and T. Sikora, A geometric segmentation approach for the 3D reconstruction of dynamic scenes in 2D video sequences, in EUSIPCO, Florence, Italy, Sep. 2006. [64] C.-C. Cheng, C.-T. Li, Y.-M. Tsai, and L.-G. Chen, Hybrid depth cueing for 2D-to-3D conversion system, in SPIE Conf. Stereoscopic Displays Appl. XX, San José, CA, 2009, vol. 7237. [65] M. Kunter, S. Knorr, A. Krutz, and T. Sikora, Unsupervised object segmentation for 2D to 3D conversion, in SPIE Conf. Stereoscopic Displays Appl. XX, San José, CA, 2009, vol. 7237. [66] B. Triggs, P. McLauchlan, R. Hartley, and A. Fitzgibbon, Bundle adjustment A modern synthesis, in Vision Algorithms: Theory & Practice. Berlin: Springer-Verlag, 2000, vol. 1883, LNCS, pp. 298 372. [67] M. B. Kim and M. S. Song, Stereoscopic conversion of monoscopic video by the transformation of vertical to horizontal disparity, Proc. SPIE, vol. 3295, pp. 65 75, 1998. [68] L. Zhang, B. Lawrence, D. Wang, and A. Vincent, Comparison study on feature matching and block matching for automatic 2D to 3D video conversion, in 2nd IEE Eur. Conf. Visual Media Prod., London, U.K., 2005, pp. 122 129. [69] D. Kim, D. Min, and K. Sohn, A stereoscopic video generation method using stereoscopic display characterization and motion analysis, IEEE Trans. Broadcast., vol. 54, no. 2, pp. 188 197, Jun. 2008. [70] Y. Matsumoto, H. Terasaki, K. Sugimoto, and T. Arakawa, Conversion system of monocular image sequence to stereo using motion parallax, in SPIE Conf. Stereoscopic Displays Virtual Reality Syst. IV, May 1997, vol. 3012, pp. 108 112. [71] Q.-T. Luong and O. Faugeras, Self calibration of a moving camera from point correspondences and fundamental matrices, Int. J. Comput. Vis., vol. 22, no. 3, 1997. [72] M. Pollefeys, Self-calibration and metric 3D reconstruction from uncalibrated image sequences, PhD thesis, K.U. Leuven, Belgium, 1999. [73] P. R. S. Mendonca and R. Cipolla, A simple technique for self-calibration, in IEEE Comput. Soc. Conf. Comput.Vis. Pattern Recog. (CVPR), 1999. [74] R. Hartley and P. Sturm, Triangulation, Comput. Vis. Image Understanding, vol. 68, no. 2, pp. 146 157, 1997. [75] P. H. S. Torr, A. W. Fitzgibbon, and A. Zisserman, The problem of degeneracy in structure and motion recovery from uncalibrated image sequences, Int. J. Comput. Vis., vol. 32, no. 1, pp. 27 44, Aug. 1999. [76] C. Fehn, A 3D-TV approach using depth-image-based rendering (DIBR), in Proc. Vis., Imaging, Image Process., Benalmadena, Spain, Sep. 2003, pp. 482 487. [77] A. Woods, T. Docherty, and R. Koch, Image distortions in stereoscopic video systems, in SPIE Conf. Stereoscopic Displays Appl., San Jose, CA, Feb. 1993, pp. 36 48. [78] W. R. Mark, Post-Rendering 3D image warping: visibility, reconstruction, and performance for depth-image warping, PhD thesis, University of North Carolina at Chapel Hill, Chapel Hill, NC, Apr. 1999. [79] L. MacMillan, An image based approach to three-dimensional computer graphics, Ph.D dissertation, University of North Carolina,, 1997. [80] C. Vázquez, W. J. Tam, and F. Speranza, Stereoscopic imaging: Filling disoccluded areas in depth image-based rendering, in SPIE Conf. 3 D TV, Video, Display V, 2006, vol. 6392, 63920D.

ZHANG et al.: 3D-TV CONTENT CREATION: AUTOMATIC 2D-TO-3D VIDEO CONVERSION 383 [81] A. Colombari, A. Fusiello, and V. Murino, Continuous parallax adjustment for 3D-TV, in IEE Eur. Conf. Vis. Media Prod., London, U.K., Nov. 2005, pp. 194 200. [82] K. Luo, D.-X. Li, Y.-M. Feng, and M. Zhang, Depth-aided inpainting for disocclusion restoration of multi-view images using depth-image-based rendering, J. Zhejiang Univ. Sci. A, vol. 10, no. 12, pp. 1738 1749, Dec. 2009. [83] M. Bertalmio, G. Sapiro, V. Caselles, and C. Ballester, Image inpainting, in ACM SIGGRAPH, Int. Conf. Comput. Graphics Interactive Tech., New Orleans, LA, Jul. 2000, vol. 174, pp. 417 424. [84] R. Klein Gunnewiek, R.-P. M. Berretty, B. Barenbrug, and J. P. Magalhães, Coherent spatial and temporal occlusion generation, in Proc. SPIE 7237, 2009, p. 723713. [85] L. Jovanov, A. Pizurica, and W. Philips, Wavelet-based stereo images reconstruction using depth images, Proc. SPIE, vol. 6701, p. 67012A, 2007. [86] W. J. Tam, G. Alain, L. Zhang, T. Martin, and R. Renaud, Smoothing depth maps for improved stereoscopic image quality, in SPIE Conf. 3 D TV, Video, Display III, Philadelphia, PA, Oct. 2004, vol. 5599, pp. 162 172. [87] L. Zhang and W. J. Tam, Stereoscopic image generation based on depth images for 3D TV, IEEE Trans. Broadcast, vol. 51, no. 2, pp. 191 199, Jun. 2005. [88] W.-Y. Chen, Y.-L. Chang, S.-F. Lin, L.-F. Ding, and L.-G. Chen, Efficient depth image based rendering with edge dependent depth filter and interpolation, in IEEE Int. Conf. Multimedia Expo (ICME), Amsterdam, The Netherlands, Jul. 2005, pp. 1314 1317. [89] I. Daribo, C. Tillier, and B. Pesquet-Popescu, Distance dependent depth filtering in 3D warping for 3DTV, in IEEE 9th Workshop Multimedia Signal Process., Crete, Greece, Oct. 2007, pp. 312 315. [90] S.-B. Lee and Y.-S. Ho, Discontinuity-adaptive depth map filtering for 3D view generation, in 2nd Int. Conf. Immersive Telecommun., Berkeley, CA, 2009. [91] W. Wang et al., Depth image segmentation for improved virtual view image quality in 3-D TV, in 2007 Int. Symp. Intell. Signal Process. Commun. Syst., Xiamen, China, Nov. 2007, pp. 784 787. [92] W. J. Tam and L. Zhang, Non-uniform smoothing of depth maps before image-based rendering, in SPIE Conf. 3 D TV, Video Display III (ITCOM), 2004, vol. 5599, pp. 173 183. [93] S. Knorr and T. Sikora, An image-based rendering (IBR) approach for realistic stereo view synthesis of TV broadcast based on structure from motion, in IEEE Int. Conf. Image Process. (ICIP), San Antonio, TX, 2007. [94] S. Knorr, M. Kunter, and T. Sikora, Stereoscopic 3D from 2D video with super-resolution capability, Signal Process.: Image Commun., vol. 23, no. 9, pp. 665 676, Oct. 2008. [95] Y. L. Chang, J. Y. Chang, Y. M. Tsai, C. L. Lee, and L. G. Chen, Priority depth fusion for 2D-to-3D conversion systems, in SPIE Proc. 3 D Image Capture Appl., 2008, vol. 6805. Liang Zhang (SM 07) received the B. S. degree from Chengdu Institute of Radio Engineering in 1982, the M. S. degree from Shanghai Jiaotong University in 1986, and the Dr.-Ing. (Ph. D.) degree in electrical engineering from the University of Hannover, Germany in 2000. He was a Teaching Assistant from 1987 to 1988 and a Lecturer from 1989 to 1992 with the Department of Electrical Engineering, Shanghai Jiaotong University. From 1992 to 2000, he was a research assistant with the Institut für Theoretische Nachrichtentechnik und Informationsverarbeitung, University of Hannover, Germany. Since 2000, he has been with Communications Research Centre, Canada, where he is currently a team leader and a Senior Research Scientist. His research interests include image and video coding, image and video processing, stereo image processing and 3D TV. Dr. Zhang is a Senior Member of IEEE BTS and CAS Societies. He was a recipient of the German Academic Exchange Service (DAAD) scholarship. He served as publication co-chair of IEEE International Conference on Multimedia and Expo, Toronto, Canada, 2006. He was chair and organizer of the special session on 3D TV: Primed for Success? at the IEEE International Conference on Multimedia and Expo, Toronto, Canada, 2006. Currently, he serves as an Associate Editor for IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY and the Journal of Real-Time Imaging. Carlos Vázquez (M 02) received the B. Eng. degree in electrical engineering with great distinction and the M.Sc. degree in applied computer sciences from the Polytechnic University José Antonio Echeverría (ISPJAE) in Havana, Cuba in 1992 and 1997 respectively. He received the Ph.D. degree in telecommunications from the Institut National de la Recherche Scientifique (INRS), Montréal, Canada in 2003. He worked at the Telecommunications Department of the Electrical Engineering Faculty of ISPJAE from September 1992 to August 1997, first as a Lecturer, becoming Assistant Professor in 1996. In September 1997 he joined the INRS as a Ph.D. student and recipient of an excellence scholarship awarded by the Ministry of Education of Québec. From 2002 to 2004 he was a post-doctoral fellow at INRS-EMT. He worked as Research Associate in the ECE Department of Concordia University, Montréal, Canada in 2004 2005. In August 2005 he joined the Advanced Video System Group of the Communication Research Centre Canada (CRC) in Ottawa, Canada as Research Scientist. His research interests are in the areas of 3D-TV; stereo and multi-view vision systems; image/video representation, sampling and interpolation; image and video coding; and motion/disparity/depth estimation and compensation. Sebastian Knorr received the Dipl.-Ing. and Dr.-Ing. (Ph. D.) degrees in electrical engineering from Technische Universität Berlin, Germany, in 2002 and 2008, respectively. He worked as Research Assistant and Senior Researcher from 2002 to 2009 in the field of 3D image processing and computer vision in the Communication Systems Lab at Technische Universität Berlin, Germany. During this time he was involved in several European Networks of Excellence, e.g. VISNET and 3DTV. Currently, he is chief research officer and managing director of imcube media GmbH. Dr. Knorr is a member of the German Society of Television- and Cinema Technology (FKTG e.v.). In 2007, he invented the process of automatic 2D to 3D image conversion with highly advanced computer vision technology. He received the German Multimedia Business Award of the Federal Ministry of Economics and Technology in 2008, and was awarded by the initiative Germany - Land of Ideas which is sponsored by the German government, commerce and industry in 2009, respectively.