3D-TV Content Creation: Automatic 2D-to-3D Video Conversion Liang Zhang, Senior Member, IEEE, Carlos Vázquez, Member, IEEE, and Sebastian Knorr

Size: px
Start display at page:

Download "3D-TV Content Creation: Automatic 2D-to-3D Video Conversion Liang Zhang, Senior Member, IEEE, Carlos Vázquez, Member, IEEE, and Sebastian Knorr"

Transcription

1 372 IEEE TRANSACTIONS ON BROADCASTING, VOL. 57, NO. 2, JUNE D-TV Content Creation: Automatic 2D-to-3D Video Conversion Liang Zhang, Senior Member, IEEE, Carlos Vázquez, Member, IEEE, and Sebastian Knorr Abstract Three-dimensional television (3D-TV) is the next major revolution in television. A successful rollout of 3D-TV will require a backward-compatible transmission/distribution system, inexpensive 3D displays, and an adequate supply of high-quality 3D program material. With respect to the last factor, the conversion of 2D images/videos to 3D will play an important role. This paper provides an overview of automatic 2D-to-3D video conversion with a specific look at a number of approaches for both the extraction of depth information from monoscopic images and the generation of stereoscopic images. Some challenging issues for the success of automatic 2D-to-3D video conversion are pointed out as possible research topics for the future. Index Terms Broadcasting, data conversion, image generation, stereo displays, stereo synthesis, stereo vision, three-dimensional displays, 3D-TV. I. INTRODUCTION T HREE-dimensional television (3D-TV) is anticipated to be the next step in the advancement of television. The term 3D in this context denotes stereoscopic, meaning a two-view system is used for visualization. Stereoscopic images that are displayed on 3D displays can increase the visual impact and heighten the sense of presence for viewers [1]. The successful adoption of 3D-TV by the general public will depend not only on technological advances in 3D displays [2] and 3D-TV broadcasting systems [3], [4] but also on the availability of a wide variety of program content in stereoscopic 3D (S3D) format for 3D-TV services [5]. The supply of adequate S3D content will be especially critical in the early stages of 3D-TV rollout to ensure that the public would be willing to spend money for 3D displays and 3D-TV services. However, a certain length of time will be required for content providers to capture and to create enough S3D material with stereoscopic cameras. We consider that the conversion of 2D images/videos to S3D is one way to alleviate this difficult problem. In this way, the vast collection of 2D material that currently exists, in the form of television programs and theatrical films, and their conversion to stereoscopic imagery should subdue any apprehension by the general public of purchasing 3D displays and 3D-TV services for fear that there is not much 3D programs to watch. Manuscript received October 12, 2010; revised February 08, 2011; accepted February 23, Date of publication March 28, 2011; date of current version May 25, L. Zhang and C. Vázquez are with the Communications Research Centre Canada, Ottawa, ON, K2H 8S2, Canada ( liang.zhang@crc.gc.ca; carlos. vazquez@crc.gc.ca). S. Knorr is with the Communication Systems Lab of Technische Universität Berlin, Berlin D-10587, Germany ( knorr@imcube.de). Color versions of one or more of the figures in this paper are available online at Digital Object Identifier /TBC D-to-3D conversion techniques can be profitable for content providers who are always looking for new sources of revenue for their vast library of program materials. This potential market is attracting many companies to invest their manpower and money for developing 2D-to-3D conversion techniques. The list of companies providing 2D-to-3D conversion includes, but is not limited to, Dynamic Digital Depth (DDD), HDlogix, Himax Technologies, In-Three, Industrial Light and Magic, Legend 3D, Passmorelab, Prime Focus, Samsung, Sassoon Film Design, Sony Image Works, Stereo Pictures, Stereo D, imcube and Victor Company of Japan (JVC). The fundamental principle of 2D-to-3D conversion techniques rests on the fact that stereoscopic viewing involves binocular processing of two slightly dissimilar images [6]. The human visual system transforms the slight differences (pixel horizontal shifts) between the left-eye and right-eye images into distance information such that objects are perceived at different depths and outside of the 2D display plane. Thus, converting 2D images to stereoscopic 3D images involves the underlying principle of horizontal shifting of pixels to create a new image, so that there are horizontal disparities between the original image and a new version of it. The extent of pixel horizontal shift depends not only on the distance of an object to the stereoscopic camera, but also on the inter-lens separation that determines the new image viewpoint. Various approaches for 2D-to-3D conversion have been proposed. These approaches can be classified into three schemes, namely: manual, human-assisted and automatic conversion. The manual scheme is to shift the pixels horizontally with an artistically chosen depth value for different regions/objects in the image to generate a new image [5], where hand drawing produces high quality depth, but is very time consuming and expensive. The human-assisted scheme is to convert 2D images to stereoscopic 3D with some corrections made manually by an operator [7]. Even though this scheme reduces the time consumed in comparison to the manual conversion scheme, a significant amount of human engagement is still required to complete the conversion. To convert the vast collection of available 2D material into 3D in an economic manner, an automatic conversion scheme is desired. The automatic conversion scheme exploits depth information originated from a single image or from a stream of images to generate a new projection of the scene with a virtual camera of a slightly different (horizontally shifted) viewpoint. It may be done in real-time or in a more time-consuming off-line process. The quality of the resulting product is related to the level of processing involved, so the real-time solutions usually produce the lowest quality conversion. There are two key issues to consider for automatic 2D-to-3D conversion techniques: how to retrieve depth information from a mono /$ IEEE

2 ZHANG et al.: 3D-TV CONTENT CREATION: AUTOMATIC 2D-TO-3D VIDEO CONVERSION 373 scopic image or video [8] as well as how to generate high-quality stereoscopic images at new virtual viewpoints [9]. In addition to the 2D-to-3D conversion of standard video content, there are also works that have addressed other types of conversion like for example [10], where a model of an actor is extracted and incorporated into a virtual environment. This type of application that imposes conditions in the type of content that is converted is out of the scope of this contribution. This paper will provide an overview of automatic 2D-to-3D video conversion. After the introduction, Section II will describe the framework for automatic 2D-to-3D video conversion. The extraction of scene depth information will be reviewed in Section III. Section IV is devoted to the generation of stereoscopic images. Some challenging issues will be discussed in Section V as future research topics. Section VI concludes this paper. II. FRAMEWORK FOR AUTOMATIC 2D-TO-3D VIDEO CONVERSION 2D-to-3D video conversion involves the generation of new images from a single 2D image or a single stream of 2D images (video sequence). From this perspective, 2D-to-3D video conversion can be seen, at least conceptually, as a special case of image-based modeling and rendering techniques developed to provide new virtual views (view synthesis) from a given set of images [11]. Image-based modeling and rendering techniques can be classified into three main categories, according to the amount of geometry information explicitly used in the process [11]: 1) Methods that use a full 3D model of the scene: This category requires the complete and accurate reconstruction of a geometric model for the captured scene. Such a model will contain all the required information for the rendering of a new virtual view from a given viewpoint. Structure from silhouette [12], for example, is a commonly used technique to construct 3D models of objects. Given the 3D model and the lighting condition in the scene, a new virtual view can be easily rendered from a desired viewpoint using conventional computer graphics techniques [13]. In the 2D-to-3D video conversion context, it is usually extremely difficult and error-prone to recover the complete 3D scene structure from a single image or video sequence unless the video is captured under stringent conditions. It is thus impractical to use a full 3D model approach for automatic 2D-to-3D video conversion. 2) Methods that use only images and no explicit geometric information: This category directly renders new virtual views from an over-captured set of images, usually hundreds to thousands of images are required, with no or very little geometric information, e.g. Lightfields [14] and Lumigraph [15]. In 2D-to-3D video conversion, the number of images available for rendering is normally small, making this approach impractical for the automatic 2D-to-3D video conversion. 3) Hybrid methods that explicitly use some geometric information: This category uses a hybrid geometry- and image-based approach. New virtual views are rendered from a limited number of images with the help of incomplete geometrical information from the scene. Methods Fig. 1. Diagram for automatic 2D-to-3D video conversion. in this category include depth-image-based rendering (DIBR) [16], layered depth images (LDI) [17], and intermediate view reconstruction (IVR) [18], [19]. Most of the proposed 2D-to-3D video conversion algorithms use a framework that falls into this category, a hybrid geometryand image-based approach. The framework commonly used for the automatic 2D-to-3D video conversion basically consists of two elements (Fig. 1), namely: the extraction of depth information and the generation of stereoscopic images in accordance with both the estimated depth information and the expected viewing conditions. The extraction of depth information aims to exploit pictorial cues and motion parallax, contained in a single 2D image or video, to recover the depth structure of the scene. The retrieved depth information is then converted into a suitable representation for usage in the 2D-to-3D video conversion process. A sparse 3D scene structure and a depth map are two representations of incomplete geometry of a captured scene that are commonly used. A sparse 3D scene structure usually consists of a number of 3D real world coordinates, while a depth map 1 is essentially a two-dimensional (2D) function that provides the depth, with respect to the camera position, as a function of the image coordinates [5], [20]. For simplicity, we use the term of depth information to include both representations of incomplete scene geometry. The generation of stereoscopic images is the step that involves warping textures from original images in accordance with the depth information retrieved to create a new image or video sequence for the second eye. The developed view synthesis algorithms can also be used to generate multiple new images to meet the format requirement of auto-stereoscopic 3D displays. In this latter case, the specificities of multi-view image content should be taken into account, e.g., the possibly wider baseline between multi-view images will generate more disoccluded regions that have to be filled with an advanced hole-filling algorithm such as inpainting algorithms. III. EXTRACTION OF SCENE DEPTH INFORMATION A variety of depth cues are exploited by the human being to perceive the world in three dimensions. These are typically classified into binocular and monocular depth cues. Binocular 1 We do not distinguish between depth maps containing relative depth information ( disparity map ) and that containing absolute depth information ( range maps ) in this paper. Theoretically, absolute depth information can be derived from the relative depth information if sufficient camera and capture information are provided.

3 374 IEEE TRANSACTIONS ON BROADCASTING, VOL. 57, NO. 2, JUNE 2011 Fig. 2. Classification of depth cues. Fig. 3. Pictorial depth cues in a 2D image. Visible depth cues: linear perspective, relative and known size, texture gradient, atmospheric scattering, relative height in picture, and interposition. cues provide depth information when viewing a scene with both eyes through exploitation of differences between the perceived images, while monocular cues provide depth information when viewing a scene with one eye. An incomplete list of the depth cues is presented in Fig. 2. The extraction of scene depth information aims to convert monocular depth cues contained in video sequences into quantitative depth values of a captured scene. An overview of methods employed for the extraction of depth from monocular depth cues for 2D-to-3D conversion was compiled in [21], which presents a brief description of methods and a simple comparison of their main characteristics. Monocular depth cues can be subdivided into pictorial and motion cues. In the following, we will focus on the methods for extracting depth from both pictorial and motion cues. A. Depths From Pictorial Cues Pictorial depth cues are the elements in an image that allow us to perceive depth in a 2D representation of the scene. This has been known for centuries and has been extensively applied in visual arts to enhance the perception of depth. Depth perception could be related to the physical characteristics of the Human Visual System (HVS) such as the perception of depth by accommodation or could be learned from experience like the perception acquired from the relative height of objects in the picture, perspective, shadows, and other pictorial cues [22], [23]. An example is shown in Fig. 3, which presents an image where a clear depth order can be extracted by using pictorial depth cues. The generation of depth information from pictorial cues embedded in an image can be subdivided into two approaches. The first one relates to the extraction of real depth information from available pictorial cues in an image. By real we mean relative depths between objects in the scene. It is impossible to obtain absolute depths without knowledge of the position and optical characteristics of the capturing device. The second approach creates artificial or non-veridical depth information by exploiting pictorial cues that are commonly found in all scenes of a given category like landscapes or indoor. We will discuss three categories of pictorial cues commonly used to extract depth information in the following subsections. 1) Depth From Focus/Defocus: Accommodation is the mechanism of the human eye used to focus on a given plane in depth. Real aperture cameras do similarly by focusing on a given plane. This in practice makes the rest of the scene blurred in a measure that depends on the distance to the focusing plane of the optics. This mechanism can be exploited for the generation of depth information from captured images, which contain a focused plane and objects out of the focused plane. This topic is known in the literature as depth-from-focus/defocus, which is one of the first mechanisms to be employed to recover the depth from single images [24] [26]. In practice, there are two main approaches that are used to implement this mechanism. The first one employs several images with different focus characteristics in order to extract the variation of blur for a given image feature across the available images. This variation can be translated into depth by finding the point where the particular feature should be in focus [27]. Although this approach is reliable and provides good depth estimation, the requirement of having several images of the same scene captured with different optical systems simultaneously is too restrictive to be of any practical application in the 2D-to-3D conversion problem. The second approach tries to extract the blur information from a single image by measuring the amount of blur associated with each pixel and then mapping the blur measures to the depth of that pixel. A deconvolution process in the frequency domain using inverse filtering was introduced in [28] to recover the amount of blur in an image. To solve instabilities related to inverse filtering in the frequency domain a regularization approach was proposed in [24]. A local scale control method was alternatively proposed in [29] to detect edges at different levels of blur and to compute the blur associated to those edges. A Gaussian blurring kernel was used to model the blur of edges and its second derivative was used to measure the spread of the edge in order to extract the blur level. More recently, a wavelet-based approach was proposed in [30], where a wavelet

4 ZHANG et al.: 3D-TV CONTENT CREATION: AUTOMATIC 2D-TO-3D VIDEO CONVERSION 375 decomposition of macro-blocks within an image was performed to recover the high frequency content of that macro-block and the number of high-value wavelet coefficients was counted to be used as a measure of blur. A similar approach was presented in [26], but extended to use 2D wavelet analysis for the detection and analysis of edges and to use color-based segmentation for adding consistency to the depth map. High order statistics has also been used to estimate the amount of blur in images for 2D-to-3D conversion [25]. Although the approach of recovering depth from focus/defocus is relatively simple, it suffers from a major drawback, how to distinguish the foreground from the background when the amount of blur is similar. In many cases the foreground corresponds to the focusing plane, but when this is not the case, then it is impossible to distinguish an out-of-focus region in the foreground from an out-of-focus region in the background. 2) Depth From Geometric Cues: Depth from geometric cues is an interesting approach to obtaining depths from a 2D image. Geometric related pictorial depth cues are linear perspective, known size, relative size, height in picture, interposition, and texture gradient. Some of these cues are stronger than others. The interposition, for example, can teach us the order of objects in depth, but not the distance in depth between them. Some cues might be hard to use in an application for the estimation of depth. For example, the information related to the size of objects is hard to use since it requires the identification of objects and the knowledge of normal sizes for those objects. The most commonly used geometric cues are the linear perspective and the height in picture. Linear perspective refers to the property of parallel lines of converging at infinite distance, or equivalently, a fixed size object will produce a smaller visual angle when more distant from the eye. This characteristic is used for depth estimation by detecting parallel lines in the images and identifying the point where these lines converge (vanishing point). Then a suitable assignment of depth can be derived based on the position of the lines and the vanishing point [31], [32]. The height in picture denotes that objects that are closer to the bottom of the images are generally closer than objects at the top of the picture. Outdoor and landscape scenes mainly contain this pictorial depth cue. To extract this depth cue, horizontal lines usually have to be identified so that the image can be divided into stripes that go from the left border to the right border. For this purpose, a line-tracing algorithm is applied to recover the optimal dividing lines subject to some geometric constraints [33]. A depth-refining step is further applied to improve the quality of the final depth map. For assigning depths to a 2D image, a predefined depth model, which can be adjusted according to image structure, is used [34]. Aside from linear perspective and height in picture it is also possible to recover depths from texture (called shape-from-texture), which aims to estimate the shape of a surface based on cues from markings on the surface or its texture [35] [37]. Those methods, however, are normally restricted to specific types of images and cannot be applied to 2D-to-3D conversion of general video content. 3) Depth From Color and Intensity Cues: Variations in the amount of light arriving to the eye could also provide information of the depth of objects. This type of variation is reflected on captured images as variations of intensity or changes in color. Depth cues that are based on this mechanism include atmospheric scattering, light and shadow distribution, figure-ground perception, and local contrast. Atmospheric scattering refers to the scattering of light rays by the atmosphere producing a bluish tint and less contrast to objects that are in the far distance and a better contrast to objects that are in close range [38]. Based on color rules, which are learned heuristically using a large number of landscape/outdoor images, semantic region detection is performed to divide landscape/outdoor images into six regions such as sky, farthest mountain, far mountain, near mountain, land, and other [31]. Then, a depth value would be assigned to each region so that the sky region will be the farthest one and the other region the closer one. The warm/cool color theory is used to generate the depths from the images [34], where a pixel with a warm color is assigned a depth closer to the camera than the pixel of a cool color. This color-based depth value was further added to the blended pre-defined depth data, which was selected according to the content of the scene, to provide the final virtual depth data for virtual stereoscopic view generation. Color dependent information is also used in the depth creation algorithm [39] and in the 2D-to-3D conversion system [40]. In the proposed method the component in the -color space is used as a substitute for the actual depth in a scene (called surrogate depth). The component is first mapped to a given range of depth and an adjustment is then made for very bright red regions in order to prevent them from coming too close to the viewer. The selection of the bright red regions takes into account the skin tone to avoid negatively affecting the depth of human characters. Experimental evidence shown in [39] suggests that this simple approach can provide a significant enhancement to the 3D effect with respect to the perceived depth in 2D images. Light and shadow distribution refers to the information provided by shadows with respect to the position and shape of objects relative to other objects and the background. The problem of acquiring depths from light and shadow distribution is known as shape-from-shading in computer vision. The shape-from-shading aims to extract the shape of objects by identifying the position of the light source to recover the shape of the reflecting surface through analyzing the variations in intensity. A good survey of those kinds of techniques is found in [41], where six algorithms are reviewed and compared. Figure-ground perception [42] is another mechanism that helps in the perception of depth. Edges and regions in the image are the depth cues providing this information. For two contiguous regions in the visual field, the common perceptual outcome is that the edge between them appears to be a boundary for only one of them. The region claiming the edge appears to have a definite shape and is called the figure. The contiguous region appears shapeless near the edge and is perceived to continue behind it, it is called the ground. Thus, in addition to being shaped, the figure appears nearer than the ground. A method based on this principle is proposed in [43] where detected edges/object boundaries are used as substitutes for the depth. Although only edges/object boundaries are assigned depth values, the HVS seems to be able to fill objects with the appropriate depths and further separate them from the local background. Experimental evidence suggests that this approach

5 376 IEEE TRANSACTIONS ON BROADCASTING, VOL. 57, NO. 2, JUNE 2011 provides enhanced 3D effects with respect to the perceived depth in a 2D image. More recently a depth creation method based on analysis of visual attention is proposed in [44], where a saliency map acts directly as a depth map. Local features such as color contrast, luminance, orientation, texture, and motion are used to provide their correspondent saliency maps, respectively. These individual saliency maps are finally fused in a linear or non-linear manner to achieve a final saliency map that is a good estimator for a visual attention map. Experimental evidence demonstrates that the resulting depth map provides good quality depth [44]. It is worth noting that most of the methods based on these intensity and color related depth cues do not attempt to recover the depth in the scene, but to provide a S3D image pair that is perceptually acceptable with an increased sensation of depth with respect to the original 2D images. B. Depths From Motion Cues Different from the pictorial depth cues as described in the previous subsection; video sequences provide motion parallax as an additional cue to depth [45]. Motion parallax refers to the relative motions of objects across the retina. For a moving observer, near objects move faster across the retina than far objects, and so relative motion provides an important depth cue. This is usually called the principle of the depth from motion parallax approach. Note that not all video sequences will provide motion parallax to depth. In principle, only video sequences that are captured by a freely moving camera have motion parallax that is closely related to the captured scene structure. If the camera has no motion, the captured video sequence does not have motion parallax. Even if there are some Independently Moving Objects (IMOs) in the scene, their motions will provide some cue to depth under certain circumstances, but this cue could be error-prone. In addition, different camera motions will lead to different strengths of depth perception. A freely moving camera can provide more information about the depth in the scene than a camera with a zooming/translating motion along the optical axis. A camera that only rotates around the optical axis does not provide any information about the depth. For video sequences, more than one image is available for the extraction of depth information. Algorithms such as structurefrom-stereo (SfS) [46] [48] and structure-from-motion (SfM) [49] [51] can be applied for this purpose. Those algorithms consist of two parts: determination of motion parallax from the sequence, and the mapping of motion parallax into depth information. 1) Motion Parallax Between Images: Motion parallax may be seen as a form of disparity over time and allows perceiving depth from the differences between two frames in a video sequence. These differences are observed in the video as image motion. By extracting this image motion, the motion parallax could be recovered. Image motion may relate to the whole image (global motion estimation) or specific parts, such as rectangular blocks, arbitrary shaped patches or even per pixel. A motion model that approximates the motion of a moving object or a real video camera, such as rotation and translation in all three dimensions and zoom, can describe the motion in the image. Motion model parameters are determined by motion estimation methods. Motion estimation is an ill-posed problem as the motion is in three dimensions, but the images are a projection of the 3D scene onto a 2D plane. Motion estimation methods can be generally classified into direct and indirect. The direct methods estimate motion model parameters directly from the images. They include, but are not limited to, block-matching algorithm [52], [53], pixel recursive algorithm [54], MAP/MRFbased Bayesian estimator [55] [57], and optical flow estimation [58]. Indirect methods estimate motion model parameters through image features. They detect image features like corners [59], [60] and track image features throughout the sequence [61]. From those tracked image features, motion model parameters are then estimated by a statistical method such as RANSAC [62], which removes the tracked image features (outliers) that do not correspond to the actual motion. For a static scene, image motion is caused only by camera motion. This camera-induced image motion depends on the camera motion parameters and the scene structure. In the case of scenes with independently moving objects (IMOs), the IMOs have a relative motion with respect to the camera that is different from that of the background. The existence of IMOs violates the principle of depth from motion parallax and makes the motion-to-depth conversion ambiguous. Therefore, those IMOs are either removed from the image for the background reconstruction process [63] or detected and segmented for an individual 3D reconstruction [64], [65]. In [64], the motion vectors of IMOs are directly converted into depths. Compared to their real depths in the captured scene, those IMOs will visually pop-up and catch more attention when the IMOs have similar motion direction as the camera. In [65], the segmented IMOs are rendered into the generated background sequence. For depth from motion parallax, actual camera motion trajectories have to be estimated from a video sequence. Due to occlusion and noise when capturing video sequences, the estimates of motion model parameters are usually inaccurate and temporally inconsistent. To improve the accuracy of the estimates, it is common to impose geometric coherence constraints such as trifocal or epipolar constraints [58]. Once the initial estimates for each frame are obtained, bundle adjustment [66] or bundle optimization [54] can be applied to refine the estimated parameters by taking into account temporal and spatial geometric constraints. 2) Conversion of Motion Parallax Into Depth Information: Dependent on the depth representation, motion parallax estimated from a video sequence is then converted into depth information in terms of a 2D depth map or a 3D scene structure. a) 2D depth map reconstruction: A 2D depth map can be reconstructed from image motion. The magnitude of motion vectors within each video frame are directly treated as depth values, when consecutive video frames are taken in almost parallel viewing or when they are acquired in a small baseline distance of two consecutive viewpoints. Such a motion-to-depth mapping might not generate correct depth magnitudes, but would have a correct depth order in the absence of IMOs in the scene. Several motion-to-depth mappings have been proposed in the literature. A linear mapping is presented in [67], which divides each video frame into blocks, directly calculates a motion vector per block, and then uses the magnitude as the depth value of that block. A scaled linear mapping is also presented in [52],

6 ZHANG et al.: 3D-TV CONTENT CREATION: AUTOMATIC 2D-TO-3D VIDEO CONVERSION 377 [68], which further scales the magnitudes of motion vectors to keep the maximal depth value constant across all video frames. A nonlinear mapping is proposed to enhance the perceptual depth [53], where the whole scene is divided into several depth layers and each depth layer is assigned a different scaling factor. Besides motion vector magnitude, additional factors such as camera movement and scene complexity are also taken into account for mapping the motion information of each video frame into depth values [69]. A captured video sequence contains the so-called stimulus-shearing motion in the image plane [45] when the camera moves around a particular object in the scene. In such a case, objects that lie between the particular object and the moving camera will have a motion in the image plane in opposite direction to that of the camera. These objects are closer to the camera than other objects that move in the same direction as the camera. To take this into account, the dominant camera motion direction is included for the motion-to-depth mapping [68]. For this kind of motion-to-depth mapping, feature-based motion estimation results in more accurate and stable depth values than block-matching based motion estimation, as shown by the preliminary experiments on video sequences with only camera panning [68]. The path taken by a moving camera, called tracks, and the camera parameters are required for the mapping of image motion into a depth value when camera motion is complicated or a more realistic depth value is desired [54], [70]. For this, camera parameters, including intrinsic and extrinsic parameters, will be estimated by structure-from-motion techniques [71] [73]. Once camera parameters are obtained, the depth of each pixel is calculated by the triangulation of the correspondent pixels between two images. Such a mapping generates a depth map that closely approximates the correct relative depths of the scene. b) Sparse 3D scene structure reconstruction: A 3D scene structure is a scene representation that consists of a number of 3D real world coordinates. As stated in Section II, a complete scene structure reconstruction is normally impracticable for an automatic 2D-to-3D video conversion. Therefore, we briefly describe the steps for the reconstruction of a sparse 3D scene structure [51], [63]. A sparse 3D scene structure is represented by a set of 3D feature points in a reconstructed 3D world. The first step of the process is the initial 3D scene structure reconstruction and camera track estimation from the first pair of images, which uses the tracked feature points to estimate the fundamental matrix. Then, the projection matrices for these two video frames are determined with singular value decomposition (SVD). If the internal calibration parameters are unknown, which in general is the case for TV broadcast, home videos, or cinema movies, a self-calibration procedure has to be carried out. Once the projection matrices are known, the 3D points of the tracked feature points can be found via optimal triangulation as described in [74]. The next step is to update the initial 3D scene structure and camera tracks using consecutive video frames. First the camera projection matrix for the next selected video frame is determined using already existing correspondences between the previously reconstructed 3D points and the tracked 2D points in the current frame. Then, the 3D scene structure and the camera tracks are refined with all tracked feature points between the Fig. 4. Sparse 3D scene structure and camera track determined by SfM and positioning of a virtual stereo camera. current frame and the previous ones. This refinement is done via global nonlinear minimization techniques, also known as bundle adjustment [66]. This procedure is repeated for all selected frames, until the refined estimates of the sparse 3D scene structure and camera tracks reach the desired accuracy (Fig. 4). The sparse 3D scene structure reconstruction is a very computationally intensive process if it is performed with all the frames in a video sequence. Such a reconstruction is also heavily dependent on the initial structure computation. Camera motion degeneracy could make the estimation of the fundamental matrix fail [75] and consecutive frame pairs may have baselines that are too short for accurate triangulation. To address these problems the reconstruction of the sparse 3D scene structure and camera tracks is usually performed only on selected video frames, called key frames, chosen in order to reduce computational complexity, to improve triangulation accuracy, and to avoid camera motion degeneracy. An alternative approach to deal with these problems is to use a prioritized sequential 3D reconstruction approach as proposed in [51] for a fast and reliable structure and motion computation. The main element of such an approach is that key frames are selected according to a priority metric and the frame pair with the highest priority metric is then used for the initial reconstruction. IV. GENERATION OF STEREOSCOPIC IMAGES The procedures for the generation of stereoscopic images vary with the representations of depth information. In the following, we address the methods that rely on both the 2D depth maps and the sparse 3D scene structure. A. Approaches Based on 2D Depth Maps Depth image based rendering (DIBR) permits the creation of novel images, using information from depth maps, as if they were captured with a camera from different viewpoints [76]. The DIBR system usually consists of three steps: (i) pre-processing of depth map, (ii) 3D image warping, and (iii) hole-filling. The pre-processing step includes choosing the zero-parallax setting (ZPS) [77], converting depth maps from 8-bit gray values to metric depth values [16], and optionally smoothing the depths to reduce block artifacts and/or sharp transition around object boundaries. The 3D image warping step renders a virtual image

7 378 IEEE TRANSACTIONS ON BROADCASTING, VOL. 57, NO. 2, JUNE 2011 Fig. 5. Parallel camera configuration used for the generation of virtual stereoscopic images. f is the focus length of the camera. c is the viewpoint of the original center image. c and c are the viewpoint of virtual left- and right-eye images to be generated. t is the distance between these two virtual cameras. Under this camera configuration, one point X with depth Z in the world is projected onto the image plane of three cameras at pixels x = (x ; y), x = (x ; y), and x = (x ; y), respectively. of any nearby viewpoint by projecting the pixels of the original image to their proper 3D locations and re-projecting them onto the virtual image plane (Fig. 5). The virtual camera is usually positioned to form a parallel stereoscopic setup with the original one. The virtual camera has the same intrinsic parameter and camera optical direction. The last step is to fill in the newly exposed areas resulting from a difference in viewpoints. The most significant problem in DIBR is how to deal with newly exposed areas (holes) appearing in the virtual images. Holes appear along object boundaries where the depth changes (Fig. 6). They are due to the accretion (disocclusion) of portions/ regions of objects or background that would have been visible only from the new viewpoint but not from the original location that was used in capturing the original image [78], [79]. There is no information in the original color image and its associated depth map about these disoccluded regions and, therefore, they would appear empty, like holes, in the new virtual image. One solution is to replace the missing image areas (holes) during the view synthesis with useful color information. A simple method is to horizontally interpolate the information in the holes. No depth information is taken into account in this method so foreground and background objects are fused together to fill in the holes [16], [80]. Another simple method that fills in the holes is to horizontally extrapolate the values of pixels on the boundary that belongs to the local background object [16], [80]. This method assumes that the hole is part of the local background object and the depth information is taken into account to avoid using local foreground objects, which are visible and not occluded, in the hole filling process. A more sophisticated method is to diffuse pixel values at the boundaries by solving the Laplacian equation on and near the hole boundaries and propagating the values inside the hole [81]. Another more sophisticated method is the so-called depth-aided in-painting method [82], which uses inpainting techniques [83] to fill in the holes by the incorporation of a depth cue into texture estimation. A temporal inpainting approach is also proposed in [84], which looks for the occlusion information in other frames. In addition, a wavelet-transform based hole filling is presented in [85]. However, these solutions result in so-called texture artifacts due to the operation of duplicating neighboring spatial color information as well as the inaccuracies in depth boundaries of objects. To reduce texture artifacts, another solution is proposed to preprocess depth maps [9], [16], [86]. The disocclusion problem arises where there are sharp edges in the depth map. Smoothing edge transitions in the depth map will help avoid the appearance of disocclusions. Consequently, it improves the image quality of the synthesized view and reduces texture artifacts in the disoccluded regions. One undesirable effect of smoothing edges is the geometrical distortion; for example, vertical straight lines in the left image could be rendered as curves in the right image if they are close to a sharp depth edge. To reduce this geometry distortion, asymmetric smoothing of depth map with more strength in the vertical direction than in the horizontal direction is proposed in [87], which is consistent with known characteristics of the binocular system of the human eyes. Different from the constant smoothing strength for the whole depth map [87], some more sophisticated methods for adaptive smoothing of depth maps are further proposed in the literature, for example, edge dependent depth filtering [88], distance dependent depth filtering [89], discontinuity-adaptive smoothing filtering [90], and layered depth filtering [91]. Nevertheless, smoothing of depth maps, while attenuating some artifacts, will lead to a reduction of the depth contrast contained in the rendered stereoscopic views. Future studies will be required to examine this trade-off more closely, although based on the studies conducted so far the benefits appear to outweigh this disadvantage [86], [92]. B. Approaches Based on Sparse 3D Scene Structure This subsection reviews the stereoscopic view generation based on a sparse 3D scene structure, for which a procedure is proposed in [93]. The basic idea in [93] is to determine the transformation between the original and virtual views, based on the sparse 3D scene structure, to enable the generation of virtual views. This procedure consists of three steps and will be described in the following paragraphs. 1) Setup of Virtual Stereo Rig: The first step of virtual stereo rig setup is to determine the projection matrix of the virtual camera. A virtual stereoscopic camera can be defined for each original video frame to simulate a stereo camera rig as indicated in Fig. 4 [94]. In case of a 2D to multi-view conversion, several virtual cameras have to be defined for each video frame. The virtual camera is usually positioned in such a way that both cameras form a stereoscopic parallel configuration. The rotation matrix of the virtual camera is identical to the one of the corresponding original view, which is estimated during the reconstruction of the sparse 3D scene structure. The internal parameters are set to be identical as well. The translation vector of the virtual camera is defined by the desired stereoscopic view setup to adjust the baseline of the virtual stereo rig, i.e. to control the desired 3D effect. With this virtual camera setup, the projection matrix for a virtual camera can be determined [50]. 2) Determination of Planar Homographies for Image Warping: The second step is to determine the relation between the virtual and neighboring original views to enable image warping. In the case of a small baseline, the relation between the virtual and neighboring original views can be represented by a planar homography (Fig. 7).

8 ZHANG et al.: 3D-TV CONTENT CREATION: AUTOMATIC 2D-TO-3D VIDEO CONVERSION 379 Fig. 6. Stereoscopic image synthesis with DIBR. (a) Original color image interview, (b) correspondent depth image, and (c) rendered image without hole-filling. Holes are marked with a green color. a reliable solution, more than 4 pairs are required. In this case, the planar homography is estimated by a regression method to minimize the error (3) i.e., (4) Fig. 7. Stereoscopic view synthesis using planar homographies. The planar homography is a perspective transformation that transfers the pixel point of an original view into the pixel point of the virtual view, namely: To enable this transformation, the planar homography has to be determined. This can be done by using the depth information represented in terms of the reconstructed sparse 3D scene points, which are projected into each virtual image plane according to the corresponding projection matrix These points in the virtual image plane correspond to feature points in the original view. With these corresponding feature pairs, the planar homography between the original view and the virtual view can be estimated. is a 3 3 matrix and has 8 independent parameters [50], [57]. A minimum number of four point corresponding pairs is required to provide a solution up to a scale. To have (1) (2) where is a cost function, such as mean square error (MSE) or a likelihood cost, to measure how well the planar homography matches those correspondent feature pairs. 3) Virtual View Generation: The final step is to render the virtual view from the original view to build up a stereoscopic image pair. Usually, the closest original view to the virtual view is selected. Once the planar homography between the virtual view and the closest original view is determined, all pixel values of the original view are projected to their corresponding locations in the virtual view by applying (1). Normally, these positions do not exactly correspond with the pixel grid in the virtual view. The value of pixels at locations in the pixel grid is then obtained by performing a bilinear or bicubic interpolation [93]. In general, the closest original view does not cover the whole scene that should be visible with the virtual stereo camera as depicted in the left side of Fig. 8(b) due to the difference in viewpoint. This is particularly the case when the orientation of both cameras differs significantly. The missing parts mainly appear at the image boundaries of the virtual stereo image. In order to fill the missing parts, additional surrounding views are taken into account. Therefore, pixels from temporal neighboring views can be projected to their corresponding grid locations in the virtual view using interpolation techniques (see Fig. 8(c) and (d)). Due to the homography warping involved during the view generation, the view rendering approach based on a sparse 3D scene structure does not have holes within the images, which is different to the DIBR approach [94]. In addition, super-resolution techniques can be exploited to enhance the image resolution of the reconstructed stereoscopic views since additional information from consecutive frames of the monocular video sequence is available [94]. However, such an approach could become error-prone due to the assumption of a global planar

9 380 IEEE TRANSACTIONS ON BROADCASTING, VOL. 57, NO. 2, JUNE 2011 Fig. 8. Virtual view generation from several temporal neighboring original views: (a) original left view of the sequence Dome, (b) virtual right view, only rendered with the closest neighboring view from the original camera track, (c) virtual right view using 30 additional neighboring frames from the camera track, and (d) final virtual right view using 62 additional neighboring frames. TABLE I SUMMARY OF CHARACTERISTICS OF DEPTH REPRESENTATION homography, which is basically valid for rotating cameras (virtual and original view share the same camera center) or for a scene of one plane. One possible solution to this restriction is to apply local homographies on segmented image regions. C. Summary of Characteristics of Depth Representation In the previous sub-sections we have presented two different approaches for the generation of stereoscopic images, which are associated with the two representation of depth information, namely: a 2D depth map representation and a sparse 3D point representation. Both of these representations have characteristics that make them preferable for particular applications or types of content. These characteristics are summarized in Table I. V. CHALLENGING ISSUES Even though much research has been done to enable automatic 2D-to-3D conversion, the techniques are still far from mature. Most available products and methods are only successful in certain circumstances. In addition to the limitations imposed on each approach, the following are some key challenging issues to be solved. 1) One issue that directly affects the image quality is the occlusion/disocclusion problem during the generation of the stereoscopic images. As long as the camera is horizontally translated, the captured sequence in principle contains enough scene information to generate a new view. When a new version of an image is generated from a virtual viewpoint, which is parallel to the current viewpoint, the texture information, which is missing from the current viewpoint, can be found from its temporal neighboring frames. However, the question is how to detect this missing texture information in the current frame from the other video frames. 2) The depth ambiguity from monocular depth cues is one issue that impacts the depth quality. The depth ambiguity originates from the violation of the principles of depth generation. One example is to retrieve the depths from motion parallax. The motion parallax can provide reasonable scene depths only when the camera is translated and the scene is static. With the scene containing moving objects, the motion parallax might ambiguously reflect the depth of the moving object. Imagine that two objects are located at the same distance to the camera, but have different motions. According to the principle of depth from motion, the object with fast motion will be assigned a depth that is closer to the camera than the other object. The solution to solve the depth ambiguity of moving objects is to use additional depth cues, such as depth from geometrical information, to generate the depth information for each video frame. 3) The integration of various depth cues is another issue affecting the success of automatic 2D-to-3D video conversion. The integration of various depth cues is an extremely

10 ZHANG et al.: 3D-TV CONTENT CREATION: AUTOMATIC 2D-TO-3D VIDEO CONVERSION 381 important issue since video sequences unavoidably contain different motion types, e.g., camera motion, object motion, or no motion. To retrieve the depth from such video sequences, different strategies of depth generation are required. The challenge is how to integrate all the extracted depths from different cues to form, not only spatially, but also temporally stable and reasonable depths. From the literature, only some research on the integration of various depth cues in the spatial domain has been proposed. A linear integration of depths was proposed in [95], where the weighting factors are determined from the perceived importance of depth cues. Another depth integration method is proposed in [32], in which two depth maps, one from motion and another from geometry, are integrated into a final depth map according to the binarization results of moving object detection. However, more investigations are still required to provide spatially and temporally consistent depths. 4) The real-time implementation of 2D-to-3D conversion is also a critical issue for the adoption of the proposed techniques by the general public. The availability of real-time 2D-to-3D conversion will allow broadcasters and consumer electronics (CE) manufacturers to remove the natural fear from users of not having enough content for their new 3D-enabled TV set. Real-time, however, adds a new hard constraint that is difficult to meet while maintaining a high quality standard for the converted material. There are several real-time implementations incorporated into TV sets (e.g., Samsung s 3D-TVs), sold as stand-alone equipment (e.g., JVC s IF-2D3D1 Stereoscopic Image Processor), or incorporated into software packages (e.g., DDD s TriDef-Player). However, the quality of the resulting stereoscopic images, with respect to the depth sensation, is still an outstanding issue that requires more research. VI. CONCLUSION This paper summarized current technical advances related to the development of automatic 2D-to-3D video conversion. The underlying principle of the conversion is to horizontally shift the pixels of an original image to generate a new version of it. To enable this conversion, different approaches for the extraction of depth information from monoscopic images and the generation of stereoscopic images were reviewed. A number of challenging issues that have to be solved for the success of automatic 2D-to-3D video conversion were pointed out as possible research topics. With the development of more advanced techniques for 2D-to-3D video conversion, the vast collection of 2D material currently available will be converted into stereoscopic 3D to boost the general public interest in purchasing 3D displays and 3D-TV services. ACKNOWLEDGMENT The authors would like to express their sincere thanks to Dr. W. J. Tam and R. Klepko for many fruitful discussions and constructive suggestions during the preparation of this manuscript. Thanks are also due to HHI for providing the Interview sequence and to NHK for providing the Balloons sequence. REFERENCES [1] S. Yano and I. Yuyama, Stereoscopic HDTV: Experimental system and psychological effects, J. SMPTE, vol. 100, pp , [2] N. S. Holliman, N. A. Dodgson, and G. Favalora, Three-dimensional display technologies: An analysis of technical performance characteristics, IEEE Trans. Broadcast., 2011, to be published. [3] H. Tilton, Broadcast standards for 3D HDTV, in Int. Symp. 3 D Image Technol. Arts, Tokyo, Feb. 1992, pp [4] P. V. Harman, An architecture for digital 3D broadcasting, in SPIE Conf. Stereoscopic Displays Virtual Reality Syst. VI, San Jose, California, Jan. 1999, vol. 3639, pp [5] P. V. Harman, Home-based 3D entertainment An overview, in IEEE Int. Conf. Image Process., Vancouver, 2000, pp [6] W. J. Tam and L. Zhang, 3D-TV Content Generation: 2D-To-3D Conversion, in IEEE Int. Conf. Multimedia Expo, Toronto, ON, Jul. 9 12, 2006, pp [7] P. Harman, J. Flack, S. Fox, and M. Dowley, Rapid 2D to3d conversion, in SPIE Conf. Stereoscopic Displays and Virtual Reality Systems IX, 2002, vol. 4660, pp [8] G. Iddan and G. Yahav, 3D imaging in the studio, in SPIE Conf. 3 D Image Capture Appl. IV, 2001, vol. 4298, pp [9] L. Zhang, J. Tam, and D. Wang, Stereoscopic image generation based on depth images, in IEEE Conference on Image Processing, Singapore, Oct. 2004, pp [10] O. Grau, M. Price, and G. A. Thomas, Use of 3-D techniques for virtual production, in Proc. SPIE, 2001, vol. 4309, pp , Invited Paper. [11] S. C. Chan, H.-Y. Shum, and K.-T. Ng, Image-based rendering and synthesis, IEEE Signal Process. Mag., pp , Nov [12] C. Liang and K. K. Wong, 3D reconstruction using silhouettes from unordered viewpoints, Image Vis. Comput., vol. 28, no. 4, pp , [13] J. F. Blinn and M. E. Newell, Texture and reflection in computer generated images, Commun. ACM, vol. 19, no. 10, pp , Oct [14] M. Levoy and P. Hanrahan, Light field rendering, in Proc. SIG- GRAPH, Aug. 1996, pp [15] S.J. Gortler, R. Grzeszczuk, R. Szeliski, and M. F. Cohen, The lumigraph, in Proc. SIGGRAPH, Aug. 1996, pp [16] C. Fehn, Depth-image-based rendering (DIBR), compression and transmission for a new approach on 3D-TV, in SPIE Conf. Stereoscopic Displays Virtual Reality Syst. XI, CA, Jan. 2004, vol. 5291, pp [17] J. Shade, S. Gortler, L. He, and R. Szeliski, Layered depth image, in Proc. SIGGRAPH 98, Jul. 1998, pp [18] C. L. Zitnick, S. B. Kang, M. Uyttendaele, S. Winder, and R. Szeliski, High-quality video view interpolation using a layered representation, ACM Trans. Graphics, vol. 23, no. 3, pp , [19] L. Zhang, D. Wang, and A. Vincent, Adaptive reconstruction of intermediate views from stereoscopic images, IEEE Trans. Circuits Syst. Video Technol., vol. 16, no. 1, pp , Jan [20] A. Redert, M. Op de Beeck, C. Fehn, W. Ijsselsteijn, M. Pollefeys, L. Van Gool, E. Ofek, I. Sexton, and P. Surman, ATTEST advanced three-dimensional television system techniques, in Proc. 3DPVT, Padova, Italy, Jun. 2002, pp [21] Q. Wei, Converting 2D to 3D: A survey, Delft University of Technology, The Netherlands, Project Report, Dec [22] Tutorial: Radar and stereoscopy visual ability in 3-D depth perception, Canada Centre for Remote Sensing [Online]. Available: cct.rncan.gc.ca/resource/tutor/stereo/chap2/chapter2_5_e.php [23] R. Ostnes, V. Abbott, and S. Lavender, Visualisation techniques: An overview Part 1, Hydrographic J., no. 113, pp. 4 7, Jul [24] J. Ens and P. Lawrence, An investigation of methods of determining depth from focus, IEEE Trans. Pattern Anal. Mach. Intell., vol. 15, no. 2, pp , [25] J. Ko, M. Kim, and C. Kim, 2D-To-3D stereoscopic conversion: depth-map estimation in a 2D single-view image, in Proc. SPIE, 2007, vol [26] G. Guo, N. Zhang, L. Huo, and W. Gao, 2D to 3D conversion based on edge defocus and segmentation, in IEEE Int. Conf. Acoust., Speech Signal Process., Mar. 31 April, , pp [27] Favaro, Shape from focus/defocus, Washington University, Department of Electrical Engineering Electronic Signals and Systems Research Lab, Jun. 25, [28] A. P. Pentland, A new sense for depth of field, IEEE Trans. Pattern Anal. Mach. Intell., vol. 9, pp , 1987.

11 382 IEEE TRANSACTIONS ON BROADCASTING, VOL. 57, NO. 2, JUNE 2011 [29] J. H. Elder and S. W. Zucker, Local scale control for edge detection and blur estimation, IEEE Trans. Pattern Anal. Mach. Intell., vol. 20, no. 7, pp , [30] S. A. Valencia and R. M. Rodriguez-Dagnino, Synthesizing stereo 3D views from focus cues in monoscopic 2D images, in Proc. SPIE, 2003, vol. 5006, pp [31] S. Battiato, S. Curti, M. La Cascia, M. Tortora, and E. Scordato, Depth map generation by image classification, in Proc. SPIE, Apr. 2004, vol. 5302, pp [32] X. Huang, L. Wang, J. Huang, D. Li, and M. Zhang, A depth extraction method based on motion and geometry for 2D to 3D conversion, in 3rd Int. Symp. Intell. Inf. Technol. Appl., 2009, pp [33] Y.-J. Jung, A. Baik, and D. Park, A novel 2D-to-3D conversion technique based on relative height-depth-cue, in SPIE Conf. Stereoscopic Displays Appl. XX, San José, CA, 2009, vol. 7237, 72371U. [34] K. Yamada, K. Suehiro, and H. Nakamura, Pseudo 3D image generation with simple depth models, in Int. Conf. Consum. Electron., Dig. Techn. Papers, Jan. 8 12, 2005, pp [35] D. A. Forsyth, Shape from texture without boundaries, in ECCV02, 2002, pp [36] A. Torralba and A. Oliva, Depth estimation from image structure, IEEE Trans. Pattern Anal. Mach. Intell., vol. 24, no. 9, Sept [37] A. M. Loh and R. Hartley, Shape from non-homogeneous, non-stationary, anisotropic, perspective texture, in Proc. British Mach. Vis. Conf., [38] F. Cozman and E. Krotkov, Depth from scattering, in IEEE Conf. Comput. Vis. Pattern Recog. (CVPR), 1997, pp [39] W. J. Tam, C. Vázquez, and F. Speranza, 3D-TV: A novel method for generating surrogate depth maps using colour information, in SPIE Conf. Stereoscopic Displays Appl. XX Conf., San José, CA, 2009, vol [40] C. Vázquez and W. J. Tam, CRC-CSDM: 2D to 3D conversion using colour-based surrogate depth maps, in Int. Conf. 3D Syst. Appl. (3DSA), Tokyo, Japan, May [41] R. Zhang, P.-S. Tsai, J. E. Cryer, and M. Shah, Shape-from-shading: A survey, IEEE Trans. Pattern Anal. Mach. Intell., vol. 21, no. 8, pp , [42] E. Rubin, Figure and ground, in Visual Perception Philadelphia, S. Yantis, Ed. London, U.K.: Psychology, 2001, pp [43] W. J. Tam, A. S. Yee, J. Ferreira, S. Tariq, and F. Speranza, Stereoscopic image rendering based on depth maps created from blur and edge information, in Proc. Stereoscopic Disp. Appl., Jan. 2005, vol. 5664, pp [44] J. Kim, A. Baik, Y. J. Jung, and D. Park, 2D-to-3D conversion by using visual attention analysis, in Proc. SPIE 7524, Feb. 2010, [45] B. J. Rogers and M. E. Graham, Motion parallax as an independent cue for depth perception, Perception, vol. 8, pp , [46] E. Arce and J. Marroquin, High-precision stereo disparity estimation using HMMF models, Image Vis. Comput., vol. 25, no. 5, pp , May [47] N. Atzpadin, P. Kauff, and O. Schreer, Stereo analysis by hybrid recursive matching for real-time immersive video conferencing, IEEE Trans. Circuits Syst. Video Technol., vol. 14, pp , [48] D. Scharstein and R. Szeliski, A taxonomy and evaluation of dense two-frame stereo correspondence algorithms, Int. J. Comput. Vis., no. 47, pp. 7 42, [49] T. Jebara, A. Azarbayejani, and A. Pentland, 3D structure from 2D motion, IEEE Signal Process. Magaz., vol. 16, no. 3, pp , May [50] R.-I. Hartley and A. Zisserman, Multiple View Geometry in Computer Vision, Second ed. Cambridge, U.K.: Cambridge Univ. Press, [51] E. Imre, S. Knorr, A. A. Alatan, and T. Sikora, Prioritized sequential 3D reconstruction in video sequences of dynamic scenes, in IEEE Int. Conf. Image Process. (ICIP), Atlanta, GA, [52] I. Ideses, L. P. Yaroslavsky, and B. Fishbain, Real-time 2D to 3D video conversion, J. Real-Time Image Process., vol. 2, no. 1, pp. 3 7, [53] M. T. Pourazad, P. Nasiopoulos, and R. K. Ward, Generating the depth map from the motion information of H.264-encoded 2D video sequence, EURASIP J. Image Video Process., Article ID , [54] G. Zhang, J. Jia, T. T. Wong, and H. Bao, Consistent depth maps recovery from a video sequence, IEEE Trans. Pattern Anal. Mach. Intell., vol. 31, no. 6, pp , [55] K. Moustakas, D. Tzovaras, and M. G. Strintzis, Stereoscopic video generation based on efficient layered structure and motion estimation from a monoscopic image sequence, IEEE Trans. Circuits Syst. Video Technol., vol. 15, no. 8, pp , Aug [56] P. Gargallo and P. F. Sturm, Bayesian 3D modeling from images using multiple depth maps, in IEEE Conf. Comput. Vis. Pattern Recog., 2005, vol. 2, pp [57] P. H. S. Torr, Bayesian model estimation and selection for epipolar geometry and generic manifold fitting, TechReport MSR-TR , Apr. 2002, Microsoft Research. [58] L. Alvarez, R. Deriche, T. Papadopoulo, and J. Sánchez, Symmetrical dense optical flow estimation with occlusions detection, Int. J. Comput. Vis., vol. 75, no. 3, pp , [59] J. Shi and C. Tomasi, Good features to track, in IEEE Int. Conf. Comput. Vis. Pattern Recog., Seattle, WA, Jun [60] D. G. Lowe, Distinctive image features from scale-invariant keypoints, Int. J. Comput. Vis., vol. 60, no. 2, pp , [61] C. Tomasi and T. Kanade, Detection and tracking of point features, Tech. Rep. CMU-CS , 1991, Carnegie Mellon University Technical. [62] M. Fischler and R. Bolles, Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography, Commun. ACM, vol. 1981, pp [63] S. Knorr, E. Ýmre, A. A. Alatan, and T. Sikora, A geometric segmentation approach for the 3D reconstruction of dynamic scenes in 2D video sequences, in EUSIPCO, Florence, Italy, Sep [64] C.-C. Cheng, C.-T. Li, Y.-M. Tsai, and L.-G. Chen, Hybrid depth cueing for 2D-to-3D conversion system, in SPIE Conf. Stereoscopic Displays Appl. XX, San José, CA, 2009, vol [65] M. Kunter, S. Knorr, A. Krutz, and T. Sikora, Unsupervised object segmentation for 2D to 3D conversion, in SPIE Conf. Stereoscopic Displays Appl. XX, San José, CA, 2009, vol [66] B. Triggs, P. McLauchlan, R. Hartley, and A. Fitzgibbon, Bundle adjustment A modern synthesis, in Vision Algorithms: Theory & Practice. Berlin: Springer-Verlag, 2000, vol. 1883, LNCS, pp [67] M. B. Kim and M. S. Song, Stereoscopic conversion of monoscopic video by the transformation of vertical to horizontal disparity, Proc. SPIE, vol. 3295, pp , [68] L. Zhang, B. Lawrence, D. Wang, and A. Vincent, Comparison study on feature matching and block matching for automatic 2D to 3D video conversion, in 2nd IEE Eur. Conf. Visual Media Prod., London, U.K., 2005, pp [69] D. Kim, D. Min, and K. Sohn, A stereoscopic video generation method using stereoscopic display characterization and motion analysis, IEEE Trans. Broadcast., vol. 54, no. 2, pp , Jun [70] Y. Matsumoto, H. Terasaki, K. Sugimoto, and T. Arakawa, Conversion system of monocular image sequence to stereo using motion parallax, in SPIE Conf. Stereoscopic Displays Virtual Reality Syst. IV, May 1997, vol. 3012, pp [71] Q.-T. Luong and O. Faugeras, Self calibration of a moving camera from point correspondences and fundamental matrices, Int. J. Comput. Vis., vol. 22, no. 3, [72] M. Pollefeys, Self-calibration and metric 3D reconstruction from uncalibrated image sequences, PhD thesis, K.U. Leuven, Belgium, [73] P. R. S. Mendonca and R. Cipolla, A simple technique for self-calibration, in IEEE Comput. Soc. Conf. Comput.Vis. Pattern Recog. (CVPR), [74] R. Hartley and P. Sturm, Triangulation, Comput. Vis. Image Understanding, vol. 68, no. 2, pp , [75] P. H. S. Torr, A. W. Fitzgibbon, and A. Zisserman, The problem of degeneracy in structure and motion recovery from uncalibrated image sequences, Int. J. Comput. Vis., vol. 32, no. 1, pp , Aug [76] C. Fehn, A 3D-TV approach using depth-image-based rendering (DIBR), in Proc. Vis., Imaging, Image Process., Benalmadena, Spain, Sep. 2003, pp [77] A. Woods, T. Docherty, and R. Koch, Image distortions in stereoscopic video systems, in SPIE Conf. Stereoscopic Displays Appl., San Jose, CA, Feb. 1993, pp [78] W. R. Mark, Post-Rendering 3D image warping: visibility, reconstruction, and performance for depth-image warping, PhD thesis, University of North Carolina at Chapel Hill, Chapel Hill, NC, Apr [79] L. MacMillan, An image based approach to three-dimensional computer graphics, Ph.D dissertation, University of North Carolina,, [80] C. Vázquez, W. J. Tam, and F. Speranza, Stereoscopic imaging: Filling disoccluded areas in depth image-based rendering, in SPIE Conf. 3 D TV, Video, Display V, 2006, vol. 6392, 63920D.

12 ZHANG et al.: 3D-TV CONTENT CREATION: AUTOMATIC 2D-TO-3D VIDEO CONVERSION 383 [81] A. Colombari, A. Fusiello, and V. Murino, Continuous parallax adjustment for 3D-TV, in IEE Eur. Conf. Vis. Media Prod., London, U.K., Nov. 2005, pp [82] K. Luo, D.-X. Li, Y.-M. Feng, and M. Zhang, Depth-aided inpainting for disocclusion restoration of multi-view images using depth-image-based rendering, J. Zhejiang Univ. Sci. A, vol. 10, no. 12, pp , Dec [83] M. Bertalmio, G. Sapiro, V. Caselles, and C. Ballester, Image inpainting, in ACM SIGGRAPH, Int. Conf. Comput. Graphics Interactive Tech., New Orleans, LA, Jul. 2000, vol. 174, pp [84] R. Klein Gunnewiek, R.-P. M. Berretty, B. Barenbrug, and J. P. Magalhães, Coherent spatial and temporal occlusion generation, in Proc. SPIE 7237, 2009, p [85] L. Jovanov, A. Pizurica, and W. Philips, Wavelet-based stereo images reconstruction using depth images, Proc. SPIE, vol. 6701, p A, [86] W. J. Tam, G. Alain, L. Zhang, T. Martin, and R. Renaud, Smoothing depth maps for improved stereoscopic image quality, in SPIE Conf. 3 D TV, Video, Display III, Philadelphia, PA, Oct. 2004, vol. 5599, pp [87] L. Zhang and W. J. Tam, Stereoscopic image generation based on depth images for 3D TV, IEEE Trans. Broadcast, vol. 51, no. 2, pp , Jun [88] W.-Y. Chen, Y.-L. Chang, S.-F. Lin, L.-F. Ding, and L.-G. Chen, Efficient depth image based rendering with edge dependent depth filter and interpolation, in IEEE Int. Conf. Multimedia Expo (ICME), Amsterdam, The Netherlands, Jul. 2005, pp [89] I. Daribo, C. Tillier, and B. Pesquet-Popescu, Distance dependent depth filtering in 3D warping for 3DTV, in IEEE 9th Workshop Multimedia Signal Process., Crete, Greece, Oct. 2007, pp [90] S.-B. Lee and Y.-S. Ho, Discontinuity-adaptive depth map filtering for 3D view generation, in 2nd Int. Conf. Immersive Telecommun., Berkeley, CA, [91] W. Wang et al., Depth image segmentation for improved virtual view image quality in 3-D TV, in 2007 Int. Symp. Intell. Signal Process. Commun. Syst., Xiamen, China, Nov. 2007, pp [92] W. J. Tam and L. Zhang, Non-uniform smoothing of depth maps before image-based rendering, in SPIE Conf. 3 D TV, Video Display III (ITCOM), 2004, vol. 5599, pp [93] S. Knorr and T. Sikora, An image-based rendering (IBR) approach for realistic stereo view synthesis of TV broadcast based on structure from motion, in IEEE Int. Conf. Image Process. (ICIP), San Antonio, TX, [94] S. Knorr, M. Kunter, and T. Sikora, Stereoscopic 3D from 2D video with super-resolution capability, Signal Process.: Image Commun., vol. 23, no. 9, pp , Oct [95] Y. L. Chang, J. Y. Chang, Y. M. Tsai, C. L. Lee, and L. G. Chen, Priority depth fusion for 2D-to-3D conversion systems, in SPIE Proc. 3 D Image Capture Appl., 2008, vol Liang Zhang (SM 07) received the B. S. degree from Chengdu Institute of Radio Engineering in 1982, the M. S. degree from Shanghai Jiaotong University in 1986, and the Dr.-Ing. (Ph. D.) degree in electrical engineering from the University of Hannover, Germany in He was a Teaching Assistant from 1987 to 1988 and a Lecturer from 1989 to 1992 with the Department of Electrical Engineering, Shanghai Jiaotong University. From 1992 to 2000, he was a research assistant with the Institut für Theoretische Nachrichtentechnik und Informationsverarbeitung, University of Hannover, Germany. Since 2000, he has been with Communications Research Centre, Canada, where he is currently a team leader and a Senior Research Scientist. His research interests include image and video coding, image and video processing, stereo image processing and 3D TV. Dr. Zhang is a Senior Member of IEEE BTS and CAS Societies. He was a recipient of the German Academic Exchange Service (DAAD) scholarship. He served as publication co-chair of IEEE International Conference on Multimedia and Expo, Toronto, Canada, He was chair and organizer of the special session on 3D TV: Primed for Success? at the IEEE International Conference on Multimedia and Expo, Toronto, Canada, Currently, he serves as an Associate Editor for IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY and the Journal of Real-Time Imaging. Carlos Vázquez (M 02) received the B. Eng. degree in electrical engineering with great distinction and the M.Sc. degree in applied computer sciences from the Polytechnic University José Antonio Echeverría (ISPJAE) in Havana, Cuba in 1992 and 1997 respectively. He received the Ph.D. degree in telecommunications from the Institut National de la Recherche Scientifique (INRS), Montréal, Canada in He worked at the Telecommunications Department of the Electrical Engineering Faculty of ISPJAE from September 1992 to August 1997, first as a Lecturer, becoming Assistant Professor in In September 1997 he joined the INRS as a Ph.D. student and recipient of an excellence scholarship awarded by the Ministry of Education of Québec. From 2002 to 2004 he was a post-doctoral fellow at INRS-EMT. He worked as Research Associate in the ECE Department of Concordia University, Montréal, Canada in In August 2005 he joined the Advanced Video System Group of the Communication Research Centre Canada (CRC) in Ottawa, Canada as Research Scientist. His research interests are in the areas of 3D-TV; stereo and multi-view vision systems; image/video representation, sampling and interpolation; image and video coding; and motion/disparity/depth estimation and compensation. Sebastian Knorr received the Dipl.-Ing. and Dr.-Ing. (Ph. D.) degrees in electrical engineering from Technische Universität Berlin, Germany, in 2002 and 2008, respectively. He worked as Research Assistant and Senior Researcher from 2002 to 2009 in the field of 3D image processing and computer vision in the Communication Systems Lab at Technische Universität Berlin, Germany. During this time he was involved in several European Networks of Excellence, e.g. VISNET and 3DTV. Currently, he is chief research officer and managing director of imcube media GmbH. Dr. Knorr is a member of the German Society of Television- and Cinema Technology (FKTG e.v.). In 2007, he invented the process of automatic 2D to 3D image conversion with highly advanced computer vision technology. He received the German Multimedia Business Award of the Federal Ministry of Economics and Technology in 2008, and was awarded by the initiative Germany - Land of Ideas which is sponsored by the German government, commerce and industry in 2009, respectively.

Visualization 2D-to-3D Photo Rendering for 3D Displays

Visualization 2D-to-3D Photo Rendering for 3D Displays Visualization 2D-to-3D Photo Rendering for 3D Displays Sumit K Chauhan 1, Divyesh R Bajpai 2, Vatsal H Shah 3 1 Information Technology, Birla Vishvakarma mahavidhyalaya,sumitskc51@gmail.com 2 Information

More information

Automatic 2D-to-3D Video Conversion Techniques for 3DTV

Automatic 2D-to-3D Video Conversion Techniques for 3DTV Automatic 2D-to-3D Video Conversion Techniques for 3DTV Dr. Lai-Man Po Email: eelmpo@cityu.edu.hk Department of Electronic Engineering City University of Hong Kong Date: 13 April 2010 Content Why 2D-to-3D

More information

Fundamentals of Stereo Vision Michael Bleyer LVA Stereo Vision

Fundamentals of Stereo Vision Michael Bleyer LVA Stereo Vision Fundamentals of Stereo Vision Michael Bleyer LVA Stereo Vision What Happened Last Time? Human 3D perception (3D cinema) Computational stereo Intuitive explanation of what is meant by disparity Stereo matching

More information

Range Imaging Through Triangulation. Range Imaging Through Triangulation. Range Imaging Through Triangulation. Range Imaging Through Triangulation

Range Imaging Through Triangulation. Range Imaging Through Triangulation. Range Imaging Through Triangulation. Range Imaging Through Triangulation Obviously, this is a very slow process and not suitable for dynamic scenes. To speed things up, we can use a laser that projects a vertical line of light onto the scene. This laser rotates around its vertical

More information

Structured Light II. Thanks to Ronen Gvili, Szymon Rusinkiewicz and Maks Ovsjanikov

Structured Light II. Thanks to Ronen Gvili, Szymon Rusinkiewicz and Maks Ovsjanikov Structured Light II Johannes Köhler Johannes.koehler@dfki.de Thanks to Ronen Gvili, Szymon Rusinkiewicz and Maks Ovsjanikov Introduction Previous lecture: Structured Light I Active Scanning Camera/emitter

More information

CONVERSION OF FREE-VIEWPOINT 3D MULTI-VIEW VIDEO FOR STEREOSCOPIC DISPLAYS

CONVERSION OF FREE-VIEWPOINT 3D MULTI-VIEW VIDEO FOR STEREOSCOPIC DISPLAYS CONVERSION OF FREE-VIEWPOINT 3D MULTI-VIEW VIDEO FOR STEREOSCOPIC DISPLAYS Luat Do 1, Svitlana Zinger 1, and Peter H. N. de With 1,2 1 Eindhoven University of Technology, P.O. Box 513, 5600 MB Eindhoven,

More information

Stereo CSE 576. Ali Farhadi. Several slides from Larry Zitnick and Steve Seitz

Stereo CSE 576. Ali Farhadi. Several slides from Larry Zitnick and Steve Seitz Stereo CSE 576 Ali Farhadi Several slides from Larry Zitnick and Steve Seitz Why do we perceive depth? What do humans use as depth cues? Motion Convergence When watching an object close to us, our eyes

More information

Segmentation and Tracking of Partial Planar Templates

Segmentation and Tracking of Partial Planar Templates Segmentation and Tracking of Partial Planar Templates Abdelsalam Masoud William Hoff Colorado School of Mines Colorado School of Mines Golden, CO 800 Golden, CO 800 amasoud@mines.edu whoff@mines.edu Abstract

More information

5LSH0 Advanced Topics Video & Analysis

5LSH0 Advanced Topics Video & Analysis 1 Multiview 3D video / Outline 2 Advanced Topics Multimedia Video (5LSH0), Module 02 3D Geometry, 3D Multiview Video Coding & Rendering Peter H.N. de With, Sveta Zinger & Y. Morvan ( p.h.n.de.with@tue.nl

More information

3D Computer Vision. Structured Light II. Prof. Didier Stricker. Kaiserlautern University.

3D Computer Vision. Structured Light II. Prof. Didier Stricker. Kaiserlautern University. 3D Computer Vision Structured Light II Prof. Didier Stricker Kaiserlautern University http://ags.cs.uni-kl.de/ DFKI Deutsches Forschungszentrum für Künstliche Intelligenz http://av.dfki.de 1 Introduction

More information

arxiv: v1 [cs.cv] 28 Sep 2018

arxiv: v1 [cs.cv] 28 Sep 2018 Camera Pose Estimation from Sequence of Calibrated Images arxiv:1809.11066v1 [cs.cv] 28 Sep 2018 Jacek Komorowski 1 and Przemyslaw Rokita 2 1 Maria Curie-Sklodowska University, Institute of Computer Science,

More information

Depth. Common Classification Tasks. Example: AlexNet. Another Example: Inception. Another Example: Inception. Depth

Depth. Common Classification Tasks. Example: AlexNet. Another Example: Inception. Another Example: Inception. Depth Common Classification Tasks Recognition of individual objects/faces Analyze object-specific features (e.g., key points) Train with images from different viewing angles Recognition of object classes Analyze

More information

lecture 10 - depth from blur, binocular stereo

lecture 10 - depth from blur, binocular stereo This lecture carries forward some of the topics from early in the course, namely defocus blur and binocular disparity. The main emphasis here will be on the information these cues carry about depth, rather

More information

EE795: Computer Vision and Intelligent Systems

EE795: Computer Vision and Intelligent Systems EE795: Computer Vision and Intelligent Systems Spring 2012 TTh 17:30-18:45 FDH 204 Lecture 14 130307 http://www.ee.unlv.edu/~b1morris/ecg795/ 2 Outline Review Stereo Dense Motion Estimation Translational

More information

Perception, Part 2 Gleitman et al. (2011), Chapter 5

Perception, Part 2 Gleitman et al. (2011), Chapter 5 Perception, Part 2 Gleitman et al. (2011), Chapter 5 Mike D Zmura Department of Cognitive Sciences, UCI Psych 9A / Psy Beh 11A February 27, 2014 T. M. D'Zmura 1 Visual Reconstruction of a Three-Dimensional

More information

Chapter 7. Conclusions and Future Work

Chapter 7. Conclusions and Future Work Chapter 7 Conclusions and Future Work In this dissertation, we have presented a new way of analyzing a basic building block in computer graphics rendering algorithms the computational interaction between

More information

calibrated coordinates Linear transformation pixel coordinates

calibrated coordinates Linear transformation pixel coordinates 1 calibrated coordinates Linear transformation pixel coordinates 2 Calibration with a rig Uncalibrated epipolar geometry Ambiguities in image formation Stratified reconstruction Autocalibration with partial

More information

Natural Viewing 3D Display

Natural Viewing 3D Display We will introduce a new category of Collaboration Projects, which will highlight DoCoMo s joint research activities with universities and other companies. DoCoMo carries out R&D to build up mobile communication,

More information

Motion Analysis. Motion analysis. Now we will talk about. Differential Motion Analysis. Motion analysis. Difference Pictures

Motion Analysis. Motion analysis. Now we will talk about. Differential Motion Analysis. Motion analysis. Difference Pictures Now we will talk about Motion Analysis Motion analysis Motion analysis is dealing with three main groups of motionrelated problems: Motion detection Moving object detection and location. Derivation of

More information

Robert Collins CSE486, Penn State Lecture 08: Introduction to Stereo

Robert Collins CSE486, Penn State Lecture 08: Introduction to Stereo Lecture 08: Introduction to Stereo Reading: T&V Section 7.1 Stereo Vision Inferring depth from images taken at the same time by two or more cameras. Basic Perspective Projection Scene Point Perspective

More information

Image Transfer Methods. Satya Prakash Mallick Jan 28 th, 2003

Image Transfer Methods. Satya Prakash Mallick Jan 28 th, 2003 Image Transfer Methods Satya Prakash Mallick Jan 28 th, 2003 Objective Given two or more images of the same scene, the objective is to synthesize a novel view of the scene from a view point where there

More information

Chapter 9 Object Tracking an Overview

Chapter 9 Object Tracking an Overview Chapter 9 Object Tracking an Overview The output of the background subtraction algorithm, described in the previous chapter, is a classification (segmentation) of pixels into foreground pixels (those belonging

More information

Using temporal seeding to constrain the disparity search range in stereo matching

Using temporal seeding to constrain the disparity search range in stereo matching Using temporal seeding to constrain the disparity search range in stereo matching Thulani Ndhlovu Mobile Intelligent Autonomous Systems CSIR South Africa Email: tndhlovu@csir.co.za Fred Nicolls Department

More information

BIL Computer Vision Apr 16, 2014

BIL Computer Vision Apr 16, 2014 BIL 719 - Computer Vision Apr 16, 2014 Binocular Stereo (cont d.), Structure from Motion Aykut Erdem Dept. of Computer Engineering Hacettepe University Slide credit: S. Lazebnik Basic stereo matching algorithm

More information

Prof. Fanny Ficuciello Robotics for Bioengineering Visual Servoing

Prof. Fanny Ficuciello Robotics for Bioengineering Visual Servoing Visual servoing vision allows a robotic system to obtain geometrical and qualitative information on the surrounding environment high level control motion planning (look-and-move visual grasping) low level

More information

Stereo Vision. MAN-522 Computer Vision

Stereo Vision. MAN-522 Computer Vision Stereo Vision MAN-522 Computer Vision What is the goal of stereo vision? The recovery of the 3D structure of a scene using two or more images of the 3D scene, each acquired from a different viewpoint in

More information

Structure from motion

Structure from motion Structure from motion Structure from motion Given a set of corresponding points in two or more images, compute the camera parameters and the 3D point coordinates?? R 1,t 1 R 2,t 2 R 3,t 3 Camera 1 Camera

More information

Feature Transfer and Matching in Disparate Stereo Views through the use of Plane Homographies

Feature Transfer and Matching in Disparate Stereo Views through the use of Plane Homographies Feature Transfer and Matching in Disparate Stereo Views through the use of Plane Homographies M. Lourakis, S. Tzurbakis, A. Argyros, S. Orphanoudakis Computer Vision and Robotics Lab (CVRL) Institute of

More information

An Overview of Matchmoving using Structure from Motion Methods

An Overview of Matchmoving using Structure from Motion Methods An Overview of Matchmoving using Structure from Motion Methods Kamyar Haji Allahverdi Pour Department of Computer Engineering Sharif University of Technology Tehran, Iran Email: allahverdi@ce.sharif.edu

More information

Prof. Feng Liu. Spring /27/2014

Prof. Feng Liu. Spring /27/2014 Prof. Feng Liu Spring 2014 http://www.cs.pdx.edu/~fliu/courses/cs510/ 05/27/2014 Last Time Video Stabilization 2 Today Stereoscopic 3D Human depth perception 3D displays 3 Stereoscopic media Digital Visual

More information

MAPI Computer Vision. Multiple View Geometry

MAPI Computer Vision. Multiple View Geometry MAPI Computer Vision Multiple View Geometry Geometry o Multiple Views 2- and 3- view geometry p p Kpˆ [ K R t]p Geometry o Multiple Views 2- and 3- view geometry Epipolar Geometry The epipolar geometry

More information

Chapter 3 Image Registration. Chapter 3 Image Registration

Chapter 3 Image Registration. Chapter 3 Image Registration Chapter 3 Image Registration Distributed Algorithms for Introduction (1) Definition: Image Registration Input: 2 images of the same scene but taken from different perspectives Goal: Identify transformation

More information

Multiple View Geometry

Multiple View Geometry Multiple View Geometry CS 6320, Spring 2013 Guest Lecture Marcel Prastawa adapted from Pollefeys, Shah, and Zisserman Single view computer vision Projective actions of cameras Camera callibration Photometric

More information

Correspondence and Stereopsis. Original notes by W. Correa. Figures from [Forsyth & Ponce] and [Trucco & Verri]

Correspondence and Stereopsis. Original notes by W. Correa. Figures from [Forsyth & Ponce] and [Trucco & Verri] Correspondence and Stereopsis Original notes by W. Correa. Figures from [Forsyth & Ponce] and [Trucco & Verri] Introduction Disparity: Informally: difference between two pictures Allows us to gain a strong

More information

Jingyi Yu CISC 849. Department of Computer and Information Science

Jingyi Yu CISC 849. Department of Computer and Information Science Digital Photography and Videos Jingyi Yu CISC 849 Light Fields, Lumigraph, and Image-based Rendering Pinhole Camera A camera captures a set of rays A pinhole camera captures a set of rays passing through

More information

ELEC Dr Reji Mathew Electrical Engineering UNSW

ELEC Dr Reji Mathew Electrical Engineering UNSW ELEC 4622 Dr Reji Mathew Electrical Engineering UNSW Review of Motion Modelling and Estimation Introduction to Motion Modelling & Estimation Forward Motion Backward Motion Block Motion Estimation Motion

More information

Stereo and Epipolar geometry

Stereo and Epipolar geometry Previously Image Primitives (feature points, lines, contours) Today: Stereo and Epipolar geometry How to match primitives between two (multiple) views) Goals: 3D reconstruction, recognition Jana Kosecka

More information

Image Transformations & Camera Calibration. Mašinska vizija, 2018.

Image Transformations & Camera Calibration. Mašinska vizija, 2018. Image Transformations & Camera Calibration Mašinska vizija, 2018. Image transformations What ve we learnt so far? Example 1 resize and rotate Open warp_affine_template.cpp Perform simple resize

More information

FLY THROUGH VIEW VIDEO GENERATION OF SOCCER SCENE

FLY THROUGH VIEW VIDEO GENERATION OF SOCCER SCENE FLY THROUGH VIEW VIDEO GENERATION OF SOCCER SCENE Naho INAMOTO and Hideo SAITO Keio University, Yokohama, Japan {nahotty,saito}@ozawa.ics.keio.ac.jp Abstract Recently there has been great deal of interest

More information

EXAM SOLUTIONS. Image Processing and Computer Vision Course 2D1421 Monday, 13 th of March 2006,

EXAM SOLUTIONS. Image Processing and Computer Vision Course 2D1421 Monday, 13 th of March 2006, School of Computer Science and Communication, KTH Danica Kragic EXAM SOLUTIONS Image Processing and Computer Vision Course 2D1421 Monday, 13 th of March 2006, 14.00 19.00 Grade table 0-25 U 26-35 3 36-45

More information

Realtime 3D Computer Graphics Virtual Reality

Realtime 3D Computer Graphics Virtual Reality Realtime 3D Computer Graphics Virtual Reality Human Visual Perception The human visual system 2 eyes Optic nerve: 1.5 million fibers per eye (each fiber is the axon from a neuron) 125 million rods (achromatic

More information

Motion and Tracking. Andrea Torsello DAIS Università Ca Foscari via Torino 155, Mestre (VE)

Motion and Tracking. Andrea Torsello DAIS Università Ca Foscari via Torino 155, Mestre (VE) Motion and Tracking Andrea Torsello DAIS Università Ca Foscari via Torino 155, 30172 Mestre (VE) Motion Segmentation Segment the video into multiple coherently moving objects Motion and Perceptual Organization

More information

Outdoor Scene Reconstruction from Multiple Image Sequences Captured by a Hand-held Video Camera

Outdoor Scene Reconstruction from Multiple Image Sequences Captured by a Hand-held Video Camera Outdoor Scene Reconstruction from Multiple Image Sequences Captured by a Hand-held Video Camera Tomokazu Sato, Masayuki Kanbara and Naokazu Yokoya Graduate School of Information Science, Nara Institute

More information

Stereo Image Rectification for Simple Panoramic Image Generation

Stereo Image Rectification for Simple Panoramic Image Generation Stereo Image Rectification for Simple Panoramic Image Generation Yun-Suk Kang and Yo-Sung Ho Gwangju Institute of Science and Technology (GIST) 261 Cheomdan-gwagiro, Buk-gu, Gwangju 500-712 Korea Email:{yunsuk,

More information

Basic distinctions. Definitions. Epstein (1965) familiar size experiment. Distance, depth, and 3D shape cues. Distance, depth, and 3D shape cues

Basic distinctions. Definitions. Epstein (1965) familiar size experiment. Distance, depth, and 3D shape cues. Distance, depth, and 3D shape cues Distance, depth, and 3D shape cues Pictorial depth cues: familiar size, relative size, brightness, occlusion, shading and shadows, aerial/ atmospheric perspective, linear perspective, height within image,

More information

Motion Tracking and Event Understanding in Video Sequences

Motion Tracking and Event Understanding in Video Sequences Motion Tracking and Event Understanding in Video Sequences Isaac Cohen Elaine Kang, Jinman Kang Institute for Robotics and Intelligent Systems University of Southern California Los Angeles, CA Objectives!

More information

A Review of Image- based Rendering Techniques Nisha 1, Vijaya Goel 2 1 Department of computer science, University of Delhi, Delhi, India

A Review of Image- based Rendering Techniques Nisha 1, Vijaya Goel 2 1 Department of computer science, University of Delhi, Delhi, India A Review of Image- based Rendering Techniques Nisha 1, Vijaya Goel 2 1 Department of computer science, University of Delhi, Delhi, India Keshav Mahavidyalaya, University of Delhi, Delhi, India Abstract

More information

Image Base Rendering: An Introduction

Image Base Rendering: An Introduction Image Base Rendering: An Introduction Cliff Lindsay CS563 Spring 03, WPI 1. Introduction Up to this point, we have focused on showing 3D objects in the form of polygons. This is not the only approach to

More information

Structure from Motion. Prof. Marco Marcon

Structure from Motion. Prof. Marco Marcon Structure from Motion Prof. Marco Marcon Summing-up 2 Stereo is the most powerful clue for determining the structure of a scene Another important clue is the relative motion between the scene and (mono)

More information

Multi-View Stereo for Static and Dynamic Scenes

Multi-View Stereo for Static and Dynamic Scenes Multi-View Stereo for Static and Dynamic Scenes Wolfgang Burgard Jan 6, 2010 Main references Yasutaka Furukawa and Jean Ponce, Accurate, Dense and Robust Multi-View Stereopsis, 2007 C.L. Zitnick, S.B.

More information

Motion Estimation. There are three main types (or applications) of motion estimation:

Motion Estimation. There are three main types (or applications) of motion estimation: Members: D91922016 朱威達 R93922010 林聖凱 R93922044 謝俊瑋 Motion Estimation There are three main types (or applications) of motion estimation: Parametric motion (image alignment) The main idea of parametric motion

More information

Dense 3D Reconstruction. Christiano Gava

Dense 3D Reconstruction. Christiano Gava Dense 3D Reconstruction Christiano Gava christiano.gava@dfki.de Outline Previous lecture: structure and motion II Structure and motion loop Triangulation Today: dense 3D reconstruction The matching problem

More information

Miniature faking. In close-up photo, the depth of field is limited.

Miniature faking. In close-up photo, the depth of field is limited. Miniature faking In close-up photo, the depth of field is limited. http://en.wikipedia.org/wiki/file:jodhpur_tilt_shift.jpg Miniature faking Miniature faking http://en.wikipedia.org/wiki/file:oregon_state_beavers_tilt-shift_miniature_greg_keene.jpg

More information

Learning and Inferring Depth from Monocular Images. Jiyan Pan April 1, 2009

Learning and Inferring Depth from Monocular Images. Jiyan Pan April 1, 2009 Learning and Inferring Depth from Monocular Images Jiyan Pan April 1, 2009 Traditional ways of inferring depth Binocular disparity Structure from motion Defocus Given a single monocular image, how to infer

More information

Experiments with Edge Detection using One-dimensional Surface Fitting

Experiments with Edge Detection using One-dimensional Surface Fitting Experiments with Edge Detection using One-dimensional Surface Fitting Gabor Terei, Jorge Luis Nunes e Silva Brito The Ohio State University, Department of Geodetic Science and Surveying 1958 Neil Avenue,

More information

Multiple View Geometry in Computer Vision Second Edition

Multiple View Geometry in Computer Vision Second Edition Multiple View Geometry in Computer Vision Second Edition Richard Hartley Australian National University, Canberra, Australia Andrew Zisserman University of Oxford, UK CAMBRIDGE UNIVERSITY PRESS Contents

More information

Colour Segmentation-based Computation of Dense Optical Flow with Application to Video Object Segmentation

Colour Segmentation-based Computation of Dense Optical Flow with Application to Video Object Segmentation ÖGAI Journal 24/1 11 Colour Segmentation-based Computation of Dense Optical Flow with Application to Video Object Segmentation Michael Bleyer, Margrit Gelautz, Christoph Rhemann Vienna University of Technology

More information

Gaze interaction (2): models and technologies

Gaze interaction (2): models and technologies Gaze interaction (2): models and technologies Corso di Interazione uomo-macchina II Prof. Giuseppe Boccignone Dipartimento di Scienze dell Informazione Università di Milano boccignone@dsi.unimi.it http://homes.dsi.unimi.it/~boccignone/l

More information

Stereo. 11/02/2012 CS129, Brown James Hays. Slides by Kristen Grauman

Stereo. 11/02/2012 CS129, Brown James Hays. Slides by Kristen Grauman Stereo 11/02/2012 CS129, Brown James Hays Slides by Kristen Grauman Multiple views Multi-view geometry, matching, invariant features, stereo vision Lowe Hartley and Zisserman Why multiple views? Structure

More information

CS 4495 Computer Vision A. Bobick. Motion and Optic Flow. Stereo Matching

CS 4495 Computer Vision A. Bobick. Motion and Optic Flow. Stereo Matching Stereo Matching Fundamental matrix Let p be a point in left image, p in right image l l Epipolar relation p maps to epipolar line l p maps to epipolar line l p p Epipolar mapping described by a 3x3 matrix

More information

An introduction to 3D image reconstruction and understanding concepts and ideas

An introduction to 3D image reconstruction and understanding concepts and ideas Introduction to 3D image reconstruction An introduction to 3D image reconstruction and understanding concepts and ideas Samuele Carli Martin Hellmich 5 febbraio 2013 1 icsc2013 Carli S. Hellmich M. (CERN)

More information

Final Exam Study Guide

Final Exam Study Guide Final Exam Study Guide Exam Window: 28th April, 12:00am EST to 30th April, 11:59pm EST Description As indicated in class the goal of the exam is to encourage you to review the material from the course.

More information

This work is about a new method for generating diffusion curve style images. Although this topic is dealing with non-photorealistic rendering, as you

This work is about a new method for generating diffusion curve style images. Although this topic is dealing with non-photorealistic rendering, as you This work is about a new method for generating diffusion curve style images. Although this topic is dealing with non-photorealistic rendering, as you will see our underlying solution is based on two-dimensional

More information

Conversion of free-viewpoint 3D multi-view video for stereoscopic displays Do, Q.L.; Zinger, S.; de With, P.H.N.

Conversion of free-viewpoint 3D multi-view video for stereoscopic displays Do, Q.L.; Zinger, S.; de With, P.H.N. Conversion of free-viewpoint 3D multi-view video for stereoscopic displays Do, Q.L.; Zinger, S.; de With, P.H.N. Published in: Proceedings of the 2010 IEEE International Conference on Multimedia and Expo

More information

Project 4 Results. Representation. Data. Learning. Zachary, Hung-I, Paul, Emanuel. SIFT and HoG are popular and successful.

Project 4 Results. Representation. Data. Learning. Zachary, Hung-I, Paul, Emanuel. SIFT and HoG are popular and successful. Project 4 Results Representation SIFT and HoG are popular and successful. Data Hugely varying results from hard mining. Learning Non-linear classifier usually better. Zachary, Hung-I, Paul, Emanuel Project

More information

Detecting motion by means of 2D and 3D information

Detecting motion by means of 2D and 3D information Detecting motion by means of 2D and 3D information Federico Tombari Stefano Mattoccia Luigi Di Stefano Fabio Tonelli Department of Electronics Computer Science and Systems (DEIS) Viale Risorgimento 2,

More information

Computer Vision Projective Geometry and Calibration. Pinhole cameras

Computer Vision Projective Geometry and Calibration. Pinhole cameras Computer Vision Projective Geometry and Calibration Professor Hager http://www.cs.jhu.edu/~hager Jason Corso http://www.cs.jhu.edu/~jcorso. Pinhole cameras Abstract camera model - box with a small hole

More information

Finally: Motion and tracking. Motion 4/20/2011. CS 376 Lecture 24 Motion 1. Video. Uses of motion. Motion parallax. Motion field

Finally: Motion and tracking. Motion 4/20/2011. CS 376 Lecture 24 Motion 1. Video. Uses of motion. Motion parallax. Motion field Finally: Motion and tracking Tracking objects, video analysis, low level motion Motion Wed, April 20 Kristen Grauman UT-Austin Many slides adapted from S. Seitz, R. Szeliski, M. Pollefeys, and S. Lazebnik

More information

Practice Exam Sample Solutions

Practice Exam Sample Solutions CS 675 Computer Vision Instructor: Marc Pomplun Practice Exam Sample Solutions Note that in the actual exam, no calculators, no books, and no notes allowed. Question 1: out of points Question 2: out of

More information

Chapter 12 3D Localisation and High-Level Processing

Chapter 12 3D Localisation and High-Level Processing Chapter 12 3D Localisation and High-Level Processing This chapter describes how the results obtained from the moving object tracking phase are used for estimating the 3D location of objects, based on the

More information

Recap: Features and filters. Recap: Grouping & fitting. Now: Multiple views 10/29/2008. Epipolar geometry & stereo vision. Why multiple views?

Recap: Features and filters. Recap: Grouping & fitting. Now: Multiple views 10/29/2008. Epipolar geometry & stereo vision. Why multiple views? Recap: Features and filters Epipolar geometry & stereo vision Tuesday, Oct 21 Kristen Grauman UT-Austin Transforming and describing images; textures, colors, edges Recap: Grouping & fitting Now: Multiple

More information

Video Alignment. Literature Survey. Spring 2005 Prof. Brian Evans Multidimensional Digital Signal Processing Project The University of Texas at Austin

Video Alignment. Literature Survey. Spring 2005 Prof. Brian Evans Multidimensional Digital Signal Processing Project The University of Texas at Austin Literature Survey Spring 2005 Prof. Brian Evans Multidimensional Digital Signal Processing Project The University of Texas at Austin Omer Shakil Abstract This literature survey compares various methods

More information

A Survey of Light Source Detection Methods

A Survey of Light Source Detection Methods A Survey of Light Source Detection Methods Nathan Funk University of Alberta Mini-Project for CMPUT 603 November 30, 2003 Abstract This paper provides an overview of the most prominent techniques for light

More information

CHAPTER 3. Single-view Geometry. 1. Consequences of Projection

CHAPTER 3. Single-view Geometry. 1. Consequences of Projection CHAPTER 3 Single-view Geometry When we open an eye or take a photograph, we see only a flattened, two-dimensional projection of the physical underlying scene. The consequences are numerous and startling.

More information

Lecture 17: Recursive Ray Tracing. Where is the way where light dwelleth? Job 38:19

Lecture 17: Recursive Ray Tracing. Where is the way where light dwelleth? Job 38:19 Lecture 17: Recursive Ray Tracing Where is the way where light dwelleth? Job 38:19 1. Raster Graphics Typical graphics terminals today are raster displays. A raster display renders a picture scan line

More information

Applying Synthetic Images to Learning Grasping Orientation from Single Monocular Images

Applying Synthetic Images to Learning Grasping Orientation from Single Monocular Images Applying Synthetic Images to Learning Grasping Orientation from Single Monocular Images 1 Introduction - Steve Chuang and Eric Shan - Determining object orientation in images is a well-established topic

More information

CHAPTER 6 PERCEPTUAL ORGANIZATION BASED ON TEMPORAL DYNAMICS

CHAPTER 6 PERCEPTUAL ORGANIZATION BASED ON TEMPORAL DYNAMICS CHAPTER 6 PERCEPTUAL ORGANIZATION BASED ON TEMPORAL DYNAMICS This chapter presents a computational model for perceptual organization. A figure-ground segregation network is proposed based on a novel boundary

More information

CS 4495 Computer Vision A. Bobick. Motion and Optic Flow. Stereo Matching

CS 4495 Computer Vision A. Bobick. Motion and Optic Flow. Stereo Matching Stereo Matching Fundamental matrix Let p be a point in left image, p in right image l l Epipolar relation p maps to epipolar line l p maps to epipolar line l p p Epipolar mapping described by a 3x3 matrix

More information

Computer Vision Lecture 17

Computer Vision Lecture 17 Computer Vision Lecture 17 Epipolar Geometry & Stereo Basics 13.01.2015 Bastian Leibe RWTH Aachen http://www.vision.rwth-aachen.de leibe@vision.rwth-aachen.de Announcements Seminar in the summer semester

More information

Texture Mapping using Surface Flattening via Multi-Dimensional Scaling

Texture Mapping using Surface Flattening via Multi-Dimensional Scaling Texture Mapping using Surface Flattening via Multi-Dimensional Scaling Gil Zigelman Ron Kimmel Department of Computer Science, Technion, Haifa 32000, Israel and Nahum Kiryati Department of Electrical Engineering

More information

Computer Vision Lecture 17

Computer Vision Lecture 17 Announcements Computer Vision Lecture 17 Epipolar Geometry & Stereo Basics Seminar in the summer semester Current Topics in Computer Vision and Machine Learning Block seminar, presentations in 1 st week

More information

Subpixel accurate refinement of disparity maps using stereo correspondences

Subpixel accurate refinement of disparity maps using stereo correspondences Subpixel accurate refinement of disparity maps using stereo correspondences Matthias Demant Lehrstuhl für Mustererkennung, Universität Freiburg Outline 1 Introduction and Overview 2 Refining the Cost Volume

More information

Light Field Occlusion Removal

Light Field Occlusion Removal Light Field Occlusion Removal Shannon Kao Stanford University kaos@stanford.edu Figure 1: Occlusion removal pipeline. The input image (left) is part of a focal stack representing a light field. Each image

More information

Optimizing Monocular Cues for Depth Estimation from Indoor Images

Optimizing Monocular Cues for Depth Estimation from Indoor Images Optimizing Monocular Cues for Depth Estimation from Indoor Images Aditya Venkatraman 1, Sheetal Mahadik 2 1, 2 Department of Electronics and Telecommunication, ST Francis Institute of Technology, Mumbai,

More information

Irradiance Gradients. Media & Occlusions

Irradiance Gradients. Media & Occlusions Irradiance Gradients in the Presence of Media & Occlusions Wojciech Jarosz in collaboration with Matthias Zwicker and Henrik Wann Jensen University of California, San Diego June 23, 2008 Wojciech Jarosz

More information

1 Projective Geometry

1 Projective Geometry CIS8, Machine Perception Review Problem - SPRING 26 Instructions. All coordinate systems are right handed. Projective Geometry Figure : Facade rectification. I took an image of a rectangular object, and

More information

Final Exam Study Guide CSE/EE 486 Fall 2007

Final Exam Study Guide CSE/EE 486 Fall 2007 Final Exam Study Guide CSE/EE 486 Fall 2007 Lecture 2 Intensity Sufaces and Gradients Image visualized as surface. Terrain concepts. Gradient of functions in 1D and 2D Numerical derivatives. Taylor series.

More information

Automatic Colorization of Grayscale Images

Automatic Colorization of Grayscale Images Automatic Colorization of Grayscale Images Austin Sousa Rasoul Kabirzadeh Patrick Blaes Department of Electrical Engineering, Stanford University 1 Introduction ere exists a wealth of photographic images,

More information

Simultaneous surface texture classification and illumination tilt angle prediction

Simultaneous surface texture classification and illumination tilt angle prediction Simultaneous surface texture classification and illumination tilt angle prediction X. Lladó, A. Oliver, M. Petrou, J. Freixenet, and J. Martí Computer Vision and Robotics Group - IIiA. University of Girona

More information

Midterm Examination CS 534: Computational Photography

Midterm Examination CS 534: Computational Photography Midterm Examination CS 534: Computational Photography November 3, 2016 NAME: Problem Score Max Score 1 6 2 8 3 9 4 12 5 4 6 13 7 7 8 6 9 9 10 6 11 14 12 6 Total 100 1 of 8 1. [6] (a) [3] What camera setting(s)

More information

Application questions. Theoretical questions

Application questions. Theoretical questions The oral exam will last 30 minutes and will consist of one application question followed by two theoretical questions. Please find below a non exhaustive list of possible application questions. The list

More information

Image-Based Rendering

Image-Based Rendering Image-Based Rendering COS 526, Fall 2016 Thomas Funkhouser Acknowledgments: Dan Aliaga, Marc Levoy, Szymon Rusinkiewicz What is Image-Based Rendering? Definition 1: the use of photographic imagery to overcome

More information

Depth Measurement and 3-D Reconstruction of Multilayered Surfaces by Binocular Stereo Vision with Parallel Axis Symmetry Using Fuzzy

Depth Measurement and 3-D Reconstruction of Multilayered Surfaces by Binocular Stereo Vision with Parallel Axis Symmetry Using Fuzzy Depth Measurement and 3-D Reconstruction of Multilayered Surfaces by Binocular Stereo Vision with Parallel Axis Symmetry Using Fuzzy Sharjeel Anwar, Dr. Shoaib, Taosif Iqbal, Mohammad Saqib Mansoor, Zubair

More information

WATERMARKING FOR LIGHT FIELD RENDERING 1

WATERMARKING FOR LIGHT FIELD RENDERING 1 ATERMARKING FOR LIGHT FIELD RENDERING 1 Alper Koz, Cevahir Çığla and A. Aydın Alatan Department of Electrical and Electronics Engineering, METU Balgat, 06531, Ankara, TURKEY. e-mail: koz@metu.edu.tr, cevahir@eee.metu.edu.tr,

More information

CEng Computational Vision

CEng Computational Vision CEng 583 - Computational Vision 2011-2012 Spring Week 4 18 th of March, 2011 Today 3D Vision Binocular (Multi-view) cues: Stereopsis Motion Monocular cues Shading Texture Familiar size etc. "God must

More information

Last update: May 4, Vision. CMSC 421: Chapter 24. CMSC 421: Chapter 24 1

Last update: May 4, Vision. CMSC 421: Chapter 24. CMSC 421: Chapter 24 1 Last update: May 4, 200 Vision CMSC 42: Chapter 24 CMSC 42: Chapter 24 Outline Perception generally Image formation Early vision 2D D Object recognition CMSC 42: Chapter 24 2 Perception generally Stimulus

More information

Lecture 20: Tracking. Tuesday, Nov 27

Lecture 20: Tracking. Tuesday, Nov 27 Lecture 20: Tracking Tuesday, Nov 27 Paper reviews Thorough summary in your own words Main contribution Strengths? Weaknesses? How convincing are the experiments? Suggestions to improve them? Extensions?

More information

Multidimensional image retargeting

Multidimensional image retargeting Multidimensional image retargeting 9:00: Introduction 9:10: Dynamic range retargeting Tone mapping Apparent contrast and brightness enhancement 10:45: Break 11:00: Color retargeting 11:30: LDR to HDR 12:20:

More information

Epipolar Geometry and Stereo Vision

Epipolar Geometry and Stereo Vision Epipolar Geometry and Stereo Vision Computer Vision Jia-Bin Huang, Virginia Tech Many slides from S. Seitz and D. Hoiem Last class: Image Stitching Two images with rotation/zoom but no translation. X x

More information

Video Communication Ecosystems. Research Challenges for Immersive. over Future Internet. Converged Networks & Services (CONES) Research Group

Video Communication Ecosystems. Research Challenges for Immersive. over Future Internet. Converged Networks & Services (CONES) Research Group Research Challenges for Immersive Video Communication Ecosystems over Future Internet Tasos Dagiuklas, Ph.D., SMIEEE Assistant Professor Converged Networks & Services (CONES) Research Group Hellenic Open

More information