Sashi Kumar Penta COMP Final Project Report Department of Computer Science, UNC at Chapel Hill 13 Dec, PDF Free Download

Computer vision framework for adding CG Simulations Sashi Kumar Penta sashi@cs.unc.edu COMP 790-072 Final Project Report Department of Computer Science, UNC at Chapel Hill 13 Dec, 2006 Figure 1: (i) Top row: Images from left to right, left view and right view of stereo pair, disparity map using only data cost and disparity map using graph-cut algorithm respectively. (ii) Bottom row: Images from left to right, original image with foreground, foreground separated from background, a snapshot of 3D model constructed from the disparity map, and a snapshot of CG model in MSR dance sequence. Abstract Rendering synthetic simulations into real world scenes is an important application of both Computer Graphics (CG) and Computer Vision (CV). CV techniques have been used to reconstruct the 3D models of the real world scenes. As part of my course project, I ve implemented computer vision modules for calibrating camera, computing stereo, computing depth-maps, computing 3D models and a graphics interactive tool for adding simulations into this 3D model obtained from the stereo video. CR Categories: I.3.3 [Computer Graphics]: Picture/Image Generation Display Algorithms I.3.7 [Computer Graphics]: Three-Dimensional GRaphics and Realism Animation I.4.8 [Image processing and Computer vision]: Scene Analysis Stereo and Time-varying imagery Keywords: Computer vision, Dynamic Scenes, Computer Animation, Image-based rendering 1 Introduction Augmentation of real world scenes with synthetic objects is a problem of great significance for Computer Graphics (CG) as well as other research disciplines. Many special effects involve combining CG elements with real footage. Synthetic objects rendered onto real video have also been used to enhance visualization for engineering as well as medical purposes. Most research in this area has focused on using either static or rigid moving objects as CG elements to be inserted into the scene. Relevance to the Robotics course : Robots need to know where they are located in the world and they need to understand 3D of the world they walk in. Camera calibration techniques are required to find the position and orientation of the camera. Its quite common in robots world to have the patterns around, to let the robots find where they are located in the world. This technique will have direct application in using Simultaneous localization and mapping (SLAM) of robot. Infrared sensors can be used to find how far they are from the surroundings, which is understanding the 3D of the world around the robot. Infrared sensors are noisy and they only give the distance in the frontal direction. Its quite easy to have two cameras for the robot and use stereo for understanding 3D of the world around it. We used Zhang s [Zhang 2000a] calibration technique for calibrating the camera. We used graph-cut based method [Boykov et al. 2001] for computing the stereo correspondences. Finally, we applied triangulation to the stereo to get the 3D of the world. we ve used the above reconstructed 3D world as height field for simple CG simulations.

2 Background Computer vision techniques can be used to reconstruct 3D models of real world scenes, where as computer graphics can be used to render these reconstructed 3D models as well as synthetic models. Constructing 3D models has been the primary focus of computer vision techniques. Camera calibration [Tsai 1987a; Zhang 2000a] is the first step to Construct 3D. Calibration involves computing camera matrix, which tells where a 3D point projects in the image. Stereo correspondence [Scharstein et al. 2002; Boykov et al. 2001] can be used to recover 3D models from 2D stereo images. Full geometry recovered through stereo techniques have been used to get high quality novel views [Zitnick et al. 2004; Buehler et al. 2001]. These techniques mainly focus on modeling and rendering real-scenes. However, adding CG simulations into these high quality novel views is relatively unexplored. State of the art : Full geometry recovered through stereo techniques have been used to get high quality novel views [Zitnick et al. 2004; Buehler et al. 2001]. These techniques mainly focus on modeling and rendering real-scenes. Adding CG simulations into these high quality novel views will be very interesting. In this project, I built a framework towards achieving this goal. Figure 2: Stereo rig used for capturing stereo images 3 Modules This project has been divided into following modules. Capturing: This module explains imaging process involved in obtaining stereo videos. Camera calibration : This module will be used to obtain camera parameters (both interior and exterior) through calibration. Stereo correspondence : This module will be used to compute correspondences from the stereo images/videos. Depth-map : This module will be used to obtain depth-maps from the correspondences and calibration matrices obtained in the previous modules. Foreground : This module explains the process of separating foreground from the background. Tool for interaction : This module will allow user to do all kinds of interactions, selecting region of interests, changing view points, selecting regions to place the CG into the video. Contributions and Collaborations : This project is part of my research project, Fluids in Video, in collaboration with Vivek Kwatra and Philippos Mordohai. We worked closely to get all the modules working. Although I spent a good amount of time on Stereo correspondence, Foreground and Tool for interaction modules. 4 Capturing We used the standard stereo rig such as one shown in the figure 2 for capturing our stereo videos and one such frame is shown in the figure 3. In this section, we first explain the notation used for the rest of the sections and then will explain all the mathematics behind the imaging process. Notation: A point in 2D space is represented by a pair of coordinates (x,y) in R 2 and in homogeneous coordinates as a 3D-vector. An arbitrary homogeneous vector representative of a point is of the Figure 3: Stereo pair form x = (x 1, x 2, x 3 ) T, represents the point (x 1 /x 3,x 2 /x 3 ) in R 2. Similarly a point in 3D space is represented by the triplet (X,Y, Z) in R 3 and in homogeneous coordinates as a 4D-vector. Points in the 2D image plane are written using boldface lower case letters x, y, z, etc, and in Cartesian coordinate system as x, ỹ, z, etc. 3D points are written using boldface capital letters X,Y,Z, etc, and in Cartesian coordinate system as X, Ỹ, Z, etc. Matrices are represented in bold face capital letters M, P, V,K, R, etc. Imaging: A camera maps a 3D world point to a 2D point in image. If the world and image points are represented by homogeneous vectors, then the mapping between their homogeneous coordinates can be expressed as x = M X (1) where, X represents world point by the homogeneous vector (X,Y,Z,1) T, x represents image point as a homogeneous 3-vector, and M represents a 3 4 homogeneous camera projection matrix. Camera matrix M can be written as where K = M = K [R t] α s p x 0 β p y 0 0 1 is 3 3 intrinsic matrix consisting of internal parameters α, β, s, p x and p y, R is 3 3 rotation matrix and t is 3 1 translation vector. The augmented matrix [R t] is the extrinsic matrix. M has 11 degrees of freedom, out of which 5 come from K, 3 from R (rotations about three axes θ x,θ y and θ z respectively) and 3 from t (t x,t y and t z ). R and t represent the rotation and translation of the camera with respect to the world coordinate system as shown in the figure 4.

Ycam Camera Coordinate System C X cam Zcam R,t Y X O World Coordinate System Figure 4: The rotation and translation between the world and camera coordinate frames. 5 Camera calibration Z Figure 6: Rectified images of the stereo pair. For every point on the white line of the left image has the corresponding point in the right image. Observe the matched blue points (corresponding points), on nose and telephone wire marked on left and righ images resepectively. Camera calibration is the process of estimating the matrix M as shown in Equation 1. Many methods [Tsai 1987b; Zhang 2000b; Wang and Tsai 1990; Strunz 1992; Maas 1999] have been proposed in literature to calibrate cameras. Some methods [Mikhail and Mulawa 1985; Kruck 1984; Forket 1996; Heikkila 1990] use known geometric shapes, such as straight lines, circles, known angles and lengths in the scene to calibrate cameras. Most methods [Tsai 1987b; Holt and Netravali 1991; Sutherland 1974] involve estimating components of M matrix using a linear or non-linear optimization technique, given sufficient matching points x in image and X in the 3D world. The M matrix is factored into K, R and t using the input information. We used popular method proposed by Zhang [Zhang 2000b] is a flexible camera calibration technique, which only requires the camera to observe a planar pattern shown at a few (at least two) different orientations. A stereo pair containing the patterns are shown in the figure 5. the disagreement between f and the observed data. Left-most image shown in the figure 7 is solely based on the data cost. E(f ) = E smooth (f ) + E data (f ) When we applied graph-cut based algorithm using the above formulation we got stair casing as shown in the middle image of the figure 7. We used guassian filtering and got the final good result as shown in the last image of the figure 7. Figure 7: Images from left to right show disparity map using only data cost, disparity map using the graphcut algorithm, disparity map applying the Gaussian filter 7 Depth map and 3D Model from depth-map Figure 5: Stereo pair : with calibration pattern, checkered board. 6 Stereo correspondence Stereo correspondences is the problem of finding the corresponding points in the left-view and right-view. For a particular point in the left-image can match to any point in the right image. Matching criteria could be close-ness to average brightness or special features of that pixel in the image, such as edge, corner, point, etc. Given that matching point can be any where in the image, search space is huge. It makes the problem very difficult. Epipolar geometry tells that a point on the left view, contains on a line (called epipolar line) on the right image. In an arbitrary situation this line can be at any arbitrary angle. We rectified the image such that epipolar lines are horizontal. One such example is shown in the figure 6. We used graph-cut based method [Boykov et al. 2001] to compute the stereo correspondence. Graph-cut methods have two costs, Data cost and Smoothness cost. Smoothness cost measures the extent to which f (labeling) is not piecewise smooth and Data cost measures Several methods are used for depth recovery from images. These include depth from stereo, shape from focus, shape from defocus, structure from motion, shape from silhouettes, shape from shading, etc. Triangulation is a standard method to find the 3D point from the corresponding 2D points in two images. A point a in one image corresponds a ray in space passing through camera center. Clearly the camera center C 1 is a point on this ray and M + 1 a is the point at infinity in that direction. The 3D points on this ray can be written as C 1 + z 1 M + 1 a, where z 1 is the depth of the point. Ambiguity in determining the 3D point on the ray can be resolved using another camera. If point b in the second image corresponding to the same 3D point, another 3D ray, with points given by, C 2 + z 2 M + 2 b contains the point. These rays will intersect at the point that is projected at a and b as shown in the Figure 8. If z 1 or z 2 is known, then one can compute the 3D point from C 1 + z 1 M + 1 a or C 2 + z 2 M + 2 b. Other wise, z i s can be computed from the following equation: C 1 + z 1 M + 1 a = C 2 + z 2 M + 2 b. (2) Equation 2 gives 3 equations in 2 unknowns z 1 and z 2 if calibration parameters C 1, C 2, M 1 and M 2 are known. One such 3D model obtained is shown in the figure 9 at an arbitrary angle. As you can

see reconstruction shown in the figure 9 is not very accurate. We are still investigating to fix the problems with the reconstruction. X (3D world point) a b Figure 11: Three 3 frames of the animation C 1 C 2 Figure 8: Triangulation to find the 3D point Figure 9: Textured 3D model shown at an arbitrary angle 8 Foreground As we proceeded to use the graph-cut method and figured that it is very expensive operation and performing it for every frame will not make our system interactive. So we decided to apply the graph-cut algorithm separately to the foregrounds, i.e regions that are different from the background image. We used a simple HSV color based segmentation of the difference image and the results are shown in the figure 10 10 Conclusions and Future work We have developed a computer vision framework which will facilitate us to add CG simulations into the dynamic real scenes, using the interactive tool. In future, we plan to implement Crowd simulations in the real videos and want to change the view point of this augmented scene using some kind of Image based rendering technique. Crowd simulations in real videos We plan to implement the crowd simulations in the real videos in the following fashion. Render the top view of the reconstructed world facing the land in the scene using OpenGL Read the Z-buffer which acts like the height field for the crowd simulation Define other goal positions of the crowd in the field Implement some kind of Continuum crowds [Treuille et al. 2006] on top of these fields Challenges involved in this kind of simulations are (i) dynamic height fields obtained in the above process mentioned and (ii) inaccuracy of height maps. We are planning to develop simulation and rendering algorithms to work for this kind of noisy/dynamic environments. References Figure 10: From left to right shows the images of background, composite (both background and foreground) and extracted foreground image. Noise in the foreground is due to varying lighting conditions in the input images. 9 Interactivity and CG in videos I ve used my interactive tool for adding simple CG models into the model obtained using the depth-map module mentioned above in the section 7. Using this tool, One can change the view point (using mouse and keyboard) and select position on the screen (using mouse) to place the 3D CG models into the scene. I made use of OpenGL for this. I read the depth-buffer at the point where user wants to place the CG model and obtain the 3D location in the scene. Three such frames are shown in the figure 11. BOYKOV, Y., VEKSLER, O., AND ZABIH, R. 2001. Fast approximate energy minization via graph cuts. In IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI) 23, 11, 1222 1239. BUEHLER, C., BOSSE, M., MCMILLAN, L., GORTLER, S. J., AND COHEN, M. F. 2001. Unstructured Lumigraph Rendering. In Proc. ACM SIGGRAPH, 425 432. FORKET, G. 1996. Image orientation exclusively based on freeform tie curves. International Archives of Photogrammetry and Remote Sensing 31, B3, 196 201. HEIKKILA, J. 1990. Update calibration of a photogrammetric station. International Archives of Photogrammetry and Remote Sensing 28, 5/2, 1234 1241. HOLT, R. J., AND NETRAVALI, A. N. 1991. Camera calibration problem: some new results. Computer Vision, Graphics, and Image Processing 54, 3, 368 383. KRUCK, E. 1984. A program for bundle adjustment for engineering applications - possibilities, facilities and practical results. International Archives of Photogrammetry and Remote Sensing 25, A5, 471 480.

MAAS, H. G. 1999. Image sequence based automatic multi-camera system calibration techiques. ISPRS J. of Photogr. and Rem. Sens 54, 6, 352 359. MIKHAIL, E. M., AND MULAWA, D. C. 1985. Geometric form fitting in industrial metrology using compuer-assisted theodolites. ASP/ACSM Fall meeting. SCHARSTEIN, D., SZELISKI, R., AND ZABIH, R. 2002. A taxonomy and evaluation of dense two-frame stereo correspondence algorithms. International Journal of Computer Vision 47, 1. STRUNZ, G. 1992. Image orientation and quality assessment in feature based photogrammetry. Robust Computer Vision, 27 40. SUTHERLAND, I. E. 1974. Three-dimensional data input by tablet. Proceedings of the IEEE 62, 4, 453 461. TREUILLE, A., COOPER, S., AND POPOVIć, Z. 2006. Continuum crowds. ACM Trans. Graph. 25, 3, 1160 1168. TSAI, R. 1987. A versatile camera calibration technique for highaccuracy 3D machine vision metrology using off-the-shelf tv cameras and lenses. IEEE Journal of Robotics and Automation 3, 4, 323 344. TSAI, R. Y. 1987. A Versatile Camera Calibration Technique for High-Accuracy 3D Machine Vision Metrology Using Off-the- Shelf TV Cameras and Lenses. IEEE Journal of Robotics and Automation 4. WANG, L., AND TSAI. 1990. Computing Camera parameters using vanishing-line information from a rectangular parallepiped. Machine Vision and Applications 3, 129 141. ZHANG, Z. 2000. A Flexible New Technique for Camera Calibration. IEEE Transactions on Pattern Analysis and Machine Intelligence 22, 11, 1330 1334. ZHANG, Z. 2000. A Flexible New Technique for Camera Calibration. IEEE Transactions on Pattern Analysis and Machine Intelligence 22, 11, 1330 1334. ZITNICK, C. L., KANG, S. B., UYTTENDAELE, M., WINDER, S., AND SZELISKI, R. 2004. High-quality video view interpolation using a layered representation. In SIGGRAPH.

Sashi Kumar Penta COMP Final Project Report Department of Computer Science, UNC at Chapel Hill 13 Dec, 2006