An Overview of Matchmoving using Structure from Motion Methods

Size: px

Start display at page:

Download "An Overview of Matchmoving using Structure from Motion Methods"

Hannah Allen
5 years ago
Views:

An Overview of Matchmoving using Structure from Motion Methods Kamyar Haji Allahverdi Pour Department of Computer Engineering Sharif University of Technology Tehran, Iran Email: allahverdi@ce.sharif.

1 An Overview of Matchmoving using Structure from Motion Methods Kamyar Haji Allahverdi Pour Department of Computer Engineering Sharif University of Technology Tehran, Iran Alimohammad Rabbani Department of Computer Engineering Sharif University of Technology Tehran, Iran Abstract Nowadays seeing supernatural happenings in movies is not a rare happening. Here we want to discuss a method on how to add extraterrestrial objects to an ordinary video. This method of adding computer-generated objects to a natural video is called Matchmoving [1]. The main part of this process is Structure from Motion (SfM). SfM is the process that enables Matchmovers to use the 3D information of the scene. We will mainly discuss this part, how it works and different methods to extract its information. I. INTRODUCTION As a Matchmover tries to add an artificial object to a real scene, he needs the 3D information of the scene. This information can be extracted from the sequence of images. Using the difference between images, a perception of this 3D structure can be computed. As converting 3D information to 2D information would inevitably introduce data loss, this task is challenging and needs extra information to do the reconstruction. This information can be found from our previous knowledge about the scene, e.g. looking for parallel lines in the images. Another way is finding corresponding points in different views and construct 3D points using triangulation [8]. II. STRUCTURE FROM MOTION First of all, we ll discuss the process in which we can find information about the real scene. This task is done using a method called Structure from Motion. Basic steps of this process is shown in Figure 1. This process starts with understanding the mapping between 3D world and 2D image plane. After understanding that, we ll discuss the method to find 3D points from 2D points. A. From 3D to 2D We want to model the process in which a 3D point is mapped to a 2D point by a camera. The most common model is pinhole projection model. If there is no non-linear effect, like radial distortion, this model is a good approximation for most real cameras. This model has 3 components in which a 3D point is converted to a 2D point: 1) The first component is transforming real world coordinates to camera coordinates. This transform consists of a rotation and a translation, which together they are called camera extrinsic parameters. Fig. 1. Basic Steps of SfM [5] 2) The second component is mapping 3D points to 2D points. This mapping is a matrix like 2 the 3 one below: X x f c 4y5 40 f Y c 7 4Z f 0 c 5 1 Where f is the focal length of camera. Here we are using homogenous coordinates, so we should keep the last element equal to 1. As we don t have access to this scale factor, we just have an arbitrary perception of depth. This means that using just one image information, we can t understand how far an object is from a camera. We can set this scale factor to 1, as we don t know the exact number. 3) The final component is converting 2D coordinates of image plane to pixel coordinates. This transformation is done using an upper triangular matrix called K which hold camera intrinsic parameters. This matrix can be written as: 2 K = 4 3 u s u 0 0 v v By combining aforementioned matrices, we can compute a projection matrix, P, which maps 3D points to 2D pixel coordinates:

2 This relation can be derived by replacing points in the essential matrix relation with the multiplication of K 1, inverse of camera intrinsic parameters matrix, by the pixel point: Fig. 2. Epipolar Geometry x PX By using this equation, we can find two linear equations to find P elements. This matrix has 12 elements, which there are 11 unknown elements as the scale factor is arbitrary. So if we have at least 6 points in 3D space and their correspondents in 2D image plane, we can write 12 different equations to find P elements. B. Finding Correspondent Points As mentioned before, we need to find corresponding points to extract 3D scene structure. One way to find these correspondent points, is to ask a user to tag these points in our images. This approach is time consuming and the results will depend on how accurate the user has selected the correspondent points. The automatic way to do so, in general, is a difficult task. Methods for finding these correspondent points automatically, use information around important pixels to describe them. These important points are called interest points. We can measure the similarity between these descriptors to find an approximate of how much two points can be considered matches. For example we can use Harris corner detector [3] to find such interest points. Harris corner detector finds points that are the maxima for image auto-correlation function. C. Epipolar Geometry Epipolar geometry or two-view geometry provides the relation between two views of a scene. If we have point x in the first view, epipolar geometry restricts the corresponding point, x 0, in the second view to lie on a line called epipolar line (Figure 2). This constraint can be formulated using a matrix called essential matrix. 1) Essential Matrix: This matrix gives the relation that can find epipolar lines: x T Ex 0 =0 Where Ex 0 is the epipolar line in the second view corresponding to the point x in the first view. Essential matrix has 5 parameters and can be found using 5 different matches of points [2]. 2) Fundamental Matrix: We can further expand this relation to find the formulation for two corresponding pixel points in two images. The resulting matrix is called fundamental matrix: u T Fu 0 =0 (K 1 u) T E(K 0 1 u 0 ) = 0 u T (K 1T EK 0 1 )u 0 = 0 u T Fu 0 = 0 So we can see that fundamental matrix is given by F K 1T EK 0 1. This matrix has 9 elements and it can be computed up to an arbitrary scale, so we can calculate this matrix using 8 correspondent points. 3) Projection Matrices: After finding fundamental matrix we can calculate projection matrices. This can be accomplished by decomposing fundamental matrix [4]. D. Triangulation: From 2D to 3D By finding camera projection matrices, we can find the correspondent 3D point of every 2D pair of points. This process is called Triangulation. This process is done by intersecting projection rays, rays passing from camera center and 2D points on the image plane. As there is always some noise in our measurements, the projected rays of 2D points will not intersect. So the main task of triangulation is to find 3D points that minimize this error: X = arg min X P ku i i u 0 i k Where u i is the 2D point of the image and u 0 i is the predicted 2D point by the 3D point and projection matrix. After doing triangulation we have found the information we were looking for about the structure of the scene. E. Multiple-View Structure from Motion As we discussed in the previous section, given two views we can find the correspondent 3D points to reconstruct the structure of the scene. But with a sequence of images, e.g. frames of a video, we have more than two views. The methods used to reconstruct a 3D scene from multiple views are called sequential methods. As new views are given to these methods they extend the partial reconstructed scene with the new information. One of these ways uses sequential Monte Carlo (SMC) methods for estimating 3D scene [7]. SMC originally is a non-linear filter for smoothing state space model. Another well-known method for refining 3D points when we have multiple views is Bundle Adjustment. Bundle Adjustment is the act of refining 3D points and projection matrices simultaneously. F. Multibody Structure from Motion Multibody structure from motion is the extension of SfM to dynamic scenes where objects can move and the scene is not static. This task is challenging because number of moving objects can change during time. Also moving objects are often small and therefore few number of features can be tracked for them [6]. The main requirements to build a system that enables dynamic SfM are:

1) finding number of moving objects at the beginning of image sequence, 2) segmenting feature tracks to find different motions in scene, 3) finding 3D structure for each moving object, The accuracy

3D MATCH MOVING Match moving is a visual-effects technique that allows us to insert computer-generated objects into real camera captured scenes.

3 1) finding number of moving objects at the beginning of image sequence, 2) segmenting feature tracks to find different motions in scene, 3) finding 3D structure for each moving object, The accuracy of the method is very dependent on how accurate we segment feature tracks. III. 3D MATCH MOVING Match moving is a visual-effects technique that allows us to insert computer-generated objects into real camera captured scenes. Thinking technically, it is hard to insert these objects in a way that they maintain correct position, orientation, scale, and motion relative to other objects in the real scene. The key process in match moving is to precisely determine camera movements in a 3D space. Having camera movements, newly inserted 3D objects will appear perfectly matched with regards to perspective, position and scale. This means that if we have a camera-tracked blank scene, we can create a 3D object in it. After regenerating a 2D video scene again, with the new object inserted, we will see the object moving in the 2D scene. These movements, gives us the impression that the camera is really moving around our 3D object. Now, if we had a 3D structure of objects present in the footage, we could recreate the scene adding new objects with regards to real objects. Consider that the final result, must comply with basic human vision rules and concepts such as motion parallax and occlusion. This is where SfM comes in handy. We create 3D structures of real objects in original footage in order to comply with these concepts as we add new objects. Figure 3 depicts our sample match moved footage. Considering previous notes in mind, we can break the process of match moving into the following steps: 1) Feature tracking 2) Calculating 3D information 3) Camera tracking 4) 3D modeling 5) Video composition A. Feature Tracking Feature tracking is a process very similar to motion estimation. This is the first step that needs to be done before calculating 3D information or tracking the camera. Features are points of interest in the image that can be tracked in several frames with a tracking algorithm. Depending on the tracking algorithm, we may have different approaches in selecting features. Fig. 3. Sample Match Moved Footage Fig. 4. Feature Tracking The more precision the algorithm has in tracking features, the more precisely we can calculate 3D information. Like motion estimation, there are motion vectors for active features frame by frame. The result is that we can find the position of each feature throughout the entire footage by having its motion vectors and its position in a single frame. Statistical information may be used to eliminate or correct mistakes in feature tracking. A sample visualization of feature tracking is shown in Figure 4. B. Calculating 3D information Extracting 3D information from original footage is an essential part of match moving. First, it helps solve cameras 3D motion. And then, that information will be used in next steps to maintain visual relations of added objects and real objects. Here is an example one can keep in mind describing why this information is needed to maintain visual relations. Suppose we are placing a cylinder on top of a table. First, we need to know where in space the surface of the table is located. In addition, some parts of the cylinders shadow may cast outside the tables surface on the ground. So the ground plane is also needed. In the previous simple example, at least two surfaces were needed. In more complex situations and scenes, more points and surfaces are required. At this point, SfM methods are used to build the 3D structure of a scene. The details needed in the structure, is determined by the complexity of the scene and the artists requirements. C. Camera Tracking Another step where SfM methods show up, is tracking the camera and solving its motion in 3D space. This is the key process in match moving. Again, using SfM techniques, the exact characteristics of the camera are extracted throughout the footage. These characteristics include orientation, position, and focal length. In a nutshell, camera tracking is done by inverse projection of 2D paths for rigid objects in the footage. During camera tracking, a vector is created in each frame for the camera specifying its characteristics. In a perspective 3D view, the cameras path and orientation can be seen clearly using these vectors. Figure 5 shows a perspective view of an original footage. In this figure, the 3D structure of the scene is partially reconstructed and the camera is tracked as well.

Fig. 7. Different Rendered Layers of Output Fig. 5. Camera Tracking Fig. 8. Composited Footage The red line in the figure depicts the path that the camera is moving on during the footage. D. 3D Modeling A 3D artist may now get involved to design new objects for the scene.

Since the camera is moving in the scene, objects that are inserted will be moving accordingly in the final 2D output.

He is the one who adds new objects in place and tries to add details to them. The outputs of this step are image sequences of different layers.

The image sequence of each layer is technically new frames that show the current position of objects in that frame. The frames are consistent with the frames from the original footage.

4 Fig. 7. Different Rendered Layers of Output Fig. 5. Camera Tracking Fig. 8. Composited Footage The red line in the figure depicts the path that the camera is moving on during the footage. D. 3D Modeling A 3D artist may now get involved to design new objects for the scene. The camera and reconstructed points from the footage are imported in a 3D modeling software. Therefore, the artist can see important points in a 3D space from the cameras view. Since the camera is moving in the scene, objects that are inserted will be moving accordingly in the final 2D output. Figure 6 demonstrates the interface of a 3D modeling software and shows a cylinder added on the surface of a table. In this step, the artists talents play a crucial role in the quality. He is the one who adds new objects in place and tries to add details to them. The outputs of this step are image sequences of different layers. Each layer consists of a set of characteristics of different objects. For example, a single layer maybe used for shadows, meaning that only shadows of different objects are visible in the layer. The image sequence of each layer is technically new frames that show the current position of objects in that frame. The frames are consistent with the frames from the original footage. In the next step, these image sequences are composited into the original footage and will create the final result. Figure 7 indicates different layers of rendered output. E. Video Composition This is the next and final step in match moving. Different layers of newly added objects need to be composited in to the original footage. As a result, computer generated objects seem to be perfectly matched into the sequence, as if they were actually there when the footage was being recorded. A lot of different computer software such as Adobe After Effects R are available for video compositing. Using blending modes, alpha channel masking, and similar image composition techniques, the artist is able to reconstruct objects from different layers. The layering that is done in the previous step helps the artist deal with more complexities in composition. Suppose a computer generated object needs to cast its shadow on an object from the original footage. Depending on the situation, it may be necessary to reduce the intensity of that shadow. If the layers for color and shadow are not separated, the artist may not be able to reduce the intensity of the objects shadow without changing its color characteristics. This may result in a visual error and is not acceptable in match moving. Figure 8 shows our sample footage being composited. IV. CONCLUSION In this overview we covered the process of Match Moving. We also discussed about Structure from Motion which is the main element of Match Moving. We saw that new methods of SfM enables us to create great visual effects. Fig. 6. 3D Model of a Cylinder REFERENCES [1] T. Dobbert. Matchmoving: The Invisible Art of Camera Tracking. Wiley Desktop Editions. John Wiley & Sons, [2] Friedrich Fraundorfer, Petri Tanskanen, and Marc Pollefeys. A minimal case solution to the calibrated relative pose problem for the case of two known orientation angles. In Proceedings of the 11th European conference on Computer vision: Part IV, ECCV 10, pages , Berlin, Heidelberg, Springer-Verlag. [3] Chris Harris and Mike Stephens. A combined corner and edge detector, volume 15, pages Manchester, UK, 1988.

5 [4] R. I. Hartley and A. Zisserman. Retrieving the camera matrices, pages Cambridge University Press, ISBN: , second edition, [5] Gang Liu, Reinhard Klette, and Bodo Rosenhahn. Structure from motion in the presence of noise. In Image and Vision Computing New Zealand, pages , [6] Kemal E. Ozden, Konrad Schindler, and Luc Van Gool. Multibody structure-from-motion in practice. IEEE Trans. Pattern Anal. Mach. Intell., 32(6): , June [7] Gang Qian and Rama Chellappa. Structure from motion using sequential monte carlo methods. International Journal of Computer Vision, 59(1):5 31, [8] D.P. Robertson and R. Cipolla. Structure from Motion. In Varga, M., editors, Practical Image Processing and Computer Vision. John Wiley, 2009.

Computer Vision Lecture 17

Computer Vision Lecture 17 Epipolar Geometry & Stereo Basics 13.01.2015 Bastian Leibe RWTH Aachen http://www.vision.rwth-aachen.de leibe@vision.rwth-aachen.de Announcements Seminar in the summer semester