Motion Sohaib A Khan 1 Introduction So far, we have dealing with single images of a static scene taken by a fixed camera. Here we will deal with sequence of images taken at different time intervals. Motion of an object in 3D or the camera induces 2D motion of pixels on the image plane. This motion is called optical flow. We will first present a method to compute local optical flow at each pixel, followed by a method to compute global flow, which fits the affine model. 2 Optical Flow Optical flow may be defined as a flow-field u(x, y),v(x, y) where u(x, y) is the velocity of pixel (x, y) in the x-direction, and v(x, y) in the y-direction. Example optical flow fields are shown in Figure 1. It can be realized from this figure that optical flow is a powerful feature for object segmentation. In addition, there are several applications that stem from optical flow computation, like object-based compression (MPEG-4), image stabilization and gesture recognition. 2.1 Brightness Constancy Equation Let a 3D function, f(x, y, t), where x and y are the spatial coordinates, and t is the time coordinate, denote the image sequence. Then, f(x 1,y 1,t 1 )isthegraylevelatpixel(x 1,y 1 )attimet 1. We assume that if there is a small change dx, dy and dt in x, y and t, there is no change in the gray-levels, that is: f(x, y, t) =f(x + dx, y + dy, t + dt) (1) This equation represents the brightness constancy assumption. That is, over small time intervals, pixels can experience small displacements but with no change in color. This assumption is not always true in real world sequences, because of non-lambertian object surfaces, changes in distance from light sources, and camera noise. However it is a reasonable simplifying assumption, based on which we can easily derive equations for computing u and v. Recall that the Taylor series expansion of a function f(x) aboutapointx = a is given by: f(x) =f(a)+(x a)f (a)+ 1 2! (x a)2 f (a)+... (2) Thus, the Taylor series expansion of right hand side of Equation 1 around x, y, t, is: f(x + dx, y + dy, t + dt) =f(x, y, t)+dx f f + dy x y + dt f +... (3) t Adapted from Fundamentals of Computer Vision by Mubarak Shah c 1992, and other sources 1
Figure 1: Examples of optical flow fields: (a) translation, (b) rotation, (c) zoom, (d) unzoom are flow fields that will fit a global model like affine or projective. (e) shows an image from a sequence whose optical flow is shown in (f). Here different image regions are moving at different velocities that do not fit a global 2D displacement model. Ignoring higher order terms and substituting in Equation 1, we get: f(x, y, t) =f(x, y, t)+dx f f + dy x y + dt f t. (4) The above equation can be simplified as: f x dx + f y dy + f t dt =0, (5) where f x = f x, f y = f y, f t = f t are the x-, y- andt-derivatives respectively. These derivatives can be computed by convolving the sequence f(x, y, t) with the masks shown in Figure 2. Dividing each term in the above equation by dt, weget: f x u + f y v + ft =0, (6) where u = dx dt and v = dy dt is the optical flow. This equation is called the brightness constancy equation. It consists of two unknowns, u and v, and therefore a unique solution for these two unknowns does not exist, based on a information available at a single pixel. In fact, the equation presents a linear constraint on the possible solutions of u and v, which can be seen if it is rewritten as: v = f x u f t. (7) f y f y This is the equation of a straight line in uv-space. There are several possible solutions of u, v, which lie along this line, as shown in Figure 3. Let (û, ˆv) be the correct solution. This vector can 2
Figure 2: Derivative Masks: The axis convention (left) and the derivative masks that conform to this convention. Note that in this convention, optical flow vectors go from I t to I t+1. be divided into two components, one along the straight line denoted by p and the other along the perpendicular line denoted by d. We can show that since cos α = d/ ft f x and cos(90 α) =d/ ft f y, f therefore d = (f t 2 x +fy 2 ). Therefore, knowing the derivatives f x, f y and f t, we can only compute the normal component, d, of optical flow. However, the parallel component p cannot be computed directly from the derivatives. 2.1.1 Example It is instructive to do a simple example to clearly understand the concept of normal and parallel components of optical flow. Consider a foreground object which has translated by û = 1, ˆv = 1 between two frames 4. We are interested in finding optical flow at the point marked x. Applying the masks in Figure 2 at x (assuming origin of the mask to be bottom right corner), we get f x (x) =0,f y (x) =2andf t (x) = 2. This means that the possible solutions of (u, v) lie along the line v = 1 in the uv-space. This makes sense, because if we look at a localized neighborhood around point x, we can only determine that the edge has moved by one pixel in the horizontal direction. All variations in the vertical direction, for example (u =0,v =1),(u = 1,v =1),(u =3,v =1) will generate exactly the same local variations in the image, and hence the same derivatives. This example illustrates an important problem with the brightness constancy equation, stated as the aperture problem. This problem is illustrated in Figure 5 and can be stated as follows: The component of the motion field in the direction orthogonal to the spatial image gradient is not constrained by the brightness constancy equation. 3 Lucas-Kanade Method Lucas-Kanade s method of finding optical flow relies on the least-squares solution of (u, v) over a small neighborhood. The idea is very simple and is as follows. We know that a single point yields one equation from which two unknowns cannot be recovered. However, if we assume that brightness constancy assumption holds for a small neighborhood around the point (typically 3x3 3
Figure 3: Optical flow constraint line in uv-space. d is the length of the perpendicular from the origin to the line, α is the angle the perpendicular makes with the x-axis. (û, ˆv) is one possible solution, which can be divided into two components: p long the constraint line, and d which is perpendicular to the constraint line. Figure 4: Example of an image sequence with (1, 1) translation at all points between two frames. Derivative masks are applied at x (affecting pixels enclosed in blue square). If white pixels are 1 and black are 0, then f x =0,f y = 2, f t = 2. The possible solutions of (u, v) lie along the line v = 1 in the uv-space (right). 4
Figure 5: The aperture problem: the black and gray lines represent two positions of the same image line in two consecutive frames. The image velocity, perceived in the image on the left through a small aperture is only the normal component d. The actual image velocity is shown on right, as u. or 5x5 neighborhood), then each point will yield one equation, but this set of equations will still have only two unknowns. This yields an over-constrained linear system, which can be solved by least-squares method. Formally, consider a 3x3 neighborhood for which brightness constancy assumption holds, i.e. we assume that the entire neighborhood has moved over interval dt with velocity (u, v). Then: f x1 u + f y1 v = f t1 (8) f x2 u + f y2 v = f t2. f x9 u + f y9 v = f t9 If we define A = f x1 f x2. f x9 f y1 f y2. f y9 then this gives us a linear system of the form B = f t1 f t2. f t9 u = [ u v ], (9) which can be solved by taking the pseudo-inverse: Au = B, (10) A T Au = A T B (11) ( 1 u = A A) T A T B (12) This solution is the least-squares solution, which means that it finds the values of (u, v) such that the square of the error is the minimum. This can be realized by deriving Eq. 11 in an alternate 5
manner. Consider the error term e = i (f x u + f y v + f t ) 2. (13) This error should be ideally zero over all points in the neighborhood, so we can find optimal values of (u, v) by minimizing this equation. e u =2 (f xi u + f yi v + f ti )(f xi ) = 0 (14) i e v =2 (f xi u + f yi v + f ti )(f yi )=0 i We can simplify these two equations and write them as: [ i f xi 2 i f ][ ] [ xif yi u i i f xif yi i f yi 2 = f ] xif ti v i f, (15) yif ti which is simply an expanded form of Eq. 11. 3.1 Lucas-Kanade with Pyramids Typically the derivative masks are small in size and therefore cannot capture faster moving objects in the scene. Therefore there is a need to either use larger derivative masks, or to use smaller images. One technique to do this is to use pyramids. At the highest level of the pyramid, standard Lucas-Kanade is applied. The resulting flow vectors from each level are propagated to the next level, using interpolation for the intermediate values. They are then multiplied by two to compensate for the increased resolution at this level. The correction in the flow vectors is then computed by applying LK, but with the additional step that f t is computed after compensating for the known estimate of flow. Finally the correction is added to the initial estimate, to obtain optical flow at the current level. The algorithm is illustrated in Figure 6. 4 Global Affine Flow So far we have looked at the issue of computing optical flow at every pixel. Often, the sequence of images is such that the entire image is being deformed in a consistent manner. Such images are generated mostly because of camera motion. For example, if the camera is translating or zooming, then optical flow of an image in this sequence has global consistency. Assuming that the deformation between frames I and I is given by the affine transformation: x = a 1x + a 2 y + b 1 y = a 3 x + a 4y + b 2, (16) the optical flow at every pixel is also related to the pixel coordinates. x x = u = a 1 x + a 2 y + b 1 y y = v = a 3 x + a 4 y + b 2 (17) where a 1 = a 1 1anda 4 = a 4 1. This equation gives a global model for optical flow, i.e. (u, v) values over the entire image are related by this model. Given two images, the model parameters 6
Figure 6: Lucas-Kanade with pyramids algorithm a =(a 1,a 2,b 1,a 3,a 4,b 2 ) T can be recovered finding the value of a which minimize the error given by the brightness constancy equation. We define the error term over the image as: e = pixels (f t + f x u) 2, (18) where f x =[f x,f y ] T and u =[u, v] T. Note that Eq. 17 can be rewritten in terms of the unknowns as follows: a 1 [ ] [ ] a 2 u x y 1 0 0 0 b 1 = v 0 0 0 x y 1 a 3 a 4 b 2 u = Xa (19) Substituting u in Eq. 18, we get e = pixels (f t + f x Xa) 2. (20) This equation represents the combined deviation of the whole image from the brightness constancy equation when affine deformation of a is assumed. The optimal value of a is the one that minimizes e, which can be obtained by solving e a = 0. This gives us the following equation: ( ) X T f x fx T X u = X T f x f t (21) pixels 7 pixels
Figure 7: Iterative method for computing global flow. W denotes warping module, M denotes global motion estimation module (Eq. 21) and + denotes the process of combining two transformations. In practice, several iterations may be done at each level, which are not shown here for the sake of simplicity. The term ( ) pixels X T f x fx T X is a 6x6 matrix, which can be inverted to solve this linear system for the unknown parameters, a. Practically this process is also done using pyramids. The 2x2 derivative masks cannot capture large motion, and therefore the process is done iteratively at multiple resolutions. We are given two images I 1 and I 2. Ateachlevell of the pyramid, the transformation from the previous level a l 1 is used as the initial estimate. Image I 1 at this level is warped using this transformation 1. The remaining transformation δa l between the warped image I1 and I 2 is recovered using Eq. 21. The final transformation at this level is then the product of the homogeneous transformation matrices of a l 1 and δa l. This process is illustrated in Figure 7. The initial estimate at the highest level is taken to be the identity transformation. 1 Note that for warping, 1 will have to be added to a 1 and a 4 8