Towards Direct Recovery of Shape and Motion Parameters from Image Sequences

Size: px

Start display at page:

Download "Towards Direct Recovery of Shape and Motion Parameters from Image Sequences"

Peter Barrie Hensley
5 years ago
Views:

1 Towards Direct Recovery of Shape and Motion Parameters from Image Sequences Stephen Benoit and Frank P. Ferrie McGill University, Center for Intelligent Machines 348 University St., Montréal, Québec, CANADA H3A A7 Abstract A novel procedure is presented to construct image-domain filters (receptive fields) that directly recover local motion and shape parameters. These receptive fields are derived from training on image deformations that best discriminate between different shape and motion parameters. Beginning with the construction of -D receptive fields that detect local surface shape and motion parameters within cross sections, we show how the recovered model parameters are sufficient to produce local estimates of optical flow, focus of expansion, and time to collision. The theory is supported by a series of experiments on well-known image sequences for which ground truth is available. Comparisons against published results are quite competitive, which we believe to be significant given the local, feed-forward nature of the resulting algorithms. Key words: structure from motion, template matching, optical flow, focus of expansion, time to collision Introduction Recovery of structure from motion has been examined from a variety of approaches, mainly feature point extraction and correspondence[4,8] or computing dense optical flow[5,9,]. Typically, the Fundamental Matrix framework or a global motion model is used to solve for global motion after which the relative 3-D positions of points of interest in the scene can be computed[,9]. address: {benoits, ferrie}@cim.mcgill.ca (Stephen Benoit and Frank P. Ferrie). Preprint submitted to Elsevier Science 7 September 5

2 Direct recovery of parameters by recognising characteristic motions in a scene has been examined using point correspondences[] and local optical flow field deformations[9,3,4]. However, both methods require the extraction of motion information from the image sequence before parameter recovery. Many of these techniques require local regularization; insufficient information is available in each sample to reconstruct the surface shape[8]. Appearance-based methods have been mostly discarded for structure from motion because much of the shape and motion information are so confounded that they cannot be recovered separately or locally[7]. Soatto proved that perspective is non-linear, therefore no coordinate system will linearize perspective effects[3]. Part of the difficulty involves how regional image changes can be represented in an appearance-based framework. Most approaches involve solving a local flow field as a weighted sum of basis flow fields with some perceptual significance[,] or tracking specific image features[7] with wavelets[], typically training on image sequences of motions of interest[3]. Converting these representations into image domain operations[6,5] could allow direct recovery of significant model parameters without solving a local optimization problem by gradient descent. Two causes prevent the effective direct recovery of local structure from motion with appearance-based methods. First is representation: how to encode image deformation independently of image texture and without the need of feature points. Second is how to map the image deformations into some optimal image operators. This paper addresses both issues by describing a representation and a methodology for designing and using these optimal image domain operators for appearance-based local recovery of some shape and motion parameters. In particular, we show how the resulting feed-forward system leads to an efficient and robust algorithm for estimating dense fields of optical flow and time to collision, as well as accurate localization of the focus of expansion. The optimal operator synthesis will discover what spatial scales most appropriately describe the different deformations for the camera model. Instead of attempting to recover the shape and motion parameters of a scene over the entire parameter space, only those shapes and motions that cause distinctly different image deformations are considered. Even with aliasing of some parameters into similar appearances, we show that enough information can be extracted from the image deformations to recognise specific shapes or motions which are useful for the aforementioned tasks. We begin in Section by considering a simplified structure and motion model that relates cross-sections of a 3-D surface S to its -D image projections. The

3 reduction in dimensionality makes the analysis tractable, while application of the model in multiple orientations relates to 3-D via the geometry of normal sections [8]. Next in Section 3 we present a solution to the inverse problem of recovering model parameters from image observations. Correspondence matrix theory is introduced as a general framework for casting model recovery as an appearance matching problem. The observability of shape and motion parameters is discussed next in Section 4. As might be expected, the confounding of shape and motion limits what can be recovered, but considerable information is still available locally. Section 5 looks at the specifics of what can be recovered, and shows how estimates of optical flow, focus of expansion, and time to collision are available directly from these parameters along with confidence measures. The theory is put to the test in Section 6, which evaluates the resulting algorithms in the recovery of the predicted scene features. Interestingly, despite their local feed-forward nature, very competitive results are obtained relative to the published literature. The paper concludes in Section 7 with a discussion of the results and pointers to future research. Image Formation Model for Local Shape and Motion A convenient representation for a local region S about a surface point P S is the augmented Darboux frame (Figure ) [8], consisting of a vector frame comprised of the surface normal, N P at P, and two orthogonal vectors in the tangent plane, T P, M P and M P corresponding to the directions of maximum and minimum curvature respectively [8]. Of particular interest to this discussion is the intersection of the plane containing the normal, π NP, with the surface S and point P. The intersection is a curve in 3-D, referred to as a normal section, and the curvature of the section at intersection point P is called the normal curvature, κ N. For a smooth region (C ) about P, κ N takes on maximum, κ M, and minimum, κ M, values as π NP is rotated. The directions associated with these extremal values are the principal directions referred to above. Given this parameterization it is straightforward to define local motion and shape change in terms of a transformation of the of the frame at P. With little loss of generality, the complexity of this model can be reduced significantly by considering projections in -D as shown in Figure. We will use this approach to develop a local shape and motion model for a normal section that will subsequently be used to determine the components of motion and shape change in the corresponding image direction. This formulation requires an image tesselation of the form shown in Figure, with the model relating a -D patch (x a, y a, θ i, t ) to its correspondent at (x a, y a, θ i, t ). An oriented slit in this figure corresponds to the white rectangular region within image I shown in Figure. 3

4 Fig.. Differential geometry of a local surface patch and the projection of a normal section. (x, y,, t ) a a θ i (x, y,, t ) a a θ i t t Fig.. Oriented slits at image coordinates (x a, y a ) at multiple orientations θ i cover the image plane at instants t and t The image formation process for each -D image slit also requires a camera model, and the one chosen for the experiments in this paper and shown in Figure 3 uses perspective projection. As shown, the fixation point O is the point on the surface in the center of the aperture s field of view, i.e. view angle γ =. γ ranges in value from α/ to α/, with positive γ to the right (positive x direction). The image plane is made up of N equally-sized, non-overlapping pixels that fit within the aperture. In a physical camera, the image plane is actually behind the focal point with an image inversion, but placing it in front of the focal point simplifies the mathematics without loss of generality. The actual distance between the focal point and the center of the image plane, h, or, proportionally, the actual dimensions of each pixel are arbitrary, and these absolute scales will be factored out and eliminated. For any view angle γ, formed between the V P O and V P P lines within α/ and α/, there is a corresponding value for the image plane coordinate x that can be projected into one of the pixel coordinates between and N, ( ) N i = + N tan(γ) tan ( ). () α 4

5 z VP x h α Image plane γ... i... N N Surface O P Fig. 3. The -D camera model And, conversely, the image coordinate i can be mapped back to the view angle γ, γ = tan [ i N + N ( )] α tan. () With an image formation model now in place, we can examine the parametric relationship between two corresponding -D slits. Exploiting the geometry of normal sections (Figure ), a simple yet expressive shape and motion model can be devised using only 5 parameters, m = (Ω, δ, η, β, k). The model is shown graphically in Figure 4. Referring to the diagram in Figure 4, the surface shape along one direction can be characterized by a curvature K, a normal vector N and distance from the viewpoint, d. Distance d scales all lengths of the diagram, so it is factored out to a canonical representation with unit distance between first viewpoint V P and the fixation point on the surface. The surface normal vector N at is encoded by the angle η with respect to the first view axis V P. The curvature of the canonical surface, the reciprocal of the radius of the circular approximation to the surface, becomes k = Kd. The motion model is chosen to minimize image deformation due to translation, defining the second viewpoint V P at a given distance δ at an angle β from the first view axis V P. V P is fixated on point Q on the surface, a view rotation Ω away from the first fixation point. Collectively, the camera and shape and motion models are sufficient to describe the forward mapping of a 3-D contour, parameterized by orientation, η, and curvature, k, onto a -D image slit, I, and then onto a corresponding image slit, I via translation, Ω, change in distance to viewer, δ, and change in viewpoint direction, β. The role of each model parameter is summarized below in Table. This -D cross section shape and motion model can be aligned with multiple All lengths, including the radius of curvature, R, are scaled by d. 5

6 VP VP Q d (or ) γ γ β η N δ d (or δ ) Ω R=/K=d/k (or /k) O P Fig. 4. What does a -D slit see of a 3-D world? -D curve model shows image correspondences by a surface point P imaged in the first view at angle γ and in the second view at angle γ relative to their respective fixation directions. Symbol Typical Values Ω.4 translate left 3/64 pixels no change +.4 translate right 3/64 pixels View direction translation angle δ.75 zoom in 5%. no change.5 zoom out 5% Relative distance to surface η Surface normal angle β 45 normal pointing 45 left of V P normal pointing toward V P +45 normal pointing 45 right of V P V P moves to left of V P V P stays in line with V P + V P moves to right of V P Viewpoint change angle k -4 concave surface flat surface +4 convex surface Relative surface curvature Table Parameters of structure from motion model. N and α are known constants from the camera model. 6

7 orientations at each image position to reconstruct 3-D surface shape and motion. In theory, this model is sufficiently general to locally describe the shape and motion of a surface. Already, by examining only local shape and motion, the local image deformations (and structure from motion) can be expressed with a model of very few parameters. One concern with the use of -D image slits is the inevitability that some image motions will cause the texture seen by a slit at one instant to move perpendicularly to the slit s orientation, effectively leaving the slit with no meaningful correspondence. Such slits are thus observing stimuli that violate the -D shape and motion model, and recovered parameters will have low likelihoods compared to -D slits aligned with the image motion. More conveniently, with multiple orientations of slits at every location in the image, there will be at least one image slit that is aligned with the dominant image motion and satisfies the -D shape and motion model. An alternative would have been to construct a more complicated model to describe the motion of a surface with two views of -D image data. The guiding principle in this work has been to introduce only as much complexity as necessary to recover structure from motion. In this case, the more complicated 3-D shape and motion has been decoupled into a small set of oriented -D detectors. As will be shown in the experimental results of Section 6, the oriented -D slits produce useful structural data from the scenes. Now that an image formation model has been chosen and parameterized, there remains the issue of recognising the image deformations from moving image samples. To this end, the next section introduces the correspondence matrix representation of image deformation and the recognition theory it provides. 3 Correspondence Matrix Theory Given a pair of images I and I at times t and t respectively, the task is now to recover model parameters m = (Ω, δ, η, β, k). Towards this end we propose an appearance-based strategy. In this section we describe a theory for generating a set of image domain operator pairs, with each pair corresponding to a particular instance of m, and responding maximally when I and I correspond to the transformation associated with m. The inverse problem is formulated as template matching. As we shall demonstrate later in Section 4, the complexity of the associated search problem is made tractable by the properties of the parameter space. Indeed, we show that of the 5 parameters, only Ω and δ can be recovered unambiguously. Fortunately this turns out to be sufficient for recovering useful properties of shape and motion. 7

8 Putting aside the issue of how we sample the parameter space for the moment, to each set of parameters m i we associate a transformation H i that accounts for the corresponding image appearance change. Given that I and I are vectors, this can be conveniently represented in matrix form. As will be seen shortly, H i can be determined procedurally given I and I, which are in turn synthesized using the forward and camera models described in the previous section. Thus, for each instance of m we determine a corresponding image pair (I i, I i) from which H i is determined. Correspondence Matrix Theory [5,4] specifies both how to compute H i as well as how to synthesize a pair of operators, U i and V i, that respond maximally when presented with a pair of images undergoing the same transformation. 3. Building the Correspondence Matrix As described in [5,4], a correspondence matrix H encodes the correspondence between cells of two spaces due to a forward mapping M and back through inverse mapping M, M : x y, x X, y Y, M : y x, x X, y Y. (3) The mapping function M may be continuous or even multi-valued, but when the two spaces X and Y are discretized (say, into M and N cells or pixels, respectively), the correspondence matrix H is of finite size M N. M M (y) x M (x) X M y Y Fig. 5. General correspondence between regions from two spaces, x X and y Y. As illustrated by the Venn diagram in Figure 5, the probability of correspondence between element x = [X] i, i {,..., M} in the first image and element y = [Y] j, j {,..., N} in the second image is related to how well the forward mapping M and the inverse mapping M of the two elements coincide, H ij P ( [X] i M [Y] j ), 8

9 ) + A ( [X]i M ([Y] j ) ) = A ( M([X] i ) [Y] j A ([X] i ) + A ( ) [, ], (4) [Y] j where A (N ) is a measure of area or volume in the neighborhood N, a measure of the size of the cell of N. The general procedure for converting some mapping M is summarized in Table. () Discretize the first image space X into M elements. () Discretize the second image space Y into N elements. (3) Initialize H M N = H M = H M =. (4) For each i {,..., M}: using a uniform distribution of points in [X] i, compute the distribution of points [Y] j at or near M([X] i ), ) assign H ij M = A (M([X] i ) [Y] j. (5) For each j {,..., N}: using a uniform distribution of points in [Y] j, compute the distribution of points [X] i at or near M ([Y] j ), ( ) assign H ij M = A [X] i M ([Y] j ). (6) Assign H ij = A([Y] j)h ij M +A([X] i )H ij M. A([X] i )+A([Y] j ) Table Procedure for synthesizing a correspondence matrix H for a mapping M There is now a clear procedure for building the forward mapping, synthesizing the correspondence matrix H to represent appearance change between two images described by a model vector m. The parameter recovery to be described next will identify the unknown H given two image vectors, and hence the model parameters m. 3. Model Parameter Recovery The image structure from two images separated by time can be encoded by a correspondence matrix H. This section will show that the Singular Value Decomposition of H leads to a unique set of image domain operators that can code for specific scene events as encoded by H. The correspondence matrix H relates the elements of a first image I and a 9

10 second image I. The images are first normalized as Ĩ, Ĩ for a zero mean intensity and a contrast of by finding the image s brightness µ I and contrast I. µ I I i I i + i I i, N max( I i µ I, I i i µ I ) (, ), µ I Ĩ = I µ I µ I I, Ĩ = I µ I µ I I. (5) 3.. Image Mapping H Image correspondence between normalized images (Ĩ, Ĩ ) is determined by H, but not necessarily all information contained in Ĩ is present in Ĩ nor vice versa. This means, for example, that some elements or pixels of Ĩ do not come from Ĩ but come instead from some other source. The image correspondence constraints are in the form of a weighted sum of elements in the complementary image: H ij Ĩ i = j j ( ) H ij Ĩ j = i i H ij Ĩ j, i {...N}, H ij Ĩ i, j {...N}. (6) The image correspondence equations can be re-written: where S = j H j j H j... SĨ = HĨ, S Ĩ = H T Ĩ, (7) j H Nj,

11 and S = i H i i H i... i H in. (8) Equation Set 7 deals automatically with non-shared information: note that elements in one space that have no correspondent in the other space will have a zero row sum of H (element of S) or zero column sum of H (element of S ). 3.. Operator Synthesis Given an image pair (Ĩ, Ĩ ), the goal is to find the H that best describes their deformation and identify that H s model parameters. In order to recognise the image events for a given correspondence matrix, there needs to be a transport function to map between the two views. The image pair (Ĩ, Ĩ ) can be represented as two different transforms of a common texture vector w. In order for the elements of texture vector w to be independent and linear, the image synthesis functions must be: Ĩ N A N N w, Ĩ B N N N N w. (9) N where A is orthonormal and B is also orthonormal. A and B are not expected to be exact solutions, because the image formation process does not lend itself to a direct linear mapping between the image pair. Image domain deformation caused by perspective projections of 3-D objects, for example, confounds the surface texture signal and the geometric deformation into one image signal, and the two components are generally not separable. A and B are, however, the best linear approximation to the geometric deformation image correspondence in the sense of minimizing some error. The texture information shared by the two images via H must exist in the space spanned by H, and thus can be expressed compactly in r independent coefficients, where r = rank(h). The first r coefficients of w encode the stationary signal shared between Ĩ and Ĩ that are expressed through H. The remaining N r coefficients encode the non-shared image signals mapped into the null space of H. Because A and B are chosen to be orthonormal and square, they are invertible (by transposition). This means that the texture vector w can be recovered from

12 images (Ĩ, Ĩ ) knowing the correspondence matrix H, A T Aŵ = A T Ĩ ŵ = A T Ĩ, B T Bŵ = B T Ĩ ŵ = B T Ĩ. () 3..3 Solution via Singular Value Decomposition Substituting Equation set 9 into Equation set 7, SA w HB w w, S B w H T A w w. () The optimization problem to solve for A and B is thus: argmin SA w HB w, argmin S B w H T A w. () A,B A,B Because the matrices A and B are expected to work independently of the surface texture encoded in w, the optimization problem can be expressed as argmin SA HB, argmin S B H T A. (3) A,B A,B The minimization terms can be recombined as argmin SAB T H, argmin S BA T H T. (4) A,B A,B This can be visualized more clearly by substituting H with its Singular Value Decomposition, H = U N N Σ N N N N VT N N, (5) where U is the matrix whose columns are the left-hand eigenvectors of H and the columns of V are the right-hand eigenvectors. That is, the columns of U are the eigenvectors of HH T, and the columns of V are the eigenvectors of H T H, both sorted in decreasing order of eigenvalues. The singular values of H are the elements of the diagonal matrix Σ, which are the square root of the eigenvalues from HH T or H T H. Substituting the SVD, one must solve for: argmin SAB T UΣV T, argmin AB T S UΣV T. (6) A,B A,B

13 The remainder of this proof builds A and B one column at a time, using successively better approximations of H. The k th order approximation, H k uses the first k columns of U and V and the first k diagonal elements of Σ. To solve for unknown orthogonal matrices A and B, consider the st order approximation of H, H = σ u v T, SA B T σ u v T, A B T S σ u v T. (7) The only choice for A and B that spans the same space as u and v, satisfying both constraining equations is A = [ u,,..., ], B = [ v,,..., ]. (8) An inductive proof introduces higher k th order approximations H k to solve for the corresponding A k and B k. k SA k B T k σ i u i v i T, i= k SA k B T k σ i u i v i T, i= S(A k B T k A k B T k ) σ k u k v k T. (9) Similarly, (A k B T k A k B T k )S σ k u k v T k. () This implies that A k is A k plus an orthogonal component in the direction of u k and that B k is B k plus an orthogonal component in the direction of v k. Therefore, the k th column of A is the k th column of U and the k th column of B is the k th column of V. Therefore, the least squares error solution for Equation Set, satisfying full rank and orthonormality of A and B is to substitute A = U, B = V. () Informally, the SVD expresses the correspondence matrix H as a linear remapping U T from Ĩ to the same space as a linear remapping VT from Ĩ. The SVD of H optimizes the spectral representation of the transform between a 3

14 pair of data sets (images), maximizing the independence between channels (elements of w), ranked according to significance in the sense of minimizing least squares error Distance to Data Metric The maximum likelihood hypothesis H for an image pair (Ĩ, Ĩ ) minimizes the residual error r between the input images and their reconstructions predicted by H. Let the feature vector ŵ represent the image vectors Ĩ and Ĩ as follows: [ ŵ = U T k /. V T k / ] Ĩ..., () Ĩ where U k and V k are the k th order approximations of U and V respectively. The latter correspond to the first k columns of U and V respectively sorted by singular values. The feature vector ŵ is now the best parameterization for the image pair assuming deformation H. The residual error can be computed by projecting the feature vector back into the image space. If the assumed deformation H is sufficiently close to the scene geometry, then residual signal error r, the difference between the original image signal and the reconstructed image signal will be low, Ĩ U k / Ĩ r = ŵ.... (3) Ĩ Ĩ V k The likelihood of correspondence H given evidence (Ĩ, Ĩ ) can be expressed as a function L ( H Ĩ, Ĩ ), L ( H Ĩ, Ĩ ) e r (, ]. (4) The uncertainty of the maximum likelihood choice can be expressed as the entropy h of the likelihoods for all the different hypotheses, h = n i= L ( ) ( ( H i Ĩ, Ĩ log L Hi Ĩ, Ĩ )) log(n) (, ]. (5) 4

15 To summarize the procedure for parametric recovery, Preparation (off-line) () Discretize parameter space M into n representative parameter vectors m i, i {,..., n}. () Synthesize a correspondence matrix H i to represent the coordinate mapping M( m i ). (3) Synthesize detectors for H i, using SVD. Usage (on-line) () Using the detectors for H i, given the observations Ĩ and Ĩ, compute L ( H i Ĩ, Ĩ ). () Find the maximum likelihood H i Ĩ, Ĩ, and set the maximum likelihood m = m i. (3) Use the distribution of L ( H i Ĩ, Ĩ ) to find the entropy h of the maximum likelihood choice. 4 Observability of Model Parameters Recovery of structure from motion must address the issue of the confounding between surface shape and motion: some parameters describing the scene are not independently observable. For example, different surface orientations and curvatures will look the same if the viewer moves in a certain direction. Also, the interpretation of some motion components, such as movement toward the scene (i.e. along the optical axis) relies on the proper compensation for lateral motion. This situation arises whenever a new model for scene analysis is proposed. The model describes how the scene appearance changes with respect to the various model parameters. To solve the inverse problem, that is, to recover the model parameters after some scene change, one must identify and characterize the relationship between appearance changes and scene model parameters that cause them. In control systems theory, the terms observability and controllability are used to distinguish between the ability to retrieve a model parameter through observations versus the ability to set a model parameter to a desired value or state. For our situation, all parameters of the scene model are controllable during a training process. In contrast, many parameter changes that produce visible appearance changes are not independently observable. Using the correspondence matrix representation of appearance change, one can automatically identify those scene parameters that are directly recoverable, 5

16 those which must be determined jointly, and those that are not recoverable at all from local observations. Szeliski and Kang used first principles to compute what scene observables could be mapped back to scene model parameters[7]. The method used here is much simpler thanks to the correspondence matrix representation of local shape and motion. To begin, we note that photometric changes, such as an overall change in luminance or light source direction, are largely normalized out by virtue of (5). Edges of shadows might not be removed, and may be interpreted as moving objects in their own right. How distinctive are the appearance changes due to each parameter? A correspondence matrix is a fingerprint of appearance change between two views of a scene. One can build a database of correspondence matrices by discretely sampling the parameter space and then measuring the relationship between changes in appearance and changes in parameters. Constructing a correspondence matrix is straightforward. The shape and motion model is controllable for any choice of parameters. Considering one parameter, say Ω, group the correspondence matrices formed by the same value of Ω together, varying all the other parameters. For example, discretizing the 5 parameters Ω, δ, η, k, β to 5 levels each, for a total of 5 5 = 35 correspondence matrices yields 5 classes of 5 4 = 65 correspondence matrices. Other parameters used in the image formation process can be held constant, such as the aperture size N = 3 pixels. The distinctiveness of the appearance changes due to one changing parameter can now be expressed as the distinctiveness of the 5 classes of correspondence matrices. The naive method of measuring the class separability would be to build detectors for each of the 5 classes and actually test many image domain samples and tally the recognition rate achieved. This would only confuse the issue: the method would need to actually build detectors and test them to answer the more fundamental question of whether or not detectors could work. A much simpler and more compact method is to test the class separability of the correspondence matrices by performing Principal Component Analysis (PCA) on the 35 correspondence matrices to reduce the dimensionality of the problem. From here, Most Discriminating Features (MDF)[6], a technique that extends Linear Discriminant Analysis (LDA), will highlight the separability of the 5 classes of a given parameter. Conclusions about the observability of different shape and motion parameters can be drawn from the information imprinted in the correspondence matrices that represent the deformations, rather than indirectly from random testing. The MDF results for parameters Ω and δ are shown in Figure 6 and Figure 7. 6

17 principle components (after MDF on eigenvectors) for o principle components (after MDF on eigenvectors) for o (a) (b) Fig. 6. MDF-enhanced principal components tuned to Ω variations using all 35 training samples and 5 discrete levels of Ω. The first dimensions are shown in (a), the first 3 dimensions in (b)..8.6 principle components (after MDF on eigenvectors) for d principle components (after MDF on eigenvectors) for d (a) (b) Fig. 7. MDF-enhanced principal components tuned to δ variations using all 35 training samples and 5 discrete levels of δ. The first dimensions are shown in (a), the first 3 dimensions in (b). The results obtained show that only 3 dimensions are required to perform a good quality recovery of Ω and δ parameters from within the training set using correspondence maps alone. To test the usefulness of the 3 principal correspondence maps against values outside of the training set, a new set of correspondence maps was generated with denser sampling of Ω. The interpolation between training values is almost linear, as shown by the almost straight black curves in Figure 8. The interpolation between the training values of δ are a bit noisier, as shown in Figure 9. This illustrates that the image deformation appearances of different values of δ are confounded by different shapes (η, k) and viewpoint differences (β). The confounding can be alleviated by recovering the joint distribution of (Ω, δ). That is, the different appearances due to δ can be recovered if Ω is already known. For the remainder of this paper, Ω and δ will be found jointly; detectors will be built to recognise deformations due to combinations of both 7

18 principle components (after MDF on eigenvectors) for o principle components (after MDF on eigenvectors) for o (a) (b) Fig. 8. Interpolation between training values of Ω. In (a) the first dimensions and in (b) the first 3 dimensions, the dark curve traces the intermediate coordinates of the intermediate values of Ω. parameters..5 3 principle components (after MDF on eigenvectors) for d (a) (b) Fig. 9. Interpolation between training values of δ. The first 3 dimensions are shown in (a) where the dark curve traces the intermediate coordinates of the intermediate values of δ. The curve is reproduced in (b) without the training data clusters for clarity. The results on the other model parameters k, η, β were not as promising. Their discrete values did not cluster well, as shown in Figure, and will need to be recovered through inference of neighborhood relations or other information. These observations reveal that the motion parameters Ω and δ are not separable, but can be found jointly; together they produce distinct motion fields. These can be considered first order structure from motion parameters. In contrast, the view angle change β and the shape parameters k and η are observed to be confounded. Even in apertures of N = 3 pixels, there is very little difference in the motion field to distinguish different shapes from a single aperture. These 3 parameters require some form of post-processing, that would combine information from neighborhoods of support, or require a global fit. Rather than pursue such a strategy, we instead consider what can be recovered from 8

19 3 principle components (after MDF on eigenvectors) for k 3 principle components (after MDF on eigenvectors) for e 3 principle components (after MDF on eigenvectors) for b (a) (b) (c) Fig.. MDF-enhanced principal components tuned to (a) k variations, (b) η variations and (c) β variations, using all 35 training samples and 5 discrete levels of each parameter. The first 3 dimensions are shown. Ω and δ. 5 Recoverable Scene Information For image pairs of interest, the views will overlap in the camera aperture of angle α (shown in Figure 3), restricting the useful range of Ω to about ( α/, +α/), and typically, α <. With these angles so small, Ω will be linearly proportional to the translation along the image slit axis. The δ parameter denotes image scale change, defined as the ratio of the distance to the surface seen from V P versus from V P. Given that only of the 5 shape and motion parameters are recoverable locally, the question naturally arises: What, if anything, do these two parameters tell us about the scene? As it turns out, these two parameters provide very useful information that can be extracted as optical flow and estimates of time to collision (TTC). Calculating the optical flow and time to collision requires recovering the δ and Ω parameters. Figure illustrates some of the unique H for specific instances of model parameters (Ω, δ, k, η, β). The mean correspondence matrices H Ω,δ of Figure are synthesized by holding Ω and δ fixed and computing the mean of all the correspondence matrices formed for those specific Ω and δ and all combinations of the other discrete parameters α, k, η, β, {α},{β},{η},{k} H Ω,δ,α,β,η,k H Ω,δ. (6) {α},{β},{η},{k} 9

20 Shift left, zoom in Shift left, zoom out No motion Shift right, zoom out Ω =.8667 Ω =.8667 Ω =. Ω =.3333 δ =.9543 δ =.4 δ =. δ =.4 Fig.. Mean correspondence matrices for specific values of angle of translation Ω and scale change δ while varying other parameters of the model (k,η, and β) at N = 64 pixels and α = {4.,..., 5.}. The rows are the index into the first image I, and the columns are the index into the second image I. Ω shifts the white curve horizontally, and δ affects its slope. The concept of a mean correspondence matrix H is a special case of encoding transparent motion. Transparent motion, or layered motion, is the perception of multiple simultaneous motions in one image location due to partial transparency of surfaces undergoing separate motions. Although the -D slit shape and motion model of Section deals only with an opaque surface and is subsequently encoded as a specific H with a sharp, dominant correspondence curve, transparent motion can be encoded as a superposition (or averaging) of multiple correspondence matrices to represent the combination of different shapes and motions. The mean correspondence matrices used here, however, are only concerned with expressing variations of slightly different shapes and motions rather than combinations of distinctly different shapes and motions at any one image location. This does not preclude the possibility that transparent motion can be recovered at the recognition stage by identifying multiple valid candidate shapes and motions at each image location. Another convenient property of the correspondence matrix is the ease with which time can be reversed, i.e. that image samples can be presented in reversechronological order to obtain more information. For a given H representing the forward-time mapping from I to I, the reverse-time mapping from I to I is simply the transpose, H T. As will be elaborated shortly, there is always more information to be gained by combining forward-time and reverse-time measurements, but the correspondence matrix offers one of the rare instances where the computation is straightforward. For the next sections, it is assumed that at every location and orientation θ i, i {,..., n} in the image, a maximum likelihood H Ωi,δ i has been recovered (or equivalently, the values Ω i and δ i ), together with the normalized entropy h i to represent the uncertainty of the maximum likelihood choice. It is further assumed that the analogous reverse-time data has been recovered, specifically

21 the maximum likelihood H Ω i,δ i and its entropy h i. 5. Optical Flow Optical flow is a measurement of the apparent movement of picture elements in the image plane of an image sequence. An optical flow vector (u, v) can be computed at each image location by combining the estimated displacement i along each orientation θ i over i {,..., n}. Each orientation has a different confidence ( h i ) for the displacement along that direction, so their contributions will be weighted accordingly. For the experiments in this paper, n = 6 orientations. The optical flow vector thus has components, ni= ( h i ) i cos(θ i ) u = ni=, ( h i ) ni= ( h i ) i sin(θ i ) v = ni= ( h i ). (7) The notion of entropy h i for the displacement i along orientation θ i maps directly to a probabilistic interpretation of the component of optical flow along that orientation. The maximum likelihood displacement i is still the corresponding to the maximum likelihood Ω i. The specificity of the distribution of L (Ω) over all Ω can be measured by its normalized entropy h i. The corresponding i has the same distribution, so the credibility of the maximum likelihood i is found automatically. The introduction of the displacement i formalizes an issue that demands clarification: how does one describe an image plane translation of a patch of an image that is deforming? Should one just use the image plane translation of the point at the center of the receptive field, or express the flow vector as a random variable with a distribution? The experiments in this paper use the former: one maximum likelihood flow vector at the center of the receptive field. At first glance, it would make sense to use the angle Ω to determine translation, referring to the image formation model presented in Equation, i = N tan(ω i ) tan ( ). (8) α There are drawbacks to using Equation 8. Most optical flow algorithms run into an analogous problem: one displacement has been chosen to represent

22 the motion of an entire aperture. The classical optical flow algorithms find the displacement assuming that the surface is planar and in the image plane. Converting Ω itself directly into a translation magnitude would ignore the fact that Ω was determined over the entire aperture of N pixels to represent a range of values for Ω, and does not necessarily reflect the translation occurring at the center of the aperture. More generally, there will be distortions: not all pixels in the aperture are displacing the same amount. Automatically dealing with this issue is one of the principal advantages of using correspondence matrices in the first place. Instead of using the angle Ω to determine translation, it is more informative to use the maximum likelihood correspondence matrix H Ω,δ. This preferred method is to look up the translation of the center pixel (or pixels in a neighborhood N ( N ) near the center) by inspection of the correspondence matrix H Ω,δ, illustrated in Figure, so that (H) r N ( N ) Nc= ch rc r N ( N ) Nc= H rc (N ), (9) i = (H Ωi,δ i ). (3) Already, the optical flow vector is more localized than before. But there is still more information available. i uses estimates from the forward-time H Ωi,δ i. But the reverse-time H Ω i,δ i and its confidence ( h i) can contribute as well. This could be very significant in cases of occlusion and disocclusion, when forward or reverse correspondences individually may have incomplete views. Time reversal is just the transposing of the correspondence matrix. Transposing it back provides a matrix in the same frame as the forward-time version. Therefore, the reverse-time opinion of the forward-time displacement can be expressed as i = (H T Ω i,δ i ). (3) The forward and reverse time information provides a unique opportunity to produce even more robust optical flow vectors by fusing the two together as i i ( h i) + i ( h i) (h, i + h i)

23 Correspondence curve (a) Center I ~ of Linear approximation (using just Ω) Ω (b) H ~ H T (Η) (c) Center (d) I ~ of Fig.. Improved localization of optical flow in one aperture. A correspondence matrix describes the correspondence between each pixel in the first image aperture (vertical axis) with each pixel in the second image aperture (horizontal axis). Examples are shown of 3 3 element correspondence matrices representing -dimensional (a) flat image-plane translation, (b) image-plane translation with scale change, and (c) non-uniform surface velocity. The white squares represent, the dark squares represent. Optical flow is considered to be the displacement of the central pixel in an aperture between two instants. As shown in (d), Typical optical flow assumes uniform displacement in the image plane (thin line) of Ω which would be different from the true central pixel displacement of (H) using a better-fitting non-uniform displacement (thick line). with h i (h i + h i). (3) The improved optical flow vector components (ũ, ṽ) are thus: ni= ( ũ = h i ) i cos(θ i ) ni= ( h, i ) ni= ( ṽ = h i ) i sin(θ i ) ni= ( h. (33) i ) 3

24 5. Time To Collision Simply put, the time to collision (TTC) is the amount of time estimated before the viewer contacts an obstacle in the scene. This information is especially useful for robot navigation and collision avoidance. An image sequence from a moving camera could be transformed into a map of time to collision for each location in an image. If this information is sufficiently precise and dense, it can be used to reconstruct the 3-D shape of objects in the scene. Given the local structure from motion model used here, the δ parameter indicates the ratio of the distance between the new viewpoint to the surface over the distance between the old viewpoint to the surface, δ = V P O V P O. (34) The time between observations, t, is known beforehand. Assuming that the camera s motion relative to the scene will continue at constant velocity, one can estimate how much time will elapse before the camera reaches the point on the surface it is looking at and heading toward, T = t ( δ ) δ. (35) Recovering Ω and δ locally in forward time can be augmented by recovering Ω and δ by reversing the sequence of the images. A direct method of computing the time to collision T between two images, using both forward and reverse information, separated by a delay of t is to average the two contributions: T = t ( δ δ + ) δ. (36) Although time to collision is customarily only used with positive values, as in the amount of time before contact happens, negative numbers are still meaningful; they indicate how much time has elapsed since the camera was in contact with the surface (assuming, as above, constant velocity). 4

25 5.3 Focus of Expansion The Focus of Expansion (FOE) is that point in the image plane towards which the camera is moving. The optical flow field radiates away from this point. Using the model of surface shape and motion from Section, a patch in the image is a candidate for Focus of Expansion if and where Ω θ =, θ. Generally, finding the Focus of Expansion is considered a global problem, requiring information from all over the image to build one globally consistent solution. Some methods [,] use normal optical flow to form concentric rings about the Focus of Expansion. In that approach, candidate points vote inclusion in a Focus of Expansion cluster. As implemented in this paper, each patch of pixels in the image also votes for itself, but based only on local information, weighted by its confidence that it has Ω =. The Focus of Expansion (FOE) is computed by a simple weighted voting procedure. The FOE weight for each patch is the sum of each orientation θ s likelihood of having Ω θ =, w ij = ni= L (Ω θi = ) ( h θi ) + n i= L ( Ω θ i = ) ( h θi ) n [, ].(37) Normalizing the weights w to produce w, w ij = w ij / max(w) [, ]. (38) Clipping the weights to only retain the highest weights, w ij, if w ij >.75 ŵ ij =, otherwise. (39) And finally, the actual coordinate of the Focus of Expansion is a weighted sum of the candidate coordinates, F OE x = ij F OE y = ij x ij ŵ ij, y ij ŵ ij. (4) 5

26 6 Experiments The results of experiments on real and artificial data are now presented in support of the preceding theory. We begin with the synthesis of operators for recovering observable parameters (Ω, δ) and then proceed to investigate the performance of these operators in recovering time to collision and focus of expansion on synthetic data. From here the experiment is repeated on real data acquired with a video camera mounted on a gantry robot system to determine the sensitivity of recovery under realistic imaging conditions. Next we consider the performance of the algorithm on some well-known sequences (Yosemite, Translating Trees) and compare our approach in recovering optical flow against published results. Specifically we show that despite the local nature of the resulting algorithm, competitive flow results are obtained along with reliable estimates of focus of expansion. Finally, we consider a mobile robot scenario and show that fairly accurate time to collision estimates can be obtained using an uncalibrated camera. 6. Synthesis of Operators The model space of (Ω, δ, k, η, β) was discretized into 9 levels for each of Ω and δ, and 5 levels each for k, η, β and α, binning the correspondence matrices into mean correspondence matrices indexed by (Ω, δ) for a total of 36 distinct correspondence matrices. Applying the procedures of Section 3.., 36 U,V pairs are synthesized, the first 4 of which are shown below in Figure 3. Each row i of the upper grayscale images at time t is a deformed sinusoid which corresponds to another deformed sinusoid at time t = t + t, in the corresponding row i in the lower grayscale images. t t + Ω =.8 δ =.9487 Ω =.8 δ =. Ω =. δ =. Ω =. δ =. t Fig. 3. U, V of correspondence matrices. Note that the most significant optimal detectors for this motion model are windowed sinusoids concentrated in the lower spatial frequencies. The SVD 6

27 automatically discovered the appropriate spatial scales for detecting each deformation. The parameters used are: Number of pixels: N 64 View translation : Ω[i].4 ( i 9 9 ), i {,.., 8} Distance to surface: δ[i].3456 ( i 9 9 ), i {,.., 8} Surface normal : η[i] { 45,.5,..., 45 } View change : β[i] {, 5,, 5, } Surface curvature: k[i] { 4,,,, 4} View aperture : α[i] {8, 8.5, 9, 9.5, } These experiments were performed using slits of length N = 64 with widths of 3 pixels. The operators are applied to a 3 3 gridding of the image pair at 6 different orientations θ (, 3, 6, 9,, 5 ). The same procedure is applied by swapping the two images to recover the maximum likelihood for the time reversal, (Ω, δ, θ ). Ω and Ω are useful for optical flow, but the local depth (or time to collision) cues are only available through the δ and δ measures. Ω and δ are found jointly, and the usefulness of δ is sensitive to the correct compensation of Ω. Some neighborhood filtering is required, since δ is noisy. The maximum likelihood values for Ω, δ, Ω, δ in the grid of 3 3 elements are median filtered in a 3 3 neighborhood for Ω, Ω, and a 7 7 neighborhood for δ, δ. 6. Synthetic Images, 4 Boxes To test the Time to Collision detectors, a random dot texture was combined with synthetic range data shown in Figure 4 to generate the synthetic image pair shown in Figure 5. The surface consisted of 4 quadrants of varying distance (45, 5, 55 and 6mm from the camera focal point), and the camera moves by 5mm towards the center of the scene. The ground truth time to collision for the four quadrants are thus 8, 9, and units of time in the future. The camera geometry is modeled as a pinhole perspective camera with 3 4 square pixels where 3 pixels are mapped onto a viewing angle of This means that the image samples of 64 pixels will cover aperture angles α varying from at the center to 8 at the periphery. 7

The recovered Ω, Ω data is almost perfectly planar for each orientation, as shown in Figure 6.

Ω is proportional to the image plane velocity, with linearly increasing magnitude as it moves further from the focus of

Black is 45mm from focal point, white is 6 mm. Fig. 5. Synthetic image pair.

28 The recovered Ω, Ω data is almost perfectly planar for each orientation, as shown in Figure 6. This is expected, as the scene motion generates a radial flow field. Ω is proportional to the image plane velocity, with linearly increasing magnitude as it moves further from the focus of expansion. Note that the Ω parameter is signed and directional. Fig. 4. Synthetic texture and range. Black is 45mm from focal point, white is 6 mm. Fig. 5. Synthetic image pair. The camera moves 5mm toward the center of the scene. Ω θ Ω Fig. 6. Recovered Ω and Ω for 6 orientations at 3 3 image locations (black:.4, gray:, white: +.4 ). δ θ δ Fig. 7. Recovered δ and δ for 6 orientations at 3 3 image locations (black:.8, gray:., white:

29 Since Ω, Ω have been cleanly recovered, it is not surprising that the more sensitive δ, δ have formed blobs roughly in the four quadrants of the scene, shown in Figure 7. The local structure from motion problem is at its worst in image sequences with a forward-moving camera. Optical flow methods have no information at or near the focus of expansion, and small errors in the flow field even toward the periphery render depth recovery impractical without fitting a global model to the optical flow field. With this experiment, the time to collision detectors are designed specifically to work best at or near the focus of expansion. Time to collision, 4 boxes from 8 to, Ground Truth Time to collision, 4 boxes from 8 to, Recovered Time to collision, 4 boxes from 8 to, Ground Truth Time to collision, 4 boxes from 8 to, Recovered Fig. 8. Time to Collision, ground truth versus recovered. The surfaces recovered in Figure 8 are noisy. The root-mean-squared error of the recovered time to collision surface is.9337 units of time, and is illustrated in Figure 9. The mean error of TTC is 8.75%. This is an interesting result, indicating that under favorable conditions, two frames separated by one unit of time can compute the time to collision to within one unit of time. This is approximately the limit of granularity of the time to collision calculation. The minimum absolute error is.76 in the centers of the boxes, where neighbors agree on the distance to the surface. The maximum error,.98 units of time, occurred at the focus of expansion at the center of the image, which is where the TTC can be interpreted as any of {8, 9,, } units of time. Depending on which of the 4 possible TTC was chosen at the center, there could have been a difference of up to 4 units of time, and the result would still have been within bounds of legitimate interpretation. Next, the receptive field responses (Ω, δ) θ and (Ω, δ ) θ were combined as above to generate the optical flow field shown in Figure. The focus of expansion was computed to be (F OE x, F OE y ) = (6., 9.5), illustrated in. The time to collision for the FOE was found to be.95 units of time. The ground 9

30 Time to collision, 4 boxes from 8 to, Absolute Error Time to collision, 4 boxes from 8 to, Absolute Error (a) (b) Fig. 9. Time to collision error from synthetic image pair. The absolute time to collision error is rendered in (a) -D and (b) 3-D. The RMS error is.9337 units of time, with a minimum error of.76 near the middle of each of the four boxes, and a maximum error of.98 at the center where multiple interpretations are legitimately possible. truth is units of time. Fig.. Recovered optical flow from synthetic image pair, using the same shape deformation receptive fields and multiple orientations as the previous experiments. The composite flow vectors are the weighted sums of the different orientations maximum likelihood displacements. The results in Figure 8 demonstrate that the recovered time to collision estimates are sufficiently clean to distinguish between regions of the image at 8, 9, and time units in the future. Although the flow field is technically correct, the flow field alone is not sufficient to reconstruct the surface shape or time to collision at specific points. At the center, the flow vectors have zero length, so they give no indication of speed at all. There are no distinct flow field changes to indicate different speeds. And yet the receptive fields were sensitive enough to yield rich 3-D shape and motion information without a surface model regularization stage. This observation further reinforces the fact 3

.9.8 5 5.7.6.5 5.4 5.3. 5 5 5 3. 5 5 5 3 (a) (b) Fig.. Finding the Focus of Expansion for the synthetic image pair from local information only.

The ground truth FOE is at the image center (6, ), and the computed FOE is at (6., 9.5).

3 Natural images, Calibration Grid The experiment was repeated using grayscale images captured by a camera mounted on a gantry robot looking at a flat planar calibration grid.

31 (a) (b) Fig.. Finding the Focus of Expansion for the synthetic image pair from local information only. A normalized likelihood map of FOE, w ij, is shown in (a). The + marker indicates the computed FOE. A contour map of FOE likelihood is superimposed on one of the images in (b). The ground truth FOE is at the image center (6, ), and the computed FOE is at (6., 9.5). that a -D image plane optical flow field is a bottleneck, discarding information that is essential for interesting egomotion situations. 6.3 Natural images, Calibration Grid The experiment was repeated using grayscale images captured by a camera mounted on a gantry robot looking at a flat planar calibration grid. The camera field of view is the same as the earlier synthetic example, but it is not an idealized pinhole camera. The camera lens starts at 5mm from the plane, and moves 5mm closer, for a time to collision of 9 units of time, shown in Figure. The camera optics slightly bend the grid lines; the small squares toward the periphery are a bit smaller than the squares near the center. This spherical aberration makes the grid appear as a textured sphere shown close up, directly in front of the camera. The camera s focal length is 7mm, placing the focal point about 5mm behind the lens from which camera distance was measured. Fig.. Planar calibration grid scene. The detector response in Figure 3 is so sensitive that the recovered time to 3

32 Time to collision, flat grid, Recovered Time to collision, flat grid, Recovered Fig. 3. Time to collision, planar grid. collision captures the spherical aberration effect. The time to collision varies from units of time at the center to 8 units of time at the periphery. The aberration makes the periphery appear further than it actually is, moving faster than the center, hence a lower time to collision. In this case, optical flow in the periphery is uninformative; there appears to be no translation, but there is texture expansion. This experimental result confirms that the local feed-forward operators can extract precise estimates of the real time to collision without a global scene motion model. 6.4 Two Well-Known Image Sequences Next we consider performance of these detectors relative to optical flow recovery via comparison to standard published algorithms [3,5]. Errors are computed using the code provided by Barron et al. in [3]. Optical flow estimates computed using (33) are by themselves not informative without consideration of their corresponding entropy values as related by (5). For the experiments presented in this paper, flows with entropies above an empirically selected threshold are marked as invalid Translating Trees Figure 4 shows two successive frames from the well-known translating tree sequence used to compare optical flow algorithms. The resulting optical flow map is shown in Figure 5a with the corresponding confidence map shown in Figure 5b. As can be seen from the confidence map, the algorithm has considerable difficulty producing reliable estimates in regions of shadow and low texture. This is perhaps not too surprising given the local nature of the algorithm. Measured error, relative to ground truth, is Mean 6., SD 6.3, which is considerably worse than the Mean.33, SD.44 reported in [5]. However, using the associated confidences to reject estimates falling below 3

threshold, error is reduced to Mean 8.79, SD 7.

The thresholded optical flow map is shown in Figure 6. (a) (b) Fig. 4.

(a) Computed optical flow, full density (b) Associated confidence map (brighter = higher confidences) 6.4.

33 threshold, error is reduced to Mean 8.79, SD which we consider quite reasonable for a local operator with limited support. The thresholded optical flow map is shown in Figure 6. (a) (b) Fig. 4. First two frames of the translating trees sequence. (a) (b) Fig. 5. (a) Computed optical flow, full density (b) Associated confidence map (brighter = higher confidences) 6.4. Yosemite Canyon In contrast to the translating tree sequence, the algorithm fares much better on the Yosemite canyon sequence, another well-known reference for optical flow comparison. Two successive frames are shown in Figure 7 with the re- Fig. 6. Computed optical flow, low confidence flows eliminated 33

sulting optical flow map shown in Figure 8. Measured error relative to ground truth is Mean.5, SD 9.9, with no confidence thresholding of the data. In comparison to the best method, Liu et al.

Frames and 3 of the Yosemite sequence To further illustrate the quality of the recovered (Ω, δ) parameters, time to collision is estimated using (36) and shown in -D in Figure 9 and as a 3-D range

34 sulting optical flow map shown in Figure 8. Measured error relative to ground truth is Mean.5, SD 9.9, with no confidence thresholding of the data. In comparison to the best method, Liu et al.[5], with Mean 6.84, SD.88, we consider the results quite competitive given the local nature of the computation. (a) (b) Fig. 7. Frames and 3 of the Yosemite sequence To further illustrate the quality of the recovered (Ω, δ) parameters, time to collision is estimated using (36) and shown in -D in Figure 9 and as a 3-D range image in Figure 3. This challenging image set has been processed to generate excellent structural information from scene, in particular the shape of the mountain range nearest the camera viewpoint. The Focus of Expansion was computed to be (F OE x, F OE y ) = (43.5,.85), illustrated in 3. The time to collision for the FOE was found to be units of time. Interestingly, the instantaneous FOE is in the further mountain range a bit to the right, suggesting that the aircraft is not heading straight forward along the camera view axis, but is instead heading a bit to the right of the camera center. Fig. 8. Computed optical flow for Yosemite sequence, full density 34

Time to collision, Yosemite, Recovered 4 35 3 5 5 5 (a) (b) Fig. 9. Time to collision for Yosemite image pair. Frame is shown in (a) for comparison with the time to collision map (b).

35 Time to collision, Yosemite, Recovered (a) (b) Fig. 9. Time to collision for Yosemite image pair. Frame is shown in (a) for comparison with the time to collision map (b). The units of time are the number of image frames before collision. Areas more than 4 units of time away are clipped for clarity. The outer border is marked as invalid. Time to collision, Yosemite, Recovered Fig. 3. Yosemite surface reconstruction from time to collision. The shape of the nearest mountain is clearly extracted and rendered in the lower left. The range data is simply the time to collision. Areas more than 4 units of time away are clipped for clarity. The outer borders are marked as invalid. 6.5 Mobile Robots As a final demonstration we consider a video sequence obtained from a mobile robot moving along a known trajectory with known constant velocity. The first and last images of an frame sequence are shown in Figure 3. The robot s position advances cm between each image, hence a velocity of cm per unit of time t. The map of maximum likelihood ˆδ and the map of time to collision T were computed over a grid of slits at 6 orientations and are rendered in Figure 33. Computation time on an Intel 35

.9 5.8 5.7.6.5 5.4 5.3. 5 5 5 5 3. 5 5 5 5 3 (a) (b) Fig. 3. Finding the Focus of Expansion for Yosemite canyon image pair from local information only. A likelihood map of FOE, w ij, is shown in (a).

Pentium 4 66 MHz workstation was approximately 9 seconds per image frame pair.

36 (a) (b) Fig. 3. Finding the Focus of Expansion for Yosemite canyon image pair from local information only. A likelihood map of FOE, w ij, is shown in (a). The + marker indicates the computed FOE. A contour map of FOE likelihood is superimposed on one of the images in (b). Frame (t ) Frame (t ) Fig. 3. Lab sequence, source images. Pentium 4 66 MHz workstation was approximately 9 seconds per image frame pair. In a practical implementation, the redundancy in the filters would be reduced using linear combinations of Principal Components, potentially reducing computation time by a factor of. Using a sparser sampling of the image plane would further reduce the computation time closer to real time. One significant observation is that although the floor tile pattern expands closer to the camera due to perspective and the texture of the floor is moving toward the robot, the floor is heading underneath the camera and does not appear to be on a collision course with the camera. Because the camera line of sight is parallel to the floor, the algorithm has effectively classified the floor motion as maintaining constant distance from the camera, and thus not an obstacle. The algorithm performs a literal figure-ground separation, and the obstacle blobs are at least qualitatively useful for identifying the location of the nearest obstacles in the image. Next, the quantitative estimates are examined. Note that in Figure 33, there are holes in the time map between the table and chair legs, where the back wall is beyond the detector s range. The chair near the center of the image started 4cm from the camera, and by the eleventh frame is cm from the camera. The chair has a large opening in it, letting a view of the background through. The image slits are based on a model of a 36

t t t 3 t 4 t 5 Source δ map Time map Fig. 33. Results from Lab sequence. For the δ map, δ =.8 (rapid approach) is rendered as black, δ =. (no depth change) is middle gray and δ =.

Dark patches are more than 5 units of time away, either in the future or the past. continuous surface, so some uncertainty in the maximum likelihood estimates in this situation is unavoidable.

37 t t t 3 t 4 t 5 Source δ map Time map Fig. 33. Results from Lab sequence. For the δ map, δ =.8 (rapid approach) is rendered as black, δ =. (no depth change) is middle gray and δ =.5 (rapid retreat) is white. The time map indicates proximity (either about to touch or recently touched) as brightness. Dark patches are more than 5 units of time away, either in the future or the past. continuous surface, so some uncertainty in the maximum likelihood estimates in this situation is unavoidable. The usefulness of the estimates can be shown in Figure 34, comparing the mean of the estimated time to collision of the region around that chair to ground truth from actual measurements. 9 Time to collision estimate: Lab sequence, central chair Ground truth Estimate 8 7 Time to collision Image frame Fig. 34. Neighborhood of central chair in Lab sequence. Image patches from frames, 3, 6, and 9 are shown above. Below, the mean time to collision with this patch, shown as small circles are compared with the ground truth (line). 37

6.6 Lab Sequence A second mobile robot travelled a different linear trajectory in the same room, sampling 4 images, each 5cm

As before, the floor tiles are ignored, and the real obstacles near the far wall are highlighted.

contact. The results of Figure 36 ignore the patches that appeared to be moving away from the moving camera.

t 6 t 7 t 8 t 9 t Source δ map Time map Fig. 36. Results from Lab sequence. For the δ map, δ =.

38 6.6 Lab Sequence A second mobile robot travelled a different linear trajectory in the same room, sampling 4 images, each 5cm apart, shown in Figure 35. Frame (t ) Frame (t ) Fig. 35. Lab sequence, source images. As before, the floor tiles are ignored, and the real obstacles near the far wall are highlighted. The textureless patches of the far wall are uninformative, and are of high uncertainty: they are rendered here as high times to contact. The results of Figure 36 ignore the patches that appeared to be moving away from the moving camera. Note the shape of the table, chair and coat rack that stand out against the background. t 6 t 7 t 8 t 9 t Source δ map Time map Fig. 36. Results from Lab sequence. For the δ map, δ =.8 (rapid approach) is rendered as black, δ =. (no depth change) is middle gray and δ =.5 (rapid retreat) is white. The time map indicates proximity (either about to touch or recently touched) as brightness. Dark patches are more than 5 units of time in the future. Negative times are ignored. 38

Towards direct motion and shape parameter recovery from image sequences. Stephen Benoit. Ph.D. Thesis Presentation September 25, 2003

Towards direct motion and shape parameter recovery from image sequences Stephen Benoit Ph.D. Thesis Presentation September 25, 2003 September 25, 2003 Towards direct motion and shape parameter recovery