Representation and Detection of Deformable Shapes

Representation and Detection of Deformable Shapes Pedro F. Felzenszwalb Artificial Intelligence Laboratory Massachusetts Institute of Technology Cambridge, MA 02139 pff@ai.mit.edu Abstract We present a new method for detecting deformable shapes in images. The main difficulty with deformable template models is the very large (or infinite) number of possible non-rigid transformations of the templates. This makes the problem of finding an optimal match of a deformable template to an image incredibly hard. Using a new representation for deformable shapes we show how to efficiently find a global optimal solution to the non-rigid matching problem. Our matching algorithm can minimize a large class of energy functions, making it applicable to a wide range of problems. We present experimental results of detecting shapes in medical and natural images. Because we don t rely on local search techniques, our method is very robust, yielding good matches even in images with high clutter. 1 Introduction In this paper we address the problem of detecting deformable shapes in images. Our approach falls within the framework of deformable template matching, where one wants to find a non-rigid transformation that maps a model to an image. In this setting, an energy function associates a cost with each potential transformation of the model, and we want to find a transformation with the lowest possible cost. Typically the energy function is a sum of two terms, the first one attracts the deformed model towards salient image features, while the second one penalizes large deformations of the model. Most of the existing non-rigid matching techniques either require initialization near the final solution or are too slow for practical use. This is because the number of possible transformations of a template is very large. In contrast, we present an algorithm that can quickly find a global optimal non-rigid transformation without any kind of initialization. The search over transformations is done efficiently by exploiting special properties of a new representation for deformable shapes. We represent shapes using triangulated polygons, as illustrated in Figure 1. The triangles in a shape form a structure that can be used to model shape deformation. In fact, by picking a special triangulation we obtain a representation that is related to Blum s symmetric axis transform [4]. We note that the triangles in any triangulated polygon are connected together in a special way. This allows us to obtain an efficient dynamic programming algorithm to find the best match of a model shape to an image, where we allow each triangle in the model to undergo an arbitrary affine transformation. The quality of a match is given by an energy function that is a sum of terms, one for each triangle. This type of energy function is quite general, and can be used to represent a wide range of deformable template models. In particular, we can represent both the boundary and the internal structure of a deformable object. Our experimental results illustrate the robustness of our method, showing accurate detection of deformable shapes even in highly cluttered scenes. The quality of the results is a consequence of using a good representation for deformable shapes together with an algorithm that can find a global optimal solution to the non-rigid matching problem. We show results both on medical and natural images, demonstrating the wide applicability of our techniques. 1.1 Related work The basic idea of matching a deformable model to an image goes back to Fischler and Elschlager [8] and Widrow [16]. More recently, Grenander [10] introduced a framework which provides a very general setting to represent deformable objects. Other influential models were presented in [11] and [5]. A few efficient and provably good matching algorithms have been developed for restricted sets of deformable models. For example, in [6] a dynamic programming algorithm was used to detect open deformable contours in images. Dynamic programming was also used in [1] to match models consisting of a number of landmarks with positions con-

Figure 1: Rabbit, pear, and hand shapes represented by triangulated polygons. strained by a decomposable graphical model. Efficient algorithms also exist for the related problem of computing a non-rigid match between two pre-segmented objects (such as [2] and [14]). Here we present the first efficient algorithm that can find optimal non-rigid matches between a deformable template and an image, where both boundary and region information of the template can be taken into account. Even when we use an energy function with a data term that depends only on the boundary of a shape we take into account region information when measuring shape deformation. In this way we obtain more realistic models of deformation than is possible using only boundary models. 2 Triangulated polygons As mentioned in the introduction, we represent shapes using triangulated polygons. The polygonal boundaries approximate the boundary of each shape, and the triangulations provide a decomposition of the shapes into parts. We restrict ourselves to simple polygons, which are polygons without holes. A triangulation of a polygon P is a decomposition of P into triangles, defined by diagonals. Each diagonal is a line segment that connects two vertices of P and lies in the interior of the polygon. Moreover, no two diagonals cross. We know (see [7]) that every simple polygon can be triangulated, and any triangulation of a polygon with n vertices consists of exactly n 2 triangles. Let T be a triangulation of a polygon P. By considering T as a decomposition of P into parts, we obtain a simple and intuitive way to non-rigidly deform the polygon. For example, a finger of the hand in Figure 1 can be bent by changing the shape of a single triangle. In practice we use a particular triangulation of the polygon known as the constrained Delaunay triangulation. The constrained Delaunay triangulation can be computed efficiently and yields a decomposition of the polygon into meaningful parts. In fact, it yields a representation that is closely related to the medial axis of a shape (see [13]). Any triangulation of a simple polygon has an important property which is the key to our efficient matching algorithm. We denoted by G T the dual graph of a triangulation T. The nodes of G T correspond to the triangles in T, and two nodes are connected when the corresponding triangles share an edge. Figure 2 illustrates a triangulated polygon and its dual graph. The important property is that G T is a tree. To see this, just note that each diagonal in T cuts the polygon into two disjoint parts, so removing an edge from G T disconnects the graph. Since G T is a tree, there is a nice order of elimination for the vertices and triangles of T. We use this fact to compute an optimal match of a model to an image using dynamic programming. Note that every tree has a leaf, and a leaf in G T corresponds to a triangle with some vertex v that is not in any other triangle. If we delete v and its triangle from T we obtain a new triangulated polygon. Repeating this procedure we get an order of elimination for the vertices (and triangles) of T. The order is such that when eliminating the i-th vertex, it is in exactly one triangle of the current triangulated polygon. If we consider the graph structure defined by T, the ordering described above is a perfect elimination scheme for the vertices of the graph. Graphs which admit perfect elimination schemes are known as decomposable, chordal or triangulated graphs (see [9]). Decomposable graphs are important because many problems which are hard to solve for general graphs can be solved efficiently in this restricted class. 3 Matching Let P be a simple polygon representing a model shape. An embedding of P in the plane is defined by a continuous function f :P R 2. We consider a set of embeddings which are extensions of maps g :V R 2, where V are the vertices of P. A triangulation of P gives a

Each term c ijk should take into account the shape of the embedded triangle and the image data covered by the embedding. Our current implementation uses a simple energy function similar to typical deformable template matching costs. For each triangle t, a deformation cost measures how far the corresponding affine map f t is from a similarity transformation (this makes our shape models invariant to translations, rotations and uniform scalings). A data cost attracts the boundary of the embedded polygon to image locations with high gradient magnitude. More details are given in Section 3.1. While the implementation described here uses a fairly simple energy function, the formulation can handle richer concepts. For example, the deformation costs could be tuned for individual triangles, taking into account that different parts of the shape may be more flexible than others. In fact, we have developed a method to learn deformation models from training data (this will be described in a longer report). Also, the data costs could take into account the whole area covered by the embedded polygon. For example, if we have a grayscale template associated with a shape we can use the correlation between the deformed grayscale template and the image to obtain a data cost. Figure 2: Triangulated polygon and dual graph. natural extension of g to all of the polygon as a piecewise affine map f. The function f sends each triangle (v 1, v 2, v 3 ) T to the triangle (g(v 1 ), g(v 2 ), g(v 3 )) using linear interpolation. In this way, the restriction of f to each triangle t T is an affine map f t. To see that f is well defined (and continuous) just note that if two triangles a, b T touch, then f a and f b agree along the intersection of a and b. What may seem surprising is that all embeddings which map each triangle according to an affine transformation are extensions of some g. This follows from the fact that an affine transformation is defined by the image of three non-collinear points. We define an energy function which assigns a cost to each map g, relative to an image I. The matching problem is to find g with minimum energy (which corresponds to the best location for the deformable shape in the image). We consider energy functions with the following structural form: E(g, I) = c ijk (g(v i ), g(v j ), g(v k ), I). (1) (v i,v j,v k ) T 3.1 Energy function In our framework, each triangle in a shape is mapped to the image plane using an affine transformation. In matrix form, we can write the affine transformation as h(x) = Ax + a. We restrict our attention to transformations which preserve orientation (det(a) > 0). This ensures that the corresponding embedding f is locally one-to-one. Let α and β be the singular values of A. The transformation h takes a unit circle to an ellipse with major and minor axes of length α and β. The value log(α/β) is called the log-anisotropy of h and is commonly used as a measure of how far h is from a similarity transform (see [15]). We use the log-anisotropy measure to assign a deformation cost for each affine map (and let the cost be infinity if the affine map is not orientation preserving). The deformation costs are combined with a data cost that attracts the shape boundary to locations in the image that have high gradient magnitude, E(g, I) = log 2 (α t /β t ) t T ( I f)(s) f (s) λ P f ds, (s) where log(α t /β t ) is the log-anisotropy of f t. The term ( I f)(s) f (s) is the component of the image gradient that is perpendicular to the shape boundary at f(s). We divide the gradient term in the integral by f (s) to make the energy scale invariant. Note that the integral can be broken up into an integral for each edge in the polygon. This allows us to write the energy function in the form of equation (1), where the cost for each triangle will be the deformation cost plus one integral term for each boundary edge that belongs to the triangle. 4 Algorithm As discussed in the last section, the matching problem is to find a map g :V R 2 with lowest possible energy. The only approximation we make is to consider

a discrete set of possible locations for each vertex. Let G R 2 be a grid of locations in the image. In the discrete setting g maps each vertex v i to a location l i G. The form of the energy function in equation (1) is quite general, and we depend on the structure of T to be able to find an optimal g efficiently. As noted in Section 2, there is a nice order of elimination for the vertices and triangles of T. The order is such that when eliminating the i-th vertex, it is in exactly one triangle of the current triangulated polygon. The matching algorithm works by sequentially eliminating the vertices of T, using the nice elimination order. This is an instance of a well known dynamic programming technique (see [3]). After eliminating v 1,..., v i 1, vertex v i is in exactly one triangle, say with nodes v j and v k. The two nodes v j and v k are the parents of v i, which we indicate by letting p[i].a = j and p[i].b = k. We compute the cost of the best placement for v i as a function of the locations for v j and v k. This cost is stored in V [j, k](l j, l k ). When we get to the last two vertices we can solve for their best location and trace back to find the best location of the other vertices, as is typical in dynamic programming. Algorithm Match(I) ( Find the best embedding of a shape in an image ) 1. for i = 1 to n 2 2. ( Eliminate the i-th vertex ) 3. j p[i].a 4. k p[i].b 5. for each pair of locations l j and l k in G 6. V [j, k](l j, l k ) min li G c ijk (l i, l j, l k, I) +V [i, j](l i, l j )+V [i, k](l i, l k ) 7. Pick l n 1 and l n minimizing V [n 1, n] and trace back to obtain the other optimal locations. This algorithm runs in O(nm 3 ) time and uses O(nm 2 ) space, where n is the number of vertices in the polygon and m is the number of possible locations for each vertex. In practice we can speed up the algorithm by noting that given positions l j and l k for the parents of the i-th vertex there is a unique similarity transformation taking v j and v k to the respective locations. This similarity transformation defines an ideal location for v i. We only need to consider locations for v i that are near this ideal location, because locations that are far introduce too much deformation in the model. With this optimization the running time of the algorithm is essentially O(nm 2 ). Note that in line 7 of the matching algorithm each entry in V [n 1, n] corresponds to the quality of an optimal embedding for the deformable shape given particular locations for v n 1 and v n. We can detect multiple instances of a shape in a single image by finding peaks in V [n 1, n]. We simply trace back from each peak that has a value above some given fixed threshold. 5 Experiments We present experimental results of our matching algorithm on both medical and natural images. In each case we used a binary picture of the target object to build a model. First we computed a polygonal approximation of the example shape, and then the Delaunay triangulation of the resulting polygon. For the matching results shown here we used a grid of 60 60 possible locations in the image for the vertices of the models. Our matching algorithm took approximately five minutes in each image when running on a standard workstation. In Figure 3 we show a model for the corpus callosum generated from a manually segmented shape. The best match of the model to several MR images is shown in Figure 4. Note how these images have very low contrast, and the shape of the corpus callosum varies quite a bit. The quality of our results is similar to the quality of results obtained using the best available methods for model based segmentation of medical images (such as [12]). The main advantage of our method is that it does not require any initialization. Figure 5 shows a model for maple leaves, constructed from a binary silhouette. The best match of the model to a few images is shown in Figure 6. The leaves in each image are different, and the viewing direction varies. Note how our method can handle the variation in shape even in the presence of occlusion and clutter. In particular, the last image shows how we can hallucinate the location of a large occluded part of the leaf. Techniques that rely on local search to find non-rigid transformations tend to fail on cluttered images because they get stuck on local optimum solutions. To check the performance of our algorithm on difficult inputs we corrupted one of the leaf images with random Gaussian noise. Figure 7 shows the corrupted images with increasing amounts of noise (corresponding to σ = 50, 150 and 250) and the matching results for each input. We can identify the approximate location of the leaf even when it is barely visible. In Figure 8 we demonstrate how our matching algorithm can be used to detect multiple instances of an object in an image. As discussed in the last section we simply selected peaks in V [n 1, n] with value above a pre-determined threshold to generate each detection.

Figure 3: A model for the corpus callosum generated from a binary picture. Figure 4: Matching the corpus callosum model to different images.

Figure 5: A model for maple leaves generated from a binary picture. Figure 6: Matching the leaf model to different cluttered images. Each image is of a different maple leaf. Figure 7: Matching the leaf model to an image corrupted by increasing amounts of Gaussian noise.

Figure 8: Detection of multiple leaves in one image. 6 Conclusion We have described a new representation for deformable shapes based on triangulated polygons. Our models are invariant to translation, rotation and scale and can be non-rigidly deformed in an intuitive way. The problem of matching a shape to an image was defined in terms of an energy function to be minimized, as is typical for deformable template models. We concentrated on a particular class of energy functions that can be used to represent both the boundary and the internal structure of a deformable object. In contrast to previous work we can efficiently find a global optimal solution to the matching problem. This allows us to detect deformable objects without any kind of initialization, even in the presence of occlusion and background clutter. Acknowledgments We would like to thank Eric Grimson and Dan Huttenlocher for their comments and support. This work was supported in part by NSF-ITR award #0085836 and DARPA contract N00014-00-1-0907. References [1] Y. Amit and A. Kong. Graphical templates for model registration. IEEE PAMI, 18(3):225 236, 1996. [2] R. Basri, L. Costa, D. Geiger, and D. Jacobs. Determining the similarity of deformable shapes. Vision Research, 38:2365 2385, 1998. [3] U. Bertele and F. Brioschi. Nonserial Dynamic Programming. Academic Press, 1972. [4] H. Blum. Biological shape and visual science. Theoretical Biology, 38:205 287, 1973. [5] T. Cootes, C. Taylor, D. Cooper, and J. Graham. Active shape models: Their training and application. CVIU, 61(1):38 59, 1995. [6] J. Coughlan, A. Yuille, C. English, and D. Snow. Efficient deformable template detection and localization without user initialization. CVIU, 78(3):303 319, 2000. [7] M. de Berg, M. van Kreveld, M. Overmars, and O. Schwarzkopf. Computational Geometry. Springer-Verlag, 1997. [8] M. Fischler and R. Elschlager. The representation and matching of pictorial structures. IEEE TC, 22(1):67 92, 1973. [9] M. Golumbic. Algorithmic Graph Theory and Perfect Graphs. Academic Press, 1980. [10] U. Grenander. Elements of Pattern Theory. Johns Hopkins University Press, 1996. [11] A. Jain, Y. Zhong, and S. Lakshmanan. Object matching using deformable templates. IEEE PAMI, 18(3):267 278, 1996. [12] M. Leventon, W. Grimson, and O. Faugeras. Statistical shape influence in geodesic active contours. In IEEE CVPR, pages 316 323, 2000. [13] L. Prasad. Morphological analysis of shapes. CNLS Newsletter 139, Center for Nonlinear Studies, Los Alamos National Laboratory, July 1997. [14] T. Sebastian, P. Klein, and B. Kimia. Recognition of shapes by editing shock graphs. In IEEE ICCV, pages 755 762, 2001. [15] C. Small. The Statistical Theory of Shapes. Springer-Verlag, 1996. [16] B. Widrow. The rubber mask technique. PR, 5(3):174 211, 1973.