Transactions on Information and Communications Technologies vol 16, 1996 WIT Press, ISSN

Shape-invariant object detection with large scale changes John DeCatrel Department of Computer Science, Florida State University, Tallahassee, FL 32306-4019 EMail: decatrel@cs.fsu.edu Abstract This paper reports extensions to an innovative object detection and pose estimation method for non-analytic shapes, potentially useful to machine or robotic vision systems. Recently we demonstrated a shape-invariant recognition technique based upon the generalized Hough transform that is invariant to large planar changes in object position and rotation, as well as small changes in scale. The method works even for cases where objects are moderately occluded. It is of comparatively low complexity (O(n 2 )), where n is the edge length in pixels. The original technique has now been extended to detect and report large scale changes and multiple instances of the transformed prototype. Preliminary results have been obtained and are demonstrated. These improvements do not increase the original time complexity. Hallmarks of the original technique include the following. A two-stage process first hypothesizes target locations. The second stage (visual confirmation) reports estimated pose (position and orientation) with respect to a prototype model. The calculation of rotationally invariant R-table indices use polar edge pixel pairs pair pixels found on a line normal to an object s edge. To incorporate the detection of moderate scale changes, distances between all valid edgel polar pairs in the image are acquired. Each distance is compared to a corresponding distance in the prototype, and stored as evidence of a scale change. For each target instance, a modal aver-age is the reported scale. Multiple targets are found by adaptively setting a threshold in Hough parameter space. This threshold responds to an inference about object perimeter size, i.e., an edge pixel count.

1 Introduction One useful machine vision objective is the automatic detection of objects that have been transformed by changes in perspectivity. Suppose that an object model is stored by the vision system, perhaps in a model library. One or more instances of the object may appear in a digitized image, acquired by a video camera. With increasing distance, the two-dimensional image is scaled down, due to perspective projection. For sufficiently long camera-toobject distances, this transformation can be accurately approximated by isotropic scaling. The technique and implementation described herein extends our recently reported method to identify objects that are moved and rotated in the plane, with respect to a canonically posed prototype [1]. Such apparent changes may also be effected by a change in viewpoint, rather than object movement. The extension allows much greater scale changes than previously reported. Furthermore, multiple instances of target objects are now allowed, and these can be individually transformed. Tolerance to moderate amounts of occlusion is preserved. Note that this technique adds scale and rotation invariance to the generalized Hough transform s (GHT) position invariance. This is performed at a lower time complexity than the few previously reported methods of comparable capability. Performance degrades as occlusion, scale change, and scene complexity increase, but it is sufficiently good for many object detection applications found in vision research and engineering environments. A two-stage process first hypothesizes target locations. The second stage (visual confirmation) reports estimated pose (position and orientation) and scale with respect to a prototype model. The second stage efficiently reuses software code from the first stage. Unlike some other GHT methods, no specialized searching of the parameter (voting) space is needed for many applications. Adaptive thresholding provides for the location of multiple, individually transformed object instances automatically. The method has been implemented with close attention to computational efficiency. 1.1 Assumptions Target detection assumed pre-processing to obtain decent binary images [2]. We proceeded with this limitation to make the problem more tractable, and various binarization techniques are readily available. One tradeoff in using binary edges is that quantization of the edge gradient direction can be much poorer. In most gradient edge detectors, gray level changes along vertical and (separately) horizontal directions provide for many possible gradient directions. This problem was resolved to a sufficient degree by computing the tangent of non-analytic contours over a small neighborhood of pixels for

each edgel. (In this paper, edgel is restrictively defined to mean edge pixel.) One complication is that internal object edges are undifferentiated from hull edges. Some internal edges and edgels can be negotiated with little difficulty. However, extensive internal edges, especially as would arise from internal reflection interactions or sudden surface variations, should be eliminated by pre-processing. A moderate amount of random noise is thwarting, and should also be pre-filtered. We presume that these limitations are implicit in the comparative methods as well, which are mentioned later. To increase the utility of our software, it is advisable to obtain or design good pre-processing methods. The objective is to better obtain useful, lownoise silhouettes automatically. Confirmation of highly cluttered or occluded objects might proceed employing reasoning or knowledge about object prototypes and domain scenes. 2 Methodology 2.1 Overview This GHT begins as usual by constructing an R-table from a prototype object. Very similar descriptions apply to building the table, or using the table during a detection trial. Essentially, the R-table is a lookup table wherein each edge point can find where a fixed reference point (L) is located with respect to that edge point. To index into the table, an edgel must first find an opposite (polar) edgel, in the pixel's normal direction. The difference of corresponding edgel-pair arctangents provides a rotationally invariant index. The table stores a vector to L. Often, several polar pair edgels have the same arctangent difference value. Then, several row entries must be stored, one entry for each edgel pair. To decrease sensitivity to finding the exact same edgel pair at construction time and at trial time, we index into several neighboring rows of the R-table, to examine a small range of tangent differences. In a standard GHT scheme, each edgel would be required to vote for every entry per row indexed. Hence, to suppress excessive voting, each entry in each indexed row is examined by secondary indices: local edge curvatures at polar points. Only entries that match these additional features within a reasonable range are used as votes. To make the method scale invariant, we make opportunistic use of data that has previously been determined. For example, upon determining a polar pair, the distance between pairs is then known; the distance is known for the image instance(s) as well as the model protype. Hence, for each edgel, distance ratios between model and image instance can be used as evidence for scale change. Specifically, each edgel votes for one or more L locations that have been scaled by the intra-pair distance ratio. Note that curvature values in the R-table used for secondary indexing are scaled as well. This tacitly assumes that curves are approximately circular arcs.

To detect multiple target instances, a threshold is set in the Hough parameter/voting space. This threshold is set automatically by comparing the vote count at a candidate L point to the expected number of voters the instance edgel count, which is the same as the perimeter size for unoccluded objects. For each target, the expected vote count varies linearly with scale change, the latter having been computed during the pose estimation stage. Local maxima above the threshold are reported as transformed targets. 2.2 Building the R-table (b) Figure 1: (a) Geometry and angle definitions for R-table construction. An edgel polar-pair at points A and B. (b) Vote cast by one edgel in scaled, rotated object. Although arctangents of polar pairs change, their angular difference is constant. Many real shapes will provide a sufficient number of polar-pairs for instance detection, even when moderately occluded. A binary edge image is input into the tangent-finding routine. For each edgel, a tangent value is calculated and stored as an image, registered with the input. Incident and exit positions of edges are obtained over a small local neighborhood with sub-pixel accuracy [1]. A second pass smoothes neighbor tangent values, while obtaining arctangents from a lookup table. In our current implementation, the angle value can range from 0 to 179 degrees. At the same time, the degree of curvature (κ) in a tangent

neighborhood is determined. This routine is similar to tangent-finding, with the tangent image used as an input. A location or reference point (L) is chosen for the prototype shape. In our trials, the location point was simply the midpoint of the maximum and minimum vertical and horizontal (x,y) extents of the object. During the target detection phase, votes by edgels should accumulate at a point corresponding to L (in registered parameter space), even when the test image contains a scaled, rotated, and partially occluded prototype. For each edgel's corresponding tangent, the parameters for a line normal to the tangent are found. Using the midpoint version of the Bresenham line drawing technique [3], pixels are visited along the line's locus until another edgel is encountered (forming a polar pair), or the visit goes beyond either the x or y limits of the image (figure 1, a). If a pair is formed, the intra-pair distance (ρ) is recorded. The difference between two arctangents of a polar pair(θ ) is a rotationinvariant index into the R-table. In the referenced figure, denote by θ Α, the arctangent at point A in degrees. φ is the angle between the x-axis and line AL. The angle between the edgel's tangent and it's vector to L, α, is then easily calculated. We have quantized tangent resolutions to one degree, therefore the R- table has 180 rows and a variable number of columns. R-table record entries, then, are 6-tuples (Θ Α, α, λ, κ Α, κ Β, ρ). Recall that Θ Α is the tangent at point A; α and λ are the direction and distance to L, respectively; κ Α and κ Β are local curvatures at edge point A and corresponding polar point B; and ρ is the inter-pair distance. The maximum number of entries in the table is n, the number of edgels in the edge image. The actual number is typically smaller than n, which reflects the redundancy of edge attributes found in typical imaged shapes. 2.3 Processing the target image The target image is first processed with the edge-producing and tangent routines. For each non-zero pixel, the tangent and normal are found. A line of pixels is visited again, in the two normal directions from the edgel. Should a non-zero pixel be encountered, its tangent is found. The difference angle (θ ) is calculated, the intra-pair distance (ρ ) is noted, and the curvature measures are recorded. θ selects a row of entries from the R-table. For each entry that is found, the two curvature measures are compared for reasonability, i.e., values fall within a pre-specified range. For scale invariance, curvatures in each entry are scaled by ρ /ρ. This operation tacitly assumes that curves are approximated by circular arcs. If the scaled curvature values are reasonable, the location point angle (α) is read from the table. The distance to L, the image of L, is calculated: λ = (ρ /ρ) λ. α and λ are used to vote for the position of L. Some edgels in the target image will produce several votes, due to false polar-pair matching. With infrequent exceptions, only one vote per edgel will register at or near the location point (L ). Hence, L should receive nearly as many votes as there are object edge pixels in the target image provided that there is indeed a match between the prototype shape and the target shape. This is employed to automatically set a threshold in the voting

space. If edges are significantly occluded, then this program feature fails, unless additional information is allowed. For example, the user might specify that expected target perimeters may be P% occluded. In that case, threshold setting would be considered less than fully automatic. Tangent calculations necessarily incorporate some imprecision; this is largely due to the limited resolution of typical digitized targets. Hence valid votes for L may occur over a small neighborhood. Therefore we sum tangent values over a small neighborhood, which effectively narrows and enhances peaks in the parameter space. Rotational invariance in the plane is an objective and result of the method. The normal to the tangent at any point on the edge of a shape should go through the corresponding polar point. This is true no matter to what degree the shape is rotated, assuming minimal object distortion due to image digitization or digital rotation algorithms. Invariance to isotropic scaling is also achieved, which is a common perspective projection approximation (weak perspectivity). 2.4 Pose estimation The voting process as described in section 2.3 is repeated in this, the second, stage. However, instead of casting votes, an edgel arctangent (β i ) that would have voted exactly for L indexes the R table. θ Α is retrieved, and the difference between β i and θ Α is an estimate of the target rotation angle. Variations reported by all β i are stored in a histogram. The peak histogram value is the reported rotation angle. Concurrently, for each β i, the target intra- polar-pair distance (ρ ) is calculated, and ρ is retrieved from the R table. A measure of ρ /ρ is stored in a separate histogram, and that peak is the reported scale. 3 Performance Two important engineering concerns in coding are the size of memory required and the speed of computation. Although there is often a tradeoff between these two, in our method they are almost entirely independent of one another [1]. Denote by N the image size in pixels, and n, the prototype object edgel size. Disregarding a small overhead, the data space memory requirement in bytes is 4N + 7n. Two bytes are allotted for each of the input and output image pixels. Seven bytes are required per R-table entry. Speed of computation is dictated by the nature of the method and by the implementation. Our implementation is coded in C, and it employs no floating point calculations and no divisions. While this speeds up many uniprocessor platforms, it should also prove useful for porting routines to some specialized parallel processors. Tangents and arctangents are calculated by lookup tables that have integer entries. The Bresenham line technique involves only additions and comparisons. Distance calculations do require some integer multiplications.

The algorithm s computation time is O(n 2 ), where n is the number of edgels in the target image. For a typical case, the time for each stage is: t = n(t 0 + pt p + rt r + r vt v ) + Nt N, where symbols are defined as: t 0 p t p r t r r' v t v n N t N general overhead for processing an edgel the number of pixels in the line that find polar-pair edgels the time to visit a line's pixel the average number of R-table entries per point the processing time required to determine if a table match is to be used for voting the average number of R-table entries per edgel that will vote the average number of (parameter space) pixels required for each voting line the time to add a vote in the voting line the number of edgels in the target image the number of pixels in the target image the time to visit a pixel while raster scanning. Hence, the time for both stages (target detection and pose estimation) is approximately 2t. Note that Nt N is the time to raster scan an input image. For most real images, it is small enough to be neglected compared to edgel processing. For example, in a 512 x 512 pixel image, Nt N contributes about 5% to t. The values for p and v depend upon the size of the target object's image and the amount of scaling being allowed, and each is O(n). The values for r and r' are shape dependent. For a boundary case, the circle, both will be close to one. Experimentally we find r to be about 4, for random cases. Currently a polar pair match is looked for in both normal directions of an edge point. It should be feasible to first determine which direction is towards the object s interior. Once the edges and tangents are calculated, the remainder of the process can be carried out by an arbitrary number of processors, as long as they have either individual, or common access to the R-table, the target image, and the output (vote accumulator) image. The order of processing of each edgel in the target image does not matter. 4 Experimental results Two examples are described to illustrate the technique's capability. Figure 2 describes an attempt to locate multiple instances of the single-hole key. The input is about 0.25 x 10 6 pixels, and 5 x 10 3 edgels. In figure 3, the test image is composed of clip-art silhouettes. The bear shape is easily found, even when about half of it's area is occluded. The occlusion is a deformation which might arise from shadowing as well as

other objects (in this case, a cat shape). The input image is about 150 x 10 6 pixels, and the edgel count is about 3 x 10 3. Our method yields the best results for irregular shapes ones with a variety of difference angles. It does work for more regular shapes, but these tend to yield less pronounced peaks [1]. The method is useful with respect to images that have incomplete edge patterns, since no edge following is used. The method fails for targets that have one side completely obscured. In order to perform a lookup in the R- table, two points are needed. If a large number of pixels have no matching pixels along their normals, then there will not be sufficient information available for detection. 5 Related research The generalized Hough Transform is a well-known method for detecting arbitrary shapes, and it was first proposed by Ballard [4]. Extensions to the GHT have been proposed more recently to include scale and rotation invariance. Davies [5] provides a thorough discussion of the GHT, and he reviews some proposed ways to lower the GHT's computational complexity from brute force (O(N 4 )) if scale or rotation invariance is incorporated. Here, N is one dimension of the Hough parameter (voting) space. That discussion is largely limited to the processing of analytic shapes i.e., shapes described by closed-form elementary functions. More recently a few researchers have described methods to implement scale- and rotation-invariant GHT [6], [7]. Each method suffers from one or more limitations compared to ours: the method is of higher time complexity, or much less computationally efficient, or the method cannot tolerate any occlusion. These methods have been more fully summarized in [1]. 6 Conclusion We have described a GHT object detection and pose estimation method that is invariant to moderate changes in object position, orientation, and scale. The method is of comparatively low time complexity (O(n 2 )). Our algorithm is distinguished from other methods in several ways, resulting in a software tool that we hope may be practical and useful within myriad vision research and engineering settings. For example, the method in not thwarted by a fair amount of object occlusion. Multiple object instances can be found at once. The implemented software pays close attention to computational efficiency. Initial results seem promising, however more experiments are required to quantify performance, and to characterize performance degradation with image complexity.

References 1. Weinstein, L. and DeCatrel, J. Scale- and rotation-invariant Hough transform for fast silhouette object detection. CS Tech. Rept. Dept. of Computer Science, Florida State University, no. 95-051, 1995. 2. Shen, J. and Castan, S. Further results on DRF method for edge detection. Proc. 9th Intl Conf.Pattern Recognition, Rome, 1988. 3. Foley, J. et al. Computer Graphics: Principles and Practice, pp. 74-78, Addison-Wesley, Reading, MA., 1990. 4. Ballard, D. Generalizing the Hough transform to detect arbitrary shapes. Pattern Recognition, 1981, 13, 111-122. 5. Davies, E. R. Machine Vision. Theory, Algorithms, Practicalities. Academic Press, San Diego, 1990. 6. Ser, P.K. and Siu, W.C. Non-analytic object recognition using the Hough transform with the matching technique. IEE Proc.-Comput. Digit. Tech., 1994, 141(1), 11-16. 7. Jeng, S.C. and Tsai W.H. Scale- and orientation-invariant generalized Hough transform a new approach. Pattern Recognition, 1991, 24(11), 1037-1051.

(a) (b) (c)

(d) Figure 2: (a) Prototype silhouette of a digitized key. Localization point (L) is marked by +. (b) Input image contains two partly occluded, transformed target instances. (c) Pose module reports one instance at -25 degrees rotation, scaled at 0.8. Another instance is reported at 177 degrees, and scaled at 0.5. The algorithm has superimposed its pose estimate on the input edge image.(d) Elevation plot (histogram) of the Hough parameter space.

(a) (b) Figure 3: (a) Clip art shapes include a bear prototype (upper left), and another instance with 50% occlusion of it's area (by a cat shape). (b) Plot of votes, where the two highest peaks correspond to the two instances.