Image Mining by Matching Exemplars Using Entropy

Size: px

Start display at page:

Download "Image Mining by Matching Exemplars Using Entropy"

Olivia Barrett
5 years ago
Views:

1 Image Mining by Matching Exemplars Using Entropy Clark F. Olson University of Washington, Bothell Computing and Software Systems Campus Way NE, Campus Box Bothell, WA ABSTRACT This paper describes a methodology for off-line extraction of content from image and video libraries in order to support complex queries at run-time. The basic technique is to detect matches for exemplars in the images using a technique based on entropy that is invariant to many (and insensitive to others) of the imaging phenomena that can cause appearance-based techniques to fail. The technique has been successful in locating matches for exemplars even when the images are captured with completely different sensors (e.g. CCD versus FLIR). Current matching techniques are efficient for single images, but are likely to be prohibitively expensive for massive databases. We outline a hierarchical processing strategy that we believe has potential for searching such databases. Categories and Subject Descriptors H.3.7 [Information Systems]: Information Storage and Retrieval Digital Libraries General Terms Algorithms Keywords Data mining, image database, template matching, entropy 1. INTRODUCTION Many current image and video databases are databases in name only. They are unmanaged collections of data, rather than being managed by a database management system (DBMS) that provides a facility for forming and processing complex queries on the data. Considerable work on contentbased image retrieval has studied methods to extract general information from images in order to support some queries at run-time [20]. However, the generality of these methods precludes them from supporting specific, complex queries, such as In which video sequences does Tiger Woods appear after to Annika Sorenstam? or Which patients have MRI The copyright of these papers belongs to the paper s authors. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. MDM/KDD 03, August 27, 2003, Washington, DC, USA. images that contain two tumors of different types? The primary objective of this work is to develop algorithms and tools in order to create true databases from collections of multimedia data (specifically images and video sequences) by mining content from the data off-line in order to efficiently support complex queries at run-time. Shape-based image retrieval is a difficult problem. The reason for this is the intractability of the segmentation problem to extract shape information from complex image data. Extending such retrieval to video gains the benefit (and the curse) of orders of magnitude more data to examine. For this reason, we believe it is likely that the most common method for extracting content from image and video databases will be through matching against appearance-based exemplars. Appearance-based matching has achieved good results in many areas of computer vision, including face recognition [19] and object classification [11]. However, appearancebased techniques face difficulties when objects vary from the expected appearance due to differences in sensor, lighting, configuration, etc. We build upon previous work for robust image matching [13, 14] to extract matches in the database using image exemplars. Figure 1 shows an example of appearance-based recognition using an exemplar. In this example, a location in an orbital image is detected using an aerial image of the same location even though the images were captured with very different sensors, at different resolutions, and slightly different viewing direction. This example illustrates that successful extraction can be performed even when the exemplar has a different appearance from the image to which it is matched. The goal of this work is to develop a matching measure that is robust to changing sensors, lighting conditions, and image motion together with an efficient search strategy for locating good matches according to the measure. For example, this technique could be used to detect all instances of a particular location in a database of geographical images using an exemplar. Maximization of mutual information [10, 21] has been successful in robust matching. However, there is no efficient search strategy for large search spaces. In the above example (in which mutual information fails to find the correct result), matching was performed by transforming both the exemplar and the search image into new images by replacing each pixel with a local entropy measurement. The scoring function for each position of the exemplar examines how well

(a) (b) (c) Figure 1: Motivating example. (a) Orbital image of the Avawatz Mountains and Silurian Valley in California. (b) Aerial image showing a detail of (a). (c) Annotated image.

2 (a) (b) (c) Figure 1: Motivating example. (a) Orbital image of the Avawatz Mountains and Silurian Valley in California. (b) Aerial image showing a detail of (a). (c) Annotated image. The box shows the location of (b) in the orbital image. the local entropies are correlated between the exemplar and the corresponding locations in the search image. The optimal match was detected over affine transformations of the exemplar image. 2. RELATED WORK A large body of previous work exists on retrieving images satisfying various characteristics from an image collection [20]. Influential early systems include QBIC [5], Photobook [15], and the Virage engine [1]. The QBIC (Query By Image Content) system uses image frames and objects that have been segmented (manually or automatically) from an image. Images and objects segmented from them are represented by sets of features representing color, texture, shape, location, and sketch. Queries can be performed on entire images or individual objects using these features. Photobook similarly allows searches to be performed using extracted image features. Three types of image properties are used: appearance (for example, segmented faces), two-dimensional shape (tools, fish, etc.), and texture (wood, honeycomb, grass). Each of the image properties is represented in the database using a small set of coefficients (30 to 100) from which the original image can be reconstructed with little error. The image properties can also be combined to form searches that are more complex. Virage provides an image search engine has three functional parts: image analysis, image comparison, and management. Preprocessing operations and primitive extraction is performed during image analysis. During image comparison, similarity measures are applied to pairs of primitive vectors that have been extracted from images and combined using weights. The management function performs initialization, and manages the weights and primitive vector data. Recent systems include NeTra [9], which first performs segmentation of the image into distinct regions. Queries are performed through indexing using color, texture, shape, and spatial location. PicHunter [4] uses relevance feedback to predict the image desired by the user given their actions. The overall approach is Bayesian, with an explicit model of the user s actions. PicHunter uses annotated images; however, the annotations are hidden from the user to reduce confusion on the user s part. PicToSeek [6] uses color and shape invariants to retrieve images from a database. New color models are used to gain invariance to viewpoint, geometry, and illumination. Shape is extracted through color invariant edge detection. The shape and color invariants are combined into a single high-dimensional feature set for discriminating between objects. Note that such content-based image retrieval systems invariably require a user in the loop in order to refine queries and locate the desired images in the database. The techniques described here extract content from multimedia databases automatically. User involvement is required only for queries on the database, after the relevant information has been extracted from the multimedia data. 3. MATCHING WITH ENTROPY The basic technique that we use for detecting matches is to compare the local entropy against that of an exemplar image over some space of relative positions between the images [13]. For a discrete random variable A, with marginal probability distribution p A(a), the entropy is defined as: H(A) = p A(a)logp A(a). (1) a Note that 0 log 0 is taken to be zero, since: lim x log x =0. (2) x 0 We apply an entropy transformation to both the template and the reference image as follows. For each image location (x, y), we examine the intensity values in an image window (with some specified size k k) centered(x, y). The intensities are histogrammed and the entropy for the window is computed according to Eq. (1) above. In practice, it is useful to smooth the histograms prior to using Eq. (1). For efficiency, we use a histogram with 64 bins (6 bits of informa-

(a) (b) (c) (d) (e) Figure 2: Entropy images at different scales. (a) Original image. (b) Entropy image for 3 3 window. (c) Entropy image for 5 5 window. (d) Entropy image for 11 11 window.

3 (a) (b) (c) (d) (e) Figure 2: Entropy images at different scales. (a) Original image. (b) Entropy image for 3 3 window. (c) Entropy image for 5 5 window. (d) Entropy image for window. (e) Entropy image for window. tion) and smooth the histogram using a Gaussian (σ =1.0 bins). Figure 2 shows an example of the entropy images generated at a variety of scales. It can be observed that the entropy transformation captures the amount of local variation at each location in the image. As the size of the window increases, the local entropies are spread and smoothed over a larger area. This property is useful in the generation of a multi-resolution search strategy. While these images are not invariant to all illumination changes and sensor characteristics, they are insensitive to many issues in multi-sensor matching while retaining much image information. The entropy images are invariant to bias in the original images and the correlation peaks are invariant to gain in the entropy images. Of course, illumination changes and different sensors cause more complex transformations than these, but we have found that the locations of the correlation peaks change little. Now, we want to maximize the normalized correlation between the entropy images over the search space. In general, this involves rotating and rescaling the template entropy image according to the pose parameters. (We can also include shearing and other deformations in the search space. Our examples use the six degree-of-freedom affine transformation.) The shape and scale of the template will, thus, vary over the search space and this must be accounted for in the normalization. In order to deal with this in the search for the best position, whenever we generate a rescaled template, we generate another template of the same shape storing the weight of each position in the correlation. Most positions will have a weight of one, but the edges of the template will have lower values, since bilinear interpolation does not have four pixels to interpolate from in this case. This template can be used in the Fast Fourier Transform (FFT) to quickly determine the normalization values for each translation of the template in the search strategy described below. 4. TOWARDS EFFICIENT SEARCH Since multimedia databases can store terabytes of information it is necessary to use highly efficient strategies for processing the data. In some cases it will be unacceptable for the mining to require days to operate, but it may be acceptable in large databases to mine the data over the course of hours (or overnight). Current strategies for robust appearance-based matching are computationally intensive. Matching using mutual information [10, 21] is a leading technique for such matching. However, developing a search strategy that is both general and efficient is difficult for this technique. Our current approach combines fast two-dimensional matching using the FFT, coarse sampling of the remaining pose parameters, and iterative refinement to detect the optimal position(s) [13]. Search over the two degrees-of-freedom corresponding to translation in the image can be performed very quickly in the frequency domain using the FFT, since this allows cross-correlation to be performed in O(n log n) time, where n is the size on the image. In more complex search spaces, including similarity and affine transformations, we sample the remaining parameters on a grid to ensure that a sample point is close to every point in the search space. For each sample point, the translation parameters are searched with the FFT and strong candidates are further refined using Powell s iterative optimization method [16]. This methodology yields good results for matching in a single image, but it is likely to be prohibitively expensive when matching against a very large database of images and video. A promising technique is guaranteed search for image matching [14, 18]. This method is able to prune large sections of the search space using little computation. These techniques have not, yet, been applied to image matching using statistical measures, such as mutual information or entropy matching. However, previous work on entropy matching [13] can be framed as a multi-resolution search problem, making such techniques feasible. Video databases with both spatial and temporal properties are even more complicated. One approach that has been suggested [3] is the use of 3D strings in the x, y, and time dimensions. Liu and Chen [8] use three-dimensional pattern matching techniques for this purpose. Their approach scales linearly with the number of video objects. In order to achieve efficient operations on very large databases, it will be necessary to obtain a sub-linear relationship with the number of video objects. This may be achieved using a hierarchy of video objects. The top of the hierarchy would be coarse, but would allow many video objects to be removed from consideration. Lower levels of the hierarchy would add detail and remove more video objects, until the remaining video objects could be examined explicitly. This methodology is similar to work in object recognition that uses a hierarchy of positions in the space of object po-

(a) (b) Figure 3: Location detection between aerial and orbital images. (a) Aerial image of California desert. (b) Detected location in an orbital image. sitions (i.e., the pose space) [2, 7, 12].

4 (a) (b) Figure 3: Location detection between aerial and orbital images. (a) Aerial image of California desert. (b) Detected location in an orbital image. sitions (i.e., the pose space) [2, 7, 12]. At the upper levels of the hierarchy, the method tries to prove that each cell of the pose space cannot contain a position meeting the search criteria. If successful for some cell, the entire cell can be removed from the search. Otherwise, the cell is subdivided into smaller cells (lower in the hierarchy) and the process is applied recursively. This technique succeeds since it never prunes a cell that could contain a position meeting the search criteria. Application of similar ideas to video databases has the potential to yield gains in efficiency when dealing with large databases. 5. RESULTS Figures 3-5 show examples of the application of our method to real images. In Fig. 3, the template is an image captured using a helicopter at an elevation of approximately 800 meters and the reference image is an orbital image that encompasses the same location. The correct location of the template is not easy to detect manually. Successful matching was performed with the affine transformation and the correct location was determined in 11.4 seconds on a 500 MHz workstation with a pixel template and a pixel reference image. Figure 4 shows an example where a military vehicle was detected in a CCD using an exemplar from an infrared image. We have applied these techniques also to images from Mars (Fig. 5). Both the template and the reference image were captured with the Imager for Mars Pathfinder (IMP) cameras. The images were captured at different times during the day, so the illumination and the shadowing of the terrain is different between the images. Furthermore, the reference image has undergone a non-linear warping operation to remove lens distortion, while the template has not. The matching algorithm was able to succeed despite these differences and required seconds with a pixel template and a pixel reference image for the problem in Fig FUTURE WORK In addition to research to find highly efficient search strategies for matching against exemplars, we have identified two other important areas for future work: highly robust tracking in video images and grouping of extracted objects. 6.1 Tracking Tracking objects in video is easier than detecting/recognizing them. However, highly robust methods are necessary to maintain query integrity. It is unacceptable for the tracking procedure to be fooled into locking onto a background object or drift off the correct target. Since the tracked object will change appearance over the course of the video sequence, thetemplateusedtotracktheobjectmustbeperiodically (or continually) updated. However, each time the template is updated, the target on which the template is centered drifts slightly from the desired position. For the tracking to be successful over a long sequence, localization must be performed with high precision each time the template is updated. One approach [17] is to compute the motion not just relative to the previous frame, but to use a set of previous frames. We intend to strengthen this approach using a robust template matching framework and high-precision subpixel estimation techniques [14]. Another important modification is to update the target template only when necessary, in order to reduce the opportunities for drift. The decision of when to update the target template can be determined using a quality of match measure between the current template and video frame. For example, we may decide that the template must be updated when

(a) (b) Figure 4: Recognition using entropy with an exemplar from a different sensor. (a) Tank exemplar from FLIR image. (b) Extracted tank location from a CCD image.

(b) Detected location of rock in a CCD image with a different lens filter.

2 Grouping If objects are mined from video using exemplars, then the identity of the objects is known and they are trivially grouped according to the exemplar that generated them.

In addition, techniques for applications such as face recognition may use general face exemplars, without distinguishing between faces belonging to different people.

5 (a) (b) Figure 4: Recognition using entropy with an exemplar from a different sensor. (a) Tank exemplar from FLIR image. (b) Extracted tank location from a CCD image. (a) (b) Figure 5: Recognition using images taken with different filters and with different illumination. (a) CCD image of a Martian rock. (b) Detected location of rock in a CCD image with a different lens filter. an estimate of the standard deviation in the localization or the probability of failure has risen above a pre-determined value. 6.2 Grouping If objects are mined from video using exemplars, then the identity of the objects is known and they are trivially grouped according to the exemplar that generated them. On the other hand, techniques can be integrated into this framework that do not use exemplars. In addition, techniques for applications such as face recognition may use general face exemplars, without distinguishing between faces belonging to different people. For cases where there is no obvious partitioning of the detected objects, we wish to classify the objects according to some meaningful characteristic; in a video containing people, we should be able to detected cases where a particular face leaves the image sequence and subsequently returns. This requires grouping the detected face instances according to the appearance of each face, rather than classifying them as different objects. Our initial approach to matching separate face images, while allowing differences in lighting and affine transformation, uses techniques similar to the problem of extraction using exemplars [13, 14]. With this methodology, the extracted objects (such as faces) are treated as exemplars for matching against other extracted objects. For highly non-planar objects, this approach can be extended to use shape information inferred from the object class and image. For example, the three-dimensional structure of a face can be well approximated using a small number of parameters extracted from an image of the face. This shape information can used to improve the robustness of matching between two images of a face observed from different viewpoints, since a non-linear image warping making use of the shape information brings the images into better alignment. 7. SUMMARY Our goal in this paper is to demonstrate the data can be mined from image and video databases by performing matching against exemplars. The technique that we use for matching is based on comparing the local entropies in the images

6 at some relative position that can encompass translation, rotation, scaling, and skew. This allows matches to be detected even when the images are taken by different sensors and from different viewpoints. Search strategies for performing matching are currently efficient only for small sets of images. We have suggested techniques for future work that may lead to efficient search in large databases. 8. REFERENCES [1] J. R. Bach, C. Fuller, A. Gupta, A. Hampapur, B. Horowitz, R. Humphry, R. Jain, and C.-F. Shu. Virage image search engine: An open framework for image management. In Storage and Retrieval for Image and Video Databases, Proc. SPIE 2670, pages 76 87, [2] T. M. Breuel. Fast recognition using adaptive subdivisions of transformation space. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages , [3] S.-F. Chang, W. Chen, H. J. Meng, H. Sundaram, and D. Zhong. VideoQ: An automated content based video search system using visual cues. In Proceedings of the ACM Multimedia Conference, pages , [4] I. J. Cox, M. L. Miller, T. P. Minka, T. V. Papathomas, and P. N. Yianilos. The Bayesian image retrieval system, PicHunter: Theory, implementation and psychophysical experiments. IEEE Transactions on Image Processing, 9(1):20=37, [5] M. Flickner, H. Sawhney, W. Niblack, J. Ashley, Q.Huang,B.Dom,M.Gorkani,J.Hafner,D.Lee, D. Petkovic, D. Steele, and P. Yanker. The QBIC system. IEEE Computer, 28(9):23 32, Sept [6] T. Gevers and A. W. M. Smeulders. PicToSeek: Combining color and shape invariant features for image retrieval. IEEE Transactions on Image Processing, 9(1): , Jan [7] D. P. Huttenlocher and W. J. Rucklidge. A multi-resolution technique for comparing images using the Hausdorff distance. In Proceedings of the IEEE Recognition, pages , [8] C.-C. Liu and A. L. P. Chen. 3d-List: A data structure for efficient video query processing. IEEE Transactions on Knowledge and Data Engineering, 14(1): , [9] W.-Y. Ma and B. S. Manjunath. NeTra: A toolbox for navigating large image databases. Multimedia Systems, 7(3): , [10] F. Maes, A. Collignon, D. Vandermeulen, G. Marchal, and P. Suetens. Multimodality image registration by maximization of mutual information. IEEE Transactions on Medical Imaging, 16(2): , Apr [11] H. Murase and S. K. Nayar. Visual learning and recognition of 3-d objects from appearance. International Journal of Computer Vision, 14:5 24, [12] C. F. Olson. A probabilistic formulation for Hausdorff matching. In Proceedings of the IEEE Computer Society Recognition, pages , [13] C. F. Olson. Image registration by aligning entropies. In Proceedings of the IEEE Computer Society Recognition, volume 2, pages , [14] C. F. Olson. Maximum-likelihood image matching. IEEE Transactions on Pattern Analysis and Machine Intelligence, 24(6): , June [15] A. Pentland, R. W. Picard, and S. Sclaroff. Photobook: Content-based manipulation of image databases. International Journal of Computer Vision, 18(3): , June [16] W. H. Press, S. A. Teukolsky, W. T. Vetterling, and B. P. Plannery. Numerical Recipes in C. Cambridge University Press, [17] A. Rahimi, L.-P. Morency, and T. Darrell. Reducing drift in parametric motion tracking. In Proceedings of the International Conference on Computer Vision, volume 1, pages , [18] W. J. Rucklidge. Efficient guaranteed search for gray-level patterns. In Proceedings of the IEEE Recognition, pages , [19] M. A. Turk and A. P. Pentland. Face recognition using eigenfaces. In Proceedings of the IEEE Recognition, pages , [20] R. C. Veltkamp and M. Tanase. Content-based image retrieval systems: A survey, revised from technical report UU-CS , Dept. of Computing Science. [21] P. Viola and W. M. Wells. Alignment by maximization of mutual information. International Journal of Computer Vision, 24(2): , 1997.

Fast Image Registration via Joint Gradient Maximization: Application to Multi-Modal Data

MITSUBISHI ELECTRIC RESEARCH LABORATORIES http://www.merl.com Fast Image Registration via Joint Gradient Maximization: Application to Multi-Modal Data Xue Mei, Fatih Porikli TR-19 September Abstract We