Using Redundant Bit Vectors for Near-Duplicate Image Detection

Size: px

Start display at page:

Download "Using Redundant Bit Vectors for Near-Duplicate Image Detection"

Hannah Gordon
6 years ago
Views:

1 Using Redundant Bit Vectors for Near-Duplicate Image Detection Jun Jie Foo and Ranjan Sinha School of Computer Science & IT RMIT University, Melbourne, Australia, 3001 Abstract. Images are amongst the most widely proliferated form of digital information due to affordable imaging technologies and the Web. In such an environment, the use of digital watermarking for image copyright infringement detection is a challenge. For such tasks, near-duplicate image detection is increasingly attractive due to its ability of automated content analysis; moreover, the application domain also extends to data management. The application of PCA-SIFT features and Locality- Sensitive Hashing (LSH) for indexing and retrieval has been shown to be highly effective for this task. In this work, we prune the number of PCA-SIFT features and introduce a modified Redundant Bit Vector (RBV) index. This is the first application of the RBV index that shows near-perfect effectiveness. Using the best parameters of our RBV approach, we observe an average recall and precision of 91% and 98%, respectively, with query response time of under 10 seconds on a collection of 20, 000 images. Compared to the baseline (the LSH index), the query response times and index size of the RBV index is 12 times faster and 126 times smaller, respectively. As compared to brute-force sequential scan, the RBV index rapidly reduces the search space to 1/80. Keywords: Near-duplicate Image Detection, Redundant Bit-Vectors, RBV. 1 Introduction Many digital images available on the Web are copies or variants of each other; these include the scaled-down thumbnails kept by web search engines and differing versions of a single image made available by different news portals. Online images can be appropriated without the acknowledgment of source and, accidentally or otherwise, disguised through simple processing. Common modifications include conversion to greyscale, change in color balance, rescaling, rotating, cropping, and filtering operations. For reasons such as copyright infringement detection [14] and collection management [5], it is attractive to identify such variants (near-duplicates) with a reasonable degree of reliability. In recent work, Qamra et al. [14] propose the Perceptual Distance Functions (PDF) for near-duplicate detection using color and texture image features. However, only mediocre effectiveness is observed when these functions are used with approximate indexing structures such as the Locality-Sensitive Hashing (LSH) R. Kotagiri et al. (Eds.): DASFAA 2007, LNCS 4443, pp , c Springer-Verlag Berlin Heidelberg 2007

2 Using Redundant Bit Vectors for Near-Duplicate Image Detection 473 index [6]. To our knowledge, the highest reported accuracy of a near-duplicate image detection system is that of Ke et al. [11] henceforth referred to as the KSH system that uses the PCA-SIFT descriptors [10] (an extension of the SIFT interest points [12]) and the LSH index. Near-perfect precision is reported for retrieval of various altered images cropped, scaled, rotated, and various affine transformation. While high precision has been reported, scalability has not been explored and the reported interactive query response times are observed on a tiny image collection. In this work, we use the KSH system as the baseline. In this paper, we propose pruning strategies on the SIFT interest points (characterized by the PCA-SIFT local descriptors) along with the use of modified Redundant Bit Vectors (RBV) [8] for this application domain. While RBV has been reported to be efficient for negative queries on audio fingerprint detection, it has yet to be demonstrated to work for positive queries. Here, we propose a novel scheme that extends the RBV index for positive queries. Using our proposed scheme on pruned PCA-SIFT descriptors, we report an almost lossless level of effectiveness for this particular application. We also demonstrate that this index can be highly compact as compared to the LSH index (as used in the KSH system). 2 Distinctive Interest Points Given an image, the idea of local descriptors is to detect image regions (centered around interest points) that possess properties invariant to geometric variation and photometric changes, so that distinctive local descriptors can be computed for each region [9,13]. In this work, we use the popular SIFT (scale invariant feature transform) detector [12] that has been demonstrated to outperform existing detectors [13]. We apply the PCA-SIFT descriptors on the SIFT interest points instead of the original SIFT descriptors, as they are shown to be both highly distinctive [10] and highly effective for near-duplicate image detection [11]. In this work, for convenience, the SIFT keypoint detector and the PCA-SIFT local descriptor are referred to as a PCA-SIFT feature. There are four major stages of computation in the SIFT detector for extracting a set of image features, namely the scale-space extrema detection, keypoint localization, orientation assignment, and generation of keypoint local descriptor. In the first phase of the SIFT keypoint detector, the difference-of-gaussian (DoG) function is used to identify candidate points in various locations and scales using a Gaussian pyramid. This is achieved by finding local peaks (keypoints) in a series of DoG images. In the second phase, poorly localized and unstable keypoints below a threshold level are rejected. After all stable keypoints are identified, each keypoint is assigned a dominant orientation for rotation invariance in the third phase. Additional keypoints are generated if there are multiple orientations within 80% threshold of the dominant orientation; thus, there can be multiple keypoints with identical scale, location, but different orientation. The PCA-SIFT descriptors are computed using the same information as the original SIFT descriptor, that is, location, scale, and dominant orientations. PCA-SIFT concatenates the horizontal and vertical gradient maps for the 41 41

3 474 J.J. Foo and R. Sinha region centered around the keypoint, rotated to align its orientation to a canonical direction to produce a = 3042 element local descriptor (feature vector). In cases where there are multiple dominant orientations, a separate vector is calculated for each. Each vector is then projected using principal component analysis, a common technique for dimensionality reduction to a lowdimensional feature space using a pre-computed eigenspace 1.Keetal.[11]have empirically determined that n = 36 feature spaces for the local descriptor performs well for near-duplicate image retrieval; wherein any two PCA-SIFT local descriptors are deemed similar (a match) within an Euclidean distance (L 2 -norm) of In this work, we use the same settings. The problem of applying PCA-SIFT features is that each image consists of hundreds to thousands of high-dimensional local descriptors, and a reliable match between two images requires at least 3 to 5 descriptor matches [12]. The KSH index uses Locality Sensitive Hashing (LSH) [6] for indexing these PCA-SIFT features. We refer interested readers to the work of Ke et al. [11] for further discussion. 3 Keypoint Reduction Querying in high-dimensional space is a challenging problem due to the curse of dimensionality [1]; indeed, this is further amplified as the evaluation of a query image using PCA-SIFT features requires multiple point matches in highdimensional space simulating multiple point queries. We observe an average of 1, 400 keypoints per image for our image collection, similar to the reported average of 1, 100 in the work of Ke et al. [11]. In practice, using PCA-SIFT features, each image can generate from a few hundred up to a few thousand local descriptors of 36 dimensions each depending on the complexity of an image. Hence, the reduction of keypoints that SIFT generates per image is key to a scalable system. Given that the SIFT interest point detector was originally proposed as a distinctive feature for matching objects or image scenes with high variance [12], it is apparent that all keypoints are required for robust matching. We hypothesize that near-duplicate image detection requires only a subset of the keypoints as we only consider images that are derived from the same source, where the level of variance is limited. In the second stage of the SIFT keypoint detector algorithm, Lowe [12] has empirically observed that a contrast threshold value of 0.03 used to eliminate keypoints with low contrast yields good results. This is an important parameter as a higher threshold will result in fewer keypoints being generated. To exploit this observation for our application, we select only the top N most significant keypoints by their contrast. By setting an upper bound on the number of keypoints that are selected in this phase, we immediately prune more than 80% (on average) of the keypoints required for each image. Images that do not have N detected keypoints are not pruned. Since some keypoints may share sub-pixel 1 The eigenspace used in this work is provided by Ke et al. [10].

4 Using Redundant Bit Vectors for Near-Duplicate Image Detection 475 location and scale information with multiple orientation, we expect approximately 15% additional keypoints to be generated in the subsequent phase [12]. This approach has been demonstrated to achieve comparable effectiveness to using all keypoints for this application, while simultaneously reducing response times considerably during query evaluation [4]. 4 Redundant Bit Vectors Goldstein et al. [8] propose Redundant Bit Vectors (RBV) for high-dimensionality search for multimedia data. The algorithm consists of three key ideas: 1) approximate high-dimensional spherical regions by tightened hyper-rectangles, 2) partition the query space to promote redundancy in the index, and 3) represent each partition with an efficient bit vector. The conventional nearest-neighbor matching problem is usually formulated as a point query over spheres of fixed or variable radius. The ɛ-range search can be applied to return all objects with distances within an ɛ threshold [1] if more than one object is required. The distances between two objects are commonly measured in some L p metric. Goldstein et al. [7] formulate this as a rectangle search problem, where each point p in a d-dimensional space is replaced by the smallest hypercube c enclosing the hyper-sphere with center point p. With this approach, the data space is searched using approximated rectangular regions instead of spheres. To create an RBV index, all data points are mapped onto data (hyper) rectangles, where each dimension of these rectangles are partitioned into m bins. The choice of m is determined by the number of disjoint intervals between the data rectangles [7]. Every dimension of the rectangle is projected onto its respective axis, where each partition is a bit vector that reflects the overlap test between the interval boundaries. Each bit in a spatial bit vector corresponds to a data rectangle within the collection. Each adjacent bit vector (within the same dimension) may have identical bits set to 1 if the data rectangle overlaps the interval boundaries, hence giving rise to redundant bit vectors. To search for approximated neighbors for a given point, a spatial bit vector for a given dimension is selected if the partition includes the query point has its corresponding bit set to 1. The resultant spatial bit vector after bitwise ANDing all selected vectors from every dimension will return the data rectangles that contains the query point. Goldstein et al. demonstrate that the RBV index excel at applications where there is a large fraction of negative queries. For such applications, they report significant gains in efficiency, and a reduction of memory requirement in comparison to LSH [8]. However, their approach is unsuitable for applications with a large fraction of positive queries given that accuracy is sacrificed considerably for efficiency gains; we propose a novel scheme that extends the current RBV index to cater for such applications. 5 Extending the RBV Index Using a similar scheme as Ke et al. [11], all indexed images are stored in a file table (FT), and each PCA-SIFT feature is mapped using a keypoint table (KT)

5 476 J.J. Foo and R. Sinha where each entry is 92 bytes and contains the location, scale, orientation, and local descriptor of a keypoint. For every keypoint in a query image or query keypoint we approximate the potential matching keypoints using an index and verify the short-listed matched pairs using geometric verification (RANSAC) [3]. The key differences are the employed indexing technique and the amount of features used. All query keypoints are read into memory during query evaluation, whereas every matched keypoint is fetched from disk. All disk reads are linearized for efficiency. Given a collection of images, we index only the 36 dimensional local descriptors (keypoints). Each keypoint can be represented by k i (x 1...x d ),i =1...N, where N is the total number of keypoints in the collection, and x d is the coordinate of dimension d. During RBV index construction, each point k i is mapped to a data rectangle using the smallest hypercube c that encloses the hyper-sphere centered on k i with radius of ɛ forɛ-range search. For two PCA-SIFT keypoint descriptors to be deemed a match, an L 2 norm ranging from 2, 200 to 3, 000 (ɛ) yields high effectiveness [10,11]. Hence, each keypoint is converted into a hypercube, with a hypercube half-sidelength (HCS) of ɛ where c =2ɛ. We use m i =(x irange )/ɛ partitions to create the desired number of disjoint intervals that cover the entire axis of a single dimension, where x irange = x imax x imin for dimension i. Thus, the choice of HCS is critical given that it determines the granularity of the partitioning scheme. To create the partitions, we first sort the boundaries of all data hypercube in a given dimension along its axis. Each partition is represented using a bit vector where each bit reflects the index position in KT. Following [8], we then select m i 1 dividers from the sorted hypercube boundary values and partition the dimension using the overlap test for each interval, where each bit in a bit vector reflects the predicate (1 or 0). The collection of data hypercube (keypoints) can be represented efficiently, as each integer can store up to 32 or 64 keypoint IDs, depending on the system architecture (machine word size). Each bit vector is represented using an array of integers, where the bit vectors are constructed in memory and written to disk, one dimension at a time. Each bit vector is stored using an array of N/32 (4- byte) integers, where N is the total number of keypoints in our collection. We also store m i 1 (4-byte) dividers for the axes of each dimension for query evaluation. Thus, the size of the index for the entire image collection is approximately D m i (4 N (m i 1)) bytes i=1 where D is the number of dimensions; in our application D is 36. In the original RBV indexing scheme, for efficiency gains, Goldstein et al. [8] sort the data points in ascending order of the most selective dimension (smallest amount of overlap) prior to constructing the bit vectors. This is done to organize the RBV index such that the first dimension will have data hypercubes closely located along the axis of its dimension, resulting in tightly packed bits of 1 s between the low and high range. Since the number of bitwise AND operations can be reduced to the number of integers between this range, the most selective dimension is used as the first queried dimension. The ordering of dimensions

6 Using Redundant Bit Vectors for Near-Duplicate Image Detection 477 in which to query is important since the first dimension always dictates the resultant list of matching keypoints using the bitwise AND operation. Querying the most selective dimension first will short-list the number of potential matching keypoints rapidly. However, given that each dimension is a very coarse approximation of the distance in the hypercube space, retrieval accuracy suffers. As our aim is to maximize the number of positive matches, the index is modified to be less restrictive for this application. In our proposed scheme, the most selective dimension is not pre-determined, and requires no prior sorting of the data points; consequently, we do not utilize the low and high range for bit vector processing. A summary of the process for constructing the modified RBV index is as follows: Require: Database of N D-dimensional local descriptors x in KT, ɛ = HCS. for i =1toD do for k =1toN do Calculate hypercube boundaries x ki ± ɛ. Sort boundaries on i-axis; calculate m i =(i max i min)/ɛ. Select m i 1 dividers from sorted boundaries. for i =1toD do for k =1toN do Create overlap tests to create m i partitions (bit-vectors). Store m i 1 dividers and m i bit-vectors to disk. Querying the modified RBV index. Instead of querying with the most selective dimension during index construction, we determine the order of dimensions dynamically during query evaluation, thereby eliminating the need for pre-processing the data points. For each x i,i =1...d of a query keypoint, we determine the normalized distance to mean using x μ x i /x μ.wesortthe distances in ascending order, and use the sorted order of dimensions for query evaluation. Thus, the dimensions are dynamically selected to maximize the potential keypoint matches to the query coordinates. In this approach, the search space is not immediately pruned with the first queried dimension but is instead narrowed progressively by processing each subsequent dimension. During query evaluation, the required partition for each dimension can be calculated in memory by using the m 1 dividers of each dimension to determine which partitions to retrieve from disk. Given that each bit vector is bitwise ANDed one dimension at a time, and that the ordering of dimensions can be pre-processed, we can bulk-process the query keypoints simultaneously. This is achieved by using a temporary resultant bit vector in memory for each query keypoint. Hence the order in which the required partitions are read can also be sorted to allow sequential access to disk. Bulk-processing of keypoints in memory is enabled by keypoint reduction since the feature space of the query images are pruned, without which the memory requirement would be high.

7 478 J.J. Foo and R. Sinha Compared to the original RBV indexing scheme [8], we tradeoff speed to maximize the potential matches (candidate pool) and perform bitwise AND operation on the entire bit vector. This results in a larger number of false positives in the pool of candidate keypoint matches and consequently results in more computation. To reduce the cost of CPU computation (bitwise operation), we prune the number of processed dimensions during query evaluation to narrow the search space gradually while minimizing the number of false negatives. The number of dimensions to prune depends on the partition granularity (HCS) since these two parameters are coupled, that is, a change in one parameter will inevitably affect the other. We empirically evaluate the effects of varying HCS and the number of dimensions pruned on retrieval speed and accuracy in Section 7. A summary of the process for querying the modified RBV index (henceforth referred as just RBV index) is as follows: Require: Database of N items, Q local descriptors q of query, D dimensions i of m i 1 dividers and m i bit-vectors, temporary resultant bit vectors R Q (one for each q), temporary container T [D]. for j =1toQ do for k =1toD do T [k] = q jμ q jk /q jμ Sort T in ascending order. for k =1toT (or <T if pruned) do Get partition p using m k 1 dividers; calculate R j = R j&m p. Perform L 2 verification on matches (ON bits) in resultant bit vector. 6 Evaluation Methodology We demonstrate the effectiveness of our approach using a series of experiments. First, we evaluate the effectiveness of keypoint reduction on matching nearduplicate images, by varying the number of detected keypoints between 1, 400 (original number of keypoints), and using a subset of 500 and 100 most significant keypoints. We report the percentage of keypoint matches relative to the keypoints in the query image between a query image and each of its image alterations. For accurate evaluation, we use the sequential scan for the nearestneighbor search on the collection of keypoints as both LSH and RBV indexes are approximate nearest-neighbor algorithms. For this experiment, due to exhaustive computation involved with using several keypoint thresholds, we use only 100 random queries on 10, 000 images (Image Collection B as described below). Second, we compare our approach keypoint reduction with the modified RBV index against the KSH system 2. We evaluate both efficiency and effectiveness of these approaches on a much larger collection of 20, 000 images (Image 2 The authors have provided their source code, allowing us to perform a direct comparison to their approach.

8 Using Redundant Bit Vectors for Near-Duplicate Image Detection 479 Collection A as described below). We use an identical framework as Ke et al. [11], the only differences being the index structure and the amount of PCA-SIFT features used. The original dataset 3 is not used as it was not available, however, we generate a dataset using the same set of alterations as used in their work. A small number of images are selected at random from the collection as queries; the relevant answers are generated by transforming each query image using the set of alterations. Only altered versions of their respective original images are considered relevant answers; everything else in the collection is treated as noise. We evaluate our approach using the standard recall and precision metrics. All experiments are run on a two-processor Xeon 3 GHz machine with 3.8 GBof main memory running Linux 2.4. Image Dataset. To generate our dataset, we select 250 images at random from Volume Twelve of the Corel Photo CD collection [2]; each image is altered using 40 alterations, creating a total of 10, 000 images. We also include 10, 000 images from the TRECVID 2005 collection, which consists of keyframes from various news broadcast. We scale all images to 512 pixels in the longer edge. Together with 10, 000 altered images, we create a test collection of 20, 000 images forming Image Collection A. Image Collection B is created using half of Image Collection A; consequently, we use 125 queries for this collection. As the PCA-SIFT algorithm does not use color information, all images are converted to greyscale after the altered image set is created. As in the work of Ke et al. [11] and Qamra et al. [14], the list of alterations are as follows : colorize (3), contrast (2), severe contrast (2), crop (3), frame (4), scale up (3), scale down (3), despeckle (1), saturation (4), intensity (6), shear (3), resize (3), and rotate (3). Note that the number in the parentheses indicate the alteration variants. 7 Results and Analysis In Table 1, the effects of keypoint reduction at every keypoint threshold value on 40 unique alterations are shown; the percentages are an average over 100 queries. Columns 1, 5, and 9 indicate the different types of alteration. The rest of the columns show the percentage of matching nearest-neighbors within the L 2 norm of 3, 000 between the original image and its corresponding near-duplicate image. The average number of keypoints per image is close to 1, 400. We experiment with a threshold of 500 and 100, reducing the average keypoints per image to 550 and 128, respectively. The last columns of each alteration (4, 9,and 12) show the percentage of matches using the same criterion with approximately 10% of the keypoints. The variation in percentages between image alterations in Table 1 is expected given that some alterations severely affect the properties of the local descriptors. The reason for the slight increase observed with threshold values of 500 and 100 as compared to the original number of keypoints is that the percentage of keypoint matches is relative to number of detected keypoints in the image (that is 550 and 128) indicating that most of the matching 3

9 480 J.J. Foo and R. Sinha Table 1. Percentage of keypoint matches within L 2 norm threshold at every level of reduction. Columns 1, 5, 9 indicate the different alterations (Alt). Keypoint thresholds of 500, and 100 are used. Default indicates the original number of keypoints (average of 1, 400). Alt Default Alt Default Alt Default Alt Default Average Recall (%) HCS=1,000 HCS=1,500 HCS=2,000 HCS=2500 LSH Average Precision (%) HCS=1,000 HCS=1,500 HCS=2,000 HCS=2500 LSH Number of dimensions Number of dimensions (a) Fig. 1. Average (a) Recall and (b) Precision (over 250 queries) of the modified RBV index for variations of HCS and number of dimensions. LSH is the baseline. (b) keypoints within the L 2 -norm of the set of alterations share similar contrast values. This is an important finding as it is the criterion by which both the LSH and RBV indexes approximate matching keypoints. The relatively similar percentages of matching keypoints for different levels of reduction across all alterations, leads us to believe that even a small subset of keypoints is sufficient for this application. Subsequent experiments on the RBV index incorporate keypoint reduction with a threshold of 100 on the number of indexed features. In Figures 1a and 1b, the effectiveness of the RBV index is measured using recall and precision, averaged over 250 queries. We experiment with varying the HCS parameter and the number of pruned dimensions; each increment of four dimensions is shown. LSH is used as a baseline with recall and precision of 99% and 98%, respectively. The highest observed recall and precision with RBV is 97% and 99% respectively, with an HCS of 1, 500 when only one dimension is processed. We observe that even after processing 16 dimensions, recall remains at 91% and

10 Using Redundant Bit Vectors for Near-Duplicate Image Detection 481 Average elapsed time (in secs) HCS=1,000 HCS=1,500 HCS=2,000 HCS=2500 LSH Number of dimensions Average keypoint pairs examined (base 10) HCS=1,000 HCS=1,500 HCS=2,000 HCS=2500 SScan Number of dimensions (a) (b) Fig. 2. (a) Average running time (over 250 queries) of the RBV index for variations of HCS and number of dimensions. LSH is the baseline. (b) Effectiveness (250 queries) of search space reduction of the RBV index. Sequential scan is the baseline. precision at 98%. We do not experiment with smaller number of dimensions for HCS of 2, 000, and 2, 500 as we achieve near-perfect recall and precision after processing 24 dimensions. HCS of 2, 500 achieves recall and precision of 88% and 99% respectively, even after processing all 36 dimensions. As expected, using HCS of 1, 000, we observe a dramatic drop in recall and precision if more than 4 dimensions are processed, which implies that the boundaries in hypercube space is too tight resulting in high partition granularity. Hence, the choice of HCS is critical for the RBV index. We observe that given a large enough HCS, the drop in recall and precision is less abrupt, since the majority of the answers are still within the hypercube boundary of a single dimension resulting in fewer eliminated matches. We have thus shown that our modified RBV index is highly effective given a suitable HCS value. Retrieval Efficiency. The timing results for query evaluation using the modified RBV index is presented in Figure 2a. The total running (elapsed) time for evaluating a single query is measured; all timings are averaged over the 250 queries. They are compared against the KSH baseline, which is observed to have an average running time of approximately 124 seconds. Since the KSH implementation of LSH, and our RBV implementation can be further optimized in terms of in-memory data structures we do not emphasize on the factors of improvement from the baseline. With our RBV index, the fastest recorded running time is approximately 9 seconds with an HCS of 1, 500, and 16 dimensions; this was also observed to have high effectiveness. As expected, the running time reduces as more dimensions are processed; the pool of candidate matches becomes smaller, requiring fewer keypoints to be retrieved from disk. This is evident from the much higher running time of 136 seconds with HCS of 1, 500 and processing only one dimension; this effectively reduces to an on-disk sequential scan. Using HCS of 1, 000, we observe that there is a slight increase in running time from 5 to 8 seconds when the number of dimensions is more than 12. We

11 482 J.J. Foo and R. Sinha Index Size (MB) LSH Index (Baseline) RBV Index Growth factor candidate pool size query response time HCS Length Number of dimensions (a) (b) Fig. 3. (a) Effects of HCS on the RBV index size. LSH is the baseline. (b) Growth factors of candidate pool size and query run-time between image collections B and A (observed using HCS of 1,500). believe this is due to the increased cost of processing (CPU operations required for bitwise ANDing and fetching bit vectors from disk) more dimensions without a corresponding decrease in the number of keypoint pairs. Finally, for HCS of 3, 000 the running time for processing all 36 dimensions is comparable to that of the HCS of 1, 500 while still showing high effectiveness. Further Studies. It is instructive to examine the effectiveness of the RBV index in reducing the search space. Figure 2b shows the total number of keypoints being processed by the RBV index. This is an independent study on the RBV index on the same collection; no comparisons are made against the LSH index since we do not experiment with keypoint reduction on the KSH system. Instead, we compare it to sequential scan to illustrate the reductions in search space using the RBV index. All numbers are reported as an average over 250 queries. The sequential scan always requires the worst-case number of keypoints, as it performs a brute-force search to find k-nearest-neighbors within the L 2 threshold. Using HCS of 1, 500, the candidate pool is quickly reduced from approximately 260 million to 120 million keypoints after processing only 4 dimensions. For 16 dimensions only 4 million keypoints remain in the candidate pool. Naturally, a smaller number of candidate matches translates to higher efficiency, since fewer keypoints need to be fetched and examined. Indeed, this result shows that the RBV index is effective at reducing the search space using only a few dimensions, while minimizing the number of false negatives. As in Figure 3a, we show the effects on index size using different HCS values; these sizes are observed for an index of 20, 000 images. This clearly shows that HCS dictates the number of partitions, which determines the number of bit vectors that are stored on disk. It is also interesting to note that using the baseline method without keypoint reduction as described in the work of Ke et al. [11], the size of the index is considerably larger than that of RBV. Finally, as shown in Figure 3b, we observe that the candidate pool size increase by only a factor of 1.4(using16dimensions),

12 Using Redundant Bit Vectors for Near-Duplicate Image Detection 483 even though the collection size increases by a factor of 2 (Image Collections B to A); the slight growth of the query response time is attributed to the increase in in-memory bitwise processing. 8 Conclusion We have presented an approach to near-duplicate image detection with pruned SIFT keypoints (using PCA-SIFT local descriptors) using our proposed modified RBV indexing scheme. An almost lossless retrieval performance is observed using only 10% of the original keypoint features, thereby reducing index size and improving scalability. We show that, unlike the original approach that was initially designed for negative queries, near-perfect effectiveness can be achieved using our modified approach for positive queries as well. We demonstrate that this indexing scheme performs as well as the KSH system in terms of effectiveness and runs in a little under ten seconds on average for a single query, on a collection of 20, 000 images. Importantly, the RBV index is highly compact over 100 times smaller than that of the original LSH index as used in the KSH system. As observed in our experiments, our approach has shown the highest efficiency a factor-of-12 speed-up over the KSH system thus far, in this domain. Hence, this approach offers a promising and viable alternative indexing strategy to the predominant LSH approach. We intend to explore the limitations and scalability of these schemes in future work. Acknowledgments This project was supported by Australian Research Council. We thank Justin Zobel for his suggestions. References 1. C. Böhm, S. Berchtold, and D. A. Keim. Searching in high-dimensional spaces: Index structures for improving the performance of multimedia databases. ACM Computing Surveys, 33(3): , Corel Corporation. Corel professional photos CD-ROMs, M. A. Fischler and R. C. Bolles. Random sample consensus: A paradigm for model fitting with applications to image analysis and automated cartography. Commun. ACM, 24(6): , J. J. Foo and R. Sinha. Pruning sift for scalable near-duplicate image matching. In Proc. ADC Australian Database Conference, January J. J. Foo, R. Sinha, and J. Zobel. Discovery of image versions in large collections. In Proc. MMM Int. Conf. on Multimedia Modelling. Springer, Januuary A. Gionis, P. Indyk, and R. Motwani. Similarity search in high dimensions via hashing. In Proc. VLDB Int. Conf. on Very Large Data Bases, pages , Edinburgh, Scotland, UK, September Morgan Kaufmann.

13 484 J.J. Foo and R. Sinha 7. J. Goldstein, J. C. Platt, and C. J. C. Burges. Indexing high dimensional rectangles for fast multimedia identification. Technical report, Microsoft Research, Redmond, WA, USA, J. Goldstein, J. C. Platt, and C. J. C. Burges. Redundant bit vectors for quickly searching high-dimensional regions. In Deterministic and Statistical Methods in Machine Learning, First International Workshop, Sheffield, UK, September 7-10, 2004, Revised Lectures, pages Springer, K. Grauman and T. Darrell. Efficient image matching with distributions of local invariant features. In Proc. CVPR Int. Conf. on Computer Vision and Pattern Recognition, pages , June Y. Ke and R. Sukthankar. PCA-sift: A more distinctive representation for local image descriptors. In Proc. CVPR Int. Conf. on Computer Vision and Pattern Recognition, pages , Washington, DC, USA, June July IEEE Computer Society. 11. Y. Ke, R. Sukthankar, and L. Huston. An efficient parts-based near-duplicate and sub-image retrieval system. In Proc. MM Int. Conf. on Multimedia, pages , New York, NY, USA, October ACM Press. 12. D. G. Lowe. Distinctive image features from scale-invariant keypoints. Int. Journal of Computer Vision, 60(2):91 110, K. Mikolajczyk and C. Schmid. A performance evaluation of local descriptors. In Proc. CVPR Int. Conf. on Computer Vision and Pattern Recognition, pages , June A. Qamra, Y. Meng, and E. Y. Chang. Enhanced perceptual distance functions and indexing for image replica recognition. IEEE Trans. Pattern Analysis and Machine Intelligence, 27(3): , 2005.

Detecting Printed and Handwritten Partial Copies of Line Drawings Embedded in Complex Backgrounds

9 1th International Conference on Document Analysis and Recognition Detecting Printed and Handwritten Partial Copies of Line Drawings Embedded in Complex Backgrounds Weihan Sun, Koichi Kise Graduate School