Query Expansion for Hash-based Image Object Retrieval

Size: px

Start display at page:

Download "Query Expansion for Hash-based Image Object Retrieval"

Elinor Thomas
5 years ago
Views:

1 Query Expansion for Hash-based Image Object Retrieval Yin-Hsi Kuo 1, Kuan-Ting Chen 2, Chien-Hsing Chiang 1, Winston H. Hsu 1,2 1 Dept. of Computer Science and Information Engineering, 2 Graduate Institute of Networking and Multimedia, National Taiwan University, Taipei, Taiwan ABSTRACT An efficient indexing method is essential for content-based image retrieval with the exponential growth in large-scale videos and photos. Recently, hash-based methods (e.g., locality sensitive hashing LSH) have been shown efficient for similarity search. We extend such hash-based methods for retrieving images represented by bags of (high-dimensional) feature points. Though promising, the hash-based image object search suffers from low recall rates. To boost the hash-based search quality, we propose two novel expansion strategies intra-expansion and inter-expansion. The former expands more target feature points similar to those in the query and the latter mines those feature points that shall co-occur with the search targets but not present in the query. We further exploit variations for the proposed methods. Experimenting in two consumer-photo benchmarks, we will show that the proposed expansion methods are complementary to each other and can collaboratively contribute up to 76.3% (average) relative improvement over the original hash-based method. Categories and Subject Descriptors H.3.3 [Information Search and Retrieval]: Retrieval models General Terms Algorithms, Experimentation, Performance Keywords Locality sensitive hashing (LSH), Query expansion 1. INTRODUCTION The exponential growth of photos and videos, either in media-sharing sites, business stockings, or personal collections, have created the needs for efficient content-based image retrieval (CBIR), which helps locating similar images in large-scale collections. In recent years, researchers have focused on the more challenging problem of image object retrieval [21][23]. Image object retrieval aims to retrieve images that contain the visual object shown in the object query image. For example, searching for images that contain Torre Pendente Di Pisa or the Starbucks logo (cf. Figure 7(c)(d)). Such techniques also motivate many promising applications such as exploring photo collections in 3D [25], photo-based question answering [31], video advertising by image matching [21], annotation by search [27], etc. The traditional solutions for CBIR employ global low-level features like color and texture. By selecting proper feature representations and distance metrics, the similarities between query and database images can be calculated and a ranking list generated accordingly. Prominent CBIR systems that use these Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. MM 09, October 19 24, 2009, Beijing, China. Copyright 2009 ACM /09/10...$ techniques include QBIC (Query By Image Content) [15] and VisualSEEK [24], etc. However, for image object retrieval, the query can be a full image or just part of the whole image which we call object-level image. As shown in Figure 2(a), the red rectangle represents an object-level query image. Similarly, the object may occupy only part of the target database images. Global color and texture features become limited under these conditions. To capture local image information that is essential in object retrieval, Lowe proposed scale-invariant feature transform (SIFT) [19]. The SIFT feature works by first detecting salient regions in an image and then describing each region with a 128-d feature vector. Its advantage is that both spatial and appearance information of each local region are recorded with built-in invariance to modest changes in object scale or camera viewpoint. As a result, an image can be viewed as a bag-of-feature-points and any object within the image is a subset of the points. With the bag-of-feature-points representation, image retrieval is carried out by comparing all feature points in query image to those from all images in the database. Since an image typically contains hundreds to thousands of feature points, an image database can easily have millions or even billions of points. Therefore, image retrieval based on bag-of-feature-points representation is in fact a large-scale many-to-many matching problem in high-dimensional space. To index the data so that similarity search can be performed effectively and efficiently is extremely crucial. A hash-based method known as locality sensitive hashing (LSH) [16] has been shown successful in performing similarity search on various types of data including text [26], audio [8][6], images [18] and videos [13]. LSH has the important characteristic that given a distance metric, hash functions are designed so that similar data have much higher probability to hash into the same bucket than dissimilar data. As a result, when we evaluate a query, only a small portion of the data points, those that hash into the same bucket as the query point, needs to be examined. Also, multiple hash functions and hash tables are employed to improve robustness against false negatives. LSH runs in sublinear time and scales well with data dimensionality [16]. See section 3 for more details. The combination of SIFT feature and LSH has been shown effective in image duplicate detection [18]. Figure 1(a) shows an illustration of searching by LSH over bag-of-feature-points representation. The red rectangle is the object query image which contains four query feature points (A, B, C, and D). Each query point is used to retrieve database points that collide (hash into the same bucket) with it in any of the hash tables. After examining all query points, retrieved database points are analyzed to identify database images associated to them and also the degree of relevance of these images to the query. For simplicity of illustration, relevance here is defined as number of matching feature point pairs between query and data images. Although this retrieval model succeeds at retrieving target images with high precision, it suffers from low recall rates. Figure 2(c) shows a real retrieval case with basic LSH. While most of the top ranked images are correct, some false positives are high up in the 65

(a) Basic LSH and query result (b) Intra-expansion result (c) Inter-expansion result Figure 1: Query expansion for hash-based image retrieval.

over bag-of-feature-points (i.e., {A, B, C, D}) and ranking by the number of matched pairs (e.g., 4, 2, 2, 1, 1); the query image is in the left; since only the exact (in the same bucket) feature points are considered, the original LSH suffers from low recall rate.

) similar to those in the query but mis-hashed to other buckets. (c) Inter-expansion mines those feature points (e.g.

2 (a) Basic LSH and query result (b) Intra-expansion result (c) Inter-expansion result Figure 1: Query expansion for hash-based image retrieval. (a) An illustration for search by hashing over bag-of-feature-points (i.e., {A, B, C, D}) and ranking by the number of matched pairs (e.g., 4, 2, 2, 1, 1); the query image is in the left; since only the exact (in the same bucket) feature points are considered, the original LSH suffers from low recall rate. We tackle the problem by proposing two expansion strategies for hashing. (b) Intra-expansion expands more target feature points (e.g., A, A, B, C, etc.) similar to those in the query but mis-hashed to other buckets. (c) Inter-expansion mines those feature points (e.g., {E, F, G}) that shall co-occur with the search targets by exploiting the related hash buckets from the initial search result; more diverse (and related) results can be retrieved. Note that the two expansion strategies can be combined iteratively to boost the search results, and the expansions are realized efficiently by merely looking up the hash buckets. state-of-the-art hash-based image search method (i.e., [18]). The recall is improved greatly as well as shown in Figure 2. list, too. The recall is low because even though LSH guarantees a high probability of hashing similar data into the same buckets, this probability is not 1. It is unavoidable that some similar features are hashed into neighboring buckets and LSH fails to retrieve them at query time. Also, there may be features that strongly characterize the query object but are not present in the query image or simply appear very different from query features. This can be the result of changes in lighting condition, camera angle and occlusion effects, etc. The primary contributions of this paper include: The first proposal of intra- and inter-expansion for hash-based image object retrieval. Investigating effective and efficient algorithms in implementing the proposed expansion methods (Section 4). Conducting extensive experiments in large-scale benchmarks (Section 5 and 6) and exemplifying the significant gains for the proposed methods. To tackle this problem, we introduce query expansion1 technique for hash-based image object retrieval. We propose two expansion strategies intra-expansion and inter-expansion. Intra-expansion uses existing query features to obtain more similar features as matching targets. For example, in Figure 1(b), features A and A are similar to feature A (the distance between A and A is smaller than a threshold) but not found by basic LSH. Intra-expansion discovers A and A by inspecting neighboring buckets to A and they can be used as target features once identified. Inter-expansion is to obtain new query features not present in the query image but mined from the initial search results. To do this, we choose some possible correct images (e.g., top ranked images) and issue their relevant features as new queries. For example, in Figure 1(c), features E, F and G are not present in the query image but they appear alongside correctly matched target features in the initial search results. By inter-expansion, they are considered relevant to the query and issued as new queries. Their query results are factored into the final rank list returned to the user. 2. RELATED WORK In most multimedia retrieval works, researchers are faced with two important issues: the feature representation and the means to calculate similarity between data objects effectively and efficiently. Repeatedly solving the nearest neighbor problem (NN) between features is often required while addressing these issues. Researchers have argued and proved that by allowing a small error bound in solving NN, time efficiency can be greatly improved while performance degradation is acceptable [4]. NN with the addition of error bound is known as approximate nearest neighbor problems (ANN). Two popular methods for solving ANN are KD-trees [4] and locality sensitive hashing (LSH) [16]. A KD-tree is a search tree that splits according to data distribution along one single dimension at a time. Data points are stored at leave nodes and the internal nodes store splitting criteria that can be used to speed up verification of pruning opportunities. However, when the number of data dimensions exceeds 20, KD-trees suffer from the curse of dimensionality and its pruning mechanism becomes ineffective [16]. Intra- and inter-expansion expand two different types of new queries. Both can improve retrieval performance significantly and they can be combined iteratively to obtain even better results. Note that such expansion methods are realized by searching over hash tables and buckets only and actually incur little extra time while obtaining new query features. Experimenting over photo search benchmarks, we show that the proposed methods can achieve up to 76.3% (average) relative improvement over 1 On the other hand, LSH has gained popularity in recent years because of its ability to deal with data features in even higher number of dimensions and at the same time keeping running time satisfactory. The basic idea for LSH is to hash data points in a way that close points in feature space have higher chance of collision compared to far apart points. LSH is first proposed in [17] along with rigorous mathematical description. [16] implements LSH in Hamming space and evaluates its performance with low level multimedia features in high dimensions. Ke et al. [18] applies LSH in Hamming space to a variant of SIFT feature and In information retrieval community, query expansion involves evaluating a user's input (what words were typed into the search query area and sometimes other types of data) and expanding the search query to match additional documents [28]. 66

verification is not possible. While appropriate for classification, it is not suitable for image object retrieval where local distribution of features is the key rather than global compositions.

In this approach, a set of training feature points is clustered and the centroid of each resulting cluster is defined as a visual word.

3 verification is not possible. While appropriate for classification, it is not suitable for image object retrieval where local distribution of features is the key rather than global compositions. Another popular approach in object retrieval is by using the bag-of-visual-words representation and adopting traditional retrieval techniques for textual words [23]. In this approach, a set of training feature points is clustered and the centroid of each resulting cluster is defined as a visual word. Once a set of visual words is obtained, every feature can be represented by its most similar visual word. An image is thus a bag-of-visual-words and described by a frequency histogram of the visual words which in turn is used in calculating similarity scores. Due to the nature of clustering algorithms, the quantization of feature points into visual words can be a noisy process [11]. To attenuate the problem, [33][34] both consider quantization-related (by soft assignment) visual words as calculating the global visual word histogram. Figure 2: For query image (a), basic LSH returns result (c) which has high precision but low recall. With proposed query expansion methods, more accurate and diverse results can be retrieved as shown in (d), (e) and (f). The number below each image is its rank in retrieval result. Number in the parentheses represents its rank with basic LSH. (b) shows a comparison of PR curves for these results. It is clear that the proposed expansion methods can greatly improve search performance. achieved very good performance in detecting near-duplicate images. However, the ground truth images in their experimental dataset are obtained by performing various image transformations on query images. In consumer photos, rather than duplicates, the ground truth images are photographed on real world objects under much diversified photo capturing conditions (as depicted in Figure 7). Meanwhile, we are interested in retrieving those designated objects in consumer photos. This results in far greater differences in feature points for the same object. A direct LSH approach in finding similar points is not sufficient to deal with the vast variability present within consumer photos. We introduce query expansion techniques to mend this problem. LSH is later extended to work with Euclidean distance by adopting a class of p-stable distribution functions [12]. E2LSH [1] is a publicly available software package that implements the algorithms in Euclidean space. One disadvantage of LSH is that by using multiple hash tables, space and time requirement becomes a burden. The authors of [20] propose multi-probe LSH where multiple buckets in a hash table are inspected to retrieve more diversified results. In exchange, less hash tables and running time are needed to achieve the same performance. [13] employs LSH to project feature points to an auxiliary space and represent images as random histograms therein. Support vector machines (SVMs) is used to learn classifiers on object classes in this auxiliary space. The use of random histograms bypasses the many-to-many matching problem when features are compared directly for similarity. However, the random histogram is a global summary of local features which does not preserve important spatial and appearance information and spatial Object retrieval of the visual word model still suffers from low recall rates. The authors of [11] proposed multiple image resolution expansion where correct entries in the initial retrieval result are analyzed to obtain latent images which are estimated visual word histograms of the query object if shot under a different resolution. The latent images are issued as new queries to retrieve more diversified results. This is similar to inter-expansion proposed in our work in that both are expanded by characteristics of verified correct images in the initial result. The difference is that instead of using estimated visual word histogram as new queries, we use feature points that are verified to be important for the query object. Also, a visual word histogram is the aggregation of local features. If an image has multiple salient objects or a complex background with many interfering visual words, its histogram does not well describe any objects. By working with the SIFT features directly, we avoid introducing extra noises from quantization of SIFT features or aggregation of visual words. Meanwhile, we propose novel expansion methods integrating both inter-expansion and intra-expansion strategies, which will be shown complementary to each other and significantly improving hash-based image object retrieval (up to 76.3%). 3. SYSTEM OVERVIEW AND LSH For image object retrieval, we adopt bag-of-feature-points image representation and employ the hash-based indexing method. To improve the search recall, we will extend the hash method (i.e., LSH) by two novel expansion strategies. We first provide the LSH overview in Section 3.1 and explain the matching for bags of feature points in Section 3.2. Two ranking criteria for image object retrieval are introduced in Section 3.3. Based on them, we detail the hash-based query expansion methods in Section LSH Overview LSH [16] has been shown effective and efficient for many multimedia (high-dimensional) retrieval applications [8][6][18] [13]. The essence is the hash functions which assure that similar data have much higher probability to hash into the same bucket than dissimilar data. There are plenty of hash functions that can achieve this goal, such as transform into hamming space [17][16], L1 distance [3], min-wise hash [5], random projection [10], or stable distribution method [12], etc. Applicable on L2 distance metric, the stable distribution hash function [12] is widely adopted. In this work, without loss of generality, we will base on the hash framework [12] and its implementation [1]. 67

Figure 3: An illustration for LSH with feature points (K=2, L=2, cf. Section 3.1); e.g., the triangle is hashed into bucket (2, 2) in hash table 1 and bucket (1, 2) in hash table 2.

4 Figure 3: An illustration for LSH with feature points (K=2, L=2, cf. Section 3.1); e.g., the triangle is hashed into bucket (2, 2) in hash table 1 and bucket (1, 2) in hash table 2. It also shows the need for intra-expansion. Assuming feature points with the same shape (or color) are from the same image. If is a query feature, we can retrieve two features, and. Actually is missed due to residing in a nearby bucket though it is still within (Euclidean) distance R. This example motivates intra-expansion by further checking neighboring buckets (cf. Section 4.1.2) or using co-occurrence across hash tables (cf. Section 4.1.1). For example, we can retrieve in hash table 2, by collided with the query in hash table 1. For LSH by stable distribution method [12], the hash function h(v) v is defined as follows: v v v a + b h ( ) = W where v represents a feature point, a v is a vector sampled from Gaussian distribution, W is the window size, and b is a random value for the offset ranging from 0 to W. h(v v ) computes the inner product a v v and projects the input feature point v onto a v and then windowed by W, as illustrated in Figure 3. Generally, K hash functions are considered to provide discriminativity among high-dimensional data points. For a large K, two feature points are very likely closed to each other (in the original feature space) if they are still hashed to the same bucket through these K hashing functions. However, false negatives, those true neighbors mis-hashed to other buckets, are commonly observed; for example, see feature points (bucket (2, 2)) and (bucket (3, 2)) in hash table 1, Figure 3; the two feature points are actually closed to each other but hashed to different buckets. To remedy the problems, multiple hash tables (parameterized by L) are considered to improve the robustness against the false positives; for example, and collide in the same bucket in hash table 2. Aggregating multiple tables, a query point can treat those feature points residing in the same buckets across hash tables as the (approximate) nearest neighbors; i.e., in Figure 3, (bucket (2, 2), hash table 2) and (bucket (2, 2), hash tale 1) are neighbors for. It is efficient since generally requiring constant time for retrieving these buckets. Empirically, the effecitiveness and efficieicy for LSH is parameterized by the number of hash functions K and the number of hash tables L. A larger K will have fewer feature points colliding in the same bucket; similarly, increasing L will increae candidate nearest neighbors since more buckets included across these tables. We will evaluate the impacts of K and L for image object retrieval in Section Feature Points Matching in LSH Originally, hash-based methods are to support problems where either the query or a target in the database is represented by a (1) (a) Images (b) Feature points (c) Hash tables (d) Ranking Figure 4: Illustration for image search by hashing over bags of feature points. Each feature point of the database image is associated with (a) the image id (i.e.,,, ) it belongs to and (b) the unique feature id (i.e., A-H), and then hashed to (c) multiple hash tables (cf. section 3.1). To retrieve the similar images for the query image ( ), with a single feature point, we simply hash the query feature point(s) into the buckets across hash tables (e.g., the grey buckets in (c)). An intuitive way to score the target image similarity is simply counting the number of its feature points that reside in the same buckets with the query feature points, as shown in (d). (global) feature. The retrieval is intuitive by one-to-many search in LSH and the candidate targets are those collide with the query in the buckets across hash tables. With the bag-of-feature-point representation, image retrieval is essentially a many-to-many matching problem in the high-dimensional space. An image typically contains hundreds or thousands of feature points (See few of them in Figure 6). The image database generally contains millions or even billions of feature points (i.e., 128-dimensioanl SIFT features). Feature points of the same image are associated with a unique image id. We then hash all feature points into the L hash tables. When querying an image, we issue a LSH query for each feature point independently. Each query point is used to retrieve database points that collide (hash into the same bucket) with it in any of the hash tables. After examining all query points, retrieved database points are analyzed to identify retrieved database images associated to them. For each retrieved image, we then know its feature points having initial matches to those in the query image. The naive way to rank these database images is by the number of possible matches between the query and database images. However, such ranking criterion is poor since the feature point matching through bucket lookup only is noisy. We will further inspect the candidate matches between the query and database images and then filter out their false positives (Section 3.3.1). 3.3 Filtering and Ranking Functions For each database image retrieved, we have a set of initial (noisy) matches to the corresponding feature points in the query image (cf. Figure 4). For improving the matching quality, we will employ two filtering methods: (1) inspecting the matches between feature pairs by the feature distance and (2) employing spatial consistency between (true) matching pairs. We will also introduce two image ranking measures. Note that the filtering methods are conducted one by one for each retrieved image and referred to the query Nearest neighbor filtering and similarities A naïve way to filter out the (noisy) matches between the query and a retrieved image is simply rejecting those candidate points with (Euclidean) distance greater than a threshold. However, such distance-thresholding filtering is not adaptive and even difficult to determine a proper threshold. It has been observed that a correct match needs to have the closest matching point significantly 68

5 closer than the closest incorrect match, while false matches have a certain number of other close false matches, due to the high dimensionality of the feature space [19]. We then filter out those matching pairs whose ratios of the distances to the first and second nearest neighbors are larger than a threshold (i.e., 0.8). For the matched feature pairs, we transform the feature distance to a similarity score. The ranking score, R s, matching similarities for the query image Q to a database image I is defined as: R S (I,Q) = e d ( v i,v j ) σ v i I,v j Q where (v i, v j ) are matched pairs between images I and Q, d(v i, v j ) is the L2 distance between the match; here σ=200 is empirically determined 2. Generally, a database image will have a higher similarity score if it has more matches to the query image Spatial verification and matching inliers Another important cue for image object retrieval is the spatial relationship between the matched image objects. Besides filtering by feature point similarities, we also investigate filtering by spatial verification exploiting geometry model between matching candidates for rejecting false positive matches. After spatial verification, we only keep those feature points following the spatial consistency (e.g., only 11 remained in Figure 6). The approach is mostly related to the well-known RANdom Sample Consensus (RANSAC) algorithm [14]. A basic assumption is that the data consists of inliers and outliers. The former are consistent with the estimated (spatial) model and can be explained by some set of model parameters; the latter are items that do not fit the model. This method has the intuition that if one of the set used for model estimation is an outlier then the geometry model will not gain much support. By the RANSAC algorithm, we can estimate the possible geometry model between the query Q and the inspected image I and use the model to further determine the inliers and outliers among candidate feature point matches. Meanwhile, we can also estimate the matched region in the target image for further applications. See more details in [19]. A matched region is illustrated in Figure 6. Hence, after spatial verification, another ranking score between the query image Q and the target image I is by the number of inliers and defined as follows: R L ( I, Q ) = inliers ( I, Q ) For example, Figure 6 illustrates the 11 matches between the query object and a retrieved image after spatial verification; i.e., R L (I,Q) = 11. Note that till now the spatial verification is very time-consuming. In this experiment, we will mainly adopt prior nearest neighbor ratio filtering for the baseline. For efficiency, we only do spatial verification for the few top-ranked initial results as [23], where an efficient spatial verification is proposed as well. 4. QUERY EXPANSION To boost hash-based image object retrieval, we propose two expansion strategies intra-expansion and inter-expansion, which will be explained in the following sections. We will show that combining both expansions can collaboratively boost the 2 Experiments show that σ is not sensitive within a reasonable range. We take σ=200 for the following experiments. (2) (3) performance gains over conventional hash-based image object retrieval (Section 6.3). 4.1 Intra-expansion LSH-like methods suffer from low recall rates [23][18][11] and similar feature points are likely to be mis-hashed to other buckets. Traditionally, a large number of hash tables are utilized to ease this problem. However, extra hash tables consume huge memory and are infeasible for large-scale image object retrieval by bags of feature points. Instead, we propose novel intra-expansion methods aimed to expand more target feature points similar to those in the query. To optimize the performance, we investigate variant methods for effective intra-expansion by: (1) associating feature points by co-occurrence across hash tables, (2) probing neighboring buckets from the initial hashed bucket, and (3) using meta-features to associate related feature points. The impacts of the proposed methods will be experimented in Section 6.1. Note that such expansion methods are merely looking up the hash buckets and require no extra hashing overheads except the initial image object query. The expansion will increase the number of candidate feature point matches. Similarly, we can reject the false positives effectively through the nearest neighbor ratio introduced in Section Matched point in LSH (MP) In this method, we use matched feature points to locate possible buckets that might accommodate similar feature points mis-hashed to other buckets. Each query feature point will be hashed into a bucket in each table. The candidate feature points collide with a query feature point can be seeds to associate possible buckets in other hash tables that the candidate matches might be hashed to. It is intuitive since for a matched feature point A, collided with the query in a hash table, the feature points that collide with feature point A in other hash tables might also be candidate matches. See Figure 3 for the illustration. A red star is a feature point for the query image. We can only find two feature points colliding in the same buckets with the query point in the two hash tables. Actually, there is still a blue circle point similar to the query (i.e., within certain Euclidean distance). In MP, we can associate the missing feature point by inspecting where the purple triangle, matched in hash table 1, has been hashed to in hash table 2. As a result, we can then get the blue feature point, which might associate with a candidate database image Multi-probe LSH (MPL) The prior method is to associate candidate buckets by feature point co-occurrence. It is less efficient since by point-based association the time complexity grows linearly with the number of feature points in the same bucket. Another perspective is to locate the possible buckets neighboring the initial bucket that the query point is hashed to, as motivated by multi-probe LSH proposed in [20] 3. It is intuitive since the neighboring buckets are geometrically closed to the query feature point and very likely to accommodate mis-hashed feature candidates. The probing sequence for neighboring buckets is not randomly chosen but considers the distance to the initial hashed bucket. See the example in Figure 3. For example, feature point star is 3 Note that in [20] multi-probe LSH is designed to reduce hash table size. Instead, in this work, we address to locate more likely buckets that the candidate matches might reside in. 69

Figure 5: An illustration for the need of inter-expansion. Each rectangle represents an image with certain feature points. Those of the same shape are assumed matched.

6 Figure 5: An illustration for the need of inter-expansion. Each rectangle represents an image with certain feature points. Those of the same shape are assumed matched. Query image can retrieve image A (4 matched pairs) by baseline LSH but cannot find image B (0 matches). However, we can still retrieve image B through image A by including those feature points in image A as the expanded feature points, i.e., inter-expansion. Note that intra-expansion is optional here. hashed to bucket (2, 2) in hash table 1; its first bucket to probe for MPL will be bucket (2, 3) (above) since it is the most closed to the feature query start ; the next is the right bucket (3, 2), etc. See more details in [20]. Note that the number of probes for the neighboring buckets is an important factor for the performance and efficiency. We will conduct the sensitivity test in Section Meta-feature (MF) The prior intra-expansion methods, MP and MPL, both depend on the cues provided by the hash structure. We investigate another intra-expansion method to associate the feature points by discovering the meta- (or representative) features in the original feature space (i.e., SIFT). A meta-feature matched to a query feature point is then used to bring in the set of feature points that the meta-feature represents. The meta-feature uses one feature point to represent a set of similar feature points in the original feature space. A clustering method for all the feature points in the database images is required to locate the meta-features. We mark each cluster center as the meta-feature and record the feature points belonging to it (in the same cluster). The meta-features are then hashed into the hash tables. In the query phase, we retrieve all meta-features collided with the query and then use the matched meta-features as seeds to include the candidate feature points associated with these meta-features. Note that since clustering for a large number of high-dimensional data is very time-consuming, here we adopt hierarchical clustering (HKM) [22] as the clustering method. It has been shown much more efficient than conventional clustering methods (e.g., K-means). Empirically, we take 10K cluster centers in HKM as meta-features. 4.2 Inter-expansion Due to the changes in lighting condition, camera angles, and occlusions, etc., there are feature points that strongly characterize the query object but are not present in the query image or appear very different from query features. Such issues cannot be solved by the intra-expansion methods mentioned above. We propose to seek the solution from the initial query results by inter-expansion. Figure 6: Matches and regions in the target (database) image for inter expansion. There are actually two regions potential for inter-expansion: region A (red) for the (projected) matched region and B for the whole image. Note that the feature point matches are usually noisy and random. The 11 matches are retained as inliers after spatial verification. Figure 5 illustrates the intuition for inter-expansion. For example, the query image can robustly retrieve image A through LSH methods (or enhanced by intra-expansion mentioned in Section 4.1) with 4 matches but fail to retrieve image B (no matches). However, we can associate image B by taking image A as a new query image. Such behaviors had been observed by pseudo-relevance feedback (PRF) in text-based retrieval [7][28] and video retrieval [29]. For optimizing inter-expansion in hash-based image object retrieval, we investigate certain factors parameterizing the retrieval results: For example, (1) if requiring any filtering process for determining the seeded images from the initial search result for expansion (more details in Section 3.3.1), (2) the proper region for expansion the region of interest or the entire image (more details in Section and Figure 6), (3) effective fusion methods and similarity measures for fusing multiple image ranking lists expanded by the images from the initial search result, (cf. Figure 5). Note that such inter-expansion methods are mostly realized by searching related feature points in the buckets. The most time-consuming part is spatial verification (cf. Section 3.3.2) for evaluating matching inliers Pseudo-relevance feedback (PRF) PRF is the most intuitive method for inter-expansion. Initially introduced in [7], where the top-ranking documents are used to rerank the retrieved documents assuming that a significant fraction of top-ranked documents will be relevant. This is in contrast to relevance feedback where users explicitly provide feedback by labeling the top results as positive or negative. For inter-expansion, we are to automatically expand the retrieved images by issuing new queries with the top-ranking images from the initial search result since some characteristic feature points relevant to the search targets might not exist in the query image but only the search results. Figure 5 demonstrates the needs and the process for PRF (or inter-expansion). It simply assumes that the top retrieved images are correct and might be the case for text retrieval but not for photo or video search [29]. The latter generally contains more noisy high-dimensional data and often mistakenly includes false positives as the seeded images for expansion. We will exploit further methods to verify the needs of image filtering (i.e., spatial verification). Meanwhile, it is not intuitive to select the number of top ranks for image expansion and mostly query-dependent. The fusion of multiple expanded lists will be discussed in Section

Figure 7: Examples from two photo datasets for evaluating image object retrieval. Oxford landmarks, (a) balliol and (b) ashmolean, in Oxford Building dataset.

7 Figure 7: Examples from two photo datasets for evaluating image object retrieval. Oxford landmarks, (a) balliol and (b) ashmolean, in Oxford Building dataset. (c) Torre Pendente Di Pisa and (d) Starbucks in Flickr11K dataset including multiple buildings and small logos, which are more complicated and diverse. Rectangles highlight the objects Full image by spatial verification (SVFI) 4.3 Iteration for Expansion Rather than blindly taking the top images as seeds for inter-expansion as PRF, we select the images, from the initial search result, whose matched inliers are larger than a threshold δ, as shown in Figure 5. Naturally images with more inliers to the query are more likely to be true targets. Here the number of inliers between the query image and the target image is determined by spatial verification discussed in Section See the example in Figure 6, where we have 11 inliers between the target and query images and region B, the entire target image, is used for further image expansion in SVFI. Note that the threshold δ for determining the correct image will be experimented in Section The expanded photos can then iteratively work as the seeded photos for another round of inter-expansion (or enhanced with intra-expansion). The iteration stops as no more new photos are considered relevant (thresholded by inlier number δ) in the expansion. For example in Figure 5, from expanded ranking list 2 and 3, we will choose the retrieved images with matching inliers larger than the threshold δ as the new seeded images for next inter and intra-expansion. The expanded correct images are collected and later ranked by their corresponding ranking measures as the iteration stops. 5. EXPERIMENTAL SETUPS Matched region by spatial verification (SVMR) We experiment the proposed methods in two photo retrieval benchmarks, Oxford Building [23] and Flickr11K, a subset of Flickr550 [30]. Some of the queries and their ground-truth images are exemplified in Figure 7. Similar to SVFI, we need to conduct spatial verification and filter out incorrect images from the initial search result by inlier threshold δ. However, the region for inter-expansion is the matched region corresponding to the query object, i.e., Region A in Figure 6. As explained in Section 3.3.2, the region generally corresponds to the region of query interest and can be estimated in the spatial verification process. 5.1 Datasets Oxford Building dataset4: The Oxford Buildings dataset [23] consists of 5,062 images collected by issuing Oxford landmark names as search keywords on the Flickr website. The dataset includes 11 query categories with 5 queries each. We use the cropped image object from the authors [23] for the 55 query photos, illustrated in Figure 7 (a) and (b) (Query objects are in the red rectangles). The images are downsized to quarter of the original to match the image dimensions in our next dataset. The average image size is about 512x374 pixels. There are totally 7,162,122 feature points in Oxford dataset and on the average 1,415 ones for each photo Fusion methods As shown in Figure 5, multiple images or image regions from the initial search result will be used for expansion. Generally, each image will generate a ranking list through LSH or further be enhanced by intra-expansion methods introduced in Section 4.1. To maximize the performance for fusing multiple ranking lists from the seeded inter-expansion images, we consider several fusion methods and similarity measures as suggested in [32]. assumed scored zero in ranking list 3. Note that, for AVG, spatial verification is not employed for the expanded ranking lists and is thus efficient. Each expanded image is to take the average similarity scores Rs from the seeding ranking lists. Maximum Score (MAX): Same as AVG, except that the final ranking score for each expanded image is to take the maximum among those scored in each ranking list. Average Inliers (AVG_IL): Similar to AVG, except that the inlier score RL in Equation 3 is adopted. Hence, such fusion method requires applying spatial verification among the expanded ranking lists and required extra computation time. Maximum Inliers (MAX_IL): Same as AVG_IL, except that the final ranking score for each expanded image is to take the maximum among those scored in each ranking list. Borda Score (Borda): Instead of using the matching similarity score RS or the inlier score RL, the Borda score is used to score each expanded image list based on its ranking position in the list. For an image at rank i among the total N, its Borda score is 1-i/N. The final ranking score for each expanded image is to take the average among the Borda scores across expanded ranking lists. Flickr11K dataset: Flickr11K is a larger dataset consisting of 11,282 medium resolution (500x360) images. This is the subset provided by the authors of Flickr550 dataset [30], downloaded from Flickr in the European Travel group. We modify the queries and ground truth defined by the authors to suit the Average Score (AVG): in AVG, the expanded image in each ranking list (cf. Figure 5) is scored by the matching similarity RS in Equation 2. For inter-expansion, the query image is now the seeded image from the initial search result (e.g., image A ). The final ranking score for each expanded image is to take the average among those scored in each ranking list. Note that since LSH-based retrieval only returns a subset of database images. We assume zero score if the image is not within the expanded ranking list; for example, image B exists only in ranking list 2 but not in ranking list 3; thus it is 4 71 Note that we do not aim to optimize the retrieval performance for the benchmark but investigate relative improvements for the proposed expansion methods in LSH.

8 Table 1: Comparisons of intra-expansion methods, which all improve the baseline hash-based image retrieval in Oxford dataset. MPL (100 probes) is the most effective as probing more buckets neighboring the initially hashed bucket. MP and MF are limited due to loosely associating other candidate feature points by co-occurrences in other hash tables or sparse meta-features. % stands for relative improvement from the hash baseline. Oxford Baseline MP MPL MF MAP % objective of content-based photo search. The result is a total of 1,282 ground truth images in 7 query categories such as Torre Pendente Di Pisa and starbucks in Figure 7 (c) and (d). We then add 10k images randomly sampled from Flickr550 to form our Flickr11K dataset. 5.2 Performance Metrics In order to evaluate the retrieval performance, we use the average precision (AP) as the major performance metric. Widely adopted in large-scale photo/video retrieval benchmarks such as TRECVID [2], Oxford Buildings [23], and Flickr550 [30], AP approximates the area under a non-interpolated precision-recall curve for a query. Since AP only shows the performance for a single image query, generally we compute mean average precision (MAP) to represent the system performance over the all queries; thereby, 55 query images (across 11 categories) in Oxford dataset and 56 query images (across 7 categories) in Flickr11K dataset. 5.3 Baseline LSH Configurations The experiments for the LSH expansion are based on E2LSH implementation [1]. For object-level image retrieval by bags of feature points, the default configurations need to be adjusted for optimizing efficiency and effectiveness. Through the pilot experiment in a held-out dataset, we found that decreasing K or increasing L will help locate more matched feature point candidates since a smaller K will put loose constraints for hashing more feature points into the same bucket and more hash tables (larger L) can help accumulate more collided feature points. For baseline LSH, we set K=10, L=10 and W=400, through the sensitivity test in the held-out set for balancing time efficiency and effectiveness 5. Generally, each query feature point will retrieve (averaged) 0.01% of total number of feature points as candidates. For Oxford dataset, we have 7,162,122 feature points in LSH; on the average, each table has around 1,192,240 buckets. We also set the threshold for matched pair filtering ratio as 0.8 as suggested in [19]. Note that if the query feature point only has one nearest neighbor we treat the two points are matched. The hash-based baseline achieves (MAP) in Oxford dataset and in Flickr11K (cf. the second row in Table 3). With the configurations mentioned above, the averaged image query time for the two datasets is 0.8 second and 1.2 second respectively in a regular Linux server with Intel CPU. We also apply spatial verification on the hash-based baselines and found the improvement is marginal due to very sparse matching between 5 Note that according to our preliminary experiment, more hash tables can slightly improve the retrieval performance, however, at the cost of huge memory and slow response time. In the following, we retain a small number of hash tables. Table 2: Comparisons for variant inter-expansion and fusion methods in Oxford dataset; most of all outperform hash-based baseline (with MAP). The best performer for inter-expansion only is SVMR with MAX_IL fusion and is SVMR in AVG fusion if further considering intra-expansion. See more explanations in Section 6.2. Note that we have MPL for intra-expansion. Oxford Inter-expansion Intra + Inter expansion PRF SVFI SVMR PRF SVFI SVMR AVG MAX Borda MAX_IL AVG_IL photo pairs as commonly observed in bags of visual words paradigm as well [11]. We will show later that the introduction of inter and intra expansions can significantly boost the hash-based baseline by bringing in more target-related and context-related feature points. 6. RESULTS AND DISCUSSIONS 6.1 Intra-expansion on LSH We first experiment variant methods for intra-expansion, which is aimed to help locate similar feature points mis-hashed to other buckets. According to Table 1, the expansion methods (i.e., MP, MPL, and MF, cf. Section 4.1) all improve the object-level image retrieval by hash-based baseline in Oxford dataset. The expansion methods do include more related feature points helpful for bags of feature point matching. Note that the candidate feature points are later filtered by nearest neighbor ratio (cf. Section 3.3.1). Among them, multi-probe LSH (MPL) outperforms MP and MF because MPL can still locate neighboring hash buckets, where candidate feature points might reside, even if there are no matched points at the initial hashed bucket; whereas MP is to associate candidate feature points by feature co-occurrence and the association between feature points might be sparse. Similar deficiencies are observed in MF, which utilizes meta-features to associate candidate feature points. However the total number of buckets is so huge such that the meta-features might not be hashed into the same bucket with the initial query feature points. 6.2 Inter-expansion and Fusion Methods The goal of inter-expansion is to retrieve more target-related photos given the initial search results since there might be certain context-related feature points not observed in the given query. We compare three inter-expansion methods (PRF, SVFI, and SVMR) based on different fusion manners, which are necessary since generally multiple photos in the initial search result are seeded for expansion and their respective expanded lists need to be fused effectively, as illustrated in Figure 5. We also look into the impacts as further considering intra-expansion. For inter-expansion only (cf. Table 2) in Oxford dataset, PRF is the worst and at most improves the hash baseline 6.1% relatively (from to by MAX fusion). It is natural since PRF simply assumes that the top returns are correct and then issued for expansion directly. Such limitation for PRF-like methods in video retrieval is also observed in [9][29]. If the initial seeded photos are further filtered by spatial verification (cf. Section 3.3.2), the expanded results mostly have salient gains (e.g., up to 23%) over 72

Table 3: Performance gain boosts saliently as combining intra and inter expansions for both Oxford and Flick11K datasets. % stands for relative improvement in MAP.

9 Table 3: Performance gain boosts saliently as combining intra and inter expansions for both Oxford and Flick11K datasets. % stands for relative improvement in MAP. Oxford Flickr11K MAP % MAP % Baseline Intra Inter Intra + Inter Intra + Inter + Iteration the hash baseline (see SVFI in Table 2) across fusion methods. Apparently, spatial verification is helpful for filtering effective photos for expansion since the search targets might be spatially correlated for these object or landmark photos. Another factor is if the photo area (either a matched region only or the entire photo) for expansion does matter the expansion quality. For inter-expansion only, it does not differ a lot for SVFI and SVMR. The former uses the entire region for expansion and the latter uses the matched region only (cf. Figure 6). As more matched feature points are brought in by intra-expansion, matching region of interest (i.e., SVMR) with proper fusion methods (i.e., AVG, AVG_IL) saliently outperforms inter-expansion by the whole seeded photo, which is generally contaminated by background noises not relevant to the target photos. Meanwhile, as shown in Table 2, SVMR with AVG fusion and intra-expansion has the most gain (MAP to 0.460, relatively 76.3%) among all configurations in Oxford dataset; so does it in Flickr11K. As for the fusion methods, AVG, AVG_IL, and MAX_IL are generally on par for different inter-expansion configurations as stated in Table 2. However, AVG_IL and MAX_IL are time-consuming as evaluating matched inliers and require extra spatial verification for the expanded photo lists; however, AVG simply averages the feature point matching similarities across expanded photo lists and is very efficient. Borda fusion, taking only the ranking order from expanded photo lists, is non-effective as ignoring the matched inliers, spatial information, and feature point similarities. 6.3 Combining Intra and Inter Expansions We had shown that both intra and inter expansion methods could improve the hash-based baseline for image object retrieval. It is more interesting to see that the performance boosts significantly as combining both expansion strategies. More reliable matching pairs can be retrieved through intra-expansion and more context-related photos are yielded through inter-expansion from the retrieved image lists. Besides the comparisons by MAP, a sample query by different expansion methods is illustrated in Figure 2, where some results ranked low in the hash-based baseline can be boosted to the top by the expansion methods, also demonstrated by the precision-recall (PR) curves in Figure 2(b). As shown in Table 3, combining intra-expansion (MPL with 100 probes) and inter-expansion (SVMR in AVG fusion) can improve the MAP 76.3% relatively from the LSH baseline in Oxford dataset. Another interesting observation is that the intra and inter expansions seem to work orthogonally and the two expansion methods collaboratively boost the performance gains. For example in Oxford dataset, we have 52% relative improvement for intra and 13.1% for inter; ideally, the multiplied gains from Figure 8: Performance breakdown in Oxford dataset where we have 100-probe MPL for intra and SVMR with AVG fusion for inter-expansion. All expansion methods improve the hash-based baseline across query categories. Only few query categories degrade slightly in inter-expansion because some incorrect images are seeded for expansion. Interestingly, inter + intra and inter + intra + iteration outperform all others. two different methods is around 72% (0.72 = ( ) x ( ) 1; practically we have 76.3%. Similarly, for Flickr11K dataset, we have 67.3% for intra and 5.2% for inter; thus ideally the multiplied performance gain should be 76% (i.e., 0.76 = ( ) x ( ) 1) and empirically it is 75%. Figure 8 shows the performance breakdown for 11 query categories in Oxford dataset for the major expansion methods. All expansion methods improve the hash-based baseline across query categories. Only few query categories (e.g., Balliol ) degrade marginally in inter-expansion because some incorrect images from the hash baseline are seeded for expansion due to unreliable feature point matches. Interestingly, as we combine intra and inter expansions (either inter + intra or inter + intra + iteration ), the former brings in more robust feature point matches and significantly outperforms the hash baseline across queries. Iteratively expanding the retrieval results by inter and intra expansions is slightly helpful (2% - 5% relative improvement) as shown in the last row in Table 3. However, it takes quite time to conduct spatial verification for selecting the seeded photos for expansion. 6.4 Parameter Sensitivity Number of probes for MPL intra-expansion In intra-expansion, the most effective method is MPL, where the essential parameter is the number of probes to the neighboring buckets (cf. Section 4.1.2). Experimenting in Oxford dataset, we test different number of probes and find that the performance (in MAP) saturates for the number from 100 to 350, as shown in Figure 9. Generally, the performance increases as probing more buckets. It is natural since it improves the recall for mis-hashed feature points. However, as probing more than 350 buckets, the performance degrades due to including more noisy feature points. Note that the candidate feature points will increase as probing more buckets. Balancing efficiency and effectiveness, we choose 100 for MPL intra-expansion; even so, the total number of candidate feature points inspected for a feature point query is still small 0.1%, on the average, among the total. 73

Min-Hashing and Geometric min-hashing

Min-Hashing and Geometric min-hashing Ondřej Chum, Michal Perdoch, and Jiří Matas Center for Machine Perception Czech Technical University Prague Outline 1. Looking for representation of images that: is