Distributed similarity search algorithm in distributed heterogeneous multimedia databases

Size: px

Start display at page:

Download "Distributed similarity search algorithm in distributed heterogeneous multimedia databases"

Chad Porter
5 years ago
Views:

Information Processing Letters 75 (2000) 35 42 Distributed similarity search algorithm in distributed heterogeneous multimedia databases Ju-Hong Lee a,1, Deok-Hwan Kim a,2, Seok-Lyong Lee a,3,

Korea b Department of Computer Science, Korea Advanced Institute of Science and Technology 373-1, Kusong-Dong, Yusong-Gu, Taejon 305-701, South Korea c IBM Almaden Research Center, San Jose, CA, USA

1 Information Processing Letters 75 (2000) Distributed similarity search algorithm in distributed heterogeneous multimedia databases Ju-Hong Lee a,1, Deok-Hwan Kim a,2, Seok-Lyong Lee a,3, Chin-Wan Chung b,, Guang-Ho Cha c,4 a Department of Information and Communication Engineering, Korea Advanced Institute of Science and Technology 373-1, Kusong-Dong, Yusong-Gu, Taejon , South Korea b Department of Computer Science, Korea Advanced Institute of Science and Technology 373-1, Kusong-Dong, Yusong-Gu, Taejon , South Korea c IBM Almaden Research Center, San Jose, CA, USA Received 5 November 1999; received in revised form 30 March 2000 Communicated by K. Iwama Abstract The collection fusion problem in multimedia databases is concerned with the merging of results retrieved by content based retrieval from distributed heterogeneous multimedia databases in order to find the most similar objects to a query object. We propose distributed similarity search algorithms, two heuristic algorithms and an algorithm using the linear regression, to solve this problem. To our knowledge, these algorithms are the first research results in the area of distributed content based retrieval for heterogeneous multimedia databases Elsevier Science B.V. All rights reserved. Keywords: Distributed similarity search algorithms; Collection fusion; Multimedia databases; Information retrieval 1. Introduction Along with the current growth of the Internet and the Web, it emerges as an important research issue to access distributed multimedia databases. To retrieve information from numerous data sources, the global server is needed to integrate various resources and process queries in a distributed manner [2]. It distributes user queries to local databases, integrates results to fit user requirements, and also provides the illusion Corresponding author. chungcw@islab.kaist.ac.kr. 1 jhlee@islab.kaist.ac.kr. 2 dhkim@islab.kaist.ac.kr. 3 sllee@islab.kaist.ac.kr. 4 ghcha@almaden.ibm.com. of a single database. A key problem is how to extract relevant objects for a query from distributed heterogeneous databases that use different similarity measures. This issue is called the collection fusion problem. It has been studied much for existing text databases [1, 3,5,8], but not for multimedia databases. The problem arises from the difference of similarity measures in a heterogeneous environment. The detailed scenario is as follows: At the global server, a user wants to retrieve objects similar to a query object from local databases using a global similarity measure. However, a local database does not support a global similarity measure, but a local similarity measure. When a global similarity measure is completely different from a local similarity measure, for instance, the global similarity mea /00/$ see front matter 2000 Elsevier Science B.V. All rights reserved. PII: S (00)

2 36 J.-H. Lee et al. / Information Processing Letters 75 (2000) sure using color and the local similarity measure using texture, a user cannot get an appropriate result for a query. Therefore, a global similarity measure must be correlated with a local similarity measure. In this paper, we show that there exist some cases that a linear relationship between two similarity measures holds. And we propose novel distributed similarity search algorithms to solve the collection fusion problem for such cases in distributed heterogeneous multimedia databases. This paper is organized as follows. Section 2 defines the collection fusion problem and provides assumptions for the problem. In Section 3, we propose distributed similarity search algorithms to solve the collection fusion problem. The experimental results are shown in Section 4. Concluding remarks are made in Section Collection fusion for distributed multimedia databases We discuss several assumptions concerning the global server and local databases. The algorithms proposed in this paper are developed based on these assumptions. Assumption 1. The global server selects local databases supporting similarity measures that are correlated with a global similarity measure, and then submits the query to them. Assumption 2. Local databases support the incremental similarity ranking such as the method using a get-more-objects facility described in [7]. Assumption 3. For a given query, local databases return objects locally most similar to the query object together with their local similarity values as the query result. The following are the formal definition and objectives of the collection fusion problem. Definition. Collection fusion problem in multimedia databases is how to retrieve and merge the results from distributed heterogeneous multimedia databases to find relevant objects, thatis,k most similar objects to a query object using a global similarity measure. Objectives. For distributed similarity search of a given query Q, letrq i be the set of relevant objects in the ith local database and IQ i be the set of irrelevant objects in the ith local database. Then RQ i I Q i = and RQ i I Q i ={all objects in the ith local database}. Let VQ i be the set of objects retrieved from the ith local database. We have the constraint that the total number of retrieved objects from local databases is fixed such as i Q = ck, where c is a constant larger than 1, k is the number of relevant objects that a user wants to retrieve, n is the number of local databases, and S is the number of elements of set S. The objectives of the collection fusion problem with this constraint are as follows: (1) The ratio of retrieved objects among relevant objects should be maximized. That is, maximize R Q i V Q i / R Q i subject to the constraint i Q = ck. (2) The ratio of irrelevant objects among retrieved objects should be minimized. That is, minimize I i Q VQ i / i Q subject to the constraint i Q = ck. (1) is to maximize the recall and (2) is to maximize the precision because the precision is 1 I i Q VQ i / i. Q Since we assume that servers are independent and autonomous, their similarity measures may be different from each other. Therefore the similarity value by a local similarity measure between an object from a local database and a query object may be different from that by a global similarity measure at the global server. There are many similarity measures for content based

3 J.-H. Lee et al. / Information Processing Letters 75 (2000) Fig. 1. Scatter diagram for the average RGB color 4 4andthe average RGB color 5 5 in the case that arbitrary pairs among images are chosen. image retrieval. There is correlation between some similarity measures. In order to show such cases, we present examples using the RGB color and the RGB texture. Example 1. Let the global server and the local database support the image similarity search using the color. The global server extracts average RGB color features for the 6 6 subimages from an image and measures its similarity value against a query image using the inter-feature normalization described in MARS [6]. The local database extracts average RGB color features for the 4 4 subimages from an image and measures its similarity value as the global server does. From 3016 images, 3000 arbitrary pairs of images are selected. For each pair, the local similarity value x and the global similarity value y are measured. The scatter diagram of the set of (x, y) values for 3000 selected pairs is shown in Fig. 1. In this case, the diagram shows the shape of a straight line. Example 2. In Fig. 2, the similarity values of the y coordinate are obtained using the average RGB color of 5 5 subimages while those of the x coordinate are obtained using the RGB texture of 6 6 subimages. Contrary to the previous case, the scatter diagram does not show any relationship between two similarity measures with different attributes. Although similarity measures are different between the global server and local databases, we observed that the scatter diagram of similarity values of some pairs of similarity measures showed the shape of a straight line. Since the relationship cannot be proved, Fig. 2. Scatter diagram for the average RGB color 5 5andthe RGB texture 6 6 in the case that arbitrary pairs among images are chosen. instead, we made extensive experiments that showed the linear relationship. Table 1 shows three groups of features, that is, RGB colors, RGB textures and RGB colors & textures, to be used for similarity measures. We used the inter-feature normalization described in MARS [6] to calculate similarity values. The statistical linear regression method is used to obtain the equation of a straight line and the test of statistical hypothesis is used to verify the linear relationship between two similarity measures. As test indicators, we used the scatter diagram, the sample coefficient of determination (r 2 ), and the analysis of variance (F 0, F (α)) where r 2 is given by (sum of squares due to linear regression)/(total variance), F 0 is given by (mean square due to linear regression)/(mean square of residual), and F(α) is obtained from F -distribution for a level of significance α. If the linear regression model is effective for two similarity measures, the scatter diagram should show the shape of a straight line, r 2 (0 <r 2 < 1) should be near to 1 and F 0 should be larger than F(α) [9,10]. Table 2 shows the result of experiments for two similarity measures. In the case of similarity measures from the same group, the scatter diagram shows the shape of a straight line, r 2 value is near to 1, and F 0 is much larger than F(α). However, in the case of similarity measures from different groups, the scatter diagram does not show the shape of a straight line and r 2 value is near to 0. And F 0 in this case is much smaller than F 0 in the case that the linear relationship is satisfied even though F 0 is larger than F(α). In this case, we can say that two similarity measures do not satisfy the linear relationship.

4 38 J.-H. Lee et al. / Information Processing Letters 75 (2000) Table 1 The description of three groups of features to be used for similarity measures Feature name Feature description RGB color features feat1 average RGB color feature for 2 2 subimages feat2 average RGB color feature for 3 3 subimages feat3 average RGB color feature for 4 4 subimages feat4 average RGB color feature for 5 5 subimages feat5 average RGB color feature for 6 6 subimages RGB texture features feat6 average RGB texture feature for 2 2 subimages feat7 average RGB texture feature for 3 3 subimages feat8 average RGB texture feature for 4 4 subimages feat9 average RGB texture feature for 5 5 subimages feat10 average RGB texture feature for 6 6 subimages RGB color & texture feat11 average RGB color & texture feature for 2 2 subimages feat12 average RGB color & texture feature for 3 3 subimages feat13 average RGB color & texture feature for 4 4 subimages feat14 average RGB color & texture feature for 5 5 subimages Table 2 Test of statistical hypothesis for linear relationship between two similarity measures Features to be used for Scatter diagram Correlation r 2 F 0 F(0.05) Result similarity measures ρ feat1 : feat2 straight line linear feat1 : feat4 straight line linear feat6 : feat8 straight line linear feat3 : feat5 straight line linear feat8 : feat10 straight line linear feat11 : feat14 straight line linear feat7 : feat9 straight line linear feat12 : feat13 straight line linear feat1 : feat9 scattered nonlinear feat6 : feat12 scattered nonlinear feat5 : feat10 scattered nonlinear feat1 : feat10 scattered nonlinear For any two similarity measures, if they satisfy the linearity relationship, we can use that property for the distributed similarity search. 3. Distributed similarity search algorithm The distributed similarity search algorithm retrieves k most similar objects using a global similarity mea-

5 J.-H. Lee et al. / Information Processing Letters 75 (2000) Table 3 Parameters used in the algorithms q k c n r LD i p i query object of distributed similarity search number of objects to find multiplication ratio (>1) when more than k objects are retrieved number of local databases number of retrievals for one local database ith local database number of objects to be retrieved from LD i in one step sure from n local databases, LD i, i = 1,...,n.The algorithm must result in high recall and high precision to achieve objectives of the collection fusion problem stated in Section 2. We suggest two heuristic algorithms and an algorithm using linear regression for distributed similarity search. Table 3 shows parameters to be used in the algorithms. Heuristic Algorithm (q,c,k,n,ld 1,...,LD n ) (1) send a query object q to all LDs (2) For each LD i, initialize p i (3) While (number of retrieved objects <ck) (4) for each LD i, get_more_objects(q, p i, LD i ) let result i be the set of objects that are retrieved from LD i. (5) merge_results(result 1,...,result n ) (6) for each LD i, recalculate p i using heuristic estimator of LD i (7) EndWhile Where, merge_results(result 1,...,result n ) merges and ranks results retrieved from all local databases using a global similarity measure and get_more_ objects(q, p i, LD i ) requests LD i to get p i more objects similar to the query q using a local similarity measure of LD i as described in [7]. If the global server retrieves exactly k objects from local databases, the recall will be less than 1 because there will be some irrelevant objects in the retrieved objects. Therefore the global server must get more than k, that is ck (c 1) objects. The precision, however, will be decreased, as c increases. The recall has a tradeoff relation to the precision. If all local databases show the same recall and the same precision, it is sufficient for the global server to get p i = [ck/n] objects only once from each local database, where []is the rounding operator. However, the values are different for each LD i and cannot be known in advance. Therefore we must refine them repeatedly. If the repetition is r, the initial value of p i is given by p i =[ck/rn]. Step (6) of the above algorithm assigns a large value to p i of the local database whose heuristic estimator is high in order to increase the recall and the precision Average ranking heuristic A heuristic estimator α i is defined as: M i α i = M i / Rank ij, j=1 where Rank ij is the merged rank of the jth object retrieved from the ith local database and M i is the number of objects retrieved in the last retrieval from the ith local database. This value means the reciprocal of the average of merged ranks of objects retrieved from the ith local database. The global server gets more objects from a local database with a high value of α i and less objects from one with a low value of α i. p i of the heuristic algorithm is given as follows: [ ] k p i = r α i. α 1 + +α n 3.2. Average global similarity heuristic This is similar to the average ranking heuristic. The rank has an integer value that has a uniform difference between adjacent ranked objects. However, the similarity difference between adjacent objects may not be uniform. So, the heuristic estimator β i is defined as: M i / β i = Global_Similarity ij Mi. j=1 This value means the average similarity of the objects retrieved from the ith local database. p i of the heuristic algorithm is given as follows: [ ] k p i = r β i. β 1 + +β n

6 40 J.-H. Lee et al. / Information Processing Letters 75 (2000) Algorithm (p,c,k,n,q,t,ld 1,...,LD n ) (1) for each LD i,i= 1,...,n, get_more_objects(q, p, LD i ) (2) for each LD i,i= 1,...,n, analyze objects retrieved from LD i using the linear regression analysis and obtain equation ŷ i =ˆα i + ˆβ i x i and obtain gt i where gt i is one of gti u,gtm i,gti l according to T (3) let LD l be the local database which has the largest GT among all local databases and its GT be gt l (4) if (the total number of retrieved objects with the global similarity value greater than gt l ) k or (the total number of retrieved objects) ck then stop (5) select the LD l which has the largest GT among all local databases (6) get_more_objects(q, p, LD l ) (7) analyze objects from LD l using linear regression (8) goto step (3) 3.3. Distributed similarity search algorithm using the linear regression In Section 2, we observe that there exist similarity measures that have the linear relationship between them. For these cases, we can apply the linear regression analysis to a distributed similarity search. The linear equation, ŷ =ˆα + ˆβx, of the straight line in a scatter diagram is obtained by using the linear regression analysis. The algorithm retrieves the predefined p number of objects from each local database and analyzes retrieved objects to find the linear equation and the global threshold (GT) corresponding to the local threshold (LT). The least of local similarity values of retrieved objects becomes the local threshold. This algorithm uses three different global thresholds, gt u, gt m, gt l corresponding to the local threshold. In Fig. 3, gt m is the y-coordinate value of the intersection point of ŷ =ˆα + ˆβx and x = LT. gt u is that of the intersection point of ŷ =ˆα + ˆβx + d y (d y is 100(1 δ)% confidence interval of y)andx = LT. gt l is that of the intersection point of ŷ =ˆα + ˆβx d y and x = LT. T indicates the type of the global threshold, one of gt u, gt m, gt l. This algorithm selects the local database that has the largest global threshold and retrieves objects from the selected database next time. The above algorithm uses one of three global thresholds gt m, gt u, gt l. In case gt u, the recall of the result is high and the precision is low. In case gt m,the recall is less than the case of gt u while the precision is higher. In case gt l, the recall is the lowest among the three cases and the precision is the highest. Fig. 3. Three local thresholds corresponding to the global threshold. 4. Experiment In order to measure the effectiveness and performance of the proposed distributed similarity search algorithms, we conducted comprehensive experiments in an environment containing a large number of image data and various queries. The test data consists of 3016 images with 256 RGB color bitmaps. The contents of test images are shown in Table 4. In order to show the preciseness of the linear regression of partly retrieved objects, we present experimental results in Table 5, indicating that the partial results approach to the final result gradually. Features to be used for similarity measures are chosen from the RGB color group. As the number of retrieved objects increases, r 2, α, and β approach the final values. We evaluated the effectiveness of the algorithm using the precision, the recall, and the combined metric that is the product of the recall and the precision. The combined metric can measure the overall effec-

7 J.-H. Lee et al. / Information Processing Letters 75 (2000) Fig. 4. The precision and the recall of each algorithm in the clustered distribution. Table 4 The contents of test images Category # of images Area plants 720 flower, leaves, grass pattern 680 glass, brick, woods architecture 820 house, building scene 796 water, sky, cloud Fig. 5. P R of each algorithm in the clustered distribution. Table 5 The preciseness of the linear regression of partly retrieved objects # of retrieved MSE * r 2 α β objects Total objects * MSE (mean square error) is (residual sum of squares)/(number of retrieved objects). tiveness. We made 10 queries for each test using various parameters and averaged their results. We assume four local databases with one global server, where the images are distributed over local databases. To allocate images to these local databases, we use two approaches: (1) random allocation and (2) clustered allocation. In the random allocation, all images are distributed randomly into four databases. In the clustered allocation, similar images are likely to be allocated to Fig. 6. P R of each algorithm in the random distribution. the same local database. The equal number of images are allocated to each local database. Clusters are generated with centers randomly distributed. Each local database contains 4 to 5 clusters. About 60% of data are allocated to clusters, while the rest are distributed randomly. These two cases are evaluated respectively. Other test parameter values are 99.9% confidence level for estimating the confidence interval of y, theupper type of the global threshold gt u,1.2forc value, and 10 for p and initial p i. The graphs of the precision, the recall, and their combined metric P R for three algorithms are summarized in Figs For the clustered distribution, the algorithm using the linear regression outperforms the average ranking heuristic algorithm (alpha)

8 42 J.-H. Lee et al. / Information Processing Letters 75 (2000) and the average global similarity heuristic algorithm (beta) because the algorithm using the linear regression (linear) reflects the clustering effect of data distribution well. For the random distribution, the algorithm using the linear regression shows slightly better results than other algorithms. In a real situation, however, data distributions of databases on the Web are generally clustered. Therefore, the algorithm using the linear regression will be used more practically. 5. Conclusion In this paper, we proposed novel distributed similarity search algorithms that solve the collection fusion problem for multimedia databases on the distributed heterogeneous environment like the Web. Experiments show that the algorithm using the linear regression is the best. As far as we know, we first studied the collection fusion problem of distributed heterogeneous multimedia databases and presented novel algorithms as solutions. The search for multimedia databases on the Web is becoming a very important issue. So algorithms proposed in this paper can be the basis for future researches in this area. References [1] J. Callan, Z. Lu, W. Croft, Searching distributed collection with inference networks, in: Proc. 18th Annual Internat. ACM/SIGIR Conference, 1995, pp [2] W. Chang, G. Sheikholeslami, J. Wang, A. Zhang, Data resource selection in distributed visual information systems, IEEE Trans. Knowledge Data Engrg. 10 (6) (1998) [3] L. Gravano, H. Garcia-Molina, Merging ranks from heterogeneous internet sources, in: Proc. 23rd Internat. Conf. on Very Large Data Bases, August 1997, pp [4] J.H. Lee, D.H. Kim, C.W. Chung, Multi-dimensional selectivity estimation using compressed histogram information, in: Proc. ACM SIGMOD Internat. Conf. on Management of Data, June 1999, pp [5] W. Meng, K.L. Liu, C. Yu, X. Wang, Y. Chang, N. Rishe, Determining text databases to search in the Internet, in: Proc. Internat. Conf. on Very Large Data Bases, August 1998, pp [6] M. Ortega, K. Chakrababarti, K. Porkaew, S. Mehrotra, Supporting ranked Boolean similarity queries in MARS, IEEE Trans. Knowledge Data Engrg. 10 (6) (1998) [7] T. Seidl, H. Kriegel, Optimal multi-step k-nearest neighbor search, in: Proc. ACM SIGMOD Internat. Conf. on Management of Data, June 1998, pp [8] E. Voorhees, N. Gupta, B. Johnson-Laird, The collection fusion problem, in: Proc. 3rd Text Retrieval Conference (TREC-3), 1994, pp [9] R.V. Hogg, E.A. Tanis, Probability & Statistical Inference, MacMillan Publishing Co., New York, [10] S.H. Park, Regression Analysis, DaeYoung Publishing Co., 1985.

A Miniature-Based Image Retrieval System

A Miniature-Based Image Retrieval System Md. Saiful Islam 1 and Md. Haider Ali 2 Institute of Information Technology 1, Dept. of Computer Science and Engineering 2, University of Dhaka 1, 2, Dhaka-1000,