Similarity Searching: Towards Bulk-loading Peer-to-Peer Networks

Size: px

Start display at page:

Download "Similarity Searching: Towards Bulk-loading Peer-to-Peer Networks"

Ariel Tyler
6 years ago
Views:

1 Similarity Searching: Towards Bulk-loading Peer-to-Peer Networks Vlastislav Dohnal Jan Sedmidubský Pavel Zezula David Novák Faculty of Informatics Masaryk University Botanická 68a, Brno, Czech Republic {dohnal, xsedmid, zezula, Abstract Due to the exponential growth of digital data and its complexity, we need a technique which allows us to search such collections efficiently. A suitable solution seems to be based on the peer-to-peer (P2P) network paradigm and the metric-space model of similarity. During the building phase of the distributed structure, the peers often split as new peers join the network. During a peer split, the local data is halved and one half is migrated to the new peer. In this paper, we study the problem of efficient splits of metric data locally organized by an M-tree and we propose a novel algorithm for speeding the splits up. In particular, we focus on the metric-based structured P2P network called the M-Chord. In experimental evaluation, we compare the proposed algorithm with several straightforward solutions on a real network organizing 10 million images. Our algorithm provides a significant performance boost. 1. Introduction Current data processing applications often use data with considerably less structure and search the data with much less precise queries than traditional database systems. An example is multimedia data like images or video clips that offer query-by-example search. This situation is what has given rise to similarity searching. The most general approach to similarity search, still allowing construction of index structures, is modeled by a metric space. Many metricbased index structures were developed and surveyed recently [25, 30]. The latest efforts in this area focus on the design of distributed access structures which exploit more computational and storage resources [2, 13, 4, 3] in order to cope with the problem of exponential growth of the data to be processed. A suitable solution arises from peer-to-peer (P2P) networks which are scalable. In this digital-explosion age, a real P2P-based data structure organizes non-trivial amounts of data, e.g., tens of millions of data objects (records) or even more. In this respect, each peer of this network has to also maintain a large number of data objects, so the search within the peer is usually sped up by a local (centralized) index structure. During the life cycle of the P2P structure, peers are split as the data collection grows and new peers join the network. Consequently, the data content of the overloaded peer must be halved, that is the local index structure must be split. In this paper, we focus on effective splits of local metricbased indices. Namely, we study this problem on the P2P index structure called the M-Chord [22] which uses the M- tree [8] to index peer s local data. This M-tree is enriched with extensions of the Slim-tree [28] and the PM-tree [27]. The paper is structured as follows. After related work, Section 2 contains description of the architecture of our image retrieval system. In Section 3, we present four algorithms to split peers. The paper concludes with a performance comparison Related work Multimedia data is often modeled by high-dimensional spaces, so a variety of high-dimensional index structures were proposed, e.g., the R-tree, the TV-tree, the SS-tree or the X-tree. A more generic approach to index such data is to exploit the metric space model, e.g. the M-tree [8], the Slim-tree [28], the GNAT [6], the SAT [21], or the D- index [11]. Due to the need to organize large databases, many techniques to bulk-load index structures were proposed. The Hilbert R-tree [17] is probably the first modification of R- tree aiming at optimized insertion of a large number of data records. It is based on sorting the elements first and then building the tree bottom-up. Earlier, similar techniques were proposed for the B-tree [18] and the quadtree [23, 15], for illustration. A bulk-loading algorithm which also optimizes the tree in order to improve search efficiency is proposed in [5]. It is demonstrated on the X-tree but it can be applied to any R-tree-like structure. Another variant of R-

2 tree called the Priority R-tree [1] is slightly less efficient in bulk-loading than the Hilbert R-tree, however, the resulting tree has better query performance. A bulk-loading strategy which sorts the data items first cannot be applied to structures based purely on the metric space model because there is no linear ordering defined in the general metric space. A generic bulk-loading algorithm that does not sort new items beforehand, is presented in [10]. This proposal was further improved in [9] and applied to the Slim-tree. For the original M-tree, two bulk-loading algorithms were proposed. The first one [7] decreases the construction costs, while retaining the same query performance. The second one [26] focuses on improving query performance without considering building costs. In [14], a general algorithm for bulk-loading applicable to height-unbalanced structures is proposed. In parallel environments, the problem of bulk-loading was studied for GridFiles [20, 19]. Papadopoulos and Manolopoulos investigated this problem in parallel database systems [24]. To the best of our knowledge, the bulkloading operations have not been explored in the area of structured peer-to-peer systems yet mainly from the splits of peers point of view Contributions In this paper, we propose a novel strategy suitable for any peer-to-peer network which models data as a metric space and uses the M-tree to index the peer s content. This strategy is utilized when a peer is split as the dataset grows and new peers join the network. We propose a sophisticated algorithm which optimizes the process of splitting. This algorithm is compared with other algorithms detailed in the paper. In addition, we enhance the M-Chord with this sophisticated algorithm and compare it with the original version. 2. The similarity search system The retrieval system studied in this work is a peer-to-peer index structure called the M-Chord [22] which organizes data modeled as a metric space. The core idea of the M- Chord is to use a modification of the method called the idistance [29] to map the metric space into a one-dimensional domain. The standard peer-to-peer protocols like Chord or Skip Graphs are then used to divide the transformed data domain among the peers and to provide navigation within the system. Each peer organizes its assigned data in the M-tree. We use an extended version of M-tree which encapsulates (1) the node-split algorithm based on building a spanning tree taken from the Slim-tree and (2) filtering using a set of independent pivots taken from the PM-tree. Part 2 Part 1 (a) d m p Part 1 p 1 Part 2 (b) Figure 2. Examples of partitioning: (a) the ball partitioning and (b) the generalized hyperplane partitioning Metric space fundamentals Formally, the metric space is a pair M = (D, d), where D is the domain of objects and d is the total distance function satisfying the metric postulates: d(x, y) 0 (nonnegativity), d(x, y) = d(y, x) (symmetry) and d(x, z) d(x, y)+d(y, z) (triangle inequality). The distance function expresses the dissimilarity of two objects. Following the query-by-example paradigm, the range query R(q, r) retrieves all objects o within the range r from q (d(q, o) r), and the nearest neighbors query knn (q, k) returns the k objects with the smallest distances to q M-tree The M-tree [8] is a dynamic balanced tree-like structure. Similarly to B-trees, it is built in a bottom-up fashion by splitting over-filled nodes. The M-tree is a k-ary tree, so every internal node consists of up to k entries. Besides the reference to a subtree, the entry contains a pivot p (selected object) and a covering radius r c which together define a balllike region R(p, r c ). This region covers all objects stored within the subtree. The leaf nodes store data objects accompanied with their distances to the pivot in the parent node. The internal nodes keep distances to the parent node s pivot as well. All these distances are utilized in order to prune some tree branches by search/insert/delete algorithms. An illustrative example of M-tree is shown in Figure 1. The ball regions of the leaves and the internal nodes are identified by dashed circles and solid circles, respectively. 3. Splitting the M-tree In this section, we analyze the problem of splitting a peer of P2P network, where the peer organizes its data in the M- tree. In general, partitioning principles are based either on ball partitioning or generalized-hyperplane partitioning (see p 2

3 o 3 o 10 o 5 o 7 o 11 o 2 o 8 o 6 o 1 o o 4 9 o o o o o o o o o o o o o o o o o o Figure 1. Example of an M-tree consisting of three levels. Above, a 2-D representation of clustering. Pivots are denoted by crosses and the circles around pivots correspond to the values of covering radii. Figure 2). Though there are other partitioning policies, they build on these two basic variants in principle. The algorithms presented below can be applied to splitting the M- tree in any P2P index structure assuming that the P2P index splits peers by a ball or generalized-hyperplane partitioning. This assumption holds for currently available native and transformation indices for metric spaces. In particular, the M-Chord splits a peer by halving a one-dimensional interval to which the peer s data is mapped. This interval corresponds to values of distance from a pivot. So, by halving the interval we define a ball partitioning. The M-CAN [12] proceeds similarly. The GHT and VPT [3] split data directly by a generalized-hyperplane and ball partitioning, respectively. To split an instance of M-tree, we define a general algorithm which traverses the tree from its root to leaves and processes each object based on the decision to which partition the object belongs. The pseudo-code of this algorithm is in Algorithm 1. The first parameter is a tree node, initially set to the tree s root. The second parameter match specifies the partitioning. The match function returns an integer value which corresponds to the partition identification. For a ball partitioning defined by a pivot p and a medium distance d m, we have match BP. For a generalized-hyperplane partitioning defined by a pair of pivots p 1 and p 2, we have ( 1 if d(o, p) d m match BP (o) = 2 otherwise 8 >< 1 if d(q, p) d m r c match BP (R(q, r c )) = 2 if d(q, p) d m + r c >: 0 otherwise ( 1 if d(o, p 1) d(o, p 2) match GHP (o) = 2 otherwise 8 >< 1 if d(q, p 2) d(q, p 1) 2r c match GHP (R(q, r c )) = 2 if d(q, p 1) d(q, p 2) 2r c >: 0 otherwise Table 1. Definitions of the match function for a ball and generalized-hyperplane partitioning. match GHP. In order to optimize the process of splitting, we apply the match function to both individual objects and ball-like regions R(q, r c ). Because a ball-like region can intersect both the partitions, the match function returns 0 in this case. The application of match to ball-like regions has a great advantage in speeding an algorithm up. As an example, we consider a partitioning that requires all objects falling into the partition 2 to be eliminated. If a ball region

4 Algorithm 1 DivideAndProcess Algorithm Input: node, match, P rocess 1, P rocess 2 1: if node is a leaf node then 2: for all objects o in node do 3: if match(o) = 1 then 4: P rocess 1({o}) 5: else 6: P rocess 2({o}) 7: end if 8: end for 9: else 10: for all node entries [q, r c, subtree] in node do 11: if match(r(q, r c )) = 1 then 12: P rocess 1({all objects stored in subtree}) 13: else if match(r(q, r c )) = 2 then 14: P rocess 2({all objects stored in subtree}) 15: else if match(r(q, r c )) = 0 then 16: DivideAndP rocess(subtree, match, P rocess 1, P rocess 2) 17: end if 18: end for 19: end if is completely contained in the partition 2, we can straightforwardly detach the whole subtree which corresponds to this ball region instead of traversing the subtree and deleting objects one by one. The individual cases of the match function are defined in Table 1. In the M-tree, distances to parent pivots are stored. We exploit these distances by defining corresponding lower and upper bounds on distances d(o, p), which allows to prune objects without actually computing the distance d(o, p) [16]. In the rest of this section, we define four specific split algorithms. The first three use the general structure of Algorithm 1 and differ in the specification of the P rocess 1 and P rocess 2 procedures. The forth is more sophisticated and defines its own algorithm structure DoubleInsert algorithm The first variant is a baseline algorithm which builds two new trees M-tree 1 and M-tree 2 by using an ordinary insert procedure. Algorithm 1 is applied with straightforward actions P rocess 1, P rocess 2 which simply insert objects into the corresponding tree. Intuitively, the construction of the new trees can be optimized by a bulk-loading algorithm, but we do not apply such optimization in this paper DeleteInsert algorithm This algorithm keeps objects of Part 1 in the original tree and the objects of Part 2 moves to a new tree. So, P rocess 1 is defined as void while P rocess 2 deletes objects from the original tree and inserts into the new tree. We implemented the delete operation as a reverse operation to insertion of an object, i.e., we locate the leaf node which contains the object to delete, delete the object and update the covering radii on the path to the root if necessary. We do not handle node under-flows by merging with neighboring nodes as in [26] but we issue reinsertion of all objects in the under-flown node and delete this node. Due to the application of the match function to ball regions, the delete operation is performed on whole subtrees, if all objects in the subtree are to be deleted. However, the insertion of the deleted objects into the new M-tree is done one by one CloneDelete algorithm In the M-tree, the insertion is an expensive operation because of lots of distance computations performed. This algorithm avoids an insert procedure by creating a clone of the original tree and deleting the respective objects from the trees. The action P rocess 1 deletes objects from the clone, while P rocess 2 deletes objects from the original tree. Again both the delete operations are optimized on the level of ball regions, so the whole subtree can be deleted at once. This process is expensive from the memory point of view, since it requires the double amount of memory LeafBall algorithm This advanced algorithm operates on leaf-level nodes represented by ball regions. The algorithm separates the regions into two sets respecting the given partitioning. If a region R(q, r c ) intersects both the partitions, the objects covered by it are separated and two new regions with the same pivot q are created. For each set of ball regions, a new M-tree is built the entries representing the leaves are all inserted into a new node. This node forms the root of the tree. If the root node is overflowing, a regular Slim-tree s split is applied and repeated until no node in the tree is overflowing. The pseudo-code is available in Algorithm 2. If some leaf nodes in the resulting trees are slim (storing only one object), these objects are reinserted in the regular way and the empty nodes are deleted. This optimization is not included in the pseudo-code for legibility reasons. The LeafBall algorithm can be further generalized to operate on non-leaf nodes, but, from our experience, the probability that ball regions overlap rapidly grows. Thus the chance for locating a ball region which is completely included in one of the split partitions is getting lower for nodes closer to the root node. 4. Performance evaluation In order to verify properties of the proposed algorithms, we use a real-life dataset consisting of ten million images

5 Algorithm 2 LeafBall Algorithm Input: node, match 1: S 1 = S 2 = 2: for all leaf nodes N in the tree rooted at node do 3: get the covering region R(q, r c ) of N 4: if match(r(q, r c )) = 1 then 5: add N to S 1 6: else if match(r(q, r c )) = 2 then 7: add N to S 2 8: else if match(r(q, r c )) = 0 then 9: create new nodes N 1, N 2 10: for all objects o in N do 11: add o to N i, where i = match(o) 12: end for 13: establish covering regions R(q, r1), c R(q, r2) c for N 1, N 2 14: add N 1 to S 1 and N 2 to S 2 15: end if 16: end for 17: {create M-tree 1 and M-tree 2} 18: for each set S in {S 1, S 2} do 19: create an empty M-tree 20: insert all leaves in the tree while having node splits disabled 21: while a node is overflowing do 22: split the node by a common node-split procedure 23: end while 24: end for gathered from the Flickr Photo Sharing System. From these images, five MPEG-7 descriptors were extracted and the images are represented as the five vectors having together 280 dimensions. In particular, the descriptors are the color structure (CS), color layout (CL), scalable color (SC), edge histogram (EH), and homogeneous texture (HT). CS, CL, and SC reflect color characteristics of the image. Local density of edge elements and their directions (sometimes called the structure or layout) is represented by EH and acts as a simple and robust substitution for shapes. Finally, HT is a texture descriptor. The distance function which measures dissimilarity of two images, is represented by a weighted sum of dissimilarity measures defined for individual MPEG-7 descriptors. The weights are 3, 2, 2, 4 and frequency (%) pairwise distance Figure 3. Distance density of the image dataset. 1 cap. 2 1 cap. 2 cap. Algorithm Dist. comp. Dist. comp. Dist. comp. Real time Real time Real time DoubleInsert 3,979,339 4,618,104 6,208, s s s DeleteInsert 2,127,120 2,425,752 3,200, s 55.0 s 71.6 s CloneDelete 418, , , s 40.7 s 40.0 s LeafBall 323, , , s 6.3 s 5.6 s Table 2. Comparison of M-tree split algorithms for the generalized-hyperplane and varying node capacity. 1 cap. 2 1 cap. 2 cap. Algorithm Dist. comp. Dist. comp. Dist. comp. Real time Real time Real time DoubleInsert 3,912,917 4,546,455 6,021, s s s DeleteInsert 1,973,954 2,250,393 3,071, s 52.3 s 67.6 s CloneDelete 210, , , s 36.9 s 36.7 s LeafBall 220, , , s 4.2 s 4.6 s Table 3. Comparison of M-tree split algorithms for the ball partitioning and varying node capacity. 0.5 for CS, CL, SC, EH and HT, respectively. The distance measure for CS and SC is the simple L 1 metric, while the measures for the others are tailored to their specific nature. The distance density of the 10-million dataset is depicted in Figure 3. Firstly, we compare the individual algorithms on 100,000 images organized in a single M-tree. In order to make the comparison objective, we used three different capacities of internal and leaf nodes. The first configuration (denoted in the tables as 1 2 cap.) allows 21 and 84 objects stored in internal and leaf nodes, respectively. The resulting tree had 1,951 leaf nodes. In the second configuration (1 cap.), we used 42 and 168 objects, which resulted in 1,168 leaf nodes. The parameters for the third configuration (2 cap.) are 84 and 336 objects having resulted in 578 leaf nodes. We used five different pairs of pivots for the generalized-hyperplane partitioning to split the M-tree.

6 In Table 2, there are results in number of computations of the distance function d and real time averaged over five runs. The same experiment was conducted for the ball partitioning where five different pivots with appropriate values of radius were selected. These results are available in Table 3. All settings of partitioning produced almost equallysized partitions. The experiments were executed on a regular PC (Athlon64 2.6GHz, 2GB RAM). From the results, we read that DoubleInsert is by far slower than the others. In particular, LeafBall was 19 and 32 times faster in distance computations and 17 and 25 times in real time for the middle capacities. In all algorithms, the content of the original tree was separated by calling the match function. Nonetheless, neither of the optimizations (precomputed distances to parent pivots and match defined on regions) helped this function to avoid many distance evaluations. So, all algorithms needed almost 200,000 and 100,000 distance computations for separating objects in the case of the generalizedhyperplane and ball partitioning, respectively. By subtracting these numbers from the total distance computations, we get 100 times performance gain comparing LeafBall and DoubleInsert. The trends when the capacity of nodes is changed are interesting. In the case of the insertion-based algorithms (DoubleInsert and DeleteInsert), the split costs correspond proportionally to the node capacity. Such behavior is induced by the split algorithm of M-tree nodes which computes a spanning tree internally. This algorithm requires more distance computations if the node capacity increases. On the other hand, the costs of CloneDelete and LeafBall are inversely proportional to the node capacity. The smaller node capacity implies more leaf nodes, so these two algorithms must inspect more leaf nodes and also more internal nodes. From all the results, we can also induce that the insert operation is very expensive due to frequent node splits. The costs of insertion can be decreased by applying a bulk-loading algorithm. We have not applied any bulkloading technique, however, the authors of [7] claim a 25% gain in CPU costs only. The other algorithm [26] to bulkload M-tree is not studied from the CPU costs point of view, but it decreases the I/O costs by 50%. Secondly, we tested the efficiency of resulting trees in terms of querying. We run 20 range and 20 knn queries on the resulting trees M-tree 1 and M-tree 2 which both have the capacity of internal and leaf nodes set to 42 and 168 objects, respectively. We fixed the radius to 0.8 and the number of neighbors to 20 for all the queries. The range queries returned 20 objects on average. Table 4 contains average values of distance computations and real time over the batch of queries. The search performance of the trees is almost identical. Then, we inserted additional 100,000 objects respecting the partitioning which produced the trees M-tree 1 and M-tree 2. The insertion costs are included in the table and they support the claim that the insertion is expensive, because 32 distance computations were needed to insert one object, on average. After populating the trees, we repeated the same querying. Obviously, the response times increased because the content of each tree was doubled, but the efficiency of trees was retained without any regards to the algorithm used to construct the tree. We experienced the same behavior when the ball partitioning was applied and for the limited space we did not include them. Finally, we incorporated the LeafBall algorithm in the M-Chord and built a P2P network on 10 million images. The M-Chord used twenty pivots for the transformation to the one-dimensional domain and the same twenty pivots were exploited in additional filtering in the M-tree (extension taken from the PM-tree [27]). The node capacities of M-tree were the same as above. All M-Chord peers together needed 200 million distance computations to compute the transformation and we do not include these costs in comparison. First, we built the network with the DoubleInsert algorithm for splitting the peers. In total, peer split procedure was called 499 times, so the network consisted of 500 peers. The overall costs of inserting all objects, including the split costs, was 849,822,669 distance computations. Next, we replaced the split algorithm with LeafBall and the overall cost dropped down to 611,667,738 distance computations, which is an improvement by 28 %. However, these costs include insertions of objects one by one, thus, the stepwise process of building each M-tree, which needed about 603 million distance computations. The remaining portion is the costs induced by splits resulting in 28 times performance boost of the LeafBall algorithm. Consequently, one M-tree split took 20 s on average, which was reduced to 1.2 s by the LeafBall algorithm. This is a very important gain because the peer is blocked during the split, so that it cannot accept new data nor answer queries. 5. Conclusions and future work In this paper, we have focused on algorithms to efficiently split data stored in metric-based index structures of the M-tree family. The work was primarily motivated by peer splits in structured peer-to-peer networks. We presented four algorithms and compared them on a real-life image database. The high efficiency of the proposed Leaf- Ball algorithm was confirmed by a large-scale experiment, where we built the M-Chord peer-to-peer index structure on 10 million images. The LeafBall algorithm compared to the baseline DoubleInsert algorithm provides about 28 times performance boost. As the future work, we want to study the possibility that the M-tree itself would find a suitable partitioning. The partitioning is then reported to the peer instead of the current procedures when the peer dictates the partitioning to the M- tree. This may bring further improvement since the M-tree

7 Algorithm Dist. comp. Real time Dist. comp. Real time DoubleInsert M-tree 1 M-tree 2 Range search #1 24,433 23, s 0.58 s knn search #1 39,612 42, s 1.00 s Insert 3,253,968 3,274, s s Range search #2 46,680 44, s 1.09 s knn search #2 75,448 81, s 2.11 s DeleteInsert M-tree 1 M-tree 2 Range search #1 24,236 24, s 0.60 s knn search #1 39,603 42, s 1.01 s Insert 2,927,322 3,291, s s Range search #2 46,489 44, s 1.10 s knn search #2 75,531 81, s 1.98 s CloneDelete M-tree 1 M-tree 2 Range search #1 24,236 24, s 0.58 s knn search #1 39,603 42, s 1.02 s Insert 2,943,171 2,938, s s Range search #2 46,540 44, s 1.10 s knn search #2 75,634 81, s 1.97 s LeafBall M-tree 1 M-tree 2 Range search #1 24,213 24, s 0.58 s knn search #1 39,581 42, s 1.02 s Insert 3,102,343 3,041, s s Range search #2 46,204 44, s 1.17 s knn search #2 75,321 81, s 1.94 s Table 4. Search performance and insertion costs of the proposed algorithms in the case of generalized-hyperplane partitioning. can exploit its internal structures to optimize the whole process. Another research challenge is to generalize the algorithms to partitioning principles which produce more than two partitions, e.g. the Voronoi-like partitioning Acknowledgements This work was partially supported by: the EU IST FP6 project (SAPIR), the national research project 1ET , the Czech Science Foundation projects 201/07/P240 and 201/06/1338. We are grateful to our SAPIR partners from ISTI-CNR, Pisa, who crawled the image dataset and extracted the MPEG-7 features. References [1] L. Arge, M. de Berg, H. J. Haverkort, and K. Yi. The priority r-tree: a practically efficient and worst-case optimal r-tree. In Proceedings of the ACM SIGMOD International Conference on Management of data (SIGMOD 2004), pages , New York, NY, USA, ACM. [2] F. Banaei-Kashani and C. Shahabi. SWAM: A family of access methods for similarity-search in peer-to-peer data networks. In Proceedings of the Thirteenth ACM conference on Information and Knowledge Management (CIKM 2004), pages ACM Press, [3] M. Batko, D. Novak, F. Falchi, and P. Zezula. On scalability of the similarity search in the world of peers. In Proceedings of First International Conference on Scalable Information Systems (INFOSCALE 2006), Hong Kong, May 30 - June 1, pages ACM Press, [4] M. Bender, S. Michel, P. Triantafillou, G. Weikum, and C. Zimmer. MINERVA: Collaborative P2P search. In Proceedings of the 31st international conference on Very large data bases (VLDB 2005), pages VLDB Endowment, [5] S. Berchtold, C. Böhm, and H.-P. Kriegel. Improving the query performance of high-dimensional index structures by bulk-load operations. In Proceedings of the 6th International Conference on Extending Database Technology (EDBT 1998), pages , London, UK, Springer- Verlag. [6] S. Brin. Near neighbor search in large metric spaces. In U. Dayal, P. M. D. Gray, and S. Nishio, editors, Proceedings of the 21th International Conference on Very Large Data Bases (VLDB 1995), Zurich, Switzerland, September 11-15, 1995, pages Morgan Kaufmann, [7] P. Ciaccia and M. Patella. Bulk loading the M-tree. In Proceedings of the 9th Australasian Database Conference (ADC 1998), Perth, Australia, February 2-3, 1998, volume 20(2) of Australian Computer Science Communications, pages Springer, [8] P. Ciaccia, M. Patella, and P. Zezula. M-tree: An efficient access method for similarity search in metric spaces. In M. Jarke, M. J. Carey, K. R. Dittrich, F. H. Lochovsky, P. Loucopoulos, and M. A. Jeusfeld, editors, Proceedings of the 23rd International Conference on Very Large Data

8 Bases (VLDB 1997), Athens, Greece, August 25-29, 1997, pages Morgan Kaufmann, [9] J. V. den Bercken and B. Seeger. An evaluation of generic bulk loading techniques. In Proceedings of the 27th International Conference on Very Large Data Bases (VLDB 2001), pages , San Francisco, CA, USA, Morgan Kaufmann Publishers Inc. [10] J. V. den Bercken, B. Seeger, and P. Widmayer. A generic approach to bulk loading multidimensional index structures. In Proceedings of the 23rd International Conference on Very Large Data Bases (VLDB 1997), pages , San Francisco, CA, USA, Morgan Kaufmann Publishers Inc. [11] V. Dohnal, C. Gennaro, P. Savino, and P. Zezula. D-Index: Distance searching index for metric data sets. Multimedia Tools and Applications, 21(1):9 33, [12] F. Falchi, C. Gennaro, and P. Zezula. A contentaddressable network for similarity search in metric spaces. In Proceedings of the the 2nd International Workshop on Databases, Information Systems and Peer-to-Peer Computing (DBISP2P 2005), Trondheim, Norway, August 28-29, 2005, pages , [13] P. Ganesan, B. Yang, and H. Garcia-Molina. One torus to rule them all: Multi-dimensional queries in P2P systems. In Proceedings of the 7th International Workshop on the Web and Databases (WebDB 2004), pages 19 24, New York, NY, USA, ACM Press. [14] T. M. Ghanem, R. Shah, M. F. Mokbel, W. G. Aref, and J. S. Vitter. Bulk operations for space-partitioning trees. In Proceedings of the 20th International Conference on Data Engineering (ICDE 2004), page 29, Washington, DC, USA, IEEE Computer Society. [15] G. R. Hjaltason and H. Samet. Improved bulk-loading algorithms for quadtrees. In Proceedings of the 7th International Symposium on Advances in Geographic Information Systems (ACM GIS 1999), November 2-6, 1999, Kansas City, USA, pages ACM Press, [16] G. R. Hjaltason and H. Samet. Index-driven similarity search in metric spaces. ACM Transactions on Database Systems (TODS 2003), 28(4): , [17] I. Kamel and C. Faloutsos. Hilbert R-tree: An Improved R- tree using Fractals. In Proceedings of the 20th International Conference on Very Large Databases (VLDB 1994), pages , Santiago, Chile, [18] T. M. Klein, K. J. Parzygnat, and A. L. Tharp. Optimal b- tree packing. Information Systems, 16(2): , [19] S. T. Leutenegger and D. M. Nicol. Efficient bulk-loading of gridfiles. IEEE Transactions on Knowledge and Data Engineering, 09(3): , [20] J. Li, D. Rotem, and J. Srivastava. Algorithms for loading parallel grid files. SIGMOD Record, 22(2): , [21] G. Navarro. Searching in metric spaces by spatial approximation. In Proceedings of the 6th International Symposium on String Processing and Information Retrieval (SPIRE 1999), Cancun, Mexico, September 21-24, 1999, pages IEEE Computer Society, [22] D. Novak and P. Zezula. M-Chord: A scalable distributed similarity search structure. In Proceedings of First International Conference on Scalable Information Systems (IN- FOSCALE 2006), Hong Kong, May 30 - June 1, pages IEEE Computer Society, [23] M. H. Overmars and J. van Leeuwen. Dynamic multidimensional data structures based on quad- and kd trees. Acta Informatica, 17(3): , [24] A. Papadopoulos and Y. Manolopoulos. Parallel bulkloading of spatial data. Parallel Computing, 29(10): , [25] H. Samet. Foundations of Multidimensional And Metric Data Structures. The Morgan Kaufmann Series in Data Management Systems. Morgan Kaufmann, [26] A. P. Sexton and R. Swinbank. Bulk loading the m-tree to enhance query performance. In M. H. Williams and L. M. MacKinnon, editors, Proceedings of 21st British National Conference on Databases (BNCOD 2004), Edinburgh, UK, July 7-9, 2004, volume 3112 of Lecture Notes in Computer Science, pages Springer, [27] T. Skopal. Pivoting M-tree: A metric access method for efficient similarity search. In V. Snášel, J. Pokorný, and K. Richta, editors, Proceedings of the Annual International Workshop on DAtabases, TExts, Specifications and Objects (DATESO 2004), Desna, Czech Republic, April 14-16, 2004, volume 98 of CEUR Workshop Proceedings. Technical University of Aachen (RWTH), [28] C. Traina, Jr., A. J. M. Traina, B. Seeger, and C. Faloutsos. Slim-Trees: High performance metric trees minimizing overlap between nodes. In C. Zaniolo, P. C. Lockemann, M. H. Scholl, and T. Grust, editors, Proceedings of the 7th International Conference on Extending Database Technology (EDBT 2000), Konstanz, Germany, March 27-31, 2000, volume 1777 of Lecture Notes in Computer Science, pages Springer, [29] C. Yu, B. C. Ooi, K.-L. Tan, and H. V. Jagadish. Indexing the distance: An efficient method to knn processing. In P. M. G. Apers, P. Atzeni, S. Ceri, S. Paraboschi, K. Ramamohanarao, and R. T. Snodgrass, editors, Proceedings of 27th International Conference on Very Large Data Bases (VLDB 2001), Roma, Italy, September 11-14, 2001, pages Morgan Kaufmann, [30] P. Zezula, G. Amato, V. Dohnal, and M. Batko. Similarity Search: The Metric Space Approach, volume 32 of Advances in Database Systems. Springer, 2005.

Crawling, Indexing, and Similarity Searching Images on the Web

Crawling, Indexing, and Similarity Searching Images on the Web (Extended Abstract) M. Batko 1 and F. Falchi 2 and C. Lucchese 2 and D. Novak 1 and R. Perego 2 and F. Rabitti 2 and J. Sedmidubsky 1 and