Similarity Searching: Towards Bulk-loading Peer-to-Peer Networks

Size: px
Start display at page:

Download "Similarity Searching: Towards Bulk-loading Peer-to-Peer Networks"

Transcription

1 Similarity Searching: Towards Bulk-loading Peer-to-Peer Networks Vlastislav Dohnal Jan Sedmidubský Pavel Zezula David Novák Faculty of Informatics Masaryk University Botanická 68a, Brno, Czech Republic {dohnal, xsedmid, zezula, Abstract Due to the exponential growth of digital data and its complexity, we need a technique which allows us to search such collections efficiently. A suitable solution seems to be based on the peer-to-peer (P2P) network paradigm and the metric-space model of similarity. During the building phase of the distributed structure, the peers often split as new peers join the network. During a peer split, the local data is halved and one half is migrated to the new peer. In this paper, we study the problem of efficient splits of metric data locally organized by an M-tree and we propose a novel algorithm for speeding the splits up. In particular, we focus on the metric-based structured P2P network called the M-Chord. In experimental evaluation, we compare the proposed algorithm with several straightforward solutions on a real network organizing 10 million images. Our algorithm provides a significant performance boost. 1. Introduction Current data processing applications often use data with considerably less structure and search the data with much less precise queries than traditional database systems. An example is multimedia data like images or video clips that offer query-by-example search. This situation is what has given rise to similarity searching. The most general approach to similarity search, still allowing construction of index structures, is modeled by a metric space. Many metricbased index structures were developed and surveyed recently [25, 30]. The latest efforts in this area focus on the design of distributed access structures which exploit more computational and storage resources [2, 13, 4, 3] in order to cope with the problem of exponential growth of the data to be processed. A suitable solution arises from peer-to-peer (P2P) networks which are scalable. In this digital-explosion age, a real P2P-based data structure organizes non-trivial amounts of data, e.g., tens of millions of data objects (records) or even more. In this respect, each peer of this network has to also maintain a large number of data objects, so the search within the peer is usually sped up by a local (centralized) index structure. During the life cycle of the P2P structure, peers are split as the data collection grows and new peers join the network. Consequently, the data content of the overloaded peer must be halved, that is the local index structure must be split. In this paper, we focus on effective splits of local metricbased indices. Namely, we study this problem on the P2P index structure called the M-Chord [22] which uses the M- tree [8] to index peer s local data. This M-tree is enriched with extensions of the Slim-tree [28] and the PM-tree [27]. The paper is structured as follows. After related work, Section 2 contains description of the architecture of our image retrieval system. In Section 3, we present four algorithms to split peers. The paper concludes with a performance comparison Related work Multimedia data is often modeled by high-dimensional spaces, so a variety of high-dimensional index structures were proposed, e.g., the R-tree, the TV-tree, the SS-tree or the X-tree. A more generic approach to index such data is to exploit the metric space model, e.g. the M-tree [8], the Slim-tree [28], the GNAT [6], the SAT [21], or the D- index [11]. Due to the need to organize large databases, many techniques to bulk-load index structures were proposed. The Hilbert R-tree [17] is probably the first modification of R- tree aiming at optimized insertion of a large number of data records. It is based on sorting the elements first and then building the tree bottom-up. Earlier, similar techniques were proposed for the B-tree [18] and the quadtree [23, 15], for illustration. A bulk-loading algorithm which also optimizes the tree in order to improve search efficiency is proposed in [5]. It is demonstrated on the X-tree but it can be applied to any R-tree-like structure. Another variant of R-

2 tree called the Priority R-tree [1] is slightly less efficient in bulk-loading than the Hilbert R-tree, however, the resulting tree has better query performance. A bulk-loading strategy which sorts the data items first cannot be applied to structures based purely on the metric space model because there is no linear ordering defined in the general metric space. A generic bulk-loading algorithm that does not sort new items beforehand, is presented in [10]. This proposal was further improved in [9] and applied to the Slim-tree. For the original M-tree, two bulk-loading algorithms were proposed. The first one [7] decreases the construction costs, while retaining the same query performance. The second one [26] focuses on improving query performance without considering building costs. In [14], a general algorithm for bulk-loading applicable to height-unbalanced structures is proposed. In parallel environments, the problem of bulk-loading was studied for GridFiles [20, 19]. Papadopoulos and Manolopoulos investigated this problem in parallel database systems [24]. To the best of our knowledge, the bulkloading operations have not been explored in the area of structured peer-to-peer systems yet mainly from the splits of peers point of view Contributions In this paper, we propose a novel strategy suitable for any peer-to-peer network which models data as a metric space and uses the M-tree to index the peer s content. This strategy is utilized when a peer is split as the dataset grows and new peers join the network. We propose a sophisticated algorithm which optimizes the process of splitting. This algorithm is compared with other algorithms detailed in the paper. In addition, we enhance the M-Chord with this sophisticated algorithm and compare it with the original version. 2. The similarity search system The retrieval system studied in this work is a peer-to-peer index structure called the M-Chord [22] which organizes data modeled as a metric space. The core idea of the M- Chord is to use a modification of the method called the idistance [29] to map the metric space into a one-dimensional domain. The standard peer-to-peer protocols like Chord or Skip Graphs are then used to divide the transformed data domain among the peers and to provide navigation within the system. Each peer organizes its assigned data in the M-tree. We use an extended version of M-tree which encapsulates (1) the node-split algorithm based on building a spanning tree taken from the Slim-tree and (2) filtering using a set of independent pivots taken from the PM-tree. Part 2 Part 1 (a) d m p Part 1 p 1 Part 2 (b) Figure 2. Examples of partitioning: (a) the ball partitioning and (b) the generalized hyperplane partitioning Metric space fundamentals Formally, the metric space is a pair M = (D, d), where D is the domain of objects and d is the total distance function satisfying the metric postulates: d(x, y) 0 (nonnegativity), d(x, y) = d(y, x) (symmetry) and d(x, z) d(x, y)+d(y, z) (triangle inequality). The distance function expresses the dissimilarity of two objects. Following the query-by-example paradigm, the range query R(q, r) retrieves all objects o within the range r from q (d(q, o) r), and the nearest neighbors query knn (q, k) returns the k objects with the smallest distances to q M-tree The M-tree [8] is a dynamic balanced tree-like structure. Similarly to B-trees, it is built in a bottom-up fashion by splitting over-filled nodes. The M-tree is a k-ary tree, so every internal node consists of up to k entries. Besides the reference to a subtree, the entry contains a pivot p (selected object) and a covering radius r c which together define a balllike region R(p, r c ). This region covers all objects stored within the subtree. The leaf nodes store data objects accompanied with their distances to the pivot in the parent node. The internal nodes keep distances to the parent node s pivot as well. All these distances are utilized in order to prune some tree branches by search/insert/delete algorithms. An illustrative example of M-tree is shown in Figure 1. The ball regions of the leaves and the internal nodes are identified by dashed circles and solid circles, respectively. 3. Splitting the M-tree In this section, we analyze the problem of splitting a peer of P2P network, where the peer organizes its data in the M- tree. In general, partitioning principles are based either on ball partitioning or generalized-hyperplane partitioning (see p 2

3 o 3 o 10 o 5 o 7 o 11 o 2 o 8 o 6 o 1 o o 4 9 o o o o o o o o o o o o o o o o o o Figure 1. Example of an M-tree consisting of three levels. Above, a 2-D representation of clustering. Pivots are denoted by crosses and the circles around pivots correspond to the values of covering radii. Figure 2). Though there are other partitioning policies, they build on these two basic variants in principle. The algorithms presented below can be applied to splitting the M- tree in any P2P index structure assuming that the P2P index splits peers by a ball or generalized-hyperplane partitioning. This assumption holds for currently available native and transformation indices for metric spaces. In particular, the M-Chord splits a peer by halving a one-dimensional interval to which the peer s data is mapped. This interval corresponds to values of distance from a pivot. So, by halving the interval we define a ball partitioning. The M-CAN [12] proceeds similarly. The GHT and VPT [3] split data directly by a generalized-hyperplane and ball partitioning, respectively. To split an instance of M-tree, we define a general algorithm which traverses the tree from its root to leaves and processes each object based on the decision to which partition the object belongs. The pseudo-code of this algorithm is in Algorithm 1. The first parameter is a tree node, initially set to the tree s root. The second parameter match specifies the partitioning. The match function returns an integer value which corresponds to the partition identification. For a ball partitioning defined by a pivot p and a medium distance d m, we have match BP. For a generalized-hyperplane partitioning defined by a pair of pivots p 1 and p 2, we have ( 1 if d(o, p) d m match BP (o) = 2 otherwise 8 >< 1 if d(q, p) d m r c match BP (R(q, r c )) = 2 if d(q, p) d m + r c >: 0 otherwise ( 1 if d(o, p 1) d(o, p 2) match GHP (o) = 2 otherwise 8 >< 1 if d(q, p 2) d(q, p 1) 2r c match GHP (R(q, r c )) = 2 if d(q, p 1) d(q, p 2) 2r c >: 0 otherwise Table 1. Definitions of the match function for a ball and generalized-hyperplane partitioning. match GHP. In order to optimize the process of splitting, we apply the match function to both individual objects and ball-like regions R(q, r c ). Because a ball-like region can intersect both the partitions, the match function returns 0 in this case. The application of match to ball-like regions has a great advantage in speeding an algorithm up. As an example, we consider a partitioning that requires all objects falling into the partition 2 to be eliminated. If a ball region

4 Algorithm 1 DivideAndProcess Algorithm Input: node, match, P rocess 1, P rocess 2 1: if node is a leaf node then 2: for all objects o in node do 3: if match(o) = 1 then 4: P rocess 1({o}) 5: else 6: P rocess 2({o}) 7: end if 8: end for 9: else 10: for all node entries [q, r c, subtree] in node do 11: if match(r(q, r c )) = 1 then 12: P rocess 1({all objects stored in subtree}) 13: else if match(r(q, r c )) = 2 then 14: P rocess 2({all objects stored in subtree}) 15: else if match(r(q, r c )) = 0 then 16: DivideAndP rocess(subtree, match, P rocess 1, P rocess 2) 17: end if 18: end for 19: end if is completely contained in the partition 2, we can straightforwardly detach the whole subtree which corresponds to this ball region instead of traversing the subtree and deleting objects one by one. The individual cases of the match function are defined in Table 1. In the M-tree, distances to parent pivots are stored. We exploit these distances by defining corresponding lower and upper bounds on distances d(o, p), which allows to prune objects without actually computing the distance d(o, p) [16]. In the rest of this section, we define four specific split algorithms. The first three use the general structure of Algorithm 1 and differ in the specification of the P rocess 1 and P rocess 2 procedures. The forth is more sophisticated and defines its own algorithm structure DoubleInsert algorithm The first variant is a baseline algorithm which builds two new trees M-tree 1 and M-tree 2 by using an ordinary insert procedure. Algorithm 1 is applied with straightforward actions P rocess 1, P rocess 2 which simply insert objects into the corresponding tree. Intuitively, the construction of the new trees can be optimized by a bulk-loading algorithm, but we do not apply such optimization in this paper DeleteInsert algorithm This algorithm keeps objects of Part 1 in the original tree and the objects of Part 2 moves to a new tree. So, P rocess 1 is defined as void while P rocess 2 deletes objects from the original tree and inserts into the new tree. We implemented the delete operation as a reverse operation to insertion of an object, i.e., we locate the leaf node which contains the object to delete, delete the object and update the covering radii on the path to the root if necessary. We do not handle node under-flows by merging with neighboring nodes as in [26] but we issue reinsertion of all objects in the under-flown node and delete this node. Due to the application of the match function to ball regions, the delete operation is performed on whole subtrees, if all objects in the subtree are to be deleted. However, the insertion of the deleted objects into the new M-tree is done one by one CloneDelete algorithm In the M-tree, the insertion is an expensive operation because of lots of distance computations performed. This algorithm avoids an insert procedure by creating a clone of the original tree and deleting the respective objects from the trees. The action P rocess 1 deletes objects from the clone, while P rocess 2 deletes objects from the original tree. Again both the delete operations are optimized on the level of ball regions, so the whole subtree can be deleted at once. This process is expensive from the memory point of view, since it requires the double amount of memory LeafBall algorithm This advanced algorithm operates on leaf-level nodes represented by ball regions. The algorithm separates the regions into two sets respecting the given partitioning. If a region R(q, r c ) intersects both the partitions, the objects covered by it are separated and two new regions with the same pivot q are created. For each set of ball regions, a new M-tree is built the entries representing the leaves are all inserted into a new node. This node forms the root of the tree. If the root node is overflowing, a regular Slim-tree s split is applied and repeated until no node in the tree is overflowing. The pseudo-code is available in Algorithm 2. If some leaf nodes in the resulting trees are slim (storing only one object), these objects are reinserted in the regular way and the empty nodes are deleted. This optimization is not included in the pseudo-code for legibility reasons. The LeafBall algorithm can be further generalized to operate on non-leaf nodes, but, from our experience, the probability that ball regions overlap rapidly grows. Thus the chance for locating a ball region which is completely included in one of the split partitions is getting lower for nodes closer to the root node. 4. Performance evaluation In order to verify properties of the proposed algorithms, we use a real-life dataset consisting of ten million images

5 Algorithm 2 LeafBall Algorithm Input: node, match 1: S 1 = S 2 = 2: for all leaf nodes N in the tree rooted at node do 3: get the covering region R(q, r c ) of N 4: if match(r(q, r c )) = 1 then 5: add N to S 1 6: else if match(r(q, r c )) = 2 then 7: add N to S 2 8: else if match(r(q, r c )) = 0 then 9: create new nodes N 1, N 2 10: for all objects o in N do 11: add o to N i, where i = match(o) 12: end for 13: establish covering regions R(q, r1), c R(q, r2) c for N 1, N 2 14: add N 1 to S 1 and N 2 to S 2 15: end if 16: end for 17: {create M-tree 1 and M-tree 2} 18: for each set S in {S 1, S 2} do 19: create an empty M-tree 20: insert all leaves in the tree while having node splits disabled 21: while a node is overflowing do 22: split the node by a common node-split procedure 23: end while 24: end for gathered from the Flickr Photo Sharing System. From these images, five MPEG-7 descriptors were extracted and the images are represented as the five vectors having together 280 dimensions. In particular, the descriptors are the color structure (CS), color layout (CL), scalable color (SC), edge histogram (EH), and homogeneous texture (HT). CS, CL, and SC reflect color characteristics of the image. Local density of edge elements and their directions (sometimes called the structure or layout) is represented by EH and acts as a simple and robust substitution for shapes. Finally, HT is a texture descriptor. The distance function which measures dissimilarity of two images, is represented by a weighted sum of dissimilarity measures defined for individual MPEG-7 descriptors. The weights are 3, 2, 2, 4 and frequency (%) pairwise distance Figure 3. Distance density of the image dataset. 1 cap. 2 1 cap. 2 cap. Algorithm Dist. comp. Dist. comp. Dist. comp. Real time Real time Real time DoubleInsert 3,979,339 4,618,104 6,208, s s s DeleteInsert 2,127,120 2,425,752 3,200, s 55.0 s 71.6 s CloneDelete 418, , , s 40.7 s 40.0 s LeafBall 323, , , s 6.3 s 5.6 s Table 2. Comparison of M-tree split algorithms for the generalized-hyperplane and varying node capacity. 1 cap. 2 1 cap. 2 cap. Algorithm Dist. comp. Dist. comp. Dist. comp. Real time Real time Real time DoubleInsert 3,912,917 4,546,455 6,021, s s s DeleteInsert 1,973,954 2,250,393 3,071, s 52.3 s 67.6 s CloneDelete 210, , , s 36.9 s 36.7 s LeafBall 220, , , s 4.2 s 4.6 s Table 3. Comparison of M-tree split algorithms for the ball partitioning and varying node capacity. 0.5 for CS, CL, SC, EH and HT, respectively. The distance measure for CS and SC is the simple L 1 metric, while the measures for the others are tailored to their specific nature. The distance density of the 10-million dataset is depicted in Figure 3. Firstly, we compare the individual algorithms on 100,000 images organized in a single M-tree. In order to make the comparison objective, we used three different capacities of internal and leaf nodes. The first configuration (denoted in the tables as 1 2 cap.) allows 21 and 84 objects stored in internal and leaf nodes, respectively. The resulting tree had 1,951 leaf nodes. In the second configuration (1 cap.), we used 42 and 168 objects, which resulted in 1,168 leaf nodes. The parameters for the third configuration (2 cap.) are 84 and 336 objects having resulted in 578 leaf nodes. We used five different pairs of pivots for the generalized-hyperplane partitioning to split the M-tree.

6 In Table 2, there are results in number of computations of the distance function d and real time averaged over five runs. The same experiment was conducted for the ball partitioning where five different pivots with appropriate values of radius were selected. These results are available in Table 3. All settings of partitioning produced almost equallysized partitions. The experiments were executed on a regular PC (Athlon64 2.6GHz, 2GB RAM). From the results, we read that DoubleInsert is by far slower than the others. In particular, LeafBall was 19 and 32 times faster in distance computations and 17 and 25 times in real time for the middle capacities. In all algorithms, the content of the original tree was separated by calling the match function. Nonetheless, neither of the optimizations (precomputed distances to parent pivots and match defined on regions) helped this function to avoid many distance evaluations. So, all algorithms needed almost 200,000 and 100,000 distance computations for separating objects in the case of the generalizedhyperplane and ball partitioning, respectively. By subtracting these numbers from the total distance computations, we get 100 times performance gain comparing LeafBall and DoubleInsert. The trends when the capacity of nodes is changed are interesting. In the case of the insertion-based algorithms (DoubleInsert and DeleteInsert), the split costs correspond proportionally to the node capacity. Such behavior is induced by the split algorithm of M-tree nodes which computes a spanning tree internally. This algorithm requires more distance computations if the node capacity increases. On the other hand, the costs of CloneDelete and LeafBall are inversely proportional to the node capacity. The smaller node capacity implies more leaf nodes, so these two algorithms must inspect more leaf nodes and also more internal nodes. From all the results, we can also induce that the insert operation is very expensive due to frequent node splits. The costs of insertion can be decreased by applying a bulk-loading algorithm. We have not applied any bulkloading technique, however, the authors of [7] claim a 25% gain in CPU costs only. The other algorithm [26] to bulkload M-tree is not studied from the CPU costs point of view, but it decreases the I/O costs by 50%. Secondly, we tested the efficiency of resulting trees in terms of querying. We run 20 range and 20 knn queries on the resulting trees M-tree 1 and M-tree 2 which both have the capacity of internal and leaf nodes set to 42 and 168 objects, respectively. We fixed the radius to 0.8 and the number of neighbors to 20 for all the queries. The range queries returned 20 objects on average. Table 4 contains average values of distance computations and real time over the batch of queries. The search performance of the trees is almost identical. Then, we inserted additional 100,000 objects respecting the partitioning which produced the trees M-tree 1 and M-tree 2. The insertion costs are included in the table and they support the claim that the insertion is expensive, because 32 distance computations were needed to insert one object, on average. After populating the trees, we repeated the same querying. Obviously, the response times increased because the content of each tree was doubled, but the efficiency of trees was retained without any regards to the algorithm used to construct the tree. We experienced the same behavior when the ball partitioning was applied and for the limited space we did not include them. Finally, we incorporated the LeafBall algorithm in the M-Chord and built a P2P network on 10 million images. The M-Chord used twenty pivots for the transformation to the one-dimensional domain and the same twenty pivots were exploited in additional filtering in the M-tree (extension taken from the PM-tree [27]). The node capacities of M-tree were the same as above. All M-Chord peers together needed 200 million distance computations to compute the transformation and we do not include these costs in comparison. First, we built the network with the DoubleInsert algorithm for splitting the peers. In total, peer split procedure was called 499 times, so the network consisted of 500 peers. The overall costs of inserting all objects, including the split costs, was 849,822,669 distance computations. Next, we replaced the split algorithm with LeafBall and the overall cost dropped down to 611,667,738 distance computations, which is an improvement by 28 %. However, these costs include insertions of objects one by one, thus, the stepwise process of building each M-tree, which needed about 603 million distance computations. The remaining portion is the costs induced by splits resulting in 28 times performance boost of the LeafBall algorithm. Consequently, one M-tree split took 20 s on average, which was reduced to 1.2 s by the LeafBall algorithm. This is a very important gain because the peer is blocked during the split, so that it cannot accept new data nor answer queries. 5. Conclusions and future work In this paper, we have focused on algorithms to efficiently split data stored in metric-based index structures of the M-tree family. The work was primarily motivated by peer splits in structured peer-to-peer networks. We presented four algorithms and compared them on a real-life image database. The high efficiency of the proposed Leaf- Ball algorithm was confirmed by a large-scale experiment, where we built the M-Chord peer-to-peer index structure on 10 million images. The LeafBall algorithm compared to the baseline DoubleInsert algorithm provides about 28 times performance boost. As the future work, we want to study the possibility that the M-tree itself would find a suitable partitioning. The partitioning is then reported to the peer instead of the current procedures when the peer dictates the partitioning to the M- tree. This may bring further improvement since the M-tree

7 Algorithm Dist. comp. Real time Dist. comp. Real time DoubleInsert M-tree 1 M-tree 2 Range search #1 24,433 23, s 0.58 s knn search #1 39,612 42, s 1.00 s Insert 3,253,968 3,274, s s Range search #2 46,680 44, s 1.09 s knn search #2 75,448 81, s 2.11 s DeleteInsert M-tree 1 M-tree 2 Range search #1 24,236 24, s 0.60 s knn search #1 39,603 42, s 1.01 s Insert 2,927,322 3,291, s s Range search #2 46,489 44, s 1.10 s knn search #2 75,531 81, s 1.98 s CloneDelete M-tree 1 M-tree 2 Range search #1 24,236 24, s 0.58 s knn search #1 39,603 42, s 1.02 s Insert 2,943,171 2,938, s s Range search #2 46,540 44, s 1.10 s knn search #2 75,634 81, s 1.97 s LeafBall M-tree 1 M-tree 2 Range search #1 24,213 24, s 0.58 s knn search #1 39,581 42, s 1.02 s Insert 3,102,343 3,041, s s Range search #2 46,204 44, s 1.17 s knn search #2 75,321 81, s 1.94 s Table 4. Search performance and insertion costs of the proposed algorithms in the case of generalized-hyperplane partitioning. can exploit its internal structures to optimize the whole process. Another research challenge is to generalize the algorithms to partitioning principles which produce more than two partitions, e.g. the Voronoi-like partitioning Acknowledgements This work was partially supported by: the EU IST FP6 project (SAPIR), the national research project 1ET , the Czech Science Foundation projects 201/07/P240 and 201/06/1338. We are grateful to our SAPIR partners from ISTI-CNR, Pisa, who crawled the image dataset and extracted the MPEG-7 features. References [1] L. Arge, M. de Berg, H. J. Haverkort, and K. Yi. The priority r-tree: a practically efficient and worst-case optimal r-tree. In Proceedings of the ACM SIGMOD International Conference on Management of data (SIGMOD 2004), pages , New York, NY, USA, ACM. [2] F. Banaei-Kashani and C. Shahabi. SWAM: A family of access methods for similarity-search in peer-to-peer data networks. In Proceedings of the Thirteenth ACM conference on Information and Knowledge Management (CIKM 2004), pages ACM Press, [3] M. Batko, D. Novak, F. Falchi, and P. Zezula. On scalability of the similarity search in the world of peers. In Proceedings of First International Conference on Scalable Information Systems (INFOSCALE 2006), Hong Kong, May 30 - June 1, pages ACM Press, [4] M. Bender, S. Michel, P. Triantafillou, G. Weikum, and C. Zimmer. MINERVA: Collaborative P2P search. In Proceedings of the 31st international conference on Very large data bases (VLDB 2005), pages VLDB Endowment, [5] S. Berchtold, C. Böhm, and H.-P. Kriegel. Improving the query performance of high-dimensional index structures by bulk-load operations. In Proceedings of the 6th International Conference on Extending Database Technology (EDBT 1998), pages , London, UK, Springer- Verlag. [6] S. Brin. Near neighbor search in large metric spaces. In U. Dayal, P. M. D. Gray, and S. Nishio, editors, Proceedings of the 21th International Conference on Very Large Data Bases (VLDB 1995), Zurich, Switzerland, September 11-15, 1995, pages Morgan Kaufmann, [7] P. Ciaccia and M. Patella. Bulk loading the M-tree. In Proceedings of the 9th Australasian Database Conference (ADC 1998), Perth, Australia, February 2-3, 1998, volume 20(2) of Australian Computer Science Communications, pages Springer, [8] P. Ciaccia, M. Patella, and P. Zezula. M-tree: An efficient access method for similarity search in metric spaces. In M. Jarke, M. J. Carey, K. R. Dittrich, F. H. Lochovsky, P. Loucopoulos, and M. A. Jeusfeld, editors, Proceedings of the 23rd International Conference on Very Large Data

8 Bases (VLDB 1997), Athens, Greece, August 25-29, 1997, pages Morgan Kaufmann, [9] J. V. den Bercken and B. Seeger. An evaluation of generic bulk loading techniques. In Proceedings of the 27th International Conference on Very Large Data Bases (VLDB 2001), pages , San Francisco, CA, USA, Morgan Kaufmann Publishers Inc. [10] J. V. den Bercken, B. Seeger, and P. Widmayer. A generic approach to bulk loading multidimensional index structures. In Proceedings of the 23rd International Conference on Very Large Data Bases (VLDB 1997), pages , San Francisco, CA, USA, Morgan Kaufmann Publishers Inc. [11] V. Dohnal, C. Gennaro, P. Savino, and P. Zezula. D-Index: Distance searching index for metric data sets. Multimedia Tools and Applications, 21(1):9 33, [12] F. Falchi, C. Gennaro, and P. Zezula. A contentaddressable network for similarity search in metric spaces. In Proceedings of the the 2nd International Workshop on Databases, Information Systems and Peer-to-Peer Computing (DBISP2P 2005), Trondheim, Norway, August 28-29, 2005, pages , [13] P. Ganesan, B. Yang, and H. Garcia-Molina. One torus to rule them all: Multi-dimensional queries in P2P systems. In Proceedings of the 7th International Workshop on the Web and Databases (WebDB 2004), pages 19 24, New York, NY, USA, ACM Press. [14] T. M. Ghanem, R. Shah, M. F. Mokbel, W. G. Aref, and J. S. Vitter. Bulk operations for space-partitioning trees. In Proceedings of the 20th International Conference on Data Engineering (ICDE 2004), page 29, Washington, DC, USA, IEEE Computer Society. [15] G. R. Hjaltason and H. Samet. Improved bulk-loading algorithms for quadtrees. In Proceedings of the 7th International Symposium on Advances in Geographic Information Systems (ACM GIS 1999), November 2-6, 1999, Kansas City, USA, pages ACM Press, [16] G. R. Hjaltason and H. Samet. Index-driven similarity search in metric spaces. ACM Transactions on Database Systems (TODS 2003), 28(4): , [17] I. Kamel and C. Faloutsos. Hilbert R-tree: An Improved R- tree using Fractals. In Proceedings of the 20th International Conference on Very Large Databases (VLDB 1994), pages , Santiago, Chile, [18] T. M. Klein, K. J. Parzygnat, and A. L. Tharp. Optimal b- tree packing. Information Systems, 16(2): , [19] S. T. Leutenegger and D. M. Nicol. Efficient bulk-loading of gridfiles. IEEE Transactions on Knowledge and Data Engineering, 09(3): , [20] J. Li, D. Rotem, and J. Srivastava. Algorithms for loading parallel grid files. SIGMOD Record, 22(2): , [21] G. Navarro. Searching in metric spaces by spatial approximation. In Proceedings of the 6th International Symposium on String Processing and Information Retrieval (SPIRE 1999), Cancun, Mexico, September 21-24, 1999, pages IEEE Computer Society, [22] D. Novak and P. Zezula. M-Chord: A scalable distributed similarity search structure. In Proceedings of First International Conference on Scalable Information Systems (IN- FOSCALE 2006), Hong Kong, May 30 - June 1, pages IEEE Computer Society, [23] M. H. Overmars and J. van Leeuwen. Dynamic multidimensional data structures based on quad- and kd trees. Acta Informatica, 17(3): , [24] A. Papadopoulos and Y. Manolopoulos. Parallel bulkloading of spatial data. Parallel Computing, 29(10): , [25] H. Samet. Foundations of Multidimensional And Metric Data Structures. The Morgan Kaufmann Series in Data Management Systems. Morgan Kaufmann, [26] A. P. Sexton and R. Swinbank. Bulk loading the m-tree to enhance query performance. In M. H. Williams and L. M. MacKinnon, editors, Proceedings of 21st British National Conference on Databases (BNCOD 2004), Edinburgh, UK, July 7-9, 2004, volume 3112 of Lecture Notes in Computer Science, pages Springer, [27] T. Skopal. Pivoting M-tree: A metric access method for efficient similarity search. In V. Snášel, J. Pokorný, and K. Richta, editors, Proceedings of the Annual International Workshop on DAtabases, TExts, Specifications and Objects (DATESO 2004), Desna, Czech Republic, April 14-16, 2004, volume 98 of CEUR Workshop Proceedings. Technical University of Aachen (RWTH), [28] C. Traina, Jr., A. J. M. Traina, B. Seeger, and C. Faloutsos. Slim-Trees: High performance metric trees minimizing overlap between nodes. In C. Zaniolo, P. C. Lockemann, M. H. Scholl, and T. Grust, editors, Proceedings of the 7th International Conference on Extending Database Technology (EDBT 2000), Konstanz, Germany, March 27-31, 2000, volume 1777 of Lecture Notes in Computer Science, pages Springer, [29] C. Yu, B. C. Ooi, K.-L. Tan, and H. V. Jagadish. Indexing the distance: An efficient method to knn processing. In P. M. G. Apers, P. Atzeni, S. Ceri, S. Paraboschi, K. Ramamohanarao, and R. T. Snodgrass, editors, Proceedings of 27th International Conference on Very Large Data Bases (VLDB 2001), Roma, Italy, September 11-14, 2001, pages Morgan Kaufmann, [30] P. Zezula, G. Amato, V. Dohnal, and M. Batko. Similarity Search: The Metric Space Approach, volume 32 of Advances in Database Systems. Springer, 2005.

Crawling, Indexing, and Similarity Searching Images on the Web

Crawling, Indexing, and Similarity Searching Images on the Web Crawling, Indexing, and Similarity Searching Images on the Web (Extended Abstract) M. Batko 1 and F. Falchi 2 and C. Lucchese 2 and D. Novak 1 and R. Perego 2 and F. Rabitti 2 and J. Sedmidubsky 1 and

More information

Pivoting M-tree: A Metric Access Method for Efficient Similarity Search

Pivoting M-tree: A Metric Access Method for Efficient Similarity Search Pivoting M-tree: A Metric Access Method for Efficient Similarity Search Tomáš Skopal Department of Computer Science, VŠB Technical University of Ostrava, tř. 17. listopadu 15, Ostrava, Czech Republic tomas.skopal@vsb.cz

More information

doc. RNDr. Tomáš Skopal, Ph.D. Department of Software Engineering, Faculty of Information Technology, Czech Technical University in Prague

doc. RNDr. Tomáš Skopal, Ph.D. Department of Software Engineering, Faculty of Information Technology, Czech Technical University in Prague Praha & EU: Investujeme do vaší budoucnosti Evropský sociální fond course: Searching the Web and Multimedia Databases (BI-VWM) Tomáš Skopal, 2011 SS2010/11 doc. RNDr. Tomáš Skopal, Ph.D. Department of

More information

Scalability Comparison of Peer-to-Peer Similarity-Search Structures

Scalability Comparison of Peer-to-Peer Similarity-Search Structures Scalability Comparison of Peer-to-Peer Similarity-Search Structures Michal Batko a David Novak a Fabrizio Falchi b Pavel Zezula a a Masaryk University, Brno, Czech Republic b ISTI-CNR, Pisa, Italy Abstract

More information

MESSIF: Metric Similarity Search Implementation Framework

MESSIF: Metric Similarity Search Implementation Framework MESSIF: Metric Similarity Search Implementation Framework Michal Batko David Novak Pavel Zezula {xbatko,xnovak8,zezula}@fi.muni.cz Masaryk University, Brno, Czech Republic Abstract The similarity search

More information

MUFIN Basics. MUFIN team Faculty of Informatics, Masaryk University Brno, Czech Republic SEMWEB 1

MUFIN Basics. MUFIN team Faculty of Informatics, Masaryk University Brno, Czech Republic SEMWEB 1 MUFIN Basics MUFIN team Faculty of Informatics, Masaryk University Brno, Czech Republic mufin@fi.muni.cz SEMWEB 1 Search problem SEARCH index structure data & queries infrastructure SEMWEB 2 The thesis

More information

Nearest Neighbor Search by Branch and Bound

Nearest Neighbor Search by Branch and Bound Nearest Neighbor Search by Branch and Bound Algorithmic Problems Around the Web #2 Yury Lifshits http://yury.name CalTech, Fall 07, CS101.2, http://yury.name/algoweb.html 1 / 30 Outline 1 Short Intro to

More information

A Pivot-based Index Structure for Combination of Feature Vectors

A Pivot-based Index Structure for Combination of Feature Vectors A Pivot-based Index Structure for Combination of Feature Vectors Benjamin Bustos Daniel Keim Tobias Schreck Department of Computer and Information Science, University of Konstanz Universitätstr. 10 Box

More information

Benchmarking the UB-tree

Benchmarking the UB-tree Benchmarking the UB-tree Michal Krátký, Tomáš Skopal Department of Computer Science, VŠB Technical University of Ostrava, tř. 17. listopadu 15, Ostrava, Czech Republic michal.kratky@vsb.cz, tomas.skopal@vsb.cz

More information

X-tree. Daniel Keim a, Benjamin Bustos b, Stefan Berchtold c, and Hans-Peter Kriegel d. SYNONYMS Extended node tree

X-tree. Daniel Keim a, Benjamin Bustos b, Stefan Berchtold c, and Hans-Peter Kriegel d. SYNONYMS Extended node tree X-tree Daniel Keim a, Benjamin Bustos b, Stefan Berchtold c, and Hans-Peter Kriegel d a Department of Computer and Information Science, University of Konstanz b Department of Computer Science, University

More information

Branch and Bound. Algorithms for Nearest Neighbor Search: Lecture 1. Yury Lifshits

Branch and Bound. Algorithms for Nearest Neighbor Search: Lecture 1. Yury Lifshits Branch and Bound Algorithms for Nearest Neighbor Search: Lecture 1 Yury Lifshits http://yury.name Steklov Institute of Mathematics at St.Petersburg California Institute of Technology 1 / 36 Outline 1 Welcome

More information

Clustered Pivot Tables for I/O-optimized Similarity Search

Clustered Pivot Tables for I/O-optimized Similarity Search Clustered Pivot Tables for I/O-optimized Similarity Search Juraj Moško Charles University in Prague, Faculty of Mathematics and Physics, SIRET research group mosko.juro@centrum.sk Jakub Lokoč Charles University

More information

Using the Holey Brick Tree for Spatial Data. in General Purpose DBMSs. Northeastern University

Using the Holey Brick Tree for Spatial Data. in General Purpose DBMSs. Northeastern University Using the Holey Brick Tree for Spatial Data in General Purpose DBMSs Georgios Evangelidis Betty Salzberg College of Computer Science Northeastern University Boston, MA 02115-5096 1 Introduction There is

More information

A Distributed Incremental Nearest Neighbor Algorithm

A Distributed Incremental Nearest Neighbor Algorithm A Distributed Incremental Nearest Neighbor Algorithm Fabrizio Falchi, Claudio Gennaro, Fausto Rabitti and Pavel Zezula ISTI-CNR, Pisa, Italy Faculty of Informatics, Masaryk University, Brno, Czech Republic

More information

Outline. Other Use of Triangle Inequality Algorithms for Nearest Neighbor Search: Lecture 2. Orchard s Algorithm. Chapter VI

Outline. Other Use of Triangle Inequality Algorithms for Nearest Neighbor Search: Lecture 2. Orchard s Algorithm. Chapter VI Other Use of Triangle Ineuality Algorithms for Nearest Neighbor Search: Lecture 2 Yury Lifshits http://yury.name Steklov Institute of Mathematics at St.Petersburg California Institute of Technology Outline

More information

Exploring intersection trees for indexing metric spaces

Exploring intersection trees for indexing metric spaces Exploring intersection trees for indexing metric spaces Zineddine KOUAHLA Laboratoire d informatique de Nantes Atlantique (LINA, UMR CNRS 6241) Polytechnic School of the University of Nantes, Nantes, France

More information

Approximate Similarity Search in Metric Spaces using Inverted Files

Approximate Similarity Search in Metric Spaces using Inverted Files Approximate Similarity Search in Metric Spaces using Inverted Files Giuseppe Amato ISTI-CNR Via G. Moruzzi, 56, Pisa, Italy giuseppe.amato@isti.cnr.it ABSTRACT We propose a new approach to perform approximate

More information

Finding the k-closest pairs in metric spaces

Finding the k-closest pairs in metric spaces Finding the k-closest pairs in metric spaces Hisashi Kurasawa The University of Tokyo 2-1-2 Hitotsubashi, Chiyoda-ku, Tokyo, JAPAN kurasawa@nii.ac.jp Atsuhiro Takasu National Institute of Informatics 2-1-2

More information

Speeding up Queries in a Leaf Image Database

Speeding up Queries in a Leaf Image Database 1 Speeding up Queries in a Leaf Image Database Daozheng Chen May 10, 2007 Abstract We have an Electronic Field Guide which contains an image database with thousands of leaf images. We have a system which

More information

Spatial Data Management

Spatial Data Management Spatial Data Management [R&G] Chapter 28 CS432 1 Types of Spatial Data Point Data Points in a multidimensional space E.g., Raster data such as satellite imagery, where each pixel stores a measured value

More information

A Scalable Index Mechanism for High-Dimensional Data in Cluster File Systems

A Scalable Index Mechanism for High-Dimensional Data in Cluster File Systems A Scalable Index Mechanism for High-Dimensional Data in Cluster File Systems Kyu-Woong Lee Hun-Soon Lee, Mi-Young Lee, Myung-Joon Kim Abstract We address the problem of designing index structures that

More information

Multidimensional Indexing The R Tree

Multidimensional Indexing The R Tree Multidimensional Indexing The R Tree Module 7, Lecture 1 Database Management Systems, R. Ramakrishnan 1 Single-Dimensional Indexes B+ trees are fundamentally single-dimensional indexes. When we create

More information

Similarity Searching in Structured and Unstructured P2P Networks

Similarity Searching in Structured and Unstructured P2P Networks Similarity Searching in Structured and Unstructured P2P Networks Vlastislav Dohnal and Pavel Zezula Faculty of Informatics, Masaryk University Botanicka 68a, 602 00 Brno, Czech Republic {dohnal,zezula}@fi.muni.cz

More information

The Effects of Dimensionality Curse in High Dimensional knn Search

The Effects of Dimensionality Curse in High Dimensional knn Search The Effects of Dimensionality Curse in High Dimensional knn Search Nikolaos Kouiroukidis, Georgios Evangelidis Department of Applied Informatics University of Macedonia Thessaloniki, Greece Email: {kouiruki,

More information

D-Index: Distance Searching Index for Metric Data Sets

D-Index: Distance Searching Index for Metric Data Sets Multimedia Tools and Applications, 21, 9 33, 23 c 23 Kluwer Academic Publishers. Manufactured in The Netherlands. D-Index: Distance Searching Index for Metric Data Sets VLASTISLAV DOHNAL Masaryk University

More information

Fully Dynamic Metric Access Methods based on Hyperplane Partitioning

Fully Dynamic Metric Access Methods based on Hyperplane Partitioning Fully Dynamic Metric Access Methods based on Hyperplane Partitioning Gonzalo Navarro a, Roberto Uribe-Paredes b,c a Department of Computer Science, University of Chile, Santiago, Chile. gnavarro@dcc.uchile.cl.

More information

A Self-Adaptive Insert Strategy for Content-Based Multidimensional Database Storage

A Self-Adaptive Insert Strategy for Content-Based Multidimensional Database Storage A Self-Adaptive Insert Strategy for Content-Based Multidimensional Database Storage Sebastian Leuoth, Wolfgang Benn Department of Computer Science Chemnitz University of Technology 09107 Chemnitz, Germany

More information

Improving the Performance of M-tree Family by Nearest-Neighbor Graphs

Improving the Performance of M-tree Family by Nearest-Neighbor Graphs Improving the Performance of M-tree Family by Nearest-Neighbor Graphs Tomáš Skopal, David Hoksza Charles University in Prague, FMP, Department of Software Engineering Malostranské nám. 25, 118 00 Prague,

More information

Spatial Data Management

Spatial Data Management Spatial Data Management Chapter 28 Database management Systems, 3ed, R. Ramakrishnan and J. Gehrke 1 Types of Spatial Data Point Data Points in a multidimensional space E.g., Raster data such as satellite

More information

Visualizing and Animating Search Operations on Quadtrees on the Worldwide Web

Visualizing and Animating Search Operations on Quadtrees on the Worldwide Web Visualizing and Animating Search Operations on Quadtrees on the Worldwide Web František Brabec Computer Science Department University of Maryland College Park, Maryland 20742 brabec@umiacs.umd.edu Hanan

More information

Improving the Query Performance of High-Dimensional Index Structures by Bulk Load Operations

Improving the Query Performance of High-Dimensional Index Structures by Bulk Load Operations Improving the Query Performance of High-Dimensional Index Structures by Bulk Load Operations Stefan Berchtold, Christian Böhm 2, and Hans-Peter Kriegel 2 AT&T Labs Research, 8 Park Avenue, Florham Park,

More information

Bulk-loading Dynamic Metric Access Methods

Bulk-loading Dynamic Metric Access Methods Bulk-loading Dynamic Metric Access Methods Thiago Galbiatti Vespa 1, Caetano Traina Jr 1, Agma Juci Machado Traina 1 1 ICMC - Institute of Mathematics and Computer Sciences USP - University of São Paulo

More information

Geometric data structures:

Geometric data structures: Geometric data structures: Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade Sham Kakade 2017 1 Announcements: HW3 posted Today: Review: LSH for Euclidean distance Other

More information

Flexible Cache Cache for afor Database Management Management Systems Systems Radim Bača and David Bednář

Flexible Cache Cache for afor Database Management Management Systems Systems Radim Bača and David Bednář Flexible Cache Cache for afor Database Management Management Systems Systems Radim Bača and David Bednář Department ofradim Computer Bača Science, and Technical David Bednář University of Ostrava Czech

More information

Introduction to Indexing R-trees. Hong Kong University of Science and Technology

Introduction to Indexing R-trees. Hong Kong University of Science and Technology Introduction to Indexing R-trees Dimitris Papadias Hong Kong University of Science and Technology 1 Introduction to Indexing 1. Assume that you work in a government office, and you maintain the records

More information

Using Natural Clusters Information to Build Fuzzy Indexing Structure

Using Natural Clusters Information to Build Fuzzy Indexing Structure Using Natural Clusters Information to Build Fuzzy Indexing Structure H.Y. Yue, I. King and K.S. Leung Department of Computer Science and Engineering The Chinese University of Hong Kong Shatin, New Territories,

More information

SIMILARITY SEARCH The Metric Space Approach. Pavel Zezula, Giuseppe Amato, Vlastislav Dohnal, Michal Batko

SIMILARITY SEARCH The Metric Space Approach. Pavel Zezula, Giuseppe Amato, Vlastislav Dohnal, Michal Batko SIMILARITY SEARCH The Metric Space Approach Pavel Zezula, Giuseppe Amato, Vlastislav Dohnal, Michal Batko Table of Contents Part I: Metric searching in a nutshell Foundations of metric space searching

More information

Striped Grid Files: An Alternative for Highdimensional

Striped Grid Files: An Alternative for Highdimensional Striped Grid Files: An Alternative for Highdimensional Indexing Thanet Praneenararat 1, Vorapong Suppakitpaisarn 2, Sunchai Pitakchonlasap 1, and Jaruloj Chongstitvatana 1 Department of Mathematics 1,

More information

Compression of the Stream Array Data Structure

Compression of the Stream Array Data Structure Compression of the Stream Array Data Structure Radim Bača and Martin Pawlas Department of Computer Science, Technical University of Ostrava Czech Republic {radim.baca,martin.pawlas}@vsb.cz Abstract. In

More information

Principles of Data Management. Lecture #14 (Spatial Data Management)

Principles of Data Management. Lecture #14 (Spatial Data Management) Principles of Data Management Lecture #14 (Spatial Data Management) Instructor: Mike Carey mjcarey@ics.uci.edu Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 1 Today s Notable News v Project

More information

LOBS: Load Balancing for Similarity Peer-to-Peer Structures

LOBS: Load Balancing for Similarity Peer-to-Peer Structures LOBS: Load Balancing for Similarity Peer-to-Peer Structures David Novák and Pavel Zezula Masaryk University, Brno, Czech Republic xnovak8@fi.muni.cz zezula@fi.muni.cz Abstract. The concept of peer-to-peer

More information

On Support of Ordering in Multidimensional Data Structures

On Support of Ordering in Multidimensional Data Structures On Support of Ordering in Multidimensional Data Structures Filip Křižka, Michal Krátký, and Radim Bača Department of Computer Science, VŠB Technical University of Ostrava 17. listopadu 15, Ostrava, Czech

More information

Foundations of Multidimensional and Metric Data Structures

Foundations of Multidimensional and Metric Data Structures Foundations of Multidimensional and Metric Data Structures Hanan Samet University of Maryland, College Park ELSEVIER AMSTERDAM BOSTON HEIDELBERG LONDON NEW YORK OXFORD PARIS SAN DIEGO SAN FRANCISCO SINGAPORE

More information

Issues in Using Knowledge to Perform Similarity Searching in Multimedia Databases without Redundant Data Objects

Issues in Using Knowledge to Perform Similarity Searching in Multimedia Databases without Redundant Data Objects Issues in Using Knowledge to Perform Similarity Searching in Multimedia Databases without Redundant Data Objects Leonard Brown The University of Oklahoma School of Computer Science 200 Felgar St. EL #114

More information

HISTORICAL BACKGROUND

HISTORICAL BACKGROUND VALID-TIME INDEXING Mirella M. Moro Universidade Federal do Rio Grande do Sul Porto Alegre, RS, Brazil http://www.inf.ufrgs.br/~mirella/ Vassilis J. Tsotras University of California, Riverside Riverside,

More information

Foundations of Nearest Neighbor Queries in Euclidean Space

Foundations of Nearest Neighbor Queries in Euclidean Space Foundations of Nearest Neighbor Queries in Euclidean Space Hanan Samet Computer Science Department Center for Automation Research Institute for Advanced Computer Studies University of Maryland College

More information

A Real Time GIS Approximation Approach for Multiphase Spatial Query Processing Using Hierarchical-Partitioned-Indexing Technique

A Real Time GIS Approximation Approach for Multiphase Spatial Query Processing Using Hierarchical-Partitioned-Indexing Technique International Journal of Scientific Research in Computer Science, Engineering and Information Technology 2017 IJSRCSEIT Volume 2 Issue 6 ISSN : 2456-3307 A Real Time GIS Approximation Approach for Multiphase

More information

A Simple, Efficient, Parallelizable Algorithm for Approximated Nearest Neighbors

A Simple, Efficient, Parallelizable Algorithm for Approximated Nearest Neighbors A Simple, Efficient, Parallelizable Algorithm for Approximated Nearest Neighbors Sebastián Ferrada 1, Benjamin Bustos 1, and Nora Reyes 2 1 Institute for Foundational Research on Data Department of Computer

More information

Benchmarking a B-tree compression method

Benchmarking a B-tree compression method Benchmarking a B-tree compression method Filip Křižka, Michal Krátký, and Radim Bača Department of Computer Science, Technical University of Ostrava, Czech Republic {filip.krizka,michal.kratky,radim.baca}@vsb.cz

More information

Surrounding Join Query Processing in Spatial Databases

Surrounding Join Query Processing in Spatial Databases Surrounding Join Query Processing in Spatial Databases Lingxiao Li (B), David Taniar, Maria Indrawan-Santiago, and Zhou Shao Monash University, Melbourne, Australia lli278@student.monash.edu, {david.taniar,maria.indrawan,joe.shao}@monash.edu

More information

Combining Local and Global Visual Feature Similarity using a Text Search Engine

Combining Local and Global Visual Feature Similarity using a Text Search Engine Combining Local and Global Visual Feature Similarity using a Text Search Engine Giuseppe Amato, Paolo Bolettieri, Fabrizio Falchi, Claudio Gennaro, Fausto Rabitti {giuseppe.amato, fabrizio.falchi, paolo.bolettieri,

More information

DS504/CS586: Big Data Analytics Data Management Prof. Yanhua Li

DS504/CS586: Big Data Analytics Data Management Prof. Yanhua Li Welcome to DS504/CS586: Big Data Analytics Data Management Prof. Yanhua Li Time: 6:00pm 8:50pm R Location: KH 116 Fall 2017 First Grading for Reading Assignment Weka v 6 weeks v https://weka.waikato.ac.nz/dataminingwithweka/preview

More information

Обработка и оптимизация запросов

Обработка и оптимизация запросов Обработка и оптимизация запросов Б.А. Новиков Осенний семестр 2014 h=p://www.math.spbu.ru/user/boris novikov/courses/ 20 сентября 2014 1 Antonin Gu=man. 1984. R- trees: a dynamic index structure for spazal

More information

Finding Shortest Path on Land Surface

Finding Shortest Path on Land Surface Finding Shortest Path on Land Surface Lian Liu, Raymond Chi-Wing Wong Hong Kong University of Science and Technology June 14th, 211 Introduction Land Surface Land surfaces are modeled as terrains A terrain

More information

PARALLEL STRATEGIES FOR MULTIMEDIA WEB SERVICES

PARALLEL STRATEGIES FOR MULTIMEDIA WEB SERVICES PARALLEL STRATEGIES FOR MULTIMEDIA WEB SERVICES Gil-Costa Verónica 1, Printista Marcela 1 and Mauricio Marín 2 1 DCC, University of San Luis, San Luis, Argentina {gvcosta,mprinti}@unsl.edu.ar 2 Yahoo!

More information

Clustering-Based Similarity Search in Metric Spaces with Sparse Spatial Centers

Clustering-Based Similarity Search in Metric Spaces with Sparse Spatial Centers Clustering-Based Similarity Search in Metric Spaces with Sparse Spatial Centers Nieves Brisaboa 1, Oscar Pedreira 1, Diego Seco 1, Roberto Solar 2,3, Roberto Uribe 2,3 1 Database Laboratory, University

More information

Treaps. 1 Binary Search Trees (BSTs) CSE341T/CSE549T 11/05/2014. Lecture 19

Treaps. 1 Binary Search Trees (BSTs) CSE341T/CSE549T 11/05/2014. Lecture 19 CSE34T/CSE549T /05/04 Lecture 9 Treaps Binary Search Trees (BSTs) Search trees are tree-based data structures that can be used to store and search for items that satisfy a total order. There are many types

More information

So, we want to perform the following query:

So, we want to perform the following query: Abstract This paper has two parts. The first part presents the join indexes.it covers the most two join indexing, which are foreign column join index and multitable join index. The second part introduces

More information

Lecture Notes: External Interval Tree. 1 External Interval Tree The Static Version

Lecture Notes: External Interval Tree. 1 External Interval Tree The Static Version Lecture Notes: External Interval Tree Yufei Tao Department of Computer Science and Engineering Chinese University of Hong Kong taoyf@cse.cuhk.edu.hk This lecture discusses the stabbing problem. Let I be

More information

ISSUES IN SPATIAL DATABASES AND GEOGRAPHICAL INFORMATION SYSTEMS (GIS) HANAN SAMET

ISSUES IN SPATIAL DATABASES AND GEOGRAPHICAL INFORMATION SYSTEMS (GIS) HANAN SAMET zk0 ISSUES IN SPATIAL DATABASES AND GEOGRAPHICAL INFORMATION SYSTEMS (GIS) HANAN SAMET COMPUTER SCIENCE DEPARTMENT AND CENTER FOR AUTOMATION RESEARCH AND INSTITUTE FOR ADVANCED COMPUTER STUDIES UNIVERSITY

More information

A Metric Cache for Similarity Search

A Metric Cache for Similarity Search A Metric Cache for Similarity Search Fabrizio Falchi ISTI-CNR Pisa, Italy fabrizio.falchi@isti.cnr.it Raffaele Perego ISTI-CNR Pisa, Italy raffaele.perego@isti.cnr.it Claudio Lucchese ISTI-CNR Pisa, Italy

More information

SILC: Efficient Query Processing on Spatial Networks

SILC: Efficient Query Processing on Spatial Networks Hanan Samet hjs@cs.umd.edu Department of Computer Science University of Maryland College Park, MD 20742, USA Joint work with Jagan Sankaranarayanan and Houman Alborzi Proceedings of the 13th ACM International

More information

Ranking Clustered Data with Pairwise Comparisons

Ranking Clustered Data with Pairwise Comparisons Ranking Clustered Data with Pairwise Comparisons Alisa Maas ajmaas@cs.wisc.edu 1. INTRODUCTION 1.1 Background Machine learning often relies heavily on being able to rank the relative fitness of instances

More information

Efficient Range Query Processing on Uncertain Data

Efficient Range Query Processing on Uncertain Data Efficient Range Query Processing on Uncertain Data Andrew Knight Rochester Institute of Technology Department of Computer Science Rochester, New York, USA andyknig@gmail.com Manjeet Rege Rochester Institute

More information

Merging R-trees. Abstract. 1. Introduction

Merging R-trees. Abstract. 1. Introduction R-trees Vasilis Vasaitis Alexandros Nanopoulos Panayiotis Bozanis Dept. of Informatics Dept. of Informatics Computer & Commun. Eng. Dept. Aristotle University Aristotle University University of Thessaly

More information

Clustering Billions of Images with Large Scale Nearest Neighbor Search

Clustering Billions of Images with Large Scale Nearest Neighbor Search Clustering Billions of Images with Large Scale Nearest Neighbor Search Ting Liu, Charles Rosenberg, Henry A. Rowley IEEE Workshop on Applications of Computer Vision February 2007 Presented by Dafna Bitton

More information

The Unified Segment Tree and its Application to the Rectangle Intersection Problem

The Unified Segment Tree and its Application to the Rectangle Intersection Problem CCCG 2013, Waterloo, Ontario, August 10, 2013 The Unified Segment Tree and its Application to the Rectangle Intersection Problem David P. Wagner Abstract In this paper we introduce a variation on the multidimensional

More information

Nearest neighbors. Focus on tree-based methods. Clément Jamin, GUDHI project, Inria March 2017

Nearest neighbors. Focus on tree-based methods. Clément Jamin, GUDHI project, Inria March 2017 Nearest neighbors Focus on tree-based methods Clément Jamin, GUDHI project, Inria March 2017 Introduction Exact and approximate nearest neighbor search Essential tool for many applications Huge bibliography

More information

SISTEMI PER LA RICERCA

SISTEMI PER LA RICERCA SISTEMI PER LA RICERCA DI DATI MULTIMEDIALI Claudio Lucchese e Salvatore Orlando Terzo Workshop di Dipartimento Dipartimento di Informatica Università degli studi di Venezia Ringrazio le seguenti persone,

More information

A Case for Merge Joins in Mediator Systems

A Case for Merge Joins in Mediator Systems A Case for Merge Joins in Mediator Systems Ramon Lawrence Kirk Hackert IDEA Lab, Department of Computer Science, University of Iowa Iowa City, IA, USA {ramon-lawrence, kirk-hackert}@uiowa.edu Abstract

More information

Advances in Data Management Principles of Database Systems - 2 A.Poulovassilis

Advances in Data Management Principles of Database Systems - 2 A.Poulovassilis 1 Advances in Data Management Principles of Database Systems - 2 A.Poulovassilis 1 Storing data on disk The traditional storage hierarchy for DBMSs is: 1. main memory (primary storage) for data currently

More information

Comparative Study of Subspace Clustering Algorithms

Comparative Study of Subspace Clustering Algorithms Comparative Study of Subspace Clustering Algorithms S.Chitra Nayagam, Asst Prof., Dept of Computer Applications, Don Bosco College, Panjim, Goa. Abstract-A cluster is a collection of data objects that

More information

Distributed minimum spanning tree problem

Distributed minimum spanning tree problem Distributed minimum spanning tree problem Juho-Kustaa Kangas 24th November 2012 Abstract Given a connected weighted undirected graph, the minimum spanning tree problem asks for a spanning subtree with

More information

Chapter 12: Indexing and Hashing. Basic Concepts

Chapter 12: Indexing and Hashing. Basic Concepts Chapter 12: Indexing and Hashing! Basic Concepts! Ordered Indices! B+-Tree Index Files! B-Tree Index Files! Static Hashing! Dynamic Hashing! Comparison of Ordered Indexing and Hashing! Index Definition

More information

Geometric and Thematic Integration of Spatial Data into Maps

Geometric and Thematic Integration of Spatial Data into Maps Geometric and Thematic Integration of Spatial Data into Maps Mark McKenney Department of Computer Science, Texas State University mckenney@txstate.edu Abstract The map construction problem (MCP) is defined

More information

Parallelizing Frequent Itemset Mining with FP-Trees

Parallelizing Frequent Itemset Mining with FP-Trees Parallelizing Frequent Itemset Mining with FP-Trees Peiyi Tang Markus P. Turkia Department of Computer Science Department of Computer Science University of Arkansas at Little Rock University of Arkansas

More information

Quadrant-Based MBR-Tree Indexing Technique for Range Query Over HBase

Quadrant-Based MBR-Tree Indexing Technique for Range Query Over HBase Quadrant-Based MBR-Tree Indexing Technique for Range Query Over HBase Bumjoon Jo and Sungwon Jung (&) Department of Computer Science and Engineering, Sogang University, 35 Baekbeom-ro, Mapo-gu, Seoul 04107,

More information

Geometric Data Structures

Geometric Data Structures Geometric Data Structures 1 Data Structure 2 Definition: A data structure is a particular way of organizing and storing data in a computer for efficient search and retrieval, including associated algorithms

More information

Metric inverted - an efficient inverted indexing method for metric spaces

Metric inverted - an efficient inverted indexing method for metric spaces Metric inverted - an efficient inverted indexing method for metric spaces Benjamin Sznajder, Jonathan Mamou, Yosi Mass, and Michal Shmueli-Scheuer IBM Haifa Research Laboratory, Mount Carmel Campus, Haifa,

More information

Chapter 12: Indexing and Hashing

Chapter 12: Indexing and Hashing Chapter 12: Indexing and Hashing Basic Concepts Ordered Indices B+-Tree Index Files B-Tree Index Files Static Hashing Dynamic Hashing Comparison of Ordered Indexing and Hashing Index Definition in SQL

More information

R-Trees. Accessing Spatial Data

R-Trees. Accessing Spatial Data R-Trees Accessing Spatial Data In the beginning The B-Tree provided a foundation for R- Trees. But what s a B-Tree? A data structure for storing sorted data with amortized run times for insertion and deletion

More information

Lecture 6: External Interval Tree (Part II) 3 Making the external interval tree dynamic. 3.1 Dynamizing an underflow structure

Lecture 6: External Interval Tree (Part II) 3 Making the external interval tree dynamic. 3.1 Dynamizing an underflow structure Lecture 6: External Interval Tree (Part II) Yufei Tao Division of Web Science and Technology Korea Advanced Institute of Science and Technology taoyf@cse.cuhk.edu.hk 3 Making the external interval tree

More information

Multidimensional Indexes [14]

Multidimensional Indexes [14] CMSC 661, Principles of Database Systems Multidimensional Indexes [14] Dr. Kalpakis http://www.csee.umbc.edu/~kalpakis/courses/661 Motivation Examined indexes when search keys are in 1-D space Many interesting

More information

Appropriate Item Partition for Improving the Mining Performance

Appropriate Item Partition for Improving the Mining Performance Appropriate Item Partition for Improving the Mining Performance Tzung-Pei Hong 1,2, Jheng-Nan Huang 1, Kawuu W. Lin 3 and Wen-Yang Lin 1 1 Department of Computer Science and Information Engineering National

More information

PSON: A Parallelized SON Algorithm with MapReduce for Mining Frequent Sets

PSON: A Parallelized SON Algorithm with MapReduce for Mining Frequent Sets 2011 Fourth International Symposium on Parallel Architectures, Algorithms and Programming PSON: A Parallelized SON Algorithm with MapReduce for Mining Frequent Sets Tao Xiao Chunfeng Yuan Yihua Huang Department

More information

Solid Modeling. Thomas Funkhouser Princeton University C0S 426, Fall Represent solid interiors of objects

Solid Modeling. Thomas Funkhouser Princeton University C0S 426, Fall Represent solid interiors of objects Solid Modeling Thomas Funkhouser Princeton University C0S 426, Fall 2000 Solid Modeling Represent solid interiors of objects Surface may not be described explicitly Visible Human (National Library of Medicine)

More information

Datenbanksysteme II: Multidimensional Index Structures 2. Ulf Leser

Datenbanksysteme II: Multidimensional Index Structures 2. Ulf Leser Datenbanksysteme II: Multidimensional Index Structures 2 Ulf Leser Content of this Lecture Introduction Partitioned Hashing Grid Files kdb Trees kd Tree kdb Tree R Trees Example: Nearest neighbor image

More information

doc. RNDr. Tomáš Skopal, Ph.D. Department of Software Engineering, Faculty of Information Technology, Czech Technical University in Prague

doc. RNDr. Tomáš Skopal, Ph.D. Department of Software Engineering, Faculty of Information Technology, Czech Technical University in Prague Praha & EU: Investujeme do vaší budoucnosti Evropský sociální fond course: Searching the Web and Multimedia Databases (BI-VWM) Tomáš Skopal, 2011 SS2010/11 doc. RNDr. Tomáš Skopal, Ph.D. Department of

More information

DDS Dynamic Search Trees

DDS Dynamic Search Trees DDS Dynamic Search Trees 1 Data structures l A data structure models some abstract object. It implements a number of operations on this object, which usually can be classified into l creation and deletion

More information

Evaluating find a path reachability queries

Evaluating find a path reachability queries Evaluating find a path reachability queries Panagiotis ouros and Theodore Dalamagas and Spiros Skiadopoulos and Timos Sellis Abstract. Graphs are used for modelling complex problems in many areas, such

More information

CSIT5300: Advanced Database Systems

CSIT5300: Advanced Database Systems CSIT5300: Advanced Database Systems L08: B + -trees and Dynamic Hashing Dr. Kenneth LEUNG Department of Computer Science and Engineering The Hong Kong University of Science and Technology Hong Kong SAR,

More information

Hike: A High Performance knn Query Processing System for Multimedia Data

Hike: A High Performance knn Query Processing System for Multimedia Data Hike: A High Performance knn Query Processing System for Multimedia Data Hui Li College of Computer Science and Technology Guizhou University Guiyang, China cse.huili@gzu.edu.cn Ling Liu College of Computing

More information

Proximity Searching in High Dimensional Spaces with a Proximity Preserving Order

Proximity Searching in High Dimensional Spaces with a Proximity Preserving Order Proximity Searching in High Dimensional Spaces with a Proximity Preserving Order E. Chávez 1, K. Figueroa 1,2, and G. Navarro 2 1 Facultad de Ciencias Físico-Matemáticas, Universidad Michoacana, México.

More information

PP-Index: Using Permutation Prefixes for Efficient and Scalable Approximate Similarity Search

PP-Index: Using Permutation Prefixes for Efficient and Scalable Approximate Similarity Search PP-Index: Using Permutation Prefixes for Efficient and Scalable Approximate Similarity Search Andrea Esuli andrea.esuli@isti.cnr.it Istituto di Scienza e Tecnologie dell Informazione A. Faedo Consiglio

More information

Speeding up Bulk-Loading of Quadtrees

Speeding up Bulk-Loading of Quadtrees Speeding up Bulk-Loading of Quadtrees 0 Gísli R. Hjaltason 1 Hanan Samet 1 Yoram J. Sussmann 2 Computer Science Department, Center for Automation Research, and Institute for Advanced Computer Studies,

More information

I/O-Algorithms Lars Arge Aarhus University

I/O-Algorithms Lars Arge Aarhus University I/O-Algorithms Aarhus University April 10, 2008 I/O-Model Block I/O D Parameters N = # elements in problem instance B = # elements that fits in disk block M = # elements that fits in main memory M T =

More information

Novel Materialized View Selection in a Multidimensional Database

Novel Materialized View Selection in a Multidimensional Database Graphic Era University From the SelectedWorks of vijay singh Winter February 10, 2009 Novel Materialized View Selection in a Multidimensional Database vijay singh Available at: https://works.bepress.com/vijaysingh/5/

More information

Notes on Binary Dumbbell Trees

Notes on Binary Dumbbell Trees Notes on Binary Dumbbell Trees Michiel Smid March 23, 2012 Abstract Dumbbell trees were introduced in [1]. A detailed description of non-binary dumbbell trees appears in Chapter 11 of [3]. These notes

More information

DeLiClu: Boosting Robustness, Completeness, Usability, and Efficiency of Hierarchical Clustering by a Closest Pair Ranking

DeLiClu: Boosting Robustness, Completeness, Usability, and Efficiency of Hierarchical Clustering by a Closest Pair Ranking In Proc. 10th Pacific-Asian Conf. on Advances in Knowledge Discovery and Data Mining (PAKDD'06), Singapore, 2006 DeLiClu: Boosting Robustness, Completeness, Usability, and Efficiency of Hierarchical Clustering

More information

Indexing. Week 14, Spring Edited by M. Naci Akkøk, , Contains slides from 8-9. April 2002 by Hector Garcia-Molina, Vera Goebel

Indexing. Week 14, Spring Edited by M. Naci Akkøk, , Contains slides from 8-9. April 2002 by Hector Garcia-Molina, Vera Goebel Indexing Week 14, Spring 2005 Edited by M. Naci Akkøk, 5.3.2004, 3.3.2005 Contains slides from 8-9. April 2002 by Hector Garcia-Molina, Vera Goebel Overview Conventional indexes B-trees Hashing schemes

More information