Finding the k-closest pairs in metric spaces

Size: px
Start display at page:

Download "Finding the k-closest pairs in metric spaces"

Transcription

1 Finding the k-closest pairs in metric spaces Hisashi Kurasawa The University of Tokyo Hitotsubashi, Chiyoda-ku, Tokyo, JAPAN Atsuhiro Takasu National Institute of Informatics Hitotsubashi, Chiyoda-ku, Tokyo, JAPAN Jun Adachi National Institute of Informatics Hitotsubashi, Chiyoda-ku, Tokyo, JAPAN ABSTRACT We investigated the problem of reducing the cost of searching for the k closest pairs in metric spaces. In general, a k-closest pair search method initializes the upper bound distance between the k closest pairs as infinity and repeatedly updates the upper bound distance whenever it finds pairs of objects whose distances are shorter than that distance. Furthermore, it prunes dissimilar pairs whose distances are estimated as longer than the upper bound distance based on the distances from the pivot to objects and the triangle inequality. The cost of a k-closest pair query is smaller for a shorter upper bound distance and a sparser distribution of distances between the pivot and objects. We propose a new divide-and-conquer-based k-closest pair search method in metric spaces, called Adaptive Multi-Partitioning (AMP). AMP repeatedly divides and conquers objects from the sparser distance-distribution space and speeds up the convergence of the upper bound distance before partitioning the denser space. As a result, AMP can prune many dissimilar pairs compared with ordinary divide-and-conquer-based method. We compare our method with other partitioning method and show that AMP reduces distances computations. Categories and Subject Descriptors H.2.4 [Database Management]: Systems Multimedia databases; H.3.1 [Information Storage and Retrieval]: Content Analysis and Indexing Indexing methods 1. INTRODUCTION It is important to enumerate similar pairs of objects from a data set. This has various applications such as record linkages, data mining, multimedia databases, and geographical information systems. There are two similar object pair enumeration problems: similarity join and k-closest pair finding. The former finds pairs of objects whose distances are shorter than a specified upper bound, and the latter finds the top-k closest object Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. NTSS 2011, March 25, 2011, Uppsala, Sweden. Copyright 2011 ACM /11/03...$ pairs from a data set for a given number k. The similarity join query potentially does not return any pair or many pairs when it is difficult to define an appropriate upper bound distance. Even in such a case, the k-closest pair query answers a fixed number of closest pairs. We propose a fast method for k-closest pair query in metric spaces. The k-closest pair query can be solved using a nested loop method. However, this naive method requires ( N 2) distance computations. Such a high search cost leads to poor scalability. A divide-and-conquer method reduces the cost to O(N(log N) d 1 ) for d-dimensional Euclidean spaces. Such a method consists of the following steps. Divide: It partitions a region into two sub-regions using a hyperplane that perpendicularly crosses axes at a mean value. Conquer: It recursively searches closest pairs in each sub-region and update the upper bound distance between the kth closest pair. Combine: It finds closest pairs with one object in each sub-region that lies within the upper bound distance of the partitioning boundary. Unfortunately, the divide-and-conquer method for metric spaces is more complicated than for Euclidean spaces. Coordinates are not defined in metric spaces and cannot be used for partitioning a region. Instead, ball partitioning is generally used [8, 9]. Ball partitioning selects one object as a pivot, and divides a region based on the distance from the pivot. It requires N 1 distance computations where N is the number of objects. Thus, we need to reduce the computational cost in the divide step as well as the conquer step. Furthermore, the distribution of distances between the pivot and objects is often skewed. Many objects tend to reside near the partitioning boundary when the ball partitioning divides a region into two sub-regions by a median distance. We should also take into account the distance distribution. However, existing methods using ball partitioning do not take these issues into account. We developed a new partitioning method that can prune more objects with less pivots than existing methods by using the distance distribution. We propose a new divide-and-conquer-based k-closest pair search method in metric spaces, called Adaptive Multi-Partitioning (AMP). AMP is based on the following observations:

2 The divide-and-conquer method works well for multipartitioning in which intervals exceed the upper bound distance between the kth closest pair. It is more difficult to prune dissimilar objects in dense regions of distances from the pivot than sparse regions. The upper bound distance u is updated and decreases whenever it finds pairs of objects whose distances are shorter than u. A shorter upper bound distance can prunes more dissimilar objects. AMP iteratively and recursively divides a region using a sparse-region-first strategy. For a sparse region, it calculates the k closest pairs. It then sets the interval of the next region to the maximum distance in the currently obtained k closest pairs. As a result, a space is multiple partitioned for each pivot, and the upper bound distance is expected to decrease before searching dense regions. The contributions of this paper are as follows. We propose a novel k-closest pair search method to reduce the number of distance computations. The method uses only the triangle inequality, and it can handle all distance functions that satisfy metric space postulates. We conducted experiments on several real datasets and demonstrated that our method outperforms other methods. The rest of this paper is organized as follows. Section 2 formally defines the problem we focus on in this paper. Section 2 gives the background of this study, and Section 3 overviews some related work. Section 4 describes AMP in detail. Section 5 shows the experimental results, and Section 6 is our conclusion. 2. PROBLEM STATEMENT Our k-closest pair search method deals with metric spaces, which are defined as follows. Definition 1 (Metric space). Let M = (D, d) be a metric space defined for a domain of objects D and a distance function d : D D R. M satisfies the following postulates [17]. x, y D, d(x, y) 0(non-negativity) x, y D, d(x, y) = d(y, x)(symmetry) x, y D, x = y d(x, y) = 0(identity) x, y, z D, d(x, z) d(x, y) + d(y, z)(triangle inequality) (1) Examples of the distance function d are the Minkowski distance, Jaccard s coefficient, the Hausdorff distance, and the edit distance. A k-closest pair query is defined as follows. Definition 2 (k-closest pair query). Given a set of objects S in D, a k-closest pair query with threshold k returns the object set R and the upper bound distance u. R Table 1: Notation Notation Meaning M metric space D domain of objects d distance function k query for the k-closest pair method upper bound distance u between the kth closest pair q query object o, a, b, x, y object p pivot S, X, Y object set A (temporary) result set and u satisfy A S S A = k (x, y) A, (a, b) (S S A), d(x, y) d(a, b) u = max ({d(x, y) (x, y) A}). (2) That is, A consists of the most similar k object pairs from S. We focus on the self k-closest pair query case. The k-closest pair query is sometimes called top-k similarity join query or kcpq. Table 1 summarizes the symbols used in this article. 3. RELATED WORK We start with pruning techniques in metric spaces. Because coordinates are not explicitly defined in metric spaces, all pruning techniques use the distance from the pivot to each object and the triangle inequality. A simple partitioning technique is ball partitioning, which uses only one pivot and divides a region into two sub-regions based on the distance from each object to the pivot [15]. It is used to construct a similarity search index for range searches and k-nearest neighbor searches [7, 5]. Quickjoin [9] modifies ball partitioning for similarity join searches as follows. Definition 3 (Modified Ball partitioning). For a metric space M = (D, d), suppose a pivot p divides a set X of objects in D into two regions: X 1 = {o X d(o, p) r p}, (3) X 2 = {o X d(o, p) > r p}, (4) Furthermore, suppose the following subsets: X 1 = {o X 1 d(o, p) > r p u}, (5) X 2 = {o X 2 d(o, p) r p + u}, (6) where u is an upper bound distance. When searching object pairs in X whose distance is shorter than u, it is sufficient to check object pairs in X 1, X 2, and X 1 X 2. Similarly, suppose p also divides an object set Y in D: Y 1 = {o Y d(o, p) r p}, (7) Y 2 = {o Y d(o, p) > r p}, (8) Y 1 = {o Y 1 d(o, p) > r p u}, (9) Y 2 = {o Y 2 d(o, p) r p + u}. (10)

3 When searching object pairs in X Y whose distance is shorter than u, it is sufficient to check object pairs in X 1 Y 1, X 2 Y 2, X 1 Y 2, and X 2 Y 1. Modified ball partitioning can prune more dissimilar object pairs by a smaller upper bound distance and a smaller number of objects near the partitioning boundary. Now let us briefly survey k-closest pair search methods. Because the similarity join is very similar to the k-closest pair query, we overview studies on these two problems. Previously proposed similarity join methods in metric spaces can be categorized into two methods. The first is the indexbased method. This method constructs an index and inserts all the objects in the dataset into the index. For each object, it searches for other objects whose distances are within the query by using the same procedure as the range search on the index. All existing index-based studies focus on improving the index. While general metric indexes are designed to deal with any distances in the range queries [10, 18, 11], the indexes for similarity join queries assume fixed-distance range queries. ed-index [8] constructs an index that is an extension of D-index [7]. ed-index differs from D-index in that it replicates objects whose distances from the partitioning boundary are within the query while dividing a region into sub-regions like modified ball partitioning. By using this replication technique, ed-index can execute the similarity join query independently within each separated region. List of twin clusters (LTC) is another extension of similarity search indexes [13]. LTC [5] is based on list of clusters (LC). It is designed to resolve range queries, similarity join, and k-closest pair queries between two datasets. It makes two lists of overlapping clusters and a distance matrix between the cluster centers of each dataset. It uses the matrix to prune objects based on the triangle inequality. For searching the k-closest pairs, LTC creates a queue heap to store temporary k-closest object pairs and their distances. It also makes a variable upper bound distance and initializes it to. When the size of the heap exceeds k, it removes the longest distance pair from the heap and sets the upper bound distance to the distance between the kth closest pair. It sets the upper bound distance as the similarity join query and finds the closest object pairs by updating the heap and decreasing the upper bound distance. It does not tune the query, while ed-index does. LTC is a general-purpose index for similarity searches between two datasets. The second method is the divide-and-conquer-based method. Quickjoin [9] recursively divides and conquers an object set into subsets based on the distances from a pivot by using modified ball partitioning. It prunes all the object pairs in the two subsets if the distance between the sets exceeds the query. The partitioning boundary of modified ball partitioning is the distance between the pivot and a randomly selected object. This results in many objects tending to reside near the partitioning boundary, which makes it difficult to prune dissimilar object pairs. Most k-closest pair methods are designed for a specific data structure. They mainly focus on pruning and search ordering. The Top-k set similarity join [16] handles sets; it uses a token-based pruning and avoids repeated verification. Approximate k-closest pairs with SPace-filling curves (ASP) [3] is an approximate k-closest pair method for highdimensional vectors; it uses Hilbert space filling curves. An index-based method [6] for spatial databases has been proposed that prunes object pairs by using R-tree. These methods cannot be extended to metric spaces or another specific space. Only a few papers discuss k-closest pair queries in metric spaces. Furthermore, the authors of those papers did not exploit the distribution of distances between the pivot and objects for determining the partitioning distance. By using the distance distribution in partitioning techniques, it may be possible to reduce the number of pivots and improve the pruning effect. This reduces the cost of distance computations between the pivots and objects and that among the objects. Therefore, we developed a new partitioning technique using the distance distribution. Compared with the index-based method, the divide-and-conquer-based method is better at dealing with self join queries because there is no reason to build an index for one query as the authors of LTC mentioned. Thus, we used the divide-and-conquerbased method. To the best of our knowledge, only LTC supports the k-closest pair query in metric spaces in previous studies. 4. ADAPTIVE MULTI-PARTITIONING AMP is a divide-and-conquer-based k-closest pair search method in metric spaces. It uses multi-ball partitioning for reducing the computational cost in the divide step. Moreover, it uses the convergence of the upper bound distance and the distance distribution for the partitioning procedure, especially for determining the interval size of the partitioning. 4.1 Multi-ball partitioning The idea of multi-partitioning is used in various studies for specific spaces such as ε-kdb [14]. Multi-ball partitioning is defined as follows. Definition 4 (Multi-Ball partitioning). For a metric space M = (D, d), suppose a pivot p divides an object set X in D into regions: X 0 = {o X 0 d(o, p) < t 0}, X 1 = {o X t 0 d(o, p) < t 1}, X i = {o X t i 1 d(o, p) < t i}, (11) X n = {o X t n 1 d(o, p) < t n}, where t i (0 i n) are the partitioning distances of p. For an upper bound distance u, suppose the inequality t i+1 t i u (12) holds when searching object pairs in X whose distance is shorter than u. Then, it is sufficient to check object pairs {(a, b) a X i, b X j, i j 1}. Similarly, suppose p divides an object set Y in D into regions: Y 0 = {o Y 0 d(o, p) < t 0}, Y 1 = {o Y t 0 d(o, p) < t 1}, Y i = {o Y t i 1 d(o, p) < t i}, (13) Y n = {o Y t n 1 d(o, p) < t n}.

4 Figure 1: Adaptive Multi-Partitioning When searching object pairs in X Y whose distance is shorter than u, it is sufficient to check object pairs {(a, b) a X i, b Y j, i j 1}. Multi-ball partitioning can be adapted to k-closest pair search method if the upper bound distance u exceeds the distance between the kth closest pair. Thus, we should consider the intervals of the partitioning distances. In general, a k-closest pair search method initializes the upper bound distance u as infinity and repeatedly updates the upper bound distance u whenever it finds k pairs of objects whose distances are shorter than u. That is, the upper bound distance u converges to the distance between the kth closest pair while searching. We came up with the idea to adjust the partitioning distances t i by taking the convergence into consideration. 4.2 Partitioning Procedure Because pivots prune dissimilar object pairs based on the triangle inequality, the distance density with respect to a pivot and the upper bound distance affects the pruning performance. We can prune objects in a sparse region w.r.t. the distance from the pivot more effectively. Moreover, a shorter upper bound distance can prune more dissimilar objects. We focus on the convergence of the upper bound distance and the distance distribution. We believe that dissimilar object pairs in a dense distance distribution should be pruned by using the upper bound distance, which has already converged. Thus, AMP searches closest pairs in the sparse distance distribution before the dense distance distribution. For detecting the distance distribution, AMP calculates the skewness s of the distance density. Skewness is a measure of the asymmetry of the distribution. It is defined as: [ ((χ ) ] 3 µ) s = E, (14) σ where χ is a random variable, µ is the mean, σ is the standard deviation, and E is the expectation operator. A negative skew of the distance density indicates that objects near the pivot are sparse. AMP applies divide-and-conquer operations from the near to far side of the pivot. On the other hand, a positive skew indicates that objects near the pivot are dense. Therefore, AMP divides and conquers in the opposite direction. Figure 1 shows the concept of AMP. AMP searches the k-closest pairs on two given object sets X and Y ( X Y ) by performing the following steps. For a metric space M = (D, d), AMP first creates a queue heap A to store temporary k-closest object pairs and their distances. It also makes a variable upper bound distance u and initializes it to. When the size of the heap exceeds k, it removes the longest distance pair from the heap and sets the upper bound distance to the distance between the kth closest pair. AMP manages the reference distance Ref o,s of each object o in an object set S. The initial distance of Ref o,s is nil. Let us define t i as the ith partitioning distance. AMP then recursively divides and conquers the given object sets X and Y, which we call AMP(X, Y ). In the following procedure, expr 1? expr 2 : expr 3 means to evaluate the expression expr 2 if expr 1 is true; otherwise, evaluate expr 3 as in C language. 1. Remove the objects from X and Y, which hold Ref o,s nil Ref o,s > u (S = X, Y ). 2. If min { X, Y } 3 holds, search closest pairs by Nested Loops and return. 3. Randomly choose a pivot object p from X, and remove p from X. 4. Calculate the distance from p to each object in X Y. 5. Set d min, d max, d mean, and s to be the minimum, maximum, mean, and skewness of the distances, respectively. 6. Update A and u if the object pair (o i, p) is found where o i is in Y and d(o i, p) < u holds. 7. Set t 0 to be (s < 0? d min : d max). 8. Set t 1 to be (s < 0? min{t 0 + u, d mean} : max{t 0 u, d mean}). 9. Set S 1 to be (s < 0? {o S t 0 d(o, p) < t 1} : {o S t 0 d(o, p) > t 1}) (S = X, Y ). 10. Call AMP(X 1, Y 1). 11. While (s < 0? t i < d max : t i > d min) holds: (a) Set t i+1 to be (s < 0? t i + u : t i u). (b) Set S i+1 to be (s < 0? {o S t i d(o, p) < t i+1} : {o S t i d(o, p) > t i+1}) (S = X, Y ). (c) Call AMP(X i+1, Y i+1). (d) Set object sets S i to be {o o S i t i d(p, o) < u}. (e) For all o S i, Ref o,s i = min {Ref o,s, t i d(p, o) } (S = X, Y ). (f) Set object sets S i+1 to be {o o S i+1 t i d(p, o) < u}. (g) For all o S i+1, Ref o,s i+1 = min {Ref o,s, t i d(p, o) } (S = X, Y ).

5 (h) Call AMP(X i, Y i+1) and AMP(X i+1, Y i ). We can solve the k-closest pair problem in the case of X = Y with minor modification. For brevity we omit the modification. 4.3 Search Cost The computational cost is the number of distance computations between objects. Let X and Y be given object sets. The computational cost of AMP(X, Y ) is: AMP(X, Y ) = X + Y 1 + AMP(X i, Y i) }{{} i divide step }{{} conquer step + AMP(X i, Y j), (15) {(i,j) i j =1} } {{ } combine step where AMP(, ) denotes the cost of the k-closest pair query. When X is equal to Y, the cost of AMP(X) is: AMP(X) = X 1 + i AMP(Xi) + {(i,j) i j =1} AMP(Xi, Xj). (16) 5. EVALUATION We evaluated the computational cost for finding k closest pairs on real datasets. 5.1 Outline of Experiments We used the following methods in the experiment. AMP is our k-closest pair search method. It uses a sparseregion-first strategy. AMP in reverse order is a comparative method of AMP. It uses a dense-region-first strategy. Binary partitioning is a modified ball partitioning-based k-closest pair search method. Nested loops is a naive k-closest pair method, which does not prune any objects and computes the distances of all pairs of objects in the dataset. We used the following real datasets, all of which are available on the Web. These datasets have been used in many recent related studies [12, 9, 4]. We removed duplicate objects from each dataset. NASA [1] is a set of feature vectors made by NASA. It consists of 40, 150 vectors in a 20-dimensional feature space. The vectors were compared using the Euclidean distance. Corel image feature [2] consists of color histogram vectors generated from the Corel image collection. It consists of 68, 040 vectors in a 32-dimensional space. The vectors were compared using the Euclidean distance. Color histogram [1] consists of color histograms of 112, 544 images represented by vectors in a 112-dimensional space. The vectors were compared using the Euclidean distance. Figure 3 shows the distance density of each dataset, and Table 2 lists the properties of the distances between the objects in each dataset. Note that NASA is in the lowest dimensional feature space, the skewness of the Color histogram is the highest, and NASA has a wide range of distances and its skewness is the lowest. We implemented AMP and comparative methods on the Metric Space Library [1], which is written in C. We con- Table 2: Real Dataset Corel Color Nasa image features histogram Distance Euclid Euclid Euclid Dimension Average Variance Skewness Kurtosis ducted the experiment on a Linux PC equipped with an Intel(R) Quad Core Xeon(TM) X GHz CPU and 64 GB of memory. The library and our codes were compiled with GCC 4.4. All datasets were processed in memory for all examined methods. 5.2 Computational Cost We evaluated how AMP reduces the search cost with respect to the query size. We measured the number of distance computations for k-closest pair queries with k ranging from 1 to 100, 000. Each result was the average of over 500 queries of its dataset. Figure 2 shows the computational cost. The vertical axis represents the number of distance computations for a k- closest pair query divided by the value for Nested loops, i.e., (N (N 1))/2 where N is the number of objects in the dataset. The horizontal axis is k. None of the methods require an index, so this result shows the total computational cost for the search. A lower percentage indicates a lower computational cost. We can see that AMP reduces the computational cost. These results show that multi-partitioning prunes more dissimilar object pairs than binary partitioning. Furthermore, AMP works better than AMP in reverse order for all results. This indicates that the sparse-region-first strategy is better than the dense-region-first strategy. In particular, AMP requires much less distance computations than other similar methods for the Color histogram. This suggests that the skewness of the Color histogram is large and the sparseregion-first strategy works well for skewed datasets. 6. CONCLUSION We investigated the problem of the k-closest pair query in metric spaces. We proposed an efficient k-closest pair search method that prunes dissimilar object pairs based on the triangle inequality. The method repeatedly divides and conquers the objects from the sparser space and speeds up the convergence of the upper bound distance before partitioning the denser space. We are currently conducting experiments using synthetic datasets and theoretically analyzing the performance of our method in detail. 7. REFERENCES [1] Metric spaces library, [2] Uci kdd archive, [3] F. Angiulli and C. Pizzuti. Approximate k-closest-pairs in large high-dimensional data sets.

6 (a) NASA (b) Corel image feature (c) Color histogram Figure 2: Computational Cost (a) NASA (b) Corel image feature (c) Color histogram Figure 3: Distance density Journal of Mathematical Modelling and Algorithms, 4(2): , [4] B. Bustos and G. Navarro. Improving the space cost of k-nn search in metric spaces by using distance estimators. Multimedia Tools Appl., 41(2): , [5] E. Chevez and G. Navarro. A compact space decomposition for effective metric indexing. Pattern Recognition Letters, 24(9): , [6] A. Corral, Y. Manolopoulos, Y. Theodoridis, and M. Vassilakopoulos. Algorithms for processing k-closest-pair queries in spatial databases. Data & Knowl. Eng., 49(1):67 104, [7] V. Dohnal, C. Gennaro, P. Savino, and P. Zezula. D-index: Distance searching index for metric data sets. Multimedia Tools and Applications, 21(1):9 33, [8] V. Dohnal, C. Gennaro, and P. Zezula. Similarity join in metric spaces using ed-index. In DEXA, [9] E. H. Jacox and H. Samet. Metric space similarity joins. ACM Trans. on Database Systems, 33(2):1 38, [10] H. V. Jagadish, B. C. Ooi, K. L. Tran, C. Yu, and R. Zhang. idistance: An adaptive b+-tree based indexing method for nearest neighbor search. ACM Trans. on Database Systems, 30(2): , [11] H. Kurasawa, D. Fukagawa, A. Takasu, and J. Adachi. Maximal metric margin partitioning for similarity search indexes. In CIKM, [12] G. Navarro and N. Reyes. Dynamic spatial approximation tree. Journal of Experimental Algorithmics, 12:1 68, [13] R. Paredes and N. Reyes. Solving similarity joins and range queries in metric spaces with the list of twin clusters. Journal of Discrete Algorithms, 7(1):18 35, [14] K. Shim, R. Srikant, and R. Agrawal. High-dimensional similarity joins. In ICDE, [15] J. K. Uhlmann. Satisfying general proximity/similarity queries with metric trees. Information Processing Letters, 40(4): , [16] C. Xiao, W. Wang, X. Lin, and H. Shang. Top-k set similarity joins. In ICDE, [17] P. Zezula, G. Amato, V. Dohnal, and M. Batko. Similarity Search: The Metric Space Approach (Advances in Database Systems). Springer-Verlag, [18] Y. Zhuang, Y. Zhuang, Q. Li, L. Chen, and Y. Yu. Indexing high-dimensional data in dual distance spaces: a symmetrical encoding approach. In EDBT, 2008.

Nearest Neighbor Search by Branch and Bound

Nearest Neighbor Search by Branch and Bound Nearest Neighbor Search by Branch and Bound Algorithmic Problems Around the Web #2 Yury Lifshits http://yury.name CalTech, Fall 07, CS101.2, http://yury.name/algoweb.html 1 / 30 Outline 1 Short Intro to

More information

A Simple, Efficient, Parallelizable Algorithm for Approximated Nearest Neighbors

A Simple, Efficient, Parallelizable Algorithm for Approximated Nearest Neighbors A Simple, Efficient, Parallelizable Algorithm for Approximated Nearest Neighbors Sebastián Ferrada 1, Benjamin Bustos 1, and Nora Reyes 2 1 Institute for Foundational Research on Data Department of Computer

More information

Branch and Bound. Algorithms for Nearest Neighbor Search: Lecture 1. Yury Lifshits

Branch and Bound. Algorithms for Nearest Neighbor Search: Lecture 1. Yury Lifshits Branch and Bound Algorithms for Nearest Neighbor Search: Lecture 1 Yury Lifshits http://yury.name Steklov Institute of Mathematics at St.Petersburg California Institute of Technology 1 / 36 Outline 1 Welcome

More information

Outline. Other Use of Triangle Inequality Algorithms for Nearest Neighbor Search: Lecture 2. Orchard s Algorithm. Chapter VI

Outline. Other Use of Triangle Inequality Algorithms for Nearest Neighbor Search: Lecture 2. Orchard s Algorithm. Chapter VI Other Use of Triangle Ineuality Algorithms for Nearest Neighbor Search: Lecture 2 Yury Lifshits http://yury.name Steklov Institute of Mathematics at St.Petersburg California Institute of Technology Outline

More information

doc. RNDr. Tomáš Skopal, Ph.D. Department of Software Engineering, Faculty of Information Technology, Czech Technical University in Prague

doc. RNDr. Tomáš Skopal, Ph.D. Department of Software Engineering, Faculty of Information Technology, Czech Technical University in Prague Praha & EU: Investujeme do vaší budoucnosti Evropský sociální fond course: Searching the Web and Multimedia Databases (BI-VWM) Tomáš Skopal, 2011 SS2010/11 doc. RNDr. Tomáš Skopal, Ph.D. Department of

More information

A Pivot-based Index Structure for Combination of Feature Vectors

A Pivot-based Index Structure for Combination of Feature Vectors A Pivot-based Index Structure for Combination of Feature Vectors Benjamin Bustos Daniel Keim Tobias Schreck Department of Computer and Information Science, University of Konstanz Universitätstr. 10 Box

More information

Parallel Similarity Join with Data Partitioning for Prefix Filtering

Parallel Similarity Join with Data Partitioning for Prefix Filtering 22 ECTI TRANSACTIONS ON COMPUTER AND INFORMATION TECHNOLOGY VOL.9, NO.1 May 2015 Parallel Similarity Join with Data Partitioning for Prefix Filtering Jaruloj Chongstitvatana 1 and Methus Bhirakit 2, Non-members

More information

Implementation and Experiments of Frequent GPS Trajectory Pattern Mining Algorithms

Implementation and Experiments of Frequent GPS Trajectory Pattern Mining Algorithms DEIM Forum 213 A5-3 Implementation and Experiments of Frequent GPS Trajectory Pattern Abstract Mining Algorithms Xiaoliang GENG, Hiroki ARIMURA, and Takeaki UNO Graduate School of Information Science and

More information

Pivoting M-tree: A Metric Access Method for Efficient Similarity Search

Pivoting M-tree: A Metric Access Method for Efficient Similarity Search Pivoting M-tree: A Metric Access Method for Efficient Similarity Search Tomáš Skopal Department of Computer Science, VŠB Technical University of Ostrava, tř. 17. listopadu 15, Ostrava, Czech Republic tomas.skopal@vsb.cz

More information

DISTANCE MATRIX APPROACH TO CONTENT IMAGE RETRIEVAL. Dmitry Kinoshenko, Vladimir Mashtalir, Elena Yegorova

DISTANCE MATRIX APPROACH TO CONTENT IMAGE RETRIEVAL. Dmitry Kinoshenko, Vladimir Mashtalir, Elena Yegorova International Book Series "Information Science and Computing" 29 DISTANCE MATRIX APPROACH TO CONTENT IMAGE RETRIEVAL Dmitry Kinoshenko, Vladimir Mashtalir, Elena Yegorova Abstract: As the volume of image

More information

Cluster Analysis. Mu-Chun Su. Department of Computer Science and Information Engineering National Central University 2003/3/11 1

Cluster Analysis. Mu-Chun Su. Department of Computer Science and Information Engineering National Central University 2003/3/11 1 Cluster Analysis Mu-Chun Su Department of Computer Science and Information Engineering National Central University 2003/3/11 1 Introduction Cluster analysis is the formal study of algorithms and methods

More information

Semi-Supervised Clustering with Partial Background Information

Semi-Supervised Clustering with Partial Background Information Semi-Supervised Clustering with Partial Background Information Jing Gao Pang-Ning Tan Haibin Cheng Abstract Incorporating background knowledge into unsupervised clustering algorithms has been the subject

More information

Getting to Know Your Data

Getting to Know Your Data Chapter 2 Getting to Know Your Data 2.1 Exercises 1. Give three additional commonly used statistical measures (i.e., not illustrated in this chapter) for the characterization of data dispersion, and discuss

More information

Leveraging Set Relations in Exact Set Similarity Join

Leveraging Set Relations in Exact Set Similarity Join Leveraging Set Relations in Exact Set Similarity Join Xubo Wang, Lu Qin, Xuemin Lin, Ying Zhang, and Lijun Chang University of New South Wales, Australia University of Technology Sydney, Australia {xwang,lxue,ljchang}@cse.unsw.edu.au,

More information

Non-Bayesian Classifiers Part I: k-nearest Neighbor Classifier and Distance Functions

Non-Bayesian Classifiers Part I: k-nearest Neighbor Classifier and Distance Functions Non-Bayesian Classifiers Part I: k-nearest Neighbor Classifier and Distance Functions Selim Aksoy Department of Computer Engineering Bilkent University saksoy@cs.bilkent.edu.tr CS 551, Fall 2017 CS 551,

More information

Joint Entity Resolution

Joint Entity Resolution Joint Entity Resolution Steven Euijong Whang, Hector Garcia-Molina Computer Science Department, Stanford University 353 Serra Mall, Stanford, CA 94305, USA {swhang, hector}@cs.stanford.edu No Institute

More information

Distance-based Outlier Detection: Consolidation and Renewed Bearing

Distance-based Outlier Detection: Consolidation and Renewed Bearing Distance-based Outlier Detection: Consolidation and Renewed Bearing Gustavo. H. Orair, Carlos H. C. Teixeira, Wagner Meira Jr., Ye Wang, Srinivasan Parthasarathy September 15, 2010 Table of contents Introduction

More information

Clustering Billions of Images with Large Scale Nearest Neighbor Search

Clustering Billions of Images with Large Scale Nearest Neighbor Search Clustering Billions of Images with Large Scale Nearest Neighbor Search Ting Liu, Charles Rosenberg, Henry A. Rowley IEEE Workshop on Applications of Computer Vision February 2007 Presented by Dafna Bitton

More information

High-dimensional knn Joins with Incremental Updates

High-dimensional knn Joins with Incremental Updates GeoInformatica manuscript No. (will be inserted by the editor) High-dimensional knn Joins with Incremental Updates Cui Yu Rui Zhang Yaochun Huang Hui Xiong Received: date / Accepted: date Abstract The

More information

Nearest neighbors. Focus on tree-based methods. Clément Jamin, GUDHI project, Inria March 2017

Nearest neighbors. Focus on tree-based methods. Clément Jamin, GUDHI project, Inria March 2017 Nearest neighbors Focus on tree-based methods Clément Jamin, GUDHI project, Inria March 2017 Introduction Exact and approximate nearest neighbor search Essential tool for many applications Huge bibliography

More information

Clustering. Robert M. Haralick. Computer Science, Graduate Center City University of New York

Clustering. Robert M. Haralick. Computer Science, Graduate Center City University of New York Clustering Robert M. Haralick Computer Science, Graduate Center City University of New York Outline K-means 1 K-means 2 3 4 5 Clustering K-means The purpose of clustering is to determine the similarity

More information

Image retrieval based on bag of images

Image retrieval based on bag of images University of Wollongong Research Online Faculty of Informatics - Papers (Archive) Faculty of Engineering and Information Sciences 2009 Image retrieval based on bag of images Jun Zhang University of Wollongong

More information

Speeding up Queries in a Leaf Image Database

Speeding up Queries in a Leaf Image Database 1 Speeding up Queries in a Leaf Image Database Daozheng Chen May 10, 2007 Abstract We have an Electronic Field Guide which contains an image database with thousands of leaf images. We have a system which

More information

Efficient Range Query Processing on Uncertain Data

Efficient Range Query Processing on Uncertain Data Efficient Range Query Processing on Uncertain Data Andrew Knight Rochester Institute of Technology Department of Computer Science Rochester, New York, USA andyknig@gmail.com Manjeet Rege Rochester Institute

More information

Clustering-Based Similarity Search in Metric Spaces with Sparse Spatial Centers

Clustering-Based Similarity Search in Metric Spaces with Sparse Spatial Centers Clustering-Based Similarity Search in Metric Spaces with Sparse Spatial Centers Nieves Brisaboa 1, Oscar Pedreira 1, Diego Seco 1, Roberto Solar 2,3, Roberto Uribe 2,3 1 Database Laboratory, University

More information

CHAPTER VII INDEXED K TWIN NEIGHBOUR CLUSTERING ALGORITHM 7.1 INTRODUCTION

CHAPTER VII INDEXED K TWIN NEIGHBOUR CLUSTERING ALGORITHM 7.1 INTRODUCTION CHAPTER VII INDEXED K TWIN NEIGHBOUR CLUSTERING ALGORITHM 7.1 INTRODUCTION Cluster analysis or clustering is the task of grouping a set of objects in such a way that objects in the same group (called cluster)

More information

Leveraging Transitive Relations for Crowdsourced Joins*

Leveraging Transitive Relations for Crowdsourced Joins* Leveraging Transitive Relations for Crowdsourced Joins* Jiannan Wang #, Guoliang Li #, Tim Kraska, Michael J. Franklin, Jianhua Feng # # Department of Computer Science, Tsinghua University, Brown University,

More information

Scalability Comparison of Peer-to-Peer Similarity-Search Structures

Scalability Comparison of Peer-to-Peer Similarity-Search Structures Scalability Comparison of Peer-to-Peer Similarity-Search Structures Michal Batko a David Novak a Fabrizio Falchi b Pavel Zezula a a Masaryk University, Brno, Czech Republic b ISTI-CNR, Pisa, Italy Abstract

More information

COMP 465: Data Mining Still More on Clustering

COMP 465: Data Mining Still More on Clustering 3/4/015 Exercise COMP 465: Data Mining Still More on Clustering Slides Adapted From : Jiawei Han, Micheline Kamber & Jian Pei Data Mining: Concepts and Techniques, 3 rd ed. Describe each of the following

More information

Lecture 6: Multimedia Information Retrieval Dr. Jian Zhang

Lecture 6: Multimedia Information Retrieval Dr. Jian Zhang Lecture 6: Multimedia Information Retrieval Dr. Jian Zhang NICTA & CSE UNSW COMP9314 Advanced Database S1 2007 jzhang@cse.unsw.edu.au Reference Papers and Resources Papers: Colour spaces-perceptual, historical

More information

NDoT: Nearest Neighbor Distance Based Outlier Detection Technique

NDoT: Nearest Neighbor Distance Based Outlier Detection Technique NDoT: Nearest Neighbor Distance Based Outlier Detection Technique Neminath Hubballi 1, Bidyut Kr. Patra 2, and Sukumar Nandi 1 1 Department of Computer Science & Engineering, Indian Institute of Technology

More information

Comparative Study of Subspace Clustering Algorithms

Comparative Study of Subspace Clustering Algorithms Comparative Study of Subspace Clustering Algorithms S.Chitra Nayagam, Asst Prof., Dept of Computer Applications, Don Bosco College, Panjim, Goa. Abstract-A cluster is a collection of data objects that

More information

Cluster Analysis. Angela Montanari and Laura Anderlucci

Cluster Analysis. Angela Montanari and Laura Anderlucci Cluster Analysis Angela Montanari and Laura Anderlucci 1 Introduction Clustering a set of n objects into k groups is usually moved by the aim of identifying internally homogenous groups according to a

More information

Evaluating find a path reachability queries

Evaluating find a path reachability queries Evaluating find a path reachability queries Panagiotis ouros and Theodore Dalamagas and Spiros Skiadopoulos and Timos Sellis Abstract. Graphs are used for modelling complex problems in many areas, such

More information

HOT asax: A Novel Adaptive Symbolic Representation for Time Series Discords Discovery

HOT asax: A Novel Adaptive Symbolic Representation for Time Series Discords Discovery HOT asax: A Novel Adaptive Symbolic Representation for Time Series Discords Discovery Ninh D. Pham, Quang Loc Le, Tran Khanh Dang Faculty of Computer Science and Engineering, HCM University of Technology,

More information

AC-Close: Efficiently Mining Approximate Closed Itemsets by Core Pattern Recovery

AC-Close: Efficiently Mining Approximate Closed Itemsets by Core Pattern Recovery : Efficiently Mining Approximate Closed Itemsets by Core Pattern Recovery Hong Cheng Philip S. Yu Jiawei Han University of Illinois at Urbana-Champaign IBM T. J. Watson Research Center {hcheng3, hanj}@cs.uiuc.edu,

More information

Predictive Indexing for Fast Search

Predictive Indexing for Fast Search Predictive Indexing for Fast Search Sharad Goel, John Langford and Alex Strehl Yahoo! Research, New York Modern Massive Data Sets (MMDS) June 25, 2008 Goel, Langford & Strehl (Yahoo! Research) Predictive

More information

The Effects of Dimensionality Curse in High Dimensional knn Search

The Effects of Dimensionality Curse in High Dimensional knn Search The Effects of Dimensionality Curse in High Dimensional knn Search Nikolaos Kouiroukidis, Georgios Evangelidis Department of Applied Informatics University of Macedonia Thessaloniki, Greece Email: {kouiruki,

More information

MUFIN Basics. MUFIN team Faculty of Informatics, Masaryk University Brno, Czech Republic SEMWEB 1

MUFIN Basics. MUFIN team Faculty of Informatics, Masaryk University Brno, Czech Republic SEMWEB 1 MUFIN Basics MUFIN team Faculty of Informatics, Masaryk University Brno, Czech Republic mufin@fi.muni.cz SEMWEB 1 Search problem SEARCH index structure data & queries infrastructure SEMWEB 2 The thesis

More information

Clustering Algorithms for Data Stream

Clustering Algorithms for Data Stream Clustering Algorithms for Data Stream Karishma Nadhe 1, Prof. P. M. Chawan 2 1Student, Dept of CS & IT, VJTI Mumbai, Maharashtra, India 2Professor, Dept of CS & IT, VJTI Mumbai, Maharashtra, India Abstract:

More information

K-Means Clustering. Sargur Srihari

K-Means Clustering. Sargur Srihari K-Means Clustering Sargur srihari@cedar.buffalo.edu 1 Topics in Mixture Models and EM Mixture models K-means Clustering Mixtures of Gaussians Maximum Likelihood EM for Gaussian mistures EM Algorithm Gaussian

More information

ECLT 5810 Clustering

ECLT 5810 Clustering ECLT 5810 Clustering What is Cluster Analysis? Cluster: a collection of data objects Similar to one another within the same cluster Dissimilar to the objects in other clusters Cluster analysis Grouping

More information

D-Index: Distance Searching Index for Metric Data Sets

D-Index: Distance Searching Index for Metric Data Sets Multimedia Tools and Applications, 21, 9 33, 23 c 23 Kluwer Academic Publishers. Manufactured in The Netherlands. D-Index: Distance Searching Index for Metric Data Sets VLASTISLAV DOHNAL Masaryk University

More information

Clustering. Chapter 10 in Introduction to statistical learning

Clustering. Chapter 10 in Introduction to statistical learning Clustering Chapter 10 in Introduction to statistical learning 16 14 12 10 8 6 4 2 0 2 4 6 8 10 12 14 1 Clustering ² Clustering is the art of finding groups in data (Kaufman and Rousseeuw, 1990). ² What

More information

10/14/2017. Dejan Sarka. Anomaly Detection. Sponsors

10/14/2017. Dejan Sarka. Anomaly Detection. Sponsors Dejan Sarka Anomaly Detection Sponsors About me SQL Server MVP (17 years) and MCT (20 years) 25 years working with SQL Server Authoring 16 th book Authoring many courses, articles Agenda Introduction Simple

More information

Mining Quantitative Association Rules on Overlapped Intervals

Mining Quantitative Association Rules on Overlapped Intervals Mining Quantitative Association Rules on Overlapped Intervals Qiang Tong 1,3, Baoping Yan 2, and Yuanchun Zhou 1,3 1 Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China {tongqiang,

More information

ECLT 5810 Clustering

ECLT 5810 Clustering ECLT 5810 Clustering What is Cluster Analysis? Cluster: a collection of data objects Similar to one another within the same cluster Dissimilar to the objects in other clusters Cluster analysis Grouping

More information

Contents. Preface to the Second Edition

Contents. Preface to the Second Edition Preface to the Second Edition v 1 Introduction 1 1.1 What Is Data Mining?....................... 4 1.2 Motivating Challenges....................... 5 1.3 The Origins of Data Mining....................

More information

Machine Learning Classifiers and Boosting

Machine Learning Classifiers and Boosting Machine Learning Classifiers and Boosting Reading Ch 18.6-18.12, 20.1-20.3.2 Outline Different types of learning problems Different types of learning algorithms Supervised learning Decision trees Naïve

More information

Quadrant-Based MBR-Tree Indexing Technique for Range Query Over HBase

Quadrant-Based MBR-Tree Indexing Technique for Range Query Over HBase Quadrant-Based MBR-Tree Indexing Technique for Range Query Over HBase Bumjoon Jo and Sungwon Jung (&) Department of Computer Science and Engineering, Sogang University, 35 Baekbeom-ro, Mapo-gu, Seoul 04107,

More information

Hierarchical Ordering for Approximate Similarity Ranking

Hierarchical Ordering for Approximate Similarity Ranking Hierarchical Ordering for Approximate Similarity Ranking Joselíto J. Chua and Peter E. Tischer School of Computer Science and Software Engineering Monash University, Victoria 3800, Australia jjchua@mail.csse.monash.edu.au

More information

Parallel K-Means Clustering with Triangle Inequality

Parallel K-Means Clustering with Triangle Inequality Parallel K-Means Clustering with Triangle Inequality Rachel Krohn and Christer Karlsson Mathematics and Computer Science Department, South Dakota School of Mines and Technology Rapid City, SD, 5771, USA

More information

A Graph Theoretic Approach to Image Database Retrieval

A Graph Theoretic Approach to Image Database Retrieval A Graph Theoretic Approach to Image Database Retrieval Selim Aksoy and Robert M. Haralick Intelligent Systems Laboratory Department of Electrical Engineering University of Washington, Seattle, WA 98195-2500

More information

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

CS246: Mining Massive Datasets Jure Leskovec, Stanford University CS246: Mining Massive Datasets Jure Leskovec, Stanford University http://cs246.stanford.edu [Kumar et al. 99] 2/13/2013 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu

More information

Using Statistics for Computing Joins with MapReduce

Using Statistics for Computing Joins with MapReduce Using Statistics for Computing Joins with MapReduce Theresa Csar 1, Reinhard Pichler 1, Emanuel Sallinger 1, and Vadim Savenkov 2 1 Vienna University of Technology {csar, pichler, sallinger}@dbaituwienacat

More information

SIMILARITY SEARCH The Metric Space Approach. Pavel Zezula, Giuseppe Amato, Vlastislav Dohnal, Michal Batko

SIMILARITY SEARCH The Metric Space Approach. Pavel Zezula, Giuseppe Amato, Vlastislav Dohnal, Michal Batko SIMILARITY SEARCH The Metric Space Approach Pavel Zezula, Giuseppe Amato, Vlastislav Dohnal, Michal Batko Table of Contents Part I: Metric searching in a nutshell Foundations of metric space searching

More information

The Comparative Study of Machine Learning Algorithms in Text Data Classification*

The Comparative Study of Machine Learning Algorithms in Text Data Classification* The Comparative Study of Machine Learning Algorithms in Text Data Classification* Wang Xin School of Science, Beijing Information Science and Technology University Beijing, China Abstract Classification

More information

Nearest Neighbor Predictors

Nearest Neighbor Predictors Nearest Neighbor Predictors September 2, 2018 Perhaps the simplest machine learning prediction method, from a conceptual point of view, and perhaps also the most unusual, is the nearest-neighbor method,

More information

Edge Classification in Networks

Edge Classification in Networks Charu C. Aggarwal, Peixiang Zhao, and Gewen He Florida State University IBM T J Watson Research Center Edge Classification in Networks ICDE Conference, 2016 Introduction We consider in this paper the edge

More information

9.1. K-means Clustering

9.1. K-means Clustering 424 9. MIXTURE MODELS AND EM Section 9.2 Section 9.3 Section 9.4 view of mixture distributions in which the discrete latent variables can be interpreted as defining assignments of data points to specific

More information

Homework 4: Clustering, Recommenders, Dim. Reduction, ML and Graph Mining (due November 19 th, 2014, 2:30pm, in class hard-copy please)

Homework 4: Clustering, Recommenders, Dim. Reduction, ML and Graph Mining (due November 19 th, 2014, 2:30pm, in class hard-copy please) Virginia Tech. Computer Science CS 5614 (Big) Data Management Systems Fall 2014, Prakash Homework 4: Clustering, Recommenders, Dim. Reduction, ML and Graph Mining (due November 19 th, 2014, 2:30pm, in

More information

Applications of Geometric Spanner

Applications of Geometric Spanner Title: Name: Affil./Addr. 1: Affil./Addr. 2: Affil./Addr. 3: Keywords: SumOriWork: Applications of Geometric Spanner Networks Joachim Gudmundsson 1, Giri Narasimhan 2, Michiel Smid 3 School of Information

More information

GPU ACCELERATED SELF-JOIN FOR THE DISTANCE SIMILARITY METRIC

GPU ACCELERATED SELF-JOIN FOR THE DISTANCE SIMILARITY METRIC GPU ACCELERATED SELF-JOIN FOR THE DISTANCE SIMILARITY METRIC MIKE GOWANLOCK NORTHERN ARIZONA UNIVERSITY SCHOOL OF INFORMATICS, COMPUTING & CYBER SYSTEMS BEN KARSIN UNIVERSITY OF HAWAII AT MANOA DEPARTMENT

More information

Geometric data structures:

Geometric data structures: Geometric data structures: Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade Sham Kakade 2017 1 Announcements: HW3 posted Today: Review: LSH for Euclidean distance Other

More information

Data mining. Classification k-nn Classifier. Piotr Paszek. (Piotr Paszek) Data mining k-nn 1 / 20

Data mining. Classification k-nn Classifier. Piotr Paszek. (Piotr Paszek) Data mining k-nn 1 / 20 Data mining Piotr Paszek Classification k-nn Classifier (Piotr Paszek) Data mining k-nn 1 / 20 Plan of the lecture 1 Lazy Learner 2 k-nearest Neighbor Classifier 1 Distance (metric) 2 How to Determine

More information

Detecting Clusters and Outliers for Multidimensional

Detecting Clusters and Outliers for Multidimensional Kennesaw State University DigitalCommons@Kennesaw State University Faculty Publications 2008 Detecting Clusters and Outliers for Multidimensional Data Yong Shi Kennesaw State University, yshi5@kennesaw.edu

More information

Collaborative Rough Clustering

Collaborative Rough Clustering Collaborative Rough Clustering Sushmita Mitra, Haider Banka, and Witold Pedrycz Machine Intelligence Unit, Indian Statistical Institute, Kolkata, India {sushmita, hbanka r}@isical.ac.in Dept. of Electrical

More information

Problem 1: Complexity of Update Rules for Logistic Regression

Problem 1: Complexity of Update Rules for Logistic Regression Case Study 1: Estimating Click Probabilities Tackling an Unknown Number of Features with Sketching Machine Learning for Big Data CSE547/STAT548, University of Washington Emily Fox January 16 th, 2014 1

More information

Data Mining Chapter 9: Descriptive Modeling Fall 2011 Ming Li Department of Computer Science and Technology Nanjing University

Data Mining Chapter 9: Descriptive Modeling Fall 2011 Ming Li Department of Computer Science and Technology Nanjing University Data Mining Chapter 9: Descriptive Modeling Fall 2011 Ming Li Department of Computer Science and Technology Nanjing University Descriptive model A descriptive model presents the main features of the data

More information

Chapter DM:II. II. Cluster Analysis

Chapter DM:II. II. Cluster Analysis Chapter DM:II II. Cluster Analysis Cluster Analysis Basics Hierarchical Cluster Analysis Iterative Cluster Analysis Density-Based Cluster Analysis Cluster Evaluation Constrained Cluster Analysis DM:II-1

More information

Accelerometer Gesture Recognition

Accelerometer Gesture Recognition Accelerometer Gesture Recognition Michael Xie xie@cs.stanford.edu David Pan napdivad@stanford.edu December 12, 2014 Abstract Our goal is to make gesture-based input for smartphones and smartwatches accurate

More information

Overview of Clustering

Overview of Clustering based on Loïc Cerfs slides (UFMG) April 2017 UCBL LIRIS DM2L Example of applicative problem Student profiles Given the marks received by students for different courses, how to group the students so that

More information

Learning to Match. Jun Xu, Zhengdong Lu, Tianqi Chen, Hang Li

Learning to Match. Jun Xu, Zhengdong Lu, Tianqi Chen, Hang Li Learning to Match Jun Xu, Zhengdong Lu, Tianqi Chen, Hang Li 1. Introduction The main tasks in many applications can be formalized as matching between heterogeneous objects, including search, recommendation,

More information

Striped Grid Files: An Alternative for Highdimensional

Striped Grid Files: An Alternative for Highdimensional Striped Grid Files: An Alternative for Highdimensional Indexing Thanet Praneenararat 1, Vorapong Suppakitpaisarn 2, Sunchai Pitakchonlasap 1, and Jaruloj Chongstitvatana 1 Department of Mathematics 1,

More information

NNH: Improving Performance of Nearest-Neighbor Searches Using Histograms (Full Version. UCI Technical Report, Dec. 2003)

NNH: Improving Performance of Nearest-Neighbor Searches Using Histograms (Full Version. UCI Technical Report, Dec. 2003) NNH: Improving Performance of Nearest-Neighbor Searches Using Histograms (Full Version. UCI Technical Report, Dec. 2003) Liang Jin 1, Nick Koudas 2,andChenLi 1 1 School of Information and Computer Science,

More information

MSA220 - Statistical Learning for Big Data

MSA220 - Statistical Learning for Big Data MSA220 - Statistical Learning for Big Data Lecture 13 Rebecka Jörnsten Mathematical Sciences University of Gothenburg and Chalmers University of Technology Clustering Explorative analysis - finding groups

More information

A Novel Approach for Minimum Spanning Tree Based Clustering Algorithm

A Novel Approach for Minimum Spanning Tree Based Clustering Algorithm IJCSES International Journal of Computer Sciences and Engineering Systems, Vol. 5, No. 2, April 2011 CSES International 2011 ISSN 0973-4406 A Novel Approach for Minimum Spanning Tree Based Clustering Algorithm

More information

Knowledge libraries and information space

Knowledge libraries and information space University of Wollongong Research Online University of Wollongong Thesis Collection 1954-2016 University of Wollongong Thesis Collections 2009 Knowledge libraries and information space Eric Rayner University

More information

Hard clustering. Each object is assigned to one and only one cluster. Hierarchical clustering is usually hard. Soft (fuzzy) clustering

Hard clustering. Each object is assigned to one and only one cluster. Hierarchical clustering is usually hard. Soft (fuzzy) clustering An unsupervised machine learning problem Grouping a set of objects in such a way that objects in the same group (a cluster) are more similar (in some sense or another) to each other than to those in other

More information

Using a Divide and Conquer Method for Routing in a PC Vehicle Routing Application. Abstract

Using a Divide and Conquer Method for Routing in a PC Vehicle Routing Application. Abstract Using a Divide and Conquer Method for Routing in a PC Vehicle Routing Application Brenda Cheang Department of Management Information Systems University College Dublin Belfield, Dublin 4, Ireland. Sherlyn

More information

An encoding-based dual distance tree high-dimensional index

An encoding-based dual distance tree high-dimensional index Science in China Series F: Information Sciences 2008 SCIENCE IN CHINA PRESS Springer www.scichina.com info.scichina.com www.springerlink.com An encoding-based dual distance tree high-dimensional index

More information

arxiv: v1 [cs.db] 12 Nov 2010

arxiv: v1 [cs.db] 12 Nov 2010 Efficient K-Nearest Neighbor Join Algorithms for High Dimensional Sparse Data Jijie Wang, Lei Lin, Ting Huang, Jingjing Wang and Zengyou He arxiv:111.287v1 [cs.db] 12 Nov 21 School of Software, Dalian

More information

Similarity Search: A Matching Based Approach

Similarity Search: A Matching Based Approach Similarity Search: A Matching Based Approach Anthony K. H. Tung Rui Zhang Nick Koudas Beng Chin Ooi National Univ. of Singapore Univ. of Melbourne Univ. of Toronto {atung, ooibc}@comp.nus.edu.sg rui@csse.unimelb.edu.au

More information

DATA MINING AND MACHINE LEARNING. Lecture 6: Data preprocessing and model selection Lecturer: Simone Scardapane

DATA MINING AND MACHINE LEARNING. Lecture 6: Data preprocessing and model selection Lecturer: Simone Scardapane DATA MINING AND MACHINE LEARNING Lecture 6: Data preprocessing and model selection Lecturer: Simone Scardapane Academic Year 2016/2017 Table of contents Data preprocessing Feature normalization Missing

More information

Document Clustering using Feature Selection Based on Multiviewpoint and Link Similarity Measure

Document Clustering using Feature Selection Based on Multiviewpoint and Link Similarity Measure Document Clustering using Feature Selection Based on Multiviewpoint and Link Similarity Measure Neelam Singh neelamjain.jain@gmail.com Neha Garg nehagarg.february@gmail.com Janmejay Pant geujay2010@gmail.com

More information

Lesson 3. Prof. Enza Messina

Lesson 3. Prof. Enza Messina Lesson 3 Prof. Enza Messina Clustering techniques are generally classified into these classes: PARTITIONING ALGORITHMS Directly divides data points into some prespecified number of clusters without a hierarchical

More information

Data Mining. Part 2. Data Understanding and Preparation. 2.4 Data Transformation. Spring Instructor: Dr. Masoud Yaghini. Data Transformation

Data Mining. Part 2. Data Understanding and Preparation. 2.4 Data Transformation. Spring Instructor: Dr. Masoud Yaghini. Data Transformation Data Mining Part 2. Data Understanding and Preparation 2.4 Spring 2010 Instructor: Dr. Masoud Yaghini Outline Introduction Normalization Attribute Construction Aggregation Attribute Subset Selection Discretization

More information

Fast Approximate Minimum Spanning Tree Algorithm Based on K-Means

Fast Approximate Minimum Spanning Tree Algorithm Based on K-Means Fast Approximate Minimum Spanning Tree Algorithm Based on K-Means Caiming Zhong 1,2,3, Mikko Malinen 2, Duoqian Miao 1,andPasiFränti 2 1 Department of Computer Science and Technology, Tongji University,

More information

Spatial Outlier Detection

Spatial Outlier Detection Spatial Outlier Detection Chang-Tien Lu Department of Computer Science Northern Virginia Center Virginia Tech Joint work with Dechang Chen, Yufeng Kou, Jiang Zhao 1 Spatial Outlier A spatial data point

More information

Outlier Detection Using Unsupervised and Semi-Supervised Technique on High Dimensional Data

Outlier Detection Using Unsupervised and Semi-Supervised Technique on High Dimensional Data Outlier Detection Using Unsupervised and Semi-Supervised Technique on High Dimensional Data Ms. Gayatri Attarde 1, Prof. Aarti Deshpande 2 M. E Student, Department of Computer Engineering, GHRCCEM, University

More information

Advanced visualization techniques for Self-Organizing Maps with graph-based methods

Advanced visualization techniques for Self-Organizing Maps with graph-based methods Advanced visualization techniques for Self-Organizing Maps with graph-based methods Georg Pölzlbauer 1, Andreas Rauber 1, and Michael Dittenbach 2 1 Department of Software Technology Vienna University

More information

If you wish to cite this paper, please use the following reference:

If you wish to cite this paper, please use the following reference: This is an accepted version of a paper published in Proceedings of the 4th ACM International Conference on SImilarity Search and Applications (SISAP 11). If you wish to cite this paper, please use the

More information

GRID BASED CLUSTERING

GRID BASED CLUSTERING Cluster Analysis Grid Based Clustering STING CLIQUE 1 GRID BASED CLUSTERING Uses a grid data structure Quantizes space into a finite number of cells that form a grid structure Several interesting methods

More information

Cluster quality assessment by the modified Renyi-ClipX algorithm

Cluster quality assessment by the modified Renyi-ClipX algorithm Issue 3, Volume 4, 2010 51 Cluster quality assessment by the modified Renyi-ClipX algorithm Dalia Baziuk, Aleksas Narščius Abstract This paper presents the modified Renyi-CLIPx clustering algorithm and

More information

An Optimal and Progressive Approach to Online Search of Top-K Influential Communities

An Optimal and Progressive Approach to Online Search of Top-K Influential Communities An Optimal and Progressive Approach to Online Search of Top-K Influential Communities Fei Bi, Lijun Chang, Xuemin Lin, Wenjie Zhang University of New South Wales, Australia The University of Sydney, Australia

More information

A Modular k-nearest Neighbor Classification Method for Massively Parallel Text Categorization

A Modular k-nearest Neighbor Classification Method for Massively Parallel Text Categorization A Modular k-nearest Neighbor Classification Method for Massively Parallel Text Categorization Hai Zhao and Bao-Liang Lu Department of Computer Science and Engineering, Shanghai Jiao Tong University, 1954

More information

Data Distortion for Privacy Protection in a Terrorist Analysis System

Data Distortion for Privacy Protection in a Terrorist Analysis System Data Distortion for Privacy Protection in a Terrorist Analysis System Shuting Xu, Jun Zhang, Dianwei Han, and Jie Wang Department of Computer Science, University of Kentucky, Lexington KY 40506-0046, USA

More information

Name: Date: Period: Chapter 2. Section 1: Describing Location in a Distribution

Name: Date: Period: Chapter 2. Section 1: Describing Location in a Distribution Name: Date: Period: Chapter 2 Section 1: Describing Location in a Distribution Suppose you earned an 86 on a statistics quiz. The question is: should you be satisfied with this score? What if it is the

More information

Cost Models for Query Processing Strategies in the Active Data Repository

Cost Models for Query Processing Strategies in the Active Data Repository Cost Models for Query rocessing Strategies in the Active Data Repository Chialin Chang Institute for Advanced Computer Studies and Department of Computer Science University of Maryland, College ark 272

More information

Tree Models of Similarity and Association. Clustering and Classification Lecture 5

Tree Models of Similarity and Association. Clustering and Classification Lecture 5 Tree Models of Similarity and Association Clustering and Lecture 5 Today s Class Tree models. Hierarchical clustering methods. Fun with ultrametrics. 2 Preliminaries Today s lecture is based on the monograph

More information