Clustering Algorithms for Spatial Databases: A Survey

Size: px
Start display at page:

Download "Clustering Algorithms for Spatial Databases: A Survey"

Transcription

1 Clustering Algorithms for Spatial Databases: A Survey Erica Kolatch Department of Computer Science University of Maryland, College Park CMSC 725 3/25/01 kolatch@cs.umd.edu 1. Introduction Spatial Database Systems (SDBS) are database systems designed to handle spatial data and the non-spatial information used to identify the data. SDBS are used for everything from geo-spatial data to bio-medical knowledge and the number of such databases and their uses are increasing rapidly. The amount of spatial data being collected is also increasing exponentially. The complexity of the data contained in these databases means that it is not possible for humans to completely analyze the data being collected. Data mining techniques have been used with relational databases to discover unknown information, searching for unexpected results and correlations. Extremely large databases require new techniques to analyze the data and discover these patterns. Traditional search algorithms can still answer questions about specific pieces of information, but traditional techniques are no longer capable of performing searches for previously unknown patterns in the data. The remainder of the paper is organized as follows: Section 2 contains some basic definitions. Section 3 discusses the concept of spatial clustering. Section 4 explains and analyzes nine classic clustering algorithms. Section 5 synthesizes these algorithms in chart form. Section 6 looks at the present and the future and includes five additional clustering techniques. Section 7 concludes the paper. 2. Definitions Spatial data describes information related to the space occupied by objects. The data consists of geometric information and can be either discrete or continuous. Discrete data might be a single point in multi-dimensional space, however discrete spatial data differs from non-spatial data in that it has a distance attribute that is used to locate the data in space. Continuous data spans a region of space. This data might consist of medical images, map regions, or star fields [Sam94]. Spatial databases are database systems that manage spatial data. They are designed to handle both spatial information and the non-spatial attributes of that data. In order to provide efficient and effective access to spatial data it is necessary to develop indices. These indices are most successful when based on multi-dimensional trees. The structures proposed for these indices include quad trees, k-d trees, R trees and R* trees [Sam94]. Data mining, orknowledge discovery in databases (KDD), is the technique of analyzing data to discover previously unknown information. The goal is to reveal regularities and relationships that are non-trivial. This is accomplished through an analysis of the patterns that form in the kolatch Page 1 3/27/2001

2 data. Various algorithms have been developed to perform this analysis, but many of these techniques are not scalable to very large databases. Spatial data mining differs from regular data mining in parallel with the differences between non-spatial data and spatial data. The attributes of a spatial object stored in a database may be affected by the attributes of the spatial neighbors of that object. In addition, spatial location, and implicit information about the location of an object, may be exactly the information that can be extracted through spatial data mining. [Fay96] In order to successfully explore the massive amounts of spatial data being collected it is necessary to develop database primitives to manipulate the data [EFKS00]. The indices developed for spatial databases are also necessary to provide effective search mechanisms. However, the very large size of spatial databases also requires additional techniques for manipulating and cleaning the data in order to prepare it for analysis. Three methods that have been proposed and developed to aid in the preparation of data are spatial characterization, spatial classification, and clustering. Spatial characterization of an object is the description of the spatial and non-spatial attributes of the object that are typical of similar objects but not necessarily typical of the database as a whole [EFKS98]. To obtain a spatial characterization of an object it is necessary to look at both the properties of the object itself and the properties of its neighbors. The goal of spatial characterization is to discover a set of tuples where a particular set of types appears with a frequency that is significantly different from the frequency in the database as a whole. However, if the neighborhood is very small, that is there are very few targets, then spatial characterization may produce misleading results. It is therefore necessary that there be a significant difference in a large target neighborhood. It is interesting to note that an attribute may be significant in a limited neighborhood, but when the neighborhood is expanded the property will no longer be significant. Spatial trend detection is the regular change of one or more non-spatial attributes while moving on the spatial plane from point x to point y. The regularity of the change is described by performing a regression analysis on the respective attribute values for the objects on the path. Algorithms for performing spatial characterization and spatial trend detection were developed and tested [EFKS98]. Spatial characterization costs increased significantly, O(n 3 ), as the size of the neighbor set increased, although trend detection was proportional to the number of neighbor operations. These rules provide groupings that can then be further mined for interesting results, although the results suggest that characterization would become prohibitively costly for very large sets. However, the most interesting and well developed method of manipulating and cleaning spatial data in order to prepare it for spatial data mining analysis is by clustering. 3. Spatial Clustering Clustering, as applied to large datasets, is the process of creating a group of objects organized on some similarity among the members. In spatial data sets, clustering permits a generalization of kolatch Page 2 3/27/2001

3 the spatial component that allows for successful data mining. There are a number of different methods proposed for performing clustering, but the three main divisions are partitional clustering, hierarchical clustering, and locality-based clustering. Partitional clustering develops a partition of the data such that objects in a cluster are more similar to each other than they are to objects in other clusters. The k-means and k-medoid methods are forms of partitional clustering. Hierarchical clustering performs a sequence of partitioning operations. These can be done bottom-up, performing repeated amalgamations of groups of data until some pre-defined threshold is reached, or top-down, recursively dividing the data until some pre-defined threshold is reached. Hierarchical clustering is frequently used in document and text analysis. Grid-based clusterings are also hierarchical. Locality-based clustering algorithms group objects based on local relationships, and therefore the entire database can be scanned in one pass. Some localitybased algorithms are density based, while others assume a random distribution. Although there are similarities between spatial and non-spatial clustering, large databases, and spatial databases in particular, have unique requirements that create special needs for clustering algorithms. 1. An obvious need considering the large amount of data to be handled is that algorithms be efficient and scalable. 2. Algorithms need to be able to identify irregular shapes, including those with lacunae or concave sections and nested shapes. (See figure 1) 3. The clustering mechanism should be insensitive to large amounts of noise. 4. Algorithms should not be sensitive to the order of input. That is, clustering results should be independent of data order. 5. No a-priori knowledge of the data or the number of clusters to be created should be required, and therefore no domain knowledge input should be required from the user. 6. Algorithms should handle data with large numbers of features, that is, higher dimensionality. Figure 1 Overtime, a number of clustering algorithms have been developed. Some of these are evolutionary, enhancements of previously developed work; others are revolutionary, introducing new concepts and methods. PAM, (Partitioning around Medoids) [Kaufman and Rousseeuw, 1990] uses k-clustering on medoids to identify clusters. It works efficiently on small data sets, but is extremely costly for larger ones. This led to the development of CLARA. CLARA (Clustering Large Applications) [KR90] creates multiple samples of the data set, and then applies PAM to the sample. CLARA kolatch Page 3 3/27/2001

4 chooses the best clustering as the output, basing quality on the similarity and dissimilarity of objects in the entire set, not just the samples. One of the first clustering algorithms specifically designed for spatial databases was CLARANS [NH94]. which uses the k-medoid method of clustering. CLARANS was followed by DBSCAN [EKSX96], a locality-based algorithm relying on the density of objects for clustering. DBCLASD [XEKS98] is also a locality-based algorithm, but it allows for random distribution of the points. Other density or locality-based algorithms include STING [WYM97], an enhancement of DBSCAN, WaveCluster [SCZ98], a method based on wavelets, and DENCLUE [HK98], which is a generalization of several locality-based algorithms. Three other algorithms, BIRCH [ZRL96], CURE [GRS98] and CLIQUE [AGGR98], are hybrid algorithms, making use of both hierarchical techniques and grouping of related items. The nine clustering algorithms for spatial databases mentioned above will be examined more closely in the next section. In particular, they will be compared to the six requirements for clustering of spatial data mentioned above. Figure 2 shows the relationship between the different types of spatial clustering algorithms under discussion. Clustering Algorithms Partitional Hierarchical Locality K-Medoid K-Means Bottom-Up Top-Down Density-based Random Distribution PAM / CLARA CLARANS Grid-Based DBSCAN DBCLASD STING BIRCH CURE DENCLUE WaveCluster CLIQUE STING+ MOSAIC Figure 2 kolatch Page 4 3/27/2001

5 4. Clustering Algorithms 4.1 CLARANS CLARANS (Clustering Large Applications Based on R A N domized Search) [NH94], is a k- medoid algorithm. It stems from the work done on PAM and CLARA. It relies on a randomized search of a graph to find the medoids which represent the clusters. Medoids are the centrally located data point of a group. The algorithm takes as input maxneighbor and numlocal. Maxneighbor is the maximum number of neighbors of a node that are to be examined. Numlocal is the maximum number of local minimum that can be collected. CLARANS begins by selecting a random node. It then checks a sample of the neighbors of the node, and if a better neighbor is found based on the cost differential of the two nodes it moves to the neighbor and continues processing until the maxneighbor criterion is met. Otherwise, it declares the current node a local minimum and starts a new pass to search for other local minima. After a specified number of local minima (numlocal) are collected, The algorithm returns the best of these local values as the medoid of the cluster. The values for maxneighbor and numlocal are not necessarily intuitive, and [NH94] describes an experimental method for deriving the best values for these parameters. The lower the value of maxneighbor, the lower the quality of the clusters. The higher the value of maxneighbor, the closer the quality comes to that of PAM. For numlocal, runtime is proportional to the number of local minima found. Experimentation determined that quality was enhanced when numlocal was set to 2 rather than 1, but that numbers larger than 2 showed no increase in quality, and were therefore not cost effective. Numlocal was therefore set to 2 for all experiments using CLARANS. One of CLARANS s main drawbacks is its lack of efficiency. The inner loop of the algorithm requires an O(N) iteration through the data. Although the authors claim that CLARANS is linearly proportional to the number of points, the time consumed in each step of searching is ÿ(kn) 2. Therefore the overall performance is at least quadratic. In addition, it may not find a real local minimum due to the searching and trimming activities controlled by the sampling methods. CLARANS also requires that all objects to be included in the clusters reside in main memory. This severely limits the size of the database that can be examined. Focusing techniques proposed in [EKX95] improve CLARANS s ability to deal with data objects that are not in main memory, by clustering a sample of the data set, and focusing on relevant data points in order to generate distance and quality updates. CLARANS without any extra focusing techniques cannot handle large data sets. Although the algorithm is insensitive to the order of data input, CLARANS can only find simple object shapes, and cannot handle nested objects or convex shapes. It was not designed to handle highdimensionality data. Based on its random sampling methods, the algorithm could be significantly distracted by large amounts of noise, leading it to identify local minima for noise instead of clusters. Although no a-priori knowledge of the number of clusters is required, both maxneighbor and numlocal must be provided to the algorithm. 4.2 DBSCAN Unlike CLARANS, DBSCAN (Density Based Spatial Clustering of Applications with Noise) [EKSX96] is a locality-based algorithm, relying on a density-based notion of clustering. The kolatch Page 5 3/27/2001

6 density-based notion of clustering states that within each cluster, the density of the points is significantly higher than the density of points outside the cluster. The algorithm uses two parameters, Eps and MinPts to control the density of the cluster. The Eps-neighborhood of a point is defined by N Eps ( p) = { q D dist( p, q) Eps }. The distance function dist(p,q) determines the shape of the neighborhood. MinPts is the minimum number of points that must be contained in the neighborhood of that point in the cluster. Eps and MinPts must be determined before DBSCAN can be run. With MinPts set to 4 for all databases with 2D data, the system first computes the 4-dist graph for the database and the user then selects an Eps based on the first valley of the graph. 4-dist is a function which maps each point to the distance from its 4 th nearest neighbor. To find the clusters in the database using these two values, DBSCAN starts with an arbitrary point and retrieves all points with the same density reachable from the point using Eps and MinPts as controlling parameters. If the point is a core point, then the procedure yields a cluster. If the point is on the border, then DBSCAN goes on to the next point in the database. The algorithm may need to be called recursively with a higher value for MinPts if close clusters need to be merged because they are within the same Eps threshold. DBSCAN is designed to handle the issue of noise and is successful in ignoring outliers, but although it can handle shapes that are hollow, there is no indication that it can handle shapes that are fully nested. The major drawback of DBSCAN is the significant input required from the user. Even if MinPts is set globally at a specific number, it is still necessary to manually determine Eps for each run of DBSCAN. The algorithm can handle large amounts of data, and the order of processing does not affect the shape of the clusters. However, the time to calculate Eps is significant, and not factored into the runtime, which is still O(NlogN) for the algorithm itself. In addition, the algorithm is not designed to handle higher dimensional data. 4.3 DBCLASD DBCLASD (Distribution Based Clustering of Large Spatial Databases) [XEKS98], is another locality-based clustering algorithm, but unlike DBSCAN, the algorithm assumes that the points inside each cluster are uniformly distributed. The authors observe that the distance from a point to its nearest neighbors is smaller inside a cluster than outside that cluster. Each cluster has a probability distribution of points to their nearest neighbors, and this probability set is used to define the cluster. A grid-based representation is used to approximate the clusters as part of the probability calculation. The definition of a cluster (C) used in the algorithm DBCLASD has the following three requirements. 1. NNDistSet(C), the nearest neighbor distance set of cluster C has the expected distribution with a required confidence level 2. C is maximal, i.e. each extension of C by neighboring points does not fulfill condition (1) 3. C is connected i.e. for each pair of points (a,b) of the cluster there is a path of occupied grid cells connecting a and b DBCLASD is an incremental algorithm. Points are processed based on the points previously seen, without regard for the points yet to come. This makes the clusters produced by DBCLASD kolatch Page 6 3/27/2001

7 dependent on input order. In order to ameliorate this dependency, the algorithm uses two techniques. First, unsuccessful candidates for a cluster are not discarded, but tried again later, and second, points already assigned to a cluster may switch to another cluster later. However, no experimental tests were done to show that these techniques were successful. Multiple runs with differently ordered input would provide confidence in the positive effects of the techniques, since the techniques act to slow the algorithm down. Although each point is only examined once as input, internally, each point may be re-examined several times. This makes the internal loops of DBCLASD computationally expensive. The runtime of DBCLASD is between 1.5 and 3 times the runtime of DBSCAN, and the factor increases as the size of the database increases. The major advantage of DBCLASD is that it requires no outside input. This makes it attractive for larger data sets and sets with larger numbers of attributes. Nevertheless, it is expensive when compared to later algorithms, It is unaffected by noise, since it relies on a probability based distance factor. It also can handle clusters of arbitrary shape, although there is still no indication that it could handle nested shapes. However, its drawbacks are that it assumes uniformly distributed points in a cluster, (the authors uses landmine placement as an example.) This uniformity is not common in spatial data sets, and thus the algorithm s effectiveness is limited. In addition, since its runtime slows as the size of the database increases, even though it requires no outside parameters, it still is of limited use for very large databases. 4.4 STING STING (Statistical Information Grid-based method) [WYM97] exploits the clustering properties of index structures. It divides the spatial area into rectangular grid cells using, for example, longitude and latitude, This makes STING order independent, i.e. the clusters created in the next step of the algorithm are not dependent on the order in which the values are placed in the grid. A hierarchical structure is used to manipulate the grid. Each cell at level i is partitioned into a fixed number k of cells at the next level. This is similar to spatial index structures. Since the default value chosen for k is 4, the hierarchical structure used in STING is similar to a quadtree structure [Sam90]. The algorithm stores parameters in each cell which are designed to aid in answering certain types of statistically based spatial queries. In addition to storing the number of objects or points in the cell, STING also stores some attribute dependent values. STING assumes that the attributes have numerical values, and stores the following data: m the mean of all the values in the cell s the standard deviation of all values of the attribute in the cell min the minimum value of the attribute in the cell max the maximum value of the attribute in the cell dist the type of distribution followed by the attribute value in the cell Parameters are calculated, and cells populated in a bottom-up fashion. The value of dist, an enumeration of the type of distribution, for example, normal, uniform, exponential, etc., can be assigned by the user for the base-case. For higher level cells, STING follows a set of heuristics to populate the value of dist. Clustering operations are performed using a top-down method, starting with the root. The relevant cells are determined using the statistical information, and only the paths from those cells kolatch Page 7 3/27/2001

8 down the tree are followed. Once the leaf cells are reached, the clusters are formed using a breadth-first search, by merging cells based on their proximity and whether the average density of the area is greater than some specified threshold. This is similar to DBSCAN, but using cells instead of points. Thus STING finds an approximation of the clusters found in DBSCAN. [WYM97] state that as the granularity of the grid approaches 0, the clusters become identical. However, the cost to manufacture the grid becomes increasingly expensive. In addition to the granularity of the grid which reduces the quality of the clusters, STING also does not consider the spatial relationship between a cell and its siblings when constructing the parent cell[scz98]. This also may cause a degradation in the quality of the clusters. The runtime complexity for STING is O(K) where K is the number of cells at the bottom layer. [WYM97] assume that K << N. However, the smaller the K, the more approximate are the clusters. The lower the granularity, the higher the K, the slower the algorithm will run. STING in its approximation mode (high granularity) is very fast. Tests showed that its execution rate was almost independent of the number of data points for both generation and query operations. However, because of the approximation characteristics the quality of the clusters is not as good as other algorithms. Since STING uses density-based methods to form its clusters, its ability to handle noise is similar to DBSCAN. Although it can handle large amounts of data, and is not sensitive to noise, it cannot handle higher dimensional data without a serious degradation of performance. 4.5 BIRCH BIRCH (Balanced Iterative Reducing and Clustering using Hierarchies) [ZRL96] uses a hierarchical data structure called a CF-tree, or Clustering-Feature-tree, to incrementally and dynamically cluster the data points as they are introduced. A clustering-feature vector CF is a ÿ triple that stores the information maintained about a cluster. The triple CF= { N, LS, SS} contains the number of data points in the cluster, N, the linear sum of the N data points, LS ÿ, and the square-sum of the N data points, SS. A CF-tree is a height balanced tree used to store the clustering features, CF. As new data objects are inserted, the tree is built dynamically. Similar to a spatial index, it is used to guide a new value into the correct cluster. One of the main goals of BIRCH is to minimize I/O time, since large datasets will not usually fit into memory. BIRCH does this by performing the clustering in two phases. In the preclustering phase, the entire database is scanned and an initial in-memory CF-tree is built, representing dense regions of points with compact summaries or sub-clusters in the leaf nodes. The pre-clustering algorithm is both incremental and approximate. Phase 2, which is optional, rescans the leaf nodes entries to build a smaller CF-tree. It can be used to remove outliers and make larger clusters from sub-clusters. Phase 3 attempts to compensate for the order-dependent input. It uses either an existing centroid based clustering algorithm, or a modification of an existing algorithm applied to the sub-clusters at the leaves as if these sub-clusters were single points. The algorithm chosen by BIRCH is applied directly to the sub-clusters and takes as input from the user either a desired number of clusters or a threshold for the desired size of the clusters as a diameter or radius. Phase 4, which is again optional, takes the centroids of the clusters found in the previous phase, and uses them as seeds to create final clusters. The remainder of the data points are redistributed to these seed-based clusters based on a nearness factor Phase 4 can kolatch Page 8 3/27/2001

9 also be used to decrease the number of outliers. As indicated, the second and fourth steps of the algorithm can be used to refine the clusters obtained, but are not required. BIRCH does a single initial scan of the dataset, thus the computational complexity of the algorithm is O(N) for each scan of the data. This assumes N >> K, where K is the number of sub-clusters. As K approaches N, the complexity becomes O(N 2 ) and thus the thresholds used in phase 1 must be chosen carefully. One of the major drawbacks of BIRCH is that the algorithm uses a centroid based method to form the clusters once the initial scan is done, and this causes problems when clusters do not have uniform shapes or sizes. Items at the edge of a very large cluster may actually be closer to the center of a nearby small cluster, and thus will be redistributed to the smaller cluster even though they really belong in the large one. In addition, the centroid method tends to form circular clusters, although repeated scans of the data can eliminate this problem and allow the algorithm to find arbitrarily shaped objects. However, these repeated scans degrade performance significantly. In addition, the data is order dependent, and several parameters must be set manually by the user. 4.6 WaveCluster WaveCluster [SCZ98] takes a grid-based approach to clustering. It maps the data onto a multidimensional grid and applies a wavelet transformation to the feature space instead of to the objects themselves. In the first step of the algorithm, the data is assigned to units based on their feature values. The number or size of these units affects the time required for clustering and the quality of the output. The algorithm then identifies the dense areas in the transformed domain. It does this by searching for connected components. If the feature space is examined from a signal processing perspective, then a group of objects in the feature space forms an n- dimensional signal. Rapid change in the distribution of objects, i.e., the borders of clusters, corresponds to the high frequency parts of the signal. Low frequency areas with high amplitude correspond to the interiors of clusters. Areas with low frequency and low amplitude are outside the clusters. With a high number of objects, that is a large database, signal processing techniques can be used to find areas of low and high frequency, and thus identify the clusters. Wavelet transformation breaks a signal into its different frequency sub-bands, creating a representation that shows multi-resolutions, and therefore provides for efficient identification of clusters. [SCZ98] identify four major benefits of using wavelet transformation to identify clusters. Unsupervised clustering. The filters emphasize regions where points cluster, but tend to suppress weaker information at the boundaries. Dense regions act as attractors for nearby points while acting as inhibitors for points that are further away. Therefore, clusters stand out automatically. The connected components of the feature space are easier to identify. Effective removal of outliers. Low-pass filters used in the transformation automatically remove outliers. Multi-resolution. This property allows the detection of clusters at different levels of accuracy, and wavelet transform can be applied multiple times at different levels of resolution which result in different granularities of clusters Cost Efficiency. Wavelet transformation is very fast, and thus makes this method of locating clusters cost effective. kolatch Page 9 3/27/2001

10 The computational complexity of WaveCluster is dependent on several factors. These include the dimensionality and the number of units created in the first step of the algorithm. The first step of the algorithm runs in O(N), since each data point is assigned to a unit once. Performing wavelet transform and locating connected components would be O(K) where K=m d and m is the number of units and d is the dimensionality. For very large databases, N >> K, so O(N) > O(K), and the overall time complexity is O(N). However, if additional dimensions are added, the number of grid cells grows exponentially. In addition, the determination of connected components becomes extremely costly because of the large number of neighboring cells. Therefore, WaveCluster is only efficient for low-dimensional data, although the authors do suggest using two techniques to improve the time required for processing higher-dimensional data. First, they suggest performing component analysis to identify the most important features. This would reduce the number of features to a value f, where N > m f and second, they recommend the use of parallel processing to handle the increased load of higher-dimensional data. WaveCluster has several significant positive contributions. It is not affected by outliers, and is not sensitive to the order of input. No a-priori knowledge about the number of clusters is needed, although an estimation of the needed resolution can prevent unnecessary work. WaveCluster s main advantage, aside from its speed at handling large datasets, is its ability to find clusters of arbitrary and complex shapes, including concave and nested clusters. 4.7 DENCLUE DENCLUE (DENsity basted CLUstEring) [HK98] is a generalization of partitioning, localitybased and hierarchical or grid-based clustering approaches. The algorithm models the overall point density analytically using the sum of the influence functions of the points. Determining the density-attractors causes the clusters to be identified. DENCLUE can handle clusters of arbitrary shape using an equation based on the overall density function. The authors claim three major advantages for this method of higher-dimensional clustering. Firm mathematical base for finding arbitrary shaped clusters in high-dimensional data sets Good clustering properties in data sets with large amounts of noise Significantly faster than existing algorithms The approach of DENCLUE is based on the concept that the influence of each data point on its neighborhood can be modeled mathematically. The mathematical function used, is called an impact function. This impact function is applied to each data point and the density of the data space is the sum of the influence function for all the data points. In DENCLUE, since many data points do not contribute to the impact function, local density functions are used. Local density functions are defined by a distance function in this case, Euclidean distance. The local density functions consider only data points which actually contribute to the impact function. Local maxima, or density-attractors identify clusters. These can be either center-defined clusters, similar to k-means clusters, or multi-center-defined clusters, that is a series of center-defined clusters linked by a particular path. Multi-center-defined clusters identify clusters of arbitrary shape. Clusters of arbitrary shape can also be defined mathematically. The mathematical model requires two parameters, and ξ. is a parameter which describes a threshold for the influence of a data point in the data space and ξ is a parameter which sets a threshold for determining whether a density-attractor is significant. kolatch Page 10 3/27/2001

11 DENCLUE generalizes k-means clustering since the algorithm provides a globally optimal partitioning of the data set, and k-means methods generally provide a local optimum partitioning of the data set. The generalization of locality-based clustering comes from the fact that DENCLUE can mimic locality-based clustering by using a square wave influence function and mapping values from that function to input parameters of a locality-based algorithm, for example DBSCAN. (The mapping would be to parameters Eps and MinPts.) However, DENCLUE can handle far more complex influence functions, for example Gaussian functions. For hierarchical clustering, the algorithm can use different values of. A hierarchy of clusters can be created based on the values chosen, and these are not restricted to data structure dependent values. The algorithm DENCLUE first performs a pre-clustering step, which creates a map of the active portion of the data set, used to speed up the density function calculations. The second step is the clustering, including the identification of density-attractors and their corresponding points. Using a cell-based representation of the data allows the algorithm to work very efficiently. The time complexity analysis of the authors suggest that after the initial pre-clustering step, whose time complexity is O(D), where D equals the active portion of the data set, the worst case for the remainder of the algorithm is O(DlogD), the average case is O(logD) and for cases where there is a high level of noise, the run-time will be even better. The time savings for DENCLUE, which successfully handles high-dimensional data and arbitrary shaped objects is significant. Although completely nested objects were not examined during the testing of the implementation, they should be identifiable by the algorithm if the parameters and ξ are not set to exclude the possibility. However, these parameters must be determined correctly and although the paper makes suggestions for how the parameters should be chosen, they do not determine mathematically or algorithmically how to choose these parameters. Therefore, knowledge of the best choice for these parameters is dependent on the type of data, and both a-priori knowledge of the data set, and human input may be required. 4.8 CLIQUE CLIQUE, named for Clustering In QUEst, the data mining research project at IBM Almaden, [AGGR98] is a density and grid-based approach for high dimensional data sets that provides automatic sub-space clustering of high dimensional data. The authors identify three main goals they hope to accomplish with their algorithm. Effective treatment of high dimensionality Interpretability of results Scalability and usability The clustering model used for high dimensional data sets limits the search for clusters to subspaces of a high dimensional data space instead of adding new dimensions that contain combinations of information contained in the original dimensions. A density-based approach is used for the actual clustering. In order to approximate the density of the data points, each dimension of the space is partitioned into equal length intervals using a bottom-up approach. The volume of each partition is thus the same, and the density of the partition can be derived from the number of points inside the partition. These density figures are used to automatically identify appropriate subspaces. Clusters are identified in the subspace units by separating data points according to the density function and grouping connected high-density partitions within kolatch Page 11 3/27/2001

12 the subspace. For simplification purposes, the clusters are constrained to be axis-parallel hyperrectangles, and these clusters can be described by a DNF expression. The cluster can be portrayed compactly as the union of a group of overlapping rectangles. Two input parameters are used to partition the subspace and identify the dense units. The input parameter is the number of intervals of equal length into which subspace is to be divided, and the input parameter is the density threshold. The computational complexity and the quality of the clusters produced are dependent on these input parameters. The algorithm CLIQUE has three phases, Identification of subspaces that contain clusters Identification of clusters Generation of minimal description for the clusters The first phase of the algorithm involves a bottom-up algorithm to find dense units. It first makes a pass over the data to identify 1-dimensional dense units. Dense units in parents are determined based on the information available in the children. This algorithm can be improved by pruning the set of dense units to those that lie in interesting subspaces using a method called MDL-based pruning or minimal description length. Subspaces with large coverage of dense units are selected and the remainder are pruned. The subspaces are sorted into descending order of coverage, an optimal cut point is determined according to an encoding method, and the subspaces are pruned again. This pruning may eliminate some clusters which exist not in the identified horizontal subspaces, but in vertical k dimension projections. This is one of the trade-offs that CLIQUE accepts. The second step of the algorithm is a matter of finding the connected components in a graph using the dense units as vertices, and having an edge present if and only if two dense units share a common face. A depth-first search algorithm is used to find the connected components. The identification of clusters is dependent on the number of dense units, n, and these have been previously limited by the threshold parameter. The final step takes the connected components identified in step 2 and generates a concise description of the cluster. In order to identify an appropriate rectangular cover for the cluster, the algorithm first uses a greedy method to cover the cluster with a number of maximal rectangles, and then discards the redundant rectangles to generate a minimal cover. The time complexity of the algorithm is made up of three parts. For the identification of subspaces, the time complexity of this step is O(c k +mk) for a constant c where k is the highest dimensionality of any dense unit and m is the number of input points. The algorithm makes k passes over the database. In the second step, if the algorithm checks 2k neighbors to find connected units, then the number of accesses is 2kn. The cost for the final step of the algorithm is O(n 2 ) for the greedy portion, where n is the number of dense unit algorithms, and O(n 2 ) for the discard portion as well. The running time of the algorithm scales linearly with the size of the database because the number of passes over the data does not change. As the dimensionality of the data increases, the runtime increases quadratically. CLIQUE makes no assumptions about the order of the data, and requires no a-priori knowledge of the database. It does require two input parameters from the user. It is reasonably successful at handling noise, but if, the density threshold is set too low, it will allow some units with enough noise points in them to become dense and appear as clusters. By definition, clusters are kolatch Page 12 3/27/2001

13 represented minimally, using DNF and minimally bounded rectangles, so the algorithms emphasis is on finding clusters and not on the accuracy of the shape of the clusters. 4.9 CURE (Clustering Using Representatives) [GRS98] CURE is a bottom-up hierarchical clustering algorithm, but instead of using a centroid-based approach or an all-points approach it employs a method that is based on choosing a well-formed group of points to identify the distance between clusters. In fact, CURE begins by choosing a constant number, c of well scattered points from a cluster. These points are used to identify the shape and size of the cluster. The next step of the algorithm shrinks the selected points toward the centroid of the cluster using some predetermined fraction. Varying the fraction between 0 and 1 can help CURE to identify different types of clusters. Using the shrunken position of these points to identify the cluster, the algorithm then finds the clusters with the closest pairs of identifying points. These are the clusters that are chosen to be merged as part of the hierarchical algorithm. Merging continues until the desired number of clusters, k, an input parameter, remain. A k-d tree [S90] is used to store the representative points for the clusters. In order to allow CURE to handle very large data sets, CURE uses a random sample of the database. This is different from BIRCH, which uses pre-clustering of all the data points in order to handle larger datasets. Random sampling has two positive effects. First, the sample can be designed to fit in main memory, which eliminates significant I/O costs, and second, random sampling helps to filter outliers. The random samples must be selected such that the probability of missing clusters is low. The authors [GRS98] analytically derive sample sizes for which the risk is low, and show empirically that random samples will still preserve accurate information about the geometry of the clusters. Sample sizes will increase as the separation between clusters and the density of the clusters decrease. In order to speed up the clustering process when sample sizes increase, CURE partitions and partially clusters the data points in the partitions of the random sample. Again, this has similarities to BIRCH, but differs in that the partitions and partial clusters are still done on random samples. Instead of using a centroid to label the clusters, multiple representative points are used, and each data point is assigned to the cluster with the closest representative point. The use of multiple points enables the algorithm to identify arbitrarily shaped clusters. Empirical work with CURE discovered the following interesting results. CURE can discover clusters with interesting shapes CURE is not very sensitive to outliers and the algorithm filters them well Sampling and partitioning reduce input size without sacrificing cluster quality Execution time of CURE is low in practice Final labeling on disk is correct even when clusters are non-spherical The worst-case time complexity of CURE is O(n 2 logn) wherenis the number of sampled points and not N, the entire database. For lower dimensions the complexity is further reduced to O(n 2 ). This makes the time complexity of CURE no worse than that of BIRCH, even for very large datasets and high dimensionality. The computational complexity of CURE is quadratic with respect to the sample size, and is not related to the size of the dataset. kolatch Page 13 3/27/2001

14 5. Summary of algorithms In the discussion of what is important in a spatial clustering algorithm we identified six factors which were necessary for effective clustering of large spatial datasets with high dimensionality. The discussion of the previous nine algorithms addressed how they matched these six requirements. Figure 3 summarizes the discussions. It can be seen that although several of the algorithms meet some of the requirements, and some meet most, none of the algorithms as originally proposed meet all the requirements. ALGORITHM AS ORIGINALLY PROPOSED Efficient/ Scalable CLARANS [NH94] ÿ(kn) 2 No DBSCAN [EKSX96] O(NlogN) No DBCLASD [XEKS98] STING [WYM97] Runtime 1.5-3x DBSCAN factor increases with size O(K) where K=#of clusters Handles higher dimensionality No No BIRCH [ZRL96] O(N) No WaveCluster [SCZ98] DENCLUE [HK98] CLIQUE [AGGR98] CURE [GRS98] Handle Irregularly Shaped Clusters Not completely Insensitive to Noise Not completely Independent of data input order Yes Not completely Yes Yes Better than DBSCAN Yes No Yes Clusters are approximate Yes Yes Yes Not easily and very costly Partially No O(N) for low dimensions only Not well Yes Yes Yes Yes O(DlogD) where D is active data set Somewhat Yes Yes Yes Quadratic on #of dimensions Yes Minimal Partially Yes O(n 2 )for low-dim., O(n 2 logn) for high-dim n=sample set Yes Yes Yes Yes Figure 3 No a-priori knowledge or inputs required 2 parameters required 2 parameters required Parameters required 2 parameters required 2 parameters required 1 parameter, desired#of clusters kolatch Page 14 3/27/2001

15 6. Present and Future In the last two years, work on clustering for spatial databases has gone in two directions. The first, is in the direction of enhancement, the second in the direction of innovation. Several enhancements have been suggested for existing algorithms that will allow them to handle larger databases and higher dimensional data. Hierarchical variants improve the effectiveness of the clustering. For example, [SCZ98] suggests a hierarchical variant of WaveCluster using a multiresolution approach to clustering by starting with coarser grids and then merging the clusters. [HK98] suggest a hierarchical version of DENCLUE, which creates a hierarchy of clusters by suing smoother and smoother kernels where the threshold for the influence of the data sets is refined repeatedly to make tighter and tighter clusters. However, there is still the problem of finding clustering algorithms that work effectively for high N where N is the number of data points, high d, wheredis the number of dimensions, and high noise levels. Very large databases with a large number of attributes for each data point (dimensionality) and a large amount of noise, are still not handled well. Some algorithms do one, or two, but not all three. Efficiency suffers in the presence of high dimensionality, and effectiveness also suffers in the presence of high dimensionality especially when accompanied by large amounts of noise. Some attempts have been made to scale up clustering algorithms. These include sampling techniques, condensation techniques, indexing techniques and grid-based techniques. CLARANS was one of the earliest algorithms to use data sampling. More recently CURE used a well chosen sample of points to define clusters. Condensation techniques include the subcluster grouping techniques of BIRCH and DENCLUE. Indexing is used in many of the algorithms discussed, including BIRCH, DBSCAN, WaveCluster, and DENCLUE. Grid-based techniques are used by DENCLUE, WaveCluster, STING, and CLIQUE. Nevertheless, most of these algorithms are still unable to handle high dimensional data effectively, or large datasets efficiently. Recently, some algorithms have been enhanced, and a few others have been introduced, which suggest different methods and new tracks for clustering algorithms. These tracks include developing dynamic data mining triggers (STING+), improving the condensation techniques of CLIQUE to allow for an adaptive interval size for the partitioning algorithm (MAFIA), generating partitioning algorithms that use enhanced parameters to improve the effectiveness of the clustering algorithm(chameleon), using Delaunay triangles, a recursive algorithm, and the proximity information inherent in the data structure to develop clusters and answer queries(amoeba), and using recursion, partitioning and clustering to create a new algorithm designed to handle high-dimensional data (OptiGrid). These algorithms will be discussed briefly below. 6.1 STING+ STING+ [WYM99] builds on the algorithms presented in STING to create a system that supports user defined triggers, that is a system that works on dynamically evolving spatial data. Spatial data mining triggers have three advantages kolatch Page 15 3/27/2001

16 From the user s perspective, queries do not need to be submitted repeatedly in order to monitor changes in the data. Users only need specify a trigger and the pattern desired as the trigger condition. In addition, interesting patterns are detected immediately if trigger conditions are used. From the system s perspective it is more efficient to check a trigger on an incremental basis than to process identical queries on the data multiple times Data mining tasks on spatial data can not be supported by traditional database triggers since spatial data consists of both non-spatial and spatial attributes, and both may be used to make up the trigger condition. In STING+ users can specify the type of trigger, the duration of the trigger, the region on which the trigger is defined, and the action to be performed by the trigger. The basic data structure of STING+ is similar to that of STING, and the statistics stored in the cells are also similar. In addition, each trigger is decomposed in sub-triggers whose values are stored in individual cells. The triggers and sub-triggers are evaluated incrementally to decrease the amount of time spent reprocessing data. 6.2 MAFIA MAFIA (Merging of Adaptive Finite Intervals (And more than a CLIQUE)) [GHC99] is a modification of CLIQUE that runs faster and finds better quality clusters. The main change is the elimination of the pruning technique for limiting the number of subspaces examined, and the implementation of an adaptive interval size which partitions the dimension dependent on the data distribution in that dimension. The minimum number of bins for a dimension are calculated and placed in a histogram during the first pass of the data. Bins are merged if they are contiguous and have similar histogram values. The boundaries of the bins are not rigid as in CLIQUE, and thus the shape of the resulting clusters will be significantly improved. The revised algorithm runs 44 times faster than the original version of CLIQUE, and handles both large size and high dimensionality well. The computation time complexity for the revised algorithm is O(c k ) where c is a constant and k is the distinct number of dimensions represented by the cluster subspaces in the dataset. 6.3 CHAMELEON CHAMELEON [KHK99] combines a graph partitioning algorithm with a hierarchical clustering scheme that dynamically creates clusters. The first step of the algorithm partitions the data using a method based on the k-nearest neighbor approach to graph partitioning. In the graph, the density of a region is stored as the weight of the connecting edge. The data is divided into a large number of small sub-clusters. The first step uses a multi-level graph partitioning algorithm. The partitioning algorithm used by CHAMELEON produces high quality partitions with a minimum number of edge cuts. The second step uses an agglomerative, or bottom-up hierarchical clustering algorithm to combine the sub-clusters and find the real clusters. Subclusters are combined based on both the relative inter-connectivity of the clusters as well as their closeness. To determine the inter-connectivity and closeness the algorithm takes into account the internal characteristics of the clusters. The relative inter-connectivity is a function of the absolute inter-connectivity which is defined as the sum of the weights of the edges connecting two clusters. The relative closeness of a pair of clusters is the average similarity between the kolatch Page 16 3/27/2001

Clustering Part 4 DBSCAN

Clustering Part 4 DBSCAN Clustering Part 4 Dr. Sanjay Ranka Professor Computer and Information Science and Engineering University of Florida, Gainesville DBSCAN DBSCAN is a density based clustering algorithm Density = number of

More information

University of Florida CISE department Gator Engineering. Clustering Part 4

University of Florida CISE department Gator Engineering. Clustering Part 4 Clustering Part 4 Dr. Sanjay Ranka Professor Computer and Information Science and Engineering University of Florida, Gainesville DBSCAN DBSCAN is a density based clustering algorithm Density = number of

More information

Clustering in Data Mining

Clustering in Data Mining Clustering in Data Mining Classification Vs Clustering When the distribution is based on a single parameter and that parameter is known for each object, it is called classification. E.g. Children, young,

More information

A Review on Cluster Based Approach in Data Mining

A Review on Cluster Based Approach in Data Mining A Review on Cluster Based Approach in Data Mining M. Vijaya Maheswari PhD Research Scholar, Department of Computer Science Karpagam University Coimbatore, Tamilnadu,India Dr T. Christopher Assistant professor,

More information

Clustering part II 1

Clustering part II 1 Clustering part II 1 Clustering What is Cluster Analysis? Types of Data in Cluster Analysis A Categorization of Major Clustering Methods Partitioning Methods Hierarchical Methods 2 Partitioning Algorithms:

More information

Notes. Reminder: HW2 Due Today by 11:59PM. Review session on Thursday. Midterm next Tuesday (10/10/2017)

Notes. Reminder: HW2 Due Today by 11:59PM. Review session on Thursday. Midterm next Tuesday (10/10/2017) 1 Notes Reminder: HW2 Due Today by 11:59PM TA s note: Please provide a detailed ReadMe.txt file on how to run the program on the STDLINUX. If you installed/upgraded any package on STDLINUX, you should

More information

PAM algorithm. Types of Data in Cluster Analysis. A Categorization of Major Clustering Methods. Partitioning i Methods. Hierarchical Methods

PAM algorithm. Types of Data in Cluster Analysis. A Categorization of Major Clustering Methods. Partitioning i Methods. Hierarchical Methods Whatis Cluster Analysis? Clustering Types of Data in Cluster Analysis Clustering part II A Categorization of Major Clustering Methods Partitioning i Methods Hierarchical Methods Partitioning i i Algorithms:

More information

COMP 465: Data Mining Still More on Clustering

COMP 465: Data Mining Still More on Clustering 3/4/015 Exercise COMP 465: Data Mining Still More on Clustering Slides Adapted From : Jiawei Han, Micheline Kamber & Jian Pei Data Mining: Concepts and Techniques, 3 rd ed. Describe each of the following

More information

A Comparative Study of Various Clustering Algorithms in Data Mining

A Comparative Study of Various Clustering Algorithms in Data Mining Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology ISSN 2320 088X IMPACT FACTOR: 6.017 IJCSMC,

More information

Lesson 3. Prof. Enza Messina

Lesson 3. Prof. Enza Messina Lesson 3 Prof. Enza Messina Clustering techniques are generally classified into these classes: PARTITIONING ALGORITHMS Directly divides data points into some prespecified number of clusters without a hierarchical

More information

Clustering Techniques

Clustering Techniques Clustering Techniques Marco BOTTA Dipartimento di Informatica Università di Torino botta@di.unito.it www.di.unito.it/~botta/didattica/clustering.html Data Clustering Outline What is cluster analysis? What

More information

Clustering in Ratemaking: Applications in Territories Clustering

Clustering in Ratemaking: Applications in Territories Clustering Clustering in Ratemaking: Applications in Territories Clustering Ji Yao, PhD FIA ASTIN 13th-16th July 2008 INTRODUCTION Structure of talk Quickly introduce clustering and its application in insurance ratemaking

More information

DATA MINING LECTURE 7. Hierarchical Clustering, DBSCAN The EM Algorithm

DATA MINING LECTURE 7. Hierarchical Clustering, DBSCAN The EM Algorithm DATA MINING LECTURE 7 Hierarchical Clustering, DBSCAN The EM Algorithm CLUSTERING What is a Clustering? In general a grouping of objects such that the objects in a group (cluster) are similar (or related)

More information

DS504/CS586: Big Data Analytics Big Data Clustering II

DS504/CS586: Big Data Analytics Big Data Clustering II Welcome to DS504/CS586: Big Data Analytics Big Data Clustering II Prof. Yanhua Li Time: 6pm 8:50pm Thu Location: AK 232 Fall 2016 More Discussions, Limitations v Center based clustering K-means BFR algorithm

More information

CS570: Introduction to Data Mining

CS570: Introduction to Data Mining CS570: Introduction to Data Mining Scalable Clustering Methods: BIRCH and Others Reading: Chapter 10.3 Han, Chapter 9.5 Tan Cengiz Gunay, Ph.D. Slides courtesy of Li Xiong, Ph.D., 2011 Han, Kamber & Pei.

More information

CSE 5243 INTRO. TO DATA MINING

CSE 5243 INTRO. TO DATA MINING CSE 5243 INTRO. TO DATA MINING Cluster Analysis: Basic Concepts and Methods Huan Sun, CSE@The Ohio State University 09/28/2017 Slides adapted from UIUC CS412, Fall 2017, by Prof. Jiawei Han 2 Chapter 10.

More information

Distance-based Methods: Drawbacks

Distance-based Methods: Drawbacks Distance-based Methods: Drawbacks Hard to find clusters with irregular shapes Hard to specify the number of clusters Heuristic: a cluster must be dense Jian Pei: CMPT 459/741 Clustering (3) 1 How to Find

More information

Data Mining Cluster Analysis: Advanced Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining, 2 nd Edition

Data Mining Cluster Analysis: Advanced Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining, 2 nd Edition Data Mining Cluster Analysis: Advanced Concepts and Algorithms Lecture Notes for Chapter 8 Introduction to Data Mining, 2 nd Edition by Tan, Steinbach, Karpatne, Kumar Outline Prototype-based Fuzzy c-means

More information

Cluster Analysis. Ying Shen, SSE, Tongji University

Cluster Analysis. Ying Shen, SSE, Tongji University Cluster Analysis Ying Shen, SSE, Tongji University Cluster analysis Cluster analysis groups data objects based only on the attributes in the data. The main objective is that The objects within a group

More information

Unsupervised Learning

Unsupervised Learning Outline Unsupervised Learning Basic concepts K-means algorithm Representation of clusters Hierarchical clustering Distance functions Which clustering algorithm to use? NN Supervised learning vs. unsupervised

More information

What is Cluster Analysis? COMP 465: Data Mining Clustering Basics. Applications of Cluster Analysis. Clustering: Application Examples 3/17/2015

What is Cluster Analysis? COMP 465: Data Mining Clustering Basics. Applications of Cluster Analysis. Clustering: Application Examples 3/17/2015 // What is Cluster Analysis? COMP : Data Mining Clustering Basics Slides Adapted From : Jiawei Han, Micheline Kamber & Jian Pei Data Mining: Concepts and Techniques, rd ed. Cluster: A collection of data

More information

DS504/CS586: Big Data Analytics Big Data Clustering II

DS504/CS586: Big Data Analytics Big Data Clustering II Welcome to DS504/CS586: Big Data Analytics Big Data Clustering II Prof. Yanhua Li Time: 6pm 8:50pm Thu Location: KH 116 Fall 2017 Updates: v Progress Presentation: Week 15: 11/30 v Next Week Office hours

More information

Notes. Reminder: HW2 Due Today by 11:59PM. Review session on Thursday. Midterm next Tuesday (10/09/2018)

Notes. Reminder: HW2 Due Today by 11:59PM. Review session on Thursday. Midterm next Tuesday (10/09/2018) 1 Notes Reminder: HW2 Due Today by 11:59PM TA s note: Please provide a detailed ReadMe.txt file on how to run the program on the STDLINUX. If you installed/upgraded any package on STDLINUX, you should

More information

MultiDimensional Signal Processing Master Degree in Ingegneria delle Telecomunicazioni A.A

MultiDimensional Signal Processing Master Degree in Ingegneria delle Telecomunicazioni A.A MultiDimensional Signal Processing Master Degree in Ingegneria delle Telecomunicazioni A.A. 205-206 Pietro Guccione, PhD DEI - DIPARTIMENTO DI INGEGNERIA ELETTRICA E DELL INFORMAZIONE POLITECNICO DI BARI

More information

Data Mining Algorithms

Data Mining Algorithms for the original version: -JörgSander and Martin Ester - Jiawei Han and Micheline Kamber Data Management and Exploration Prof. Dr. Thomas Seidl Data Mining Algorithms Lecture Course with Tutorials Wintersemester

More information

CHAPTER 4: CLUSTER ANALYSIS

CHAPTER 4: CLUSTER ANALYSIS CHAPTER 4: CLUSTER ANALYSIS WHAT IS CLUSTER ANALYSIS? A cluster is a collection of data-objects similar to one another within the same group & dissimilar to the objects in other groups. Cluster analysis

More information

Clustering CS 550: Machine Learning

Clustering CS 550: Machine Learning Clustering CS 550: Machine Learning This slide set mainly uses the slides given in the following links: http://www-users.cs.umn.edu/~kumar/dmbook/ch8.pdf http://www-users.cs.umn.edu/~kumar/dmbook/dmslides/chap8_basic_cluster_analysis.pdf

More information

Hierarchy. No or very little supervision Some heuristic quality guidances on the quality of the hierarchy. Jian Pei: CMPT 459/741 Clustering (2) 1

Hierarchy. No or very little supervision Some heuristic quality guidances on the quality of the hierarchy. Jian Pei: CMPT 459/741 Clustering (2) 1 Hierarchy An arrangement or classification of things according to inclusiveness A natural way of abstraction, summarization, compression, and simplification for understanding Typical setting: organize

More information

Unsupervised Learning : Clustering

Unsupervised Learning : Clustering Unsupervised Learning : Clustering Things to be Addressed Traditional Learning Models. Cluster Analysis K-means Clustering Algorithm Drawbacks of traditional clustering algorithms. Clustering as a complex

More information

Acknowledgements First of all, my thanks go to my supervisor Dr. Osmar R. Za ane for his guidance and funding. Thanks to Jörg Sander who reviewed this

Acknowledgements First of all, my thanks go to my supervisor Dr. Osmar R. Za ane for his guidance and funding. Thanks to Jörg Sander who reviewed this Abstract Clustering means grouping similar objects into classes. In the result, objects within a same group should bear similarity to each other while objects in different groups are dissimilar to each

More information

Knowledge Discovery in Databases

Knowledge Discovery in Databases Ludwig-Maximilians-Universität München Institut für Informatik Lehr- und Forschungseinheit für Datenbanksysteme Lecture notes Knowledge Discovery in Databases Summer Semester 2012 Lecture 8: Clustering

More information

Efficient Parallel DBSCAN algorithms for Bigdata using MapReduce

Efficient Parallel DBSCAN algorithms for Bigdata using MapReduce Efficient Parallel DBSCAN algorithms for Bigdata using MapReduce Thesis submitted in partial fulfillment of the requirements for the award of degree of Master of Engineering in Software Engineering Submitted

More information

Clustering Algorithms for Data Stream

Clustering Algorithms for Data Stream Clustering Algorithms for Data Stream Karishma Nadhe 1, Prof. P. M. Chawan 2 1Student, Dept of CS & IT, VJTI Mumbai, Maharashtra, India 2Professor, Dept of CS & IT, VJTI Mumbai, Maharashtra, India Abstract:

More information

Kapitel 4: Clustering

Kapitel 4: Clustering Ludwig-Maximilians-Universität München Institut für Informatik Lehr- und Forschungseinheit für Datenbanksysteme Knowledge Discovery in Databases WiSe 2017/18 Kapitel 4: Clustering Vorlesung: Prof. Dr.

More information

Comparative Study of Subspace Clustering Algorithms

Comparative Study of Subspace Clustering Algorithms Comparative Study of Subspace Clustering Algorithms S.Chitra Nayagam, Asst Prof., Dept of Computer Applications, Don Bosco College, Panjim, Goa. Abstract-A cluster is a collection of data objects that

More information

Data Mining: Concepts and Techniques. Chapter March 8, 2007 Data Mining: Concepts and Techniques 1

Data Mining: Concepts and Techniques. Chapter March 8, 2007 Data Mining: Concepts and Techniques 1 Data Mining: Concepts and Techniques Chapter 7.1-4 March 8, 2007 Data Mining: Concepts and Techniques 1 1. What is Cluster Analysis? 2. Types of Data in Cluster Analysis Chapter 7 Cluster Analysis 3. A

More information

Clustering Algorithm (DBSCAN) VISHAL BHARTI Computer Science Dept. GC, CUNY

Clustering Algorithm (DBSCAN) VISHAL BHARTI Computer Science Dept. GC, CUNY Clustering Algorithm (DBSCAN) VISHAL BHARTI Computer Science Dept. GC, CUNY Clustering Algorithm Clustering is an unsupervised machine learning algorithm that divides a data into meaningful sub-groups,

More information

Unsupervised Learning Hierarchical Methods

Unsupervised Learning Hierarchical Methods Unsupervised Learning Hierarchical Methods Road Map. Basic Concepts 2. BIRCH 3. ROCK The Principle Group data objects into a tree of clusters Hierarchical methods can be Agglomerative: bottom-up approach

More information

d(2,1) d(3,1 ) d (3,2) 0 ( n, ) ( n ,2)......

d(2,1) d(3,1 ) d (3,2) 0 ( n, ) ( n ,2)...... Data Mining i Topic: Clustering CSEE Department, e t, UMBC Some of the slides used in this presentation are prepared by Jiawei Han and Micheline Kamber Cluster Analysis What is Cluster Analysis? Types

More information

UNIT V CLUSTERING, APPLICATIONS AND TRENDS IN DATA MINING. Clustering is unsupervised classification: no predefined classes

UNIT V CLUSTERING, APPLICATIONS AND TRENDS IN DATA MINING. Clustering is unsupervised classification: no predefined classes UNIT V CLUSTERING, APPLICATIONS AND TRENDS IN DATA MINING What is Cluster Analysis? Cluster: a collection of data objects Similar to one another within the same cluster Dissimilar to the objects in other

More information

Unsupervised Learning and Clustering

Unsupervised Learning and Clustering Unsupervised Learning and Clustering Selim Aksoy Department of Computer Engineering Bilkent University saksoy@cs.bilkent.edu.tr CS 551, Spring 2009 CS 551, Spring 2009 c 2009, Selim Aksoy (Bilkent University)

More information

Data Mining 4. Cluster Analysis

Data Mining 4. Cluster Analysis Data Mining 4. Cluster Analysis 4.5 Spring 2010 Instructor: Dr. Masoud Yaghini Introduction DBSCAN Algorithm OPTICS Algorithm DENCLUE Algorithm References Outline Introduction Introduction Density-based

More information

Unsupervised Learning and Clustering

Unsupervised Learning and Clustering Unsupervised Learning and Clustering Selim Aksoy Department of Computer Engineering Bilkent University saksoy@cs.bilkent.edu.tr CS 551, Spring 2008 CS 551, Spring 2008 c 2008, Selim Aksoy (Bilkent University)

More information

Lecture 7 Cluster Analysis: Part A

Lecture 7 Cluster Analysis: Part A Lecture 7 Cluster Analysis: Part A Zhou Shuigeng May 7, 2007 2007-6-23 Data Mining: Tech. & Appl. 1 Outline What is Cluster Analysis? Types of Data in Cluster Analysis A Categorization of Major Clustering

More information

Clustering. CE-717: Machine Learning Sharif University of Technology Spring Soleymani

Clustering. CE-717: Machine Learning Sharif University of Technology Spring Soleymani Clustering CE-717: Machine Learning Sharif University of Technology Spring 2016 Soleymani Outline Clustering Definition Clustering main approaches Partitional (flat) Hierarchical Clustering validation

More information

Lecture-17: Clustering with K-Means (Contd: DT + Random Forest)

Lecture-17: Clustering with K-Means (Contd: DT + Random Forest) Lecture-17: Clustering with K-Means (Contd: DT + Random Forest) Medha Vidyotma April 24, 2018 1 Contd. Random Forest For Example, if there are 50 scholars who take the measurement of the length of the

More information

CHAPTER 7. PAPER 3: EFFICIENT HIERARCHICAL CLUSTERING OF LARGE DATA SETS USING P-TREES

CHAPTER 7. PAPER 3: EFFICIENT HIERARCHICAL CLUSTERING OF LARGE DATA SETS USING P-TREES CHAPTER 7. PAPER 3: EFFICIENT HIERARCHICAL CLUSTERING OF LARGE DATA SETS USING P-TREES 7.1. Abstract Hierarchical clustering methods have attracted much attention by giving the user a maximum amount of

More information

Review of Spatial Clustering Methods

Review of Spatial Clustering Methods ISSN 2320 2629 Volume 2, No.3, May - June 2013 Neethu C V et al., International Journal Journal of Information of Information Technology Technology Infrastructure, Infrastructure 2(3), May June 2013, 15-24

More information

GRID BASED CLUSTERING

GRID BASED CLUSTERING Cluster Analysis Grid Based Clustering STING CLIQUE 1 GRID BASED CLUSTERING Uses a grid data structure Quantizes space into a finite number of cells that form a grid structure Several interesting methods

More information

Efficient and Effective Clustering Methods for Spatial Data Mining. Raymond T. Ng, Jiawei Han

Efficient and Effective Clustering Methods for Spatial Data Mining. Raymond T. Ng, Jiawei Han Efficient and Effective Clustering Methods for Spatial Data Mining Raymond T. Ng, Jiawei Han 1 Overview Spatial Data Mining Clustering techniques CLARANS Spatial and Non-Spatial dominant CLARANS Observations

More information

CHAMELEON: A Hierarchical Clustering Algorithm Using Dynamic Modeling

CHAMELEON: A Hierarchical Clustering Algorithm Using Dynamic Modeling To Appear in the IEEE Computer CHAMELEON: A Hierarchical Clustering Algorithm Using Dynamic Modeling George Karypis Eui-Hong (Sam) Han Vipin Kumar Department of Computer Science and Engineering University

More information

BBS654 Data Mining. Pinar Duygulu. Slides are adapted from Nazli Ikizler

BBS654 Data Mining. Pinar Duygulu. Slides are adapted from Nazli Ikizler BBS654 Data Mining Pinar Duygulu Slides are adapted from Nazli Ikizler 1 Classification Classification systems: Supervised learning Make a rational prediction given evidence There are several methods for

More information

DBSCAN. Presented by: Garrett Poppe

DBSCAN. Presented by: Garrett Poppe DBSCAN Presented by: Garrett Poppe A density-based algorithm for discovering clusters in large spatial databases with noise by Martin Ester, Hans-peter Kriegel, Jörg S, Xiaowei Xu Slides adapted from resources

More information

CS Data Mining Techniques Instructor: Abdullah Mueen

CS Data Mining Techniques Instructor: Abdullah Mueen CS 591.03 Data Mining Techniques Instructor: Abdullah Mueen LECTURE 6: BASIC CLUSTERING Chapter 10. Cluster Analysis: Basic Concepts and Methods Cluster Analysis: Basic Concepts Partitioning Methods Hierarchical

More information

Data Mining. Clustering. Hamid Beigy. Sharif University of Technology. Fall 1394

Data Mining. Clustering. Hamid Beigy. Sharif University of Technology. Fall 1394 Data Mining Clustering Hamid Beigy Sharif University of Technology Fall 1394 Hamid Beigy (Sharif University of Technology) Data Mining Fall 1394 1 / 31 Table of contents 1 Introduction 2 Data matrix and

More information

Unsupervised Learning Partitioning Methods

Unsupervised Learning Partitioning Methods Unsupervised Learning Partitioning Methods Road Map 1. Basic Concepts 2. K-Means 3. K-Medoids 4. CLARA & CLARANS Cluster Analysis Unsupervised learning (i.e., Class label is unknown) Group data to form

More information

CSE 5243 INTRO. TO DATA MINING

CSE 5243 INTRO. TO DATA MINING CSE 5243 INTRO. TO DATA MINING Cluster Analysis: Basic Concepts and Methods Huan Sun, CSE@The Ohio State University 09/25/2017 Slides adapted from UIUC CS412, Fall 2017, by Prof. Jiawei Han 2 Chapter 10.

More information

DATA MINING AND WAREHOUSING

DATA MINING AND WAREHOUSING DATA MINING AND WAREHOUSING Qno Question Answer 1 Define data warehouse? Data warehouse is a subject oriented, integrated, time-variant, and nonvolatile collection of data that supports management's decision-making

More information

Understanding Clustering Supervising the unsupervised

Understanding Clustering Supervising the unsupervised Understanding Clustering Supervising the unsupervised Janu Verma IBM T.J. Watson Research Center, New York http://jverma.github.io/ jverma@us.ibm.com @januverma Clustering Grouping together similar data

More information

CS412 Homework #3 Answer Set

CS412 Homework #3 Answer Set CS41 Homework #3 Answer Set December 1, 006 Q1. (6 points) (1) (3 points) Suppose that a transaction datase DB is partitioned into DB 1,..., DB p. The outline of a distributed algorithm is as follows.

More information

CSE 5243 INTRO. TO DATA MINING

CSE 5243 INTRO. TO DATA MINING CSE 5243 INTRO. TO DATA MINING Cluster Analysis: Basic Concepts and Methods Huan Sun, CSE@The Ohio State University Slides adapted from UIUC CS412, Fall 2017, by Prof. Jiawei Han 2 Chapter 10. Cluster

More information

Data Mining Chapter 9: Descriptive Modeling Fall 2011 Ming Li Department of Computer Science and Technology Nanjing University

Data Mining Chapter 9: Descriptive Modeling Fall 2011 Ming Li Department of Computer Science and Technology Nanjing University Data Mining Chapter 9: Descriptive Modeling Fall 2011 Ming Li Department of Computer Science and Technology Nanjing University Descriptive model A descriptive model presents the main features of the data

More information

Unsupervised Learning. Andrea G. B. Tettamanzi I3S Laboratory SPARKS Team

Unsupervised Learning. Andrea G. B. Tettamanzi I3S Laboratory SPARKS Team Unsupervised Learning Andrea G. B. Tettamanzi I3S Laboratory SPARKS Team Table of Contents 1)Clustering: Introduction and Basic Concepts 2)An Overview of Popular Clustering Methods 3)Other Unsupervised

More information

Clustering for Mining in Large Spatial Databases

Clustering for Mining in Large Spatial Databases Published in Special Issue on Data Mining, KI-Journal, ScienTec Publishing, Vol. 1, 1998 Clustering for Mining in Large Spatial Databases Martin Ester, Hans-Peter Kriegel, Jörg Sander, Xiaowei Xu In the

More information

University of Florida CISE department Gator Engineering. Clustering Part 5

University of Florida CISE department Gator Engineering. Clustering Part 5 Clustering Part 5 Dr. Sanjay Ranka Professor Computer and Information Science and Engineering University of Florida, Gainesville SNN Approach to Clustering Ordinary distance measures have problems Euclidean

More information

ECLT 5810 Clustering

ECLT 5810 Clustering ECLT 5810 Clustering What is Cluster Analysis? Cluster: a collection of data objects Similar to one another within the same cluster Dissimilar to the objects in other clusters Cluster analysis Grouping

More information

ECLT 5810 Clustering

ECLT 5810 Clustering ECLT 5810 Clustering What is Cluster Analysis? Cluster: a collection of data objects Similar to one another within the same cluster Dissimilar to the objects in other clusters Cluster analysis Grouping

More information

On Clustering Validation Techniques

On Clustering Validation Techniques On Clustering Validation Techniques Maria Halkidi, Yannis Batistakis, Michalis Vazirgiannis Department of Informatics, Athens University of Economics & Business, Patision 76, 0434, Athens, Greece (Hellas)

More information

Analysis and Extensions of Popular Clustering Algorithms

Analysis and Extensions of Popular Clustering Algorithms Analysis and Extensions of Popular Clustering Algorithms Renáta Iváncsy, Attila Babos, Csaba Legány Department of Automation and Applied Informatics and HAS-BUTE Control Research Group Budapest University

More information

Clustering. Chapter 10 in Introduction to statistical learning

Clustering. Chapter 10 in Introduction to statistical learning Clustering Chapter 10 in Introduction to statistical learning 16 14 12 10 8 6 4 2 0 2 4 6 8 10 12 14 1 Clustering ² Clustering is the art of finding groups in data (Kaufman and Rousseeuw, 1990). ² What

More information

Data Clustering Hierarchical Clustering, Density based clustering Grid based clustering

Data Clustering Hierarchical Clustering, Density based clustering Grid based clustering Data Clustering Hierarchical Clustering, Density based clustering Grid based clustering Team 2 Prof. Anita Wasilewska CSE 634 Data Mining All Sources Used for the Presentation Olson CF. Parallel algorithms

More information

Cluster Analysis: Basic Concepts and Algorithms

Cluster Analysis: Basic Concepts and Algorithms Cluster Analysis: Basic Concepts and Algorithms Data Warehousing and Mining Lecture 10 by Hossen Asiful Mustafa What is Cluster Analysis? Finding groups of objects such that the objects in a group will

More information

Cluster Analysis: Basic Concepts and Methods

Cluster Analysis: Basic Concepts and Methods HAN 17-ch10-443-496-9780123814791 2011/6/1 3:44 Page 443 #1 10 Cluster Analysis: Basic Concepts and Methods Imagine that you are the Director of Customer Relationships at AllElectronics, and you have five

More information

DATA WAREHOUING UNIT I

DATA WAREHOUING UNIT I BHARATHIDASAN ENGINEERING COLLEGE NATTRAMAPALLI DEPARTMENT OF COMPUTER SCIENCE SUB CODE & NAME: IT6702/DWDM DEPT: IT Staff Name : N.RAMESH DATA WAREHOUING UNIT I 1. Define data warehouse? NOV/DEC 2009

More information

DBRS: A Density-Based Spatial Clustering Method with Random Sampling. Xin Wang and Howard J. Hamilton Technical Report CS

DBRS: A Density-Based Spatial Clustering Method with Random Sampling. Xin Wang and Howard J. Hamilton Technical Report CS DBRS: A Density-Based Spatial Clustering Method with Random Sampling Xin Wang and Howard J. Hamilton Technical Report CS-2003-13 November, 2003 Copyright 2003, Xin Wang and Howard J. Hamilton Department

More information

! Introduction. ! Partitioning methods. ! Hierarchical methods. ! Model-based methods. ! Density-based methods. ! Scalability

! Introduction. ! Partitioning methods. ! Hierarchical methods. ! Model-based methods. ! Density-based methods. ! Scalability Preview Lecture Clustering! Introduction! Partitioning methods! Hierarchical methods! Model-based methods! Densit-based methods What is Clustering?! Cluster: a collection of data objects! Similar to one

More information

Mining Data Streams. Outline [Garofalakis, Gehrke & Rastogi 2002] Introduction. Summarization Methods. Clustering Data Streams

Mining Data Streams. Outline [Garofalakis, Gehrke & Rastogi 2002] Introduction. Summarization Methods. Clustering Data Streams Mining Data Streams Outline [Garofalakis, Gehrke & Rastogi 2002] Introduction Summarization Methods Clustering Data Streams Data Stream Classification Temporal Models CMPT 843, SFU, Martin Ester, 1-06

More information

Clustering Algorithms for High Dimensional Data Literature Review

Clustering Algorithms for High Dimensional Data Literature Review Clustering Algorithms for High Dimensional Data Literature Review S. Geetha #, K. Thurkai Muthuraj * # Department of Computer Applications, Mepco Schlenk Engineering College, Sivakasi, TamilNadu, India

More information

Data Mining. Part 2. Data Understanding and Preparation. 2.4 Data Transformation. Spring Instructor: Dr. Masoud Yaghini. Data Transformation

Data Mining. Part 2. Data Understanding and Preparation. 2.4 Data Transformation. Spring Instructor: Dr. Masoud Yaghini. Data Transformation Data Mining Part 2. Data Understanding and Preparation 2.4 Spring 2010 Instructor: Dr. Masoud Yaghini Outline Introduction Normalization Attribute Construction Aggregation Attribute Subset Selection Discretization

More information

Gene Clustering & Classification

Gene Clustering & Classification BINF, Introduction to Computational Biology Gene Clustering & Classification Young-Rae Cho Associate Professor Department of Computer Science Baylor University Overview Introduction to Gene Clustering

More information

Clustering Techniques for Large Data Sets

Clustering Techniques for Large Data Sets Clustering Techniques for Large Data Sets From the Past to the Future Alexander Hinneburg, Daniel A. Keim University of Halle Introduction Application Example: Marketing Given: Large data base of customer

More information

Clustering and Visualisation of Data

Clustering and Visualisation of Data Clustering and Visualisation of Data Hiroshi Shimodaira January-March 28 Cluster analysis aims to partition a data set into meaningful or useful groups, based on distances between data points. In some

More information

CMPUT 391 Database Management Systems. Data Mining. Textbook: Chapter (without 17.10)

CMPUT 391 Database Management Systems. Data Mining. Textbook: Chapter (without 17.10) CMPUT 391 Database Management Systems Data Mining Textbook: Chapter 17.7-17.11 (without 17.10) University of Alberta 1 Overview Motivation KDD and Data Mining Association Rules Clustering Classification

More information

Clustering from Data Streams

Clustering from Data Streams Clustering from Data Streams João Gama LIAAD-INESC Porto, University of Porto, Portugal jgama@fep.up.pt 1 Introduction 2 Clustering Micro Clustering 3 Clustering Time Series Growing the Structure Adapting

More information

Cluster Analysis. CSE634 Data Mining

Cluster Analysis. CSE634 Data Mining Cluster Analysis CSE634 Data Mining Agenda Introduction Clustering Requirements Data Representation Partitioning Methods K-Means Clustering K-Medoids Clustering Constrained K-Means clustering Introduction

More information

Improving Cluster Method Quality by Validity Indices

Improving Cluster Method Quality by Validity Indices Improving Cluster Method Quality by Validity Indices N. Hachani and H. Ounalli Faculty of Sciences of Bizerte, Tunisia narjes hachani@yahoo.fr Faculty of Sciences of Tunis, Tunisia habib.ounalli@fst.rnu.tn

More information

Data Partitioning. Figure 1-31: Communication Topologies. Regular Partitions

Data Partitioning. Figure 1-31: Communication Topologies. Regular Partitions Data In single-program multiple-data (SPMD) parallel programs, global data is partitioned, with a portion of the data assigned to each processing node. Issues relevant to choosing a partitioning strategy

More information

K-Mean Clustering Algorithm Implemented To E-Banking

K-Mean Clustering Algorithm Implemented To E-Banking K-Mean Clustering Algorithm Implemented To E-Banking Kanika Bansal Banasthali University Anjali Bohra Banasthali University Abstract As the nations are connected to each other, so is the banking sector.

More information

INF4820. Clustering. Erik Velldal. Nov. 17, University of Oslo. Erik Velldal INF / 22

INF4820. Clustering. Erik Velldal. Nov. 17, University of Oslo. Erik Velldal INF / 22 INF4820 Clustering Erik Velldal University of Oslo Nov. 17, 2009 Erik Velldal INF4820 1 / 22 Topics for Today More on unsupervised machine learning for data-driven categorization: clustering. The task

More information

数据挖掘 Introduction to Data Mining

数据挖掘 Introduction to Data Mining 数据挖掘 Introduction to Data Mining Philippe Fournier-Viger Full professor School of Natural Sciences and Humanities philfv8@yahoo.com Spring 2019 S8700113C 1 Introduction Last week: Association Analysis

More information

Clustering Large Dynamic Datasets Using Exemplar Points

Clustering Large Dynamic Datasets Using Exemplar Points Clustering Large Dynamic Datasets Using Exemplar Points William Sia, Mihai M. Lazarescu Department of Computer Science, Curtin University, GPO Box U1987, Perth 61, W.A. Email: {siaw, lazaresc}@cs.curtin.edu.au

More information

Chapter 2 Basic Structure of High-Dimensional Spaces

Chapter 2 Basic Structure of High-Dimensional Spaces Chapter 2 Basic Structure of High-Dimensional Spaces Data is naturally represented geometrically by associating each record with a point in the space spanned by the attributes. This idea, although simple,

More information

Table Of Contents: xix Foreword to Second Edition

Table Of Contents: xix Foreword to Second Edition Data Mining : Concepts and Techniques Table Of Contents: Foreword xix Foreword to Second Edition xxi Preface xxiii Acknowledgments xxxi About the Authors xxxv Chapter 1 Introduction 1 (38) 1.1 Why Data

More information

Clustering. Lecture 6, 1/24/03 ECS289A

Clustering. Lecture 6, 1/24/03 ECS289A Clustering Lecture 6, 1/24/03 What is Clustering? Given n objects, assign them to groups (clusters) based on their similarity Unsupervised Machine Learning Class Discovery Difficult, and maybe ill-posed

More information

Data Preprocessing. Slides by: Shree Jaswal

Data Preprocessing. Slides by: Shree Jaswal Data Preprocessing Slides by: Shree Jaswal Topics to be covered Why Preprocessing? Data Cleaning; Data Integration; Data Reduction: Attribute subset selection, Histograms, Clustering and Sampling; Data

More information

Road map. Basic concepts

Road map. Basic concepts Clustering Basic concepts Road map K-means algorithm Representation of clusters Hierarchical clustering Distance functions Data standardization Handling mixed attributes Which clustering algorithm to use?

More information

Solution Sketches Midterm Exam COSC 6342 Machine Learning March 20, 2013

Solution Sketches Midterm Exam COSC 6342 Machine Learning March 20, 2013 Your Name: Your student id: Solution Sketches Midterm Exam COSC 6342 Machine Learning March 20, 2013 Problem 1 [5+?]: Hypothesis Classes Problem 2 [8]: Losses and Risks Problem 3 [11]: Model Generation

More information

CS145: INTRODUCTION TO DATA MINING

CS145: INTRODUCTION TO DATA MINING CS145: INTRODUCTION TO DATA MINING 09: Vector Data: Clustering Basics Instructor: Yizhou Sun yzsun@cs.ucla.edu October 27, 2017 Methods to Learn Vector Data Set Data Sequence Data Text Data Classification

More information

A Technical Insight into Clustering Algorithms & Applications

A Technical Insight into Clustering Algorithms & Applications A Technical Insight into Clustering Algorithms & Applications Nandita Yambem 1, and Dr A.N.Nandakumar 2 1 Research Scholar,Department of CSE, Jain University,Bangalore, India 2 Professor,Department of

More information

Cluster Analysis. Mu-Chun Su. Department of Computer Science and Information Engineering National Central University 2003/3/11 1

Cluster Analysis. Mu-Chun Su. Department of Computer Science and Information Engineering National Central University 2003/3/11 1 Cluster Analysis Mu-Chun Su Department of Computer Science and Information Engineering National Central University 2003/3/11 1 Introduction Cluster analysis is the formal study of algorithms and methods

More information