Clustering Algorithms for Spatial Databases: A Survey

Size: px

Start display at page:

Download "Clustering Algorithms for Spatial Databases: A Survey"

Gabriel Riley
6 years ago
Views:

1 Clustering Algorithms for Spatial Databases: A Survey Erica Kolatch Department of Computer Science University of Maryland, College Park CMSC 725 3/25/01 kolatch@cs.umd.edu 1. Introduction Spatial Database Systems (SDBS) are database systems designed to handle spatial data and the non-spatial information used to identify the data. SDBS are used for everything from geo-spatial data to bio-medical knowledge and the number of such databases and their uses are increasing rapidly. The amount of spatial data being collected is also increasing exponentially. The complexity of the data contained in these databases means that it is not possible for humans to completely analyze the data being collected. Data mining techniques have been used with relational databases to discover unknown information, searching for unexpected results and correlations. Extremely large databases require new techniques to analyze the data and discover these patterns. Traditional search algorithms can still answer questions about specific pieces of information, but traditional techniques are no longer capable of performing searches for previously unknown patterns in the data. The remainder of the paper is organized as follows: Section 2 contains some basic definitions. Section 3 discusses the concept of spatial clustering. Section 4 explains and analyzes nine classic clustering algorithms. Section 5 synthesizes these algorithms in chart form. Section 6 looks at the present and the future and includes five additional clustering techniques. Section 7 concludes the paper. 2. Definitions Spatial data describes information related to the space occupied by objects. The data consists of geometric information and can be either discrete or continuous. Discrete data might be a single point in multi-dimensional space, however discrete spatial data differs from non-spatial data in that it has a distance attribute that is used to locate the data in space. Continuous data spans a region of space. This data might consist of medical images, map regions, or star fields [Sam94]. Spatial databases are database systems that manage spatial data. They are designed to handle both spatial information and the non-spatial attributes of that data. In order to provide efficient and effective access to spatial data it is necessary to develop indices. These indices are most successful when based on multi-dimensional trees. The structures proposed for these indices include quad trees, k-d trees, R trees and R* trees [Sam94]. Data mining, orknowledge discovery in databases (KDD), is the technique of analyzing data to discover previously unknown information. The goal is to reveal regularities and relationships that are non-trivial. This is accomplished through an analysis of the patterns that form in the kolatch Page 1 3/27/2001

2 data. Various algorithms have been developed to perform this analysis, but many of these techniques are not scalable to very large databases. Spatial data mining differs from regular data mining in parallel with the differences between non-spatial data and spatial data. The attributes of a spatial object stored in a database may be affected by the attributes of the spatial neighbors of that object. In addition, spatial location, and implicit information about the location of an object, may be exactly the information that can be extracted through spatial data mining. [Fay96] In order to successfully explore the massive amounts of spatial data being collected it is necessary to develop database primitives to manipulate the data [EFKS00]. The indices developed for spatial databases are also necessary to provide effective search mechanisms. However, the very large size of spatial databases also requires additional techniques for manipulating and cleaning the data in order to prepare it for analysis. Three methods that have been proposed and developed to aid in the preparation of data are spatial characterization, spatial classification, and clustering. Spatial characterization of an object is the description of the spatial and non-spatial attributes of the object that are typical of similar objects but not necessarily typical of the database as a whole [EFKS98]. To obtain a spatial characterization of an object it is necessary to look at both the properties of the object itself and the properties of its neighbors. The goal of spatial characterization is to discover a set of tuples where a particular set of types appears with a frequency that is significantly different from the frequency in the database as a whole. However, if the neighborhood is very small, that is there are very few targets, then spatial characterization may produce misleading results. It is therefore necessary that there be a significant difference in a large target neighborhood. It is interesting to note that an attribute may be significant in a limited neighborhood, but when the neighborhood is expanded the property will no longer be significant. Spatial trend detection is the regular change of one or more non-spatial attributes while moving on the spatial plane from point x to point y. The regularity of the change is described by performing a regression analysis on the respective attribute values for the objects on the path. Algorithms for performing spatial characterization and spatial trend detection were developed and tested [EFKS98]. Spatial characterization costs increased significantly, O(n 3 ), as the size of the neighbor set increased, although trend detection was proportional to the number of neighbor operations. These rules provide groupings that can then be further mined for interesting results, although the results suggest that characterization would become prohibitively costly for very large sets. However, the most interesting and well developed method of manipulating and cleaning spatial data in order to prepare it for spatial data mining analysis is by clustering. 3. Spatial Clustering Clustering, as applied to large datasets, is the process of creating a group of objects organized on some similarity among the members. In spatial data sets, clustering permits a generalization of kolatch Page 2 3/27/2001

3 the spatial component that allows for successful data mining. There are a number of different methods proposed for performing clustering, but the three main divisions are partitional clustering, hierarchical clustering, and locality-based clustering. Partitional clustering develops a partition of the data such that objects in a cluster are more similar to each other than they are to objects in other clusters. The k-means and k-medoid methods are forms of partitional clustering. Hierarchical clustering performs a sequence of partitioning operations. These can be done bottom-up, performing repeated amalgamations of groups of data until some pre-defined threshold is reached, or top-down, recursively dividing the data until some pre-defined threshold is reached. Hierarchical clustering is frequently used in document and text analysis. Grid-based clusterings are also hierarchical. Locality-based clustering algorithms group objects based on local relationships, and therefore the entire database can be scanned in one pass. Some localitybased algorithms are density based, while others assume a random distribution. Although there are similarities between spatial and non-spatial clustering, large databases, and spatial databases in particular, have unique requirements that create special needs for clustering algorithms. 1. An obvious need considering the large amount of data to be handled is that algorithms be efficient and scalable. 2. Algorithms need to be able to identify irregular shapes, including those with lacunae or concave sections and nested shapes. (See figure 1) 3. The clustering mechanism should be insensitive to large amounts of noise. 4. Algorithms should not be sensitive to the order of input. That is, clustering results should be independent of data order. 5. No a-priori knowledge of the data or the number of clusters to be created should be required, and therefore no domain knowledge input should be required from the user. 6. Algorithms should handle data with large numbers of features, that is, higher dimensionality. Figure 1 Overtime, a number of clustering algorithms have been developed. Some of these are evolutionary, enhancements of previously developed work; others are revolutionary, introducing new concepts and methods. PAM, (Partitioning around Medoids) [Kaufman and Rousseeuw, 1990] uses k-clustering on medoids to identify clusters. It works efficiently on small data sets, but is extremely costly for larger ones. This led to the development of CLARA. CLARA (Clustering Large Applications) [KR90] creates multiple samples of the data set, and then applies PAM to the sample. CLARA kolatch Page 3 3/27/2001

4 chooses the best clustering as the output, basing quality on the similarity and dissimilarity of objects in the entire set, not just the samples. One of the first clustering algorithms specifically designed for spatial databases was CLARANS [NH94]. which uses the k-medoid method of clustering. CLARANS was followed by DBSCAN [EKSX96], a locality-based algorithm relying on the density of objects for clustering. DBCLASD [XEKS98] is also a locality-based algorithm, but it allows for random distribution of the points. Other density or locality-based algorithms include STING [WYM97], an enhancement of DBSCAN, WaveCluster [SCZ98], a method based on wavelets, and DENCLUE [HK98], which is a generalization of several locality-based algorithms. Three other algorithms, BIRCH [ZRL96], CURE [GRS98] and CLIQUE [AGGR98], are hybrid algorithms, making use of both hierarchical techniques and grouping of related items. The nine clustering algorithms for spatial databases mentioned above will be examined more closely in the next section. In particular, they will be compared to the six requirements for clustering of spatial data mentioned above. Figure 2 shows the relationship between the different types of spatial clustering algorithms under discussion. Clustering Algorithms Partitional Hierarchical Locality K-Medoid K-Means Bottom-Up Top-Down Density-based Random Distribution PAM / CLARA CLARANS Grid-Based DBSCAN DBCLASD STING BIRCH CURE DENCLUE WaveCluster CLIQUE STING+ MOSAIC Figure 2 kolatch Page 4 3/27/2001

5 4. Clustering Algorithms 4.1 CLARANS CLARANS (Clustering Large Applications Based on R A N domized Search) [NH94], is a k- medoid algorithm. It stems from the work done on PAM and CLARA. It relies on a randomized search of a graph to find the medoids which represent the clusters. Medoids are the centrally located data point of a group. The algorithm takes as input maxneighbor and numlocal. Maxneighbor is the maximum number of neighbors of a node that are to be examined. Numlocal is the maximum number of local minimum that can be collected. CLARANS begins by selecting a random node. It then checks a sample of the neighbors of the node, and if a better neighbor is found based on the cost differential of the two nodes it moves to the neighbor and continues processing until the maxneighbor criterion is met. Otherwise, it declares the current node a local minimum and starts a new pass to search for other local minima. After a specified number of local minima (numlocal) are collected, The algorithm returns the best of these local values as the medoid of the cluster. The values for maxneighbor and numlocal are not necessarily intuitive, and [NH94] describes an experimental method for deriving the best values for these parameters. The lower the value of maxneighbor, the lower the quality of the clusters. The higher the value of maxneighbor, the closer the quality comes to that of PAM. For numlocal, runtime is proportional to the number of local minima found. Experimentation determined that quality was enhanced when numlocal was set to 2 rather than 1, but that numbers larger than 2 showed no increase in quality, and were therefore not cost effective. Numlocal was therefore set to 2 for all experiments using CLARANS. One of CLARANS s main drawbacks is its lack of efficiency. The inner loop of the algorithm requires an O(N) iteration through the data. Although the authors claim that CLARANS is linearly proportional to the number of points, the time consumed in each step of searching is ÿ(kn) 2. Therefore the overall performance is at least quadratic. In addition, it may not find a real local minimum due to the searching and trimming activities controlled by the sampling methods. CLARANS also requires that all objects to be included in the clusters reside in main memory. This severely limits the size of the database that can be examined. Focusing techniques proposed in [EKX95] improve CLARANS s ability to deal with data objects that are not in main memory, by clustering a sample of the data set, and focusing on relevant data points in order to generate distance and quality updates. CLARANS without any extra focusing techniques cannot handle large data sets. Although the algorithm is insensitive to the order of data input, CLARANS can only find simple object shapes, and cannot handle nested objects or convex shapes. It was not designed to handle highdimensionality data. Based on its random sampling methods, the algorithm could be significantly distracted by large amounts of noise, leading it to identify local minima for noise instead of clusters. Although no a-priori knowledge of the number of clusters is required, both maxneighbor and numlocal must be provided to the algorithm. 4.2 DBSCAN Unlike CLARANS, DBSCAN (Density Based Spatial Clustering of Applications with Noise) [EKSX96] is a locality-based algorithm, relying on a density-based notion of clustering. The kolatch Page 5 3/27/2001

6 density-based notion of clustering states that within each cluster, the density of the points is significantly higher than the density of points outside the cluster. The algorithm uses two parameters, Eps and MinPts to control the density of the cluster. The Eps-neighborhood of a point is defined by N Eps ( p) = { q D dist( p, q) Eps }. The distance function dist(p,q) determines the shape of the neighborhood. MinPts is the minimum number of points that must be contained in the neighborhood of that point in the cluster. Eps and MinPts must be determined before DBSCAN can be run. With MinPts set to 4 for all databases with 2D data, the system first computes the 4-dist graph for the database and the user then selects an Eps based on the first valley of the graph. 4-dist is a function which maps each point to the distance from its 4 th nearest neighbor. To find the clusters in the database using these two values, DBSCAN starts with an arbitrary point and retrieves all points with the same density reachable from the point using Eps and MinPts as controlling parameters. If the point is a core point, then the procedure yields a cluster. If the point is on the border, then DBSCAN goes on to the next point in the database. The algorithm may need to be called recursively with a higher value for MinPts if close clusters need to be merged because they are within the same Eps threshold. DBSCAN is designed to handle the issue of noise and is successful in ignoring outliers, but although it can handle shapes that are hollow, there is no indication that it can handle shapes that are fully nested. The major drawback of DBSCAN is the significant input required from the user. Even if MinPts is set globally at a specific number, it is still necessary to manually determine Eps for each run of DBSCAN. The algorithm can handle large amounts of data, and the order of processing does not affect the shape of the clusters. However, the time to calculate Eps is significant, and not factored into the runtime, which is still O(NlogN) for the algorithm itself. In addition, the algorithm is not designed to handle higher dimensional data. 4.3 DBCLASD DBCLASD (Distribution Based Clustering of Large Spatial Databases) [XEKS98], is another locality-based clustering algorithm, but unlike DBSCAN, the algorithm assumes that the points inside each cluster are uniformly distributed. The authors observe that the distance from a point to its nearest neighbors is smaller inside a cluster than outside that cluster. Each cluster has a probability distribution of points to their nearest neighbors, and this probability set is used to define the cluster. A grid-based representation is used to approximate the clusters as part of the probability calculation. The definition of a cluster (C) used in the algorithm DBCLASD has the following three requirements. 1. NNDistSet(C), the nearest neighbor distance set of cluster C has the expected distribution with a required confidence level 2. C is maximal, i.e. each extension of C by neighboring points does not fulfill condition (1) 3. C is connected i.e. for each pair of points (a,b) of the cluster there is a path of occupied grid cells connecting a and b DBCLASD is an incremental algorithm. Points are processed based on the points previously seen, without regard for the points yet to come. This makes the clusters produced by DBCLASD kolatch Page 6 3/27/2001

7 dependent on input order. In order to ameliorate this dependency, the algorithm uses two techniques. First, unsuccessful candidates for a cluster are not discarded, but tried again later, and second, points already assigned to a cluster may switch to another cluster later. However, no experimental tests were done to show that these techniques were successful. Multiple runs with differently ordered input would provide confidence in the positive effects of the techniques, since the techniques act to slow the algorithm down. Although each point is only examined once as input, internally, each point may be re-examined several times. This makes the internal loops of DBCLASD computationally expensive. The runtime of DBCLASD is between 1.5 and 3 times the runtime of DBSCAN, and the factor increases as the size of the database increases. The major advantage of DBCLASD is that it requires no outside input. This makes it attractive for larger data sets and sets with larger numbers of attributes. Nevertheless, it is expensive when compared to later algorithms, It is unaffected by noise, since it relies on a probability based distance factor. It also can handle clusters of arbitrary shape, although there is still no indication that it could handle nested shapes. However, its drawbacks are that it assumes uniformly distributed points in a cluster, (the authors uses landmine placement as an example.) This uniformity is not common in spatial data sets, and thus the algorithm s effectiveness is limited. In addition, since its runtime slows as the size of the database increases, even though it requires no outside parameters, it still is of limited use for very large databases. 4.4 STING STING (Statistical Information Grid-based method) [WYM97] exploits the clustering properties of index structures. It divides the spatial area into rectangular grid cells using, for example, longitude and latitude, This makes STING order independent, i.e. the clusters created in the next step of the algorithm are not dependent on the order in which the values are placed in the grid. A hierarchical structure is used to manipulate the grid. Each cell at level i is partitioned into a fixed number k of cells at the next level. This is similar to spatial index structures. Since the default value chosen for k is 4, the hierarchical structure used in STING is similar to a quadtree structure [Sam90]. The algorithm stores parameters in each cell which are designed to aid in answering certain types of statistically based spatial queries. In addition to storing the number of objects or points in the cell, STING also stores some attribute dependent values. STING assumes that the attributes have numerical values, and stores the following data: m the mean of all the values in the cell s the standard deviation of all values of the attribute in the cell min the minimum value of the attribute in the cell max the maximum value of the attribute in the cell dist the type of distribution followed by the attribute value in the cell Parameters are calculated, and cells populated in a bottom-up fashion. The value of dist, an enumeration of the type of distribution, for example, normal, uniform, exponential, etc., can be assigned by the user for the base-case. For higher level cells, STING follows a set of heuristics to populate the value of dist. Clustering operations are performed using a top-down method, starting with the root. The relevant cells are determined using the statistical information, and only the paths from those cells kolatch Page 7 3/27/2001

8 down the tree are followed. Once the leaf cells are reached, the clusters are formed using a breadth-first search, by merging cells based on their proximity and whether the average density of the area is greater than some specified threshold. This is similar to DBSCAN, but using cells instead of points. Thus STING finds an approximation of the clusters found in DBSCAN. [WYM97] state that as the granularity of the grid approaches 0, the clusters become identical. However, the cost to manufacture the grid becomes increasingly expensive. In addition to the granularity of the grid which reduces the quality of the clusters, STING also does not consider the spatial relationship between a cell and its siblings when constructing the parent cell[scz98]. This also may cause a degradation in the quality of the clusters. The runtime complexity for STING is O(K) where K is the number of cells at the bottom layer. [WYM97] assume that K << N. However, the smaller the K, the more approximate are the clusters. The lower the granularity, the higher the K, the slower the algorithm will run. STING in its approximation mode (high granularity) is very fast. Tests showed that its execution rate was almost independent of the number of data points for both generation and query operations. However, because of the approximation characteristics the quality of the clusters is not as good as other algorithms. Since STING uses density-based methods to form its clusters, its ability to handle noise is similar to DBSCAN. Although it can handle large amounts of data, and is not sensitive to noise, it cannot handle higher dimensional data without a serious degradation of performance. 4.5 BIRCH BIRCH (Balanced Iterative Reducing and Clustering using Hierarchies) [ZRL96] uses a hierarchical data structure called a CF-tree, or Clustering-Feature-tree, to incrementally and dynamically cluster the data points as they are introduced. A clustering-feature vector CF is a ÿ triple that stores the information maintained about a cluster. The triple CF= { N, LS, SS} contains the number of data points in the cluster, N, the linear sum of the N data points, LS ÿ, and the square-sum of the N data points, SS. A CF-tree is a height balanced tree used to store the clustering features, CF. As new data objects are inserted, the tree is built dynamically. Similar to a spatial index, it is used to guide a new value into the correct cluster. One of the main goals of BIRCH is to minimize I/O time, since large datasets will not usually fit into memory. BIRCH does this by performing the clustering in two phases. In the preclustering phase, the entire database is scanned and an initial in-memory CF-tree is built, representing dense regions of points with compact summaries or sub-clusters in the leaf nodes. The pre-clustering algorithm is both incremental and approximate. Phase 2, which is optional, rescans the leaf nodes entries to build a smaller CF-tree. It can be used to remove outliers and make larger clusters from sub-clusters. Phase 3 attempts to compensate for the order-dependent input. It uses either an existing centroid based clustering algorithm, or a modification of an existing algorithm applied to the sub-clusters at the leaves as if these sub-clusters were single points. The algorithm chosen by BIRCH is applied directly to the sub-clusters and takes as input from the user either a desired number of clusters or a threshold for the desired size of the clusters as a diameter or radius. Phase 4, which is again optional, takes the centroids of the clusters found in the previous phase, and uses them as seeds to create final clusters. The remainder of the data points are redistributed to these seed-based clusters based on a nearness factor Phase 4 can kolatch Page 8 3/27/2001

9 also be used to decrease the number of outliers. As indicated, the second and fourth steps of the algorithm can be used to refine the clusters obtained, but are not required. BIRCH does a single initial scan of the dataset, thus the computational complexity of the algorithm is O(N) for each scan of the data. This assumes N >> K, where K is the number of sub-clusters. As K approaches N, the complexity becomes O(N 2 ) and thus the thresholds used in phase 1 must be chosen carefully. One of the major drawbacks of BIRCH is that the algorithm uses a centroid based method to form the clusters once the initial scan is done, and this causes problems when clusters do not have uniform shapes or sizes. Items at the edge of a very large cluster may actually be closer to the center of a nearby small cluster, and thus will be redistributed to the smaller cluster even though they really belong in the large one. In addition, the centroid method tends to form circular clusters, although repeated scans of the data can eliminate this problem and allow the algorithm to find arbitrarily shaped objects. However, these repeated scans degrade performance significantly. In addition, the data is order dependent, and several parameters must be set manually by the user. 4.6 WaveCluster WaveCluster [SCZ98] takes a grid-based approach to clustering. It maps the data onto a multidimensional grid and applies a wavelet transformation to the feature space instead of to the objects themselves. In the first step of the algorithm, the data is assigned to units based on their feature values. The number or size of these units affects the time required for clustering and the quality of the output. The algorithm then identifies the dense areas in the transformed domain. It does this by searching for connected components. If the feature space is examined from a signal processing perspective, then a group of objects in the feature space forms an n- dimensional signal. Rapid change in the distribution of objects, i.e., the borders of clusters, corresponds to the high frequency parts of the signal. Low frequency areas with high amplitude correspond to the interiors of clusters. Areas with low frequency and low amplitude are outside the clusters. With a high number of objects, that is a large database, signal processing techniques can be used to find areas of low and high frequency, and thus identify the clusters. Wavelet transformation breaks a signal into its different frequency sub-bands, creating a representation that shows multi-resolutions, and therefore provides for efficient identification of clusters. [SCZ98] identify four major benefits of using wavelet transformation to identify clusters. Unsupervised clustering. The filters emphasize regions where points cluster, but tend to suppress weaker information at the boundaries. Dense regions act as attractors for nearby points while acting as inhibitors for points that are further away. Therefore, clusters stand out automatically. The connected components of the feature space are easier to identify. Effective removal of outliers. Low-pass filters used in the transformation automatically remove outliers. Multi-resolution. This property allows the detection of clusters at different levels of accuracy, and wavelet transform can be applied multiple times at different levels of resolution which result in different granularities of clusters Cost Efficiency. Wavelet transformation is very fast, and thus makes this method of locating clusters cost effective. kolatch Page 9 3/27/2001

10 The computational complexity of WaveCluster is dependent on several factors. These include the dimensionality and the number of units created in the first step of the algorithm. The first step of the algorithm runs in O(N), since each data point is assigned to a unit once. Performing wavelet transform and locating connected components would be O(K) where K=m d and m is the number of units and d is the dimensionality. For very large databases, N >> K, so O(N) > O(K), and the overall time complexity is O(N). However, if additional dimensions are added, the number of grid cells grows exponentially. In addition, the determination of connected components becomes extremely costly because of the large number of neighboring cells. Therefore, WaveCluster is only efficient for low-dimensional data, although the authors do suggest using two techniques to improve the time required for processing higher-dimensional data. First, they suggest performing component analysis to identify the most important features. This would reduce the number of features to a value f, where N > m f and second, they recommend the use of parallel processing to handle the increased load of higher-dimensional data. WaveCluster has several significant positive contributions. It is not affected by outliers, and is not sensitive to the order of input. No a-priori knowledge about the number of clusters is needed, although an estimation of the needed resolution can prevent unnecessary work. WaveCluster s main advantage, aside from its speed at handling large datasets, is its ability to find clusters of arbitrary and complex shapes, including concave and nested clusters. 4.7 DENCLUE DENCLUE (DENsity basted CLUstEring) [HK98] is a generalization of partitioning, localitybased and hierarchical or grid-based clustering approaches. The algorithm models the overall point density analytically using the sum of the influence functions of the points. Determining the density-attractors causes the clusters to be identified. DENCLUE can handle clusters of arbitrary shape using an equation based on the overall density function. The authors claim three major advantages for this method of higher-dimensional clustering. Firm mathematical base for finding arbitrary shaped clusters in high-dimensional data sets Good clustering properties in data sets with large amounts of noise Significantly faster than existing algorithms The approach of DENCLUE is based on the concept that the influence of each data point on its neighborhood can be modeled mathematically. The mathematical function used, is called an impact function. This impact function is applied to each data point and the density of the data space is the sum of the influence function for all the data points. In DENCLUE, since many data points do not contribute to the impact function, local density functions are used. Local density functions are defined by a distance function in this case, Euclidean distance. The local density functions consider only data points which actually contribute to the impact function. Local maxima, or density-attractors identify clusters. These can be either center-defined clusters, similar to k-means clusters, or multi-center-defined clusters, that is a series of center-defined clusters linked by a particular path. Multi-center-defined clusters identify clusters of arbitrary shape. Clusters of arbitrary shape can also be defined mathematically. The mathematical model requires two parameters, and ξ. is a parameter which describes a threshold for the influence of a data point in the data space and ξ is a parameter which sets a threshold for determining whether a density-attractor is significant. kolatch Page 10 3/27/2001

11 DENCLUE generalizes k-means clustering since the algorithm provides a globally optimal partitioning of the data set, and k-means methods generally provide a local optimum partitioning of the data set. The generalization of locality-based clustering comes from the fact that DENCLUE can mimic locality-based clustering by using a square wave influence function and mapping values from that function to input parameters of a locality-based algorithm, for example DBSCAN. (The mapping would be to parameters Eps and MinPts.) However, DENCLUE can handle far more complex influence functions, for example Gaussian functions. For hierarchical clustering, the algorithm can use different values of. A hierarchy of clusters can be created based on the values chosen, and these are not restricted to data structure dependent values. The algorithm DENCLUE first performs a pre-clustering step, which creates a map of the active portion of the data set, used to speed up the density function calculations. The second step is the clustering, including the identification of density-attractors and their corresponding points. Using a cell-based representation of the data allows the algorithm to work very efficiently. The time complexity analysis of the authors suggest that after the initial pre-clustering step, whose time complexity is O(D), where D equals the active portion of the data set, the worst case for the remainder of the algorithm is O(DlogD), the average case is O(logD) and for cases where there is a high level of noise, the run-time will be even better. The time savings for DENCLUE, which successfully handles high-dimensional data and arbitrary shaped objects is significant. Although completely nested objects were not examined during the testing of the implementation, they should be identifiable by the algorithm if the parameters and ξ are not set to exclude the possibility. However, these parameters must be determined correctly and although the paper makes suggestions for how the parameters should be chosen, they do not determine mathematically or algorithmically how to choose these parameters. Therefore, knowledge of the best choice for these parameters is dependent on the type of data, and both a-priori knowledge of the data set, and human input may be required. 4.8 CLIQUE CLIQUE, named for Clustering In QUEst, the data mining research project at IBM Almaden, [AGGR98] is a density and grid-based approach for high dimensional data sets that provides automatic sub-space clustering of high dimensional data. The authors identify three main goals they hope to accomplish with their algorithm. Effective treatment of high dimensionality Interpretability of results Scalability and usability The clustering model used for high dimensional data sets limits the search for clusters to subspaces of a high dimensional data space instead of adding new dimensions that contain combinations of information contained in the original dimensions. A density-based approach is used for the actual clustering. In order to approximate the density of the data points, each dimension of the space is partitioned into equal length intervals using a bottom-up approach. The volume of each partition is thus the same, and the density of the partition can be derived from the number of points inside the partition. These density figures are used to automatically identify appropriate subspaces. Clusters are identified in the subspace units by separating data points according to the density function and grouping connected high-density partitions within kolatch Page 11 3/27/2001

12 the subspace. For simplification purposes, the clusters are constrained to be axis-parallel hyperrectangles, and these clusters can be described by a DNF expression. The cluster can be portrayed compactly as the union of a group of overlapping rectangles. Two input parameters are used to partition the subspace and identify the dense units. The input parameter is the number of intervals of equal length into which subspace is to be divided, and the input parameter is the density threshold. The computational complexity and the quality of the clusters produced are dependent on these input parameters. The algorithm CLIQUE has three phases, Identification of subspaces that contain clusters Identification of clusters Generation of minimal description for the clusters The first phase of the algorithm involves a bottom-up algorithm to find dense units. It first makes a pass over the data to identify 1-dimensional dense units. Dense units in parents are determined based on the information available in the children. This algorithm can be improved by pruning the set of dense units to those that lie in interesting subspaces using a method called MDL-based pruning or minimal description length. Subspaces with large coverage of dense units are selected and the remainder are pruned. The subspaces are sorted into descending order of coverage, an optimal cut point is determined according to an encoding method, and the subspaces are pruned again. This pruning may eliminate some clusters which exist not in the identified horizontal subspaces, but in vertical k dimension projections. This is one of the trade-offs that CLIQUE accepts. The second step of the algorithm is a matter of finding the connected components in a graph using the dense units as vertices, and having an edge present if and only if two dense units share a common face. A depth-first search algorithm is used to find the connected components. The identification of clusters is dependent on the number of dense units, n, and these have been previously limited by the threshold parameter. The final step takes the connected components identified in step 2 and generates a concise description of the cluster. In order to identify an appropriate rectangular cover for the cluster, the algorithm first uses a greedy method to cover the cluster with a number of maximal rectangles, and then discards the redundant rectangles to generate a minimal cover. The time complexity of the algorithm is made up of three parts. For the identification of subspaces, the time complexity of this step is O(c k +mk) for a constant c where k is the highest dimensionality of any dense unit and m is the number of input points. The algorithm makes k passes over the database. In the second step, if the algorithm checks 2k neighbors to find connected units, then the number of accesses is 2kn. The cost for the final step of the algorithm is O(n 2 ) for the greedy portion, where n is the number of dense unit algorithms, and O(n 2 ) for the discard portion as well. The running time of the algorithm scales linearly with the size of the database because the number of passes over the data does not change. As the dimensionality of the data increases, the runtime increases quadratically. CLIQUE makes no assumptions about the order of the data, and requires no a-priori knowledge of the database. It does require two input parameters from the user. It is reasonably successful at handling noise, but if, the density threshold is set too low, it will allow some units with enough noise points in them to become dense and appear as clusters. By definition, clusters are kolatch Page 12 3/27/2001

13 represented minimally, using DNF and minimally bounded rectangles, so the algorithms emphasis is on finding clusters and not on the accuracy of the shape of the clusters. 4.9 CURE (Clustering Using Representatives) [GRS98] CURE is a bottom-up hierarchical clustering algorithm, but instead of using a centroid-based approach or an all-points approach it employs a method that is based on choosing a well-formed group of points to identify the distance between clusters. In fact, CURE begins by choosing a constant number, c of well scattered points from a cluster. These points are used to identify the shape and size of the cluster. The next step of the algorithm shrinks the selected points toward the centroid of the cluster using some predetermined fraction. Varying the fraction between 0 and 1 can help CURE to identify different types of clusters. Using the shrunken position of these points to identify the cluster, the algorithm then finds the clusters with the closest pairs of identifying points. These are the clusters that are chosen to be merged as part of the hierarchical algorithm. Merging continues until the desired number of clusters, k, an input parameter, remain. A k-d tree [S90] is used to store the representative points for the clusters. In order to allow CURE to handle very large data sets, CURE uses a random sample of the database. This is different from BIRCH, which uses pre-clustering of all the data points in order to handle larger datasets. Random sampling has two positive effects. First, the sample can be designed to fit in main memory, which eliminates significant I/O costs, and second, random sampling helps to filter outliers. The random samples must be selected such that the probability of missing clusters is low. The authors [GRS98] analytically derive sample sizes for which the risk is low, and show empirically that random samples will still preserve accurate information about the geometry of the clusters. Sample sizes will increase as the separation between clusters and the density of the clusters decrease. In order to speed up the clustering process when sample sizes increase, CURE partitions and partially clusters the data points in the partitions of the random sample. Again, this has similarities to BIRCH, but differs in that the partitions and partial clusters are still done on random samples. Instead of using a centroid to label the clusters, multiple representative points are used, and each data point is assigned to the cluster with the closest representative point. The use of multiple points enables the algorithm to identify arbitrarily shaped clusters. Empirical work with CURE discovered the following interesting results. CURE can discover clusters with interesting shapes CURE is not very sensitive to outliers and the algorithm filters them well Sampling and partitioning reduce input size without sacrificing cluster quality Execution time of CURE is low in practice Final labeling on disk is correct even when clusters are non-spherical The worst-case time complexity of CURE is O(n 2 logn) wherenis the number of sampled points and not N, the entire database. For lower dimensions the complexity is further reduced to O(n 2 ). This makes the time complexity of CURE no worse than that of BIRCH, even for very large datasets and high dimensionality. The computational complexity of CURE is quadratic with respect to the sample size, and is not related to the size of the dataset. kolatch Page 13 3/27/2001

14 5. Summary of algorithms In the discussion of what is important in a spatial clustering algorithm we identified six factors which were necessary for effective clustering of large spatial datasets with high dimensionality. The discussion of the previous nine algorithms addressed how they matched these six requirements. Figure 3 summarizes the discussions. It can be seen that although several of the algorithms meet some of the requirements, and some meet most, none of the algorithms as originally proposed meet all the requirements. ALGORITHM AS ORIGINALLY PROPOSED Efficient/ Scalable CLARANS [NH94] ÿ(kn) 2 No DBSCAN [EKSX96] O(NlogN) No DBCLASD [XEKS98] STING [WYM97] Runtime 1.5-3x DBSCAN factor increases with size O(K) where K=#of clusters Handles higher dimensionality No No BIRCH [ZRL96] O(N) No WaveCluster [SCZ98] DENCLUE [HK98] CLIQUE [AGGR98] CURE [GRS98] Handle Irregularly Shaped Clusters Not completely Insensitive to Noise Not completely Independent of data input order Yes Not completely Yes Yes Better than DBSCAN Yes No Yes Clusters are approximate Yes Yes Yes Not easily and very costly Partially No O(N) for low dimensions only Not well Yes Yes Yes Yes O(DlogD) where D is active data set Somewhat Yes Yes Yes Quadratic on #of dimensions Yes Minimal Partially Yes O(n 2 )for low-dim., O(n 2 logn) for high-dim n=sample set Yes Yes Yes Yes Figure 3 No a-priori knowledge or inputs required 2 parameters required 2 parameters required Parameters required 2 parameters required 2 parameters required 1 parameter, desired#of clusters kolatch Page 14 3/27/2001

15 6. Present and Future In the last two years, work on clustering for spatial databases has gone in two directions. The first, is in the direction of enhancement, the second in the direction of innovation. Several enhancements have been suggested for existing algorithms that will allow them to handle larger databases and higher dimensional data. Hierarchical variants improve the effectiveness of the clustering. For example, [SCZ98] suggests a hierarchical variant of WaveCluster using a multiresolution approach to clustering by starting with coarser grids and then merging the clusters. [HK98] suggest a hierarchical version of DENCLUE, which creates a hierarchy of clusters by suing smoother and smoother kernels where the threshold for the influence of the data sets is refined repeatedly to make tighter and tighter clusters. However, there is still the problem of finding clustering algorithms that work effectively for high N where N is the number of data points, high d, wheredis the number of dimensions, and high noise levels. Very large databases with a large number of attributes for each data point (dimensionality) and a large amount of noise, are still not handled well. Some algorithms do one, or two, but not all three. Efficiency suffers in the presence of high dimensionality, and effectiveness also suffers in the presence of high dimensionality especially when accompanied by large amounts of noise. Some attempts have been made to scale up clustering algorithms. These include sampling techniques, condensation techniques, indexing techniques and grid-based techniques. CLARANS was one of the earliest algorithms to use data sampling. More recently CURE used a well chosen sample of points to define clusters. Condensation techniques include the subcluster grouping techniques of BIRCH and DENCLUE. Indexing is used in many of the algorithms discussed, including BIRCH, DBSCAN, WaveCluster, and DENCLUE. Grid-based techniques are used by DENCLUE, WaveCluster, STING, and CLIQUE. Nevertheless, most of these algorithms are still unable to handle high dimensional data effectively, or large datasets efficiently. Recently, some algorithms have been enhanced, and a few others have been introduced, which suggest different methods and new tracks for clustering algorithms. These tracks include developing dynamic data mining triggers (STING+), improving the condensation techniques of CLIQUE to allow for an adaptive interval size for the partitioning algorithm (MAFIA), generating partitioning algorithms that use enhanced parameters to improve the effectiveness of the clustering algorithm(chameleon), using Delaunay triangles, a recursive algorithm, and the proximity information inherent in the data structure to develop clusters and answer queries(amoeba), and using recursion, partitioning and clustering to create a new algorithm designed to handle high-dimensional data (OptiGrid). These algorithms will be discussed briefly below. 6.1 STING+ STING+ [WYM99] builds on the algorithms presented in STING to create a system that supports user defined triggers, that is a system that works on dynamically evolving spatial data. Spatial data mining triggers have three advantages kolatch Page 15 3/27/2001

16 From the user s perspective, queries do not need to be submitted repeatedly in order to monitor changes in the data. Users only need specify a trigger and the pattern desired as the trigger condition. In addition, interesting patterns are detected immediately if trigger conditions are used. From the system s perspective it is more efficient to check a trigger on an incremental basis than to process identical queries on the data multiple times Data mining tasks on spatial data can not be supported by traditional database triggers since spatial data consists of both non-spatial and spatial attributes, and both may be used to make up the trigger condition. In STING+ users can specify the type of trigger, the duration of the trigger, the region on which the trigger is defined, and the action to be performed by the trigger. The basic data structure of STING+ is similar to that of STING, and the statistics stored in the cells are also similar. In addition, each trigger is decomposed in sub-triggers whose values are stored in individual cells. The triggers and sub-triggers are evaluated incrementally to decrease the amount of time spent reprocessing data. 6.2 MAFIA MAFIA (Merging of Adaptive Finite Intervals (And more than a CLIQUE)) [GHC99] is a modification of CLIQUE that runs faster and finds better quality clusters. The main change is the elimination of the pruning technique for limiting the number of subspaces examined, and the implementation of an adaptive interval size which partitions the dimension dependent on the data distribution in that dimension. The minimum number of bins for a dimension are calculated and placed in a histogram during the first pass of the data. Bins are merged if they are contiguous and have similar histogram values. The boundaries of the bins are not rigid as in CLIQUE, and thus the shape of the resulting clusters will be significantly improved. The revised algorithm runs 44 times faster than the original version of CLIQUE, and handles both large size and high dimensionality well. The computation time complexity for the revised algorithm is O(c k ) where c is a constant and k is the distinct number of dimensions represented by the cluster subspaces in the dataset. 6.3 CHAMELEON CHAMELEON [KHK99] combines a graph partitioning algorithm with a hierarchical clustering scheme that dynamically creates clusters. The first step of the algorithm partitions the data using a method based on the k-nearest neighbor approach to graph partitioning. In the graph, the density of a region is stored as the weight of the connecting edge. The data is divided into a large number of small sub-clusters. The first step uses a multi-level graph partitioning algorithm. The partitioning algorithm used by CHAMELEON produces high quality partitions with a minimum number of edge cuts. The second step uses an agglomerative, or bottom-up hierarchical clustering algorithm to combine the sub-clusters and find the real clusters. Subclusters are combined based on both the relative inter-connectivity of the clusters as well as their closeness. To determine the inter-connectivity and closeness the algorithm takes into account the internal characteristics of the clusters. The relative inter-connectivity is a function of the absolute inter-connectivity which is defined as the sum of the weights of the edges connecting two clusters. The relative closeness of a pair of clusters is the average similarity between the kolatch Page 16 3/27/2001

Clustering Part 4 DBSCAN

Clustering Part 4 DBSCAN Clustering Part 4 Dr. Sanjay Ranka Professor Computer and Information Science and Engineering University of Florida, Gainesville DBSCAN DBSCAN is a density based clustering algorithm Density = number of