Clustering in Data Mining
Classification Vs Clustering When the distribution is based on a single parameter and that parameter is known for each object, it is called classification. E.g. Children, young, adult, old etc. is based on the age of the customers. When we have many parameters describing the profile of an object then it is difficult to segregate them based on some ranges and we need a clustering tool to help. Copyright @ gdeepak. com 2
Clustering Clustering is the process of grouping a set of data objects such that objects with in a cluster have high similarity (intracluster cohesiveness) and have high dissimilarity to the objects in other clusters (intercluster distinctiveness). Copyright @ gdeepak. com 3
Few points Different clustering algorithms may come out with different clustering. Applying different clustering algorithms on same data may prove to be useful in the sense that it may come out with the previously unknown groupings with in the data. Applications are very diverse such as image patterns, Biological patterns, Searching patterns, attacking or hacking patterns, clustering documents based on topics, clustering alphabets based on handwriting, Finding teams or cells in the social networking sites etc. Copyright @ gdeepak. com 4
Supervised Vs Unsupervised learning Supervised learning few initial entries are classified by human intervention about their belongingness to a particular cluster. It is prevalent in machine learning. Clustering by observation is unsupervised learning because it implicitly looks for the similarities with in the objects. Copyright @ gdeepak. com 5
Characteristics of good clustering algorithm Whether your algorithm works only for a few hundred data objects or it is scalable to billions of objects. Real life data volumes are very large. Are you dealing with numbers? Your algorithm should also deal with images, documents, sequences and graphs. Your algorithm must be coming out with regular shape clusters. Complex and critical applications may have arbitrary shape clusters also. Copyright @ gdeepak. com 6
Characteristics of good clustering algorithm Whether your algorithm is dependent upon domain expertise or independently handles clustering. Your algorithm is online or offline. Based on this it can either offer incremental updates or not. Does your algorithm discards or moderates inaccurate, noisy, erroneous data. How many dimensions it is dealing with? Giving Semantic interpretation of your data as part of post processing step will be good for the user. Copyright @ gdeepak. com 7
Orthogonality in clustering Clusters may be at the same level e.g. schools, colleges, universities Clusters may be in some hierarchy e.g., in colleges you have Engineering, Pharmacy, Law etc. Clusters may be exclusive that means each object belongs to one cluster only. Clusters may not be exclusive e.g., few colleges may have Engineering, MBA and Diploma as well. Copyright @ gdeepak. com 8
Partitioning Technique Case 1: Number of clusters to be formed is also done by the algorithm. Case 2: Initially the number of clusters to be formed is given. Copyright @ gdeepak. com 9
Centroid based K-means Algorithm Ci is the centroid of a cluster which acts as the center point. It is mean of all the objects assigned to that cluster. The objective is to minimize the distance between various objects of a cluster with the representative object of that cluster. k 2 E p c i i 1 p C ( i) For each object in each cluster, the distance from the object to its cluster center is squared and the distances are summed. This tries to make the resulting clusters as compact and separate as possible. Copyright @ gdeepak. com 10
K-means algorithm Problem is NP-hard for K-clusters even in 2-D Euclidean space. One of the solution is using K-means algorithms. Number of clusters required are given and output is that many clusters with objects contained in those clusters. Randomly choose any k points from D as cluster centers. All objects are checked for their distances from cluster centers to decide that to which cluster it belongs and recalculate that cluster mean. Repeat above step until there is no change in the clusters. Copyright @ gdeepak. com 11
Issues in k-means algorithm It requires value of K to be supplied. It is not good for irregular shape clusters also. It does not converge always. It is dependent upon initial selection of cluster centers. Complexity is O(nkt) where n is the number of objects, k is the number of clusters and t is the number of iterations. Outliers can disturb the normal values of clusters. Different researchers has suggested different methods for selecting initial k points which also depends upon the domain and application area. Copyright @ gdeepak. com 12
K-Medoid Algorithm In this instead of taking the mean value of the cluster points, Actual point is chosen which acts as a representative element of the cluster. Choose Initial set of medoids which may suitably represent each cluster. Out of the remaining objects try each one of them, if some of them can be chosen to replace any of the existing medoids. This will happen if replacement reduces the absolute error. PAM(Partitioning around medoids) is a popular implementation of K- medoids algorithm. Copyright @ gdeepak. com 13
Issues in K-Medoid algorithm Complexity is O(k(n-k) 2 where k is the number of clusters and n is total number of data objects. The algorithm becomes computationally intensive for large values and does not scale well. It does not have that much sensitivity to outliers as k-means algorithm. Another implementation that scales well for some application is CLARA (Clustering LARge Applications). It uses sampling to choose the initial sample points and then PAM is applied on these sample points rather than the total objects. If an object is one of the best k-medoids but is not selected during sampling, CLARA will never find the best clustering. Copyright @ gdeepak. com 14
Hierarchical Methods Bottom-Up Approach starts from single object (Agglomerative) Need to merge at every iteration Top-Down Approach starts from set of all objects (Divisive) Need to split after every iteration Copyright @ gdeepak. com 15
Agglomerative Clustering Dendograms are used Method requires a termination criteria or it will result in final cluster containing all the objects Distance measure is important for finding the similarity or dissimilarity of the clusters. We cannot undo the merge operations which may result in local optimizations Complexity is O(n 2 ) so it does not scale well for large volumes AGNES (Agglomerative Nesting) is one of the implementations. Copyright @ gdeepak. com 16
Divisive Clustering There are exponential ways in which a cluster can be partitioned into different clusters. Relatively difficult and challenging than merging method. Divisive methods do not backtrack on the partitioning decisions. DIANA (Divisive Analysis) is one of the implementation. Copyright @ gdeepak. com 17
Different Distance Measures Minimum Distance: dist(k i, K j ) = min(t ip, t jq ) smallest distance between an element in one cluster and an element in the other called Single Link Maximum Distance : dist(k i, K j ) = max(t ip, t jq ) largest distance between an element in one cluster and an element in the other called complete Link Average Distance: dist(k i, K j ) = avg(t ip, t jq ) avg distance between an element in one cluster and an element in the other Copyright @ gdeepak. com 18
More distance measures Centroid : dist(k i, K j ) = dist(c i, C j ) distance between the centroids of two clusters Medoid: dist(k i, K j ) = dist(m i, M j ) distance between the medoids of two clusters Centroid can be calculated as C m N ( t i 1 N ip ) Copyright @ gdeepak. com 19
Multiphase clustering In this Hierarchical clustering also takes advantage of other techniques to make the algorithms less complex and more sensitive BIRCH (Balanced Iterative Reducing and Clustering using Hierarchies) and Chameleon are two popular implementations. Copyright @ gdeepak. com 20
BIRCH It is scalable and it has the ability to undo the clusters performed in the intermediate steps. It uses cluster feature concept and uses cluster feature (CF-Tree) to represent cluster hierarchy. If a cluster is of n d-dimensional data objects then CF is a 3D vector summarizing information about cluster of objects as CF=<n, LS,SS> where LS is linear sum of n data points and SS is the square sum of n data points. Cluster Diameter 1 n( n 1) ( x i 2 ) Copyright @ gdeepak. com 21 x j
Cf-Tree Copyright @ gdeepak. com 22
CF Tree features It is a height balanced tree storing features of the clusters A non leaf node has children and stores sums of the CFs of his children A CF tree has branching factor( max no of children) and Threshold ( max diameter of sub clusters stored at the leaf) Copyright @ gdeepak. com 23
BIRCH algorithm Phase 1- It scans the database to build an initial inmemory CF-tree, which can be seen as a multilevel compression of the data that tries to preserve the data s inherent clustering structures. Phase 2-It applies another clustering algorithm to cluster the leaf nodes of the CF-tree, which removes sparse clusters as outliers and groups dense clusters into the larger ones Copyright @ gdeepak. com 24
Building CF tree For each point in the input find the closest leaf entry Add that point to leaf entry and update CF If entry diameter > max_diameter then split leaf. If required split parent also and so on Copyright @ gdeepak. com 25
Disadvantages of BIRCH Natural clusters are not formed, because we fix size of leaf nodes Due to the use of radius and diameter parameters clusters tend to be spherical Insertion order of data points may change clustering However, Complexity is O(n) and the algorithm scales well to number of objects. Copyright @ gdeepak. com 26
Chameleon Based on Dynamic Modelling of the Graph More adaptive and natural in context to the number of clusters It uses a k-nearest neighbor graph to construct a sparse graph, where each vertex of the graph represents a data object and there exists an edge between two vertices if one object is among the k most similar objects to the other. Edge weight represents similarity between objects. It partitions the graph into a large number of small sub graphs which may represent sub clusters such that it minimizes the edge cut. Copyright @ gdeepak. com 27
Illustration Copyright @ gdeepak. com 28
Chameleon After partitioning it uses agglomerative hierarchical clustering to merge sub clusters based on their similarity. Similarity is calculated based on interconnectivity(connectivity of c1 and c2 over internal connectivity) and relative closeness (closeness of c1 and c2 over internal closeness) Its advantage lies in discovering arbitrarily shaped clusters of high quality. However complexity of O(n 2 ) worst case may not scale well for high dimensional data. Copyright @ gdeepak. com 29
Flaws in distance based algorithms Very difficult to choose a good distance measure Missing attribute values in some data objects also is not allowed Based on local search so optimization boundaries are not clear Copyright @ gdeepak. com 30
Probabilistic hierarchical clustering It uses probabilistic models to measure the distance between clusters Data objects are considered as samples of data generation mechanism to be analyzed Can handle missing data values and easy to comprehend Generally Gaussian distribution or Bernoulli distributions are considered. Copyright @ gdeepak. com 31
Drawbacks It outputs only one hierarchy for the chosen probabilistic model. Real life data sets generally contain multiple hierarchies fitting the same data. Bayesian tree structured models can handle this drawback but those are very complex to handle. Copyright @ gdeepak. com 32
Density Based Techniques These methods look for dense regions of data in the data space with sparse regions in between. Neighborhood concept is used to find out the neighbors of an data object with in a given radius. Density of neighborhood can be measured using number of objects in the given cluster region. Copyright @ gdeepak. com 33
Density Based Spatial Clustering of Applications with Noise(DBSCAN) It uses two parameters є(max radius of the neighborhood) and MinPts( minimum density threshold of region) An object is core object if є-neighborhood contains at least MinPts objects. First step is to find all core objects from given data set. An object is called directly density-reachable if it is with in the є distance of a core object. Copyright @ gdeepak. com 34
Density terminology An object is density-reachable from another object if it has a chain of directly density-reachable points leading to the core object in the end. Two objects are density connected with respect to є and Minpts if both are density reachable to one common object. Cluster is defined as a maximal set of density connected points. Copyright @ gdeepak. com 35
DBSCAN Algorithm Mark all points as unvisited Randomly select a point and mark it as visited Check if є neighborhood of this point contains MinPts If yes, a cluster C is formed and all the points in є neighborhood are added to a candidate set N. otherwise it is designated as a noise point. Iteratively add those objects in N labeled as unvisited to C and if their neighborhood also contains MinPts, then also add all those in N. Iterate this for all data objects until all points has been visited. Copyright @ gdeepak. com 36
DBSCAN Complexity General Complexity is O(n 2 ) but with the use of spatial indexing it can be improved to O(nlogn) One drawback is that it is the user s responsibility to select є and MinPts which is not good in general. Slight variations in values may lead to entirely different clustering. Copyright @ gdeepak. com 37
Ordering Points to Identify Cluster Structure (OPTICS) It provides a linear list of data objects based on the closeness of the data density wise. This ordering can be based on different parameter settings. User can even try different settings and see the clustering structure visually. It uses two parameters. Copyright @ gdeepak. com 38
OPTICS process Core distance of an object p is the smallest value є such that є neighborhood of p has at least MinPts. So є is the minimum distance threshold that makes p a core object. Reachability Distance to object p from q is the minimum radius that makes p density-reachable from q. Therefore the reachability distance from q to p is max{core-distance(q),distance(p,q)} P may have multiple reachability-distances with respect to different core objects. We are interested in smallest reachability distance of p because it gives the shortest path for which p is connected to a dense cluster. Copyright @ gdeepak. com 39
OPTICS algorithm Start with any arbitrary object P from the data set Determine the core-distance by finding the є-neighborhood and set the reach ability distance to undefined Object P is written to output. If p is not a core object, move to next object in order seeds list, if it is empty then take next from input database. If it is a core object then for all objects in its neighborhood update the reach ability distance from p and insert q in the order seeds list. Continue until input list as well as order seeds list is empty. Copyright @ gdeepak. com 40
DENCLUE (Density Based Clustering) Uses Proved mathematical functions for density estimation. Time complexity is better than other algorithms. Data sets having large noise are also accommodated In other algorithms results were dependent on choice of є Copyright @ gdeepak. com 41
Key functions Kernel density estimation treats an observed object as high probability density in the surrounding region. The probability density at a point depends on the distances from this point to the observed objects. DENCLUE uses Gaussian kernel to estimate density based on the given set of objects to be clustered. A point is called density attractor if it is a local maximum of the estimated density function. Copyright @ gdeepak. com 42
DENCLUE A Cluster will be a set of density attractors X and a set of input objects C such that each object in C is assigned to a density attractor in X and there exists a path between every pair of density attractors where the density is above ξ. In this way we can find clusters of arbitrary shape. Copyright @ gdeepak. com 43
Grid-based techniques The space under consideration is divided into cells irrespective of the actual data points and then it is further processed. Complexity depends upon the number of cells Copyright @ gdeepak. com 44
STING(Statistical Information Grid) Space in divided in hierarchical as well as rectangular manner. A cell at higher level points to number of cells at next lower level. Mean, median and other of a cell are precomputed Copyright @ gdeepak. com 45
Query algorithm Process starts from a selected layer of cells in top down fashion From this confidence interval (estimated probability range) is calculated which reflects the relevance to the given query At Next lower level only relevant cells are examined after removing irrelevant cells This is repeated till the lowest layer Copyright @ gdeepak. com 46
Pros and Cons It is easy to parallelize and supports incremental updates Complexity is O(n) where n is the cells at the lowest layer Disadvantage is that Cluster boundaries are predefined Copyright @ gdeepak. com 47
Clustering in Quest It identifies the monotonicity of dense cells with respect to dimensionality. 1 step- it partitions d-dimensional data space into non-overlapping rectangular units. It finds dense cells containing at least l(density threshold) points. The it joins the adjacent dense cells in both dimensions. Iteration terminates when no candidates can be generated or no candidate cells are dense. Copyright @ gdeepak. com 48
Clustering in Quest 2-step: determine maximal regions that cover a cluster of connected dense units for each cluster Disadvantage is that meaningful clustering is dependent on tuning of the grid size which is not easy to predict Advantage is that it scales well with data size and dimensions as well. It also does not presume some distribution and is insensitive to order of inputs. Copyright @ gdeepak. com 49
Other clustering techniques Clustering High Dimensional Data Clustering Graphs and network data Copyright @ gdeepak. com 50
Questions, Comments and Suggestions
Question 1 Which is the popular implementation of k-medoid algorithm Copyright @ gdeepak. com 52
Question 2 Full form of OPTICS Copyright @ gdeepak. com 53
Question 3 Chameleon is based on of Graph. Copyright @ gdeepak. com 54