Lecture-17: Clustering with K-Means (Contd: DT + Random Forest) Medha Vidyotma April 24, 2018 1 Contd. Random Forest For Example, if there are 50 scholars who take the measurement of the length of the rod, there will be 50 different measurements. So, the noise in he data due to errors gets cancelled out. Similarly, a combination of learning models (ensemble of classifiers) increases classification accuracy. Random forest is a large subset of decorrelated decision trees. For every subset a new tree is built. There can be a total of 2 n+m 1 trees in the forest since, each attribute (n) and each data point(m) have equal chances of either being included or not being included in a tree. It takes a vote for predicting the class of a data point. 1
2 Clustering The aim of clustering to segregate groups with similar traits and assign them into clusters. Clustering is an example of Unsupervised learning It is used when data is available but without any labels, i.e., without any a priori information. For Example, to understand customers better to market a product to them in a better manner. Clustering is used for operations like 1) Summarization, 2) Compression, and 3) Finding Nearest Neighbors Types of clustering are: 1. Hierarchical: where there is a hierarchical relationship between the clusters 2. Partitional: Where no hierarchical relationship exists between clusters 3. Exclusive: Where no such data point exists that belongs to more than one clusters. 4. Overlapping: Where a data point belongs to more than one clusters 5. Fuzzy: Where the data point s probability of belonging to each cluster is evaluated. 6. Complete: Where all data points are assigned clusters 7. Partial: Where a few data points are such thta they don t blong to any cluster. This happens dues to outliers. K-means: Is a partitional clustering technique that attempts to find a user-specified number of clusters(k), which are represented by their centroids. To run a k-means algorithm, you have to randomly initialize three points (See the figure below) called the cluster centroids. 2
I have three cluster centroids, because I want to group my data into three clusters. K- means is an iterative algorithm and it does two steps: 1. Cluster assignment step 2. Move centroid step(see the figure below).[5] This is repeated till they converge.[5] 3
Closeness is measured by: 1. Euclidean distance d Σ(q p) 2 (1) 2. Cosine similarity d ΣA.B ΣA 2. ΣB 2 (2) 3. Correlation 4. Bregman divergence d F (x) F (y) F (y) (3) In Cluster assignment step, the algorithm goes through each of the data points and depending on which cluster is closer, whether the red cluster centroid or the blue cluster centroid or the green; It assigns the data points to one of the three cluster centroids.[5] In move centroid step, K-means moves the centroids to the average of the points in a cluster. In other words, the algorithm calculates the average of all the points in a cluster and moves the centroid to that average location. This process is repeated until there is no change in the clusters (or possibly until some other stopping condition is met). K is chosen randomly or by giving specific initial starting points by the user. 4
The three lines in figure shows the path from each centroids initial location to its final location. For choosing an appropriate value for K, just run the experiment using different values of K and see which ones generate good results. The value for k can be decreased if some clusters are too small, and increased if the clusters are too broad. 3 Evaluation of K-means The object of k-means is to minimize he distance of points from their assigned centroids. Choose clusters corresponding to the one that minimized sum of squared error (SSE) The red lines on the graph indicate that the value of SSE can be anywhere in the range of the length of the line, since, the average values have been taken to plot the graph. A good clustering will have smaller number of k s. Therefore, there is trade-off between the number of k and the sum of squared error (SSE). For choosing k: 1. Iterate over a large number of k s to find optimum k. 2. Pre-process data with other algorithms 5
3. Domain knowledge about the data may be very important for knowing an estimate of the number of clusters that can exist. Initialization of Centers: 1. The centers can be random points in space 2. A data-point can be selected and the over the iterations the centroids jump from one data point to another 3. A random point in the dense region 4. Spaced uniformly over the feature space. Cluster Quality: 1. Depends on the relative sizes of clusters and the inter-cluster distance. 2. Distance between members of a cluster and the cluster center 3. Diameter of smallest sphere. In ideal case, the diameter of smallest sphere should be large. Because this means that all cluster spheres are of similar sizes. 4. Ability of the model to discover hidden patterns in the data Limitations of K-means The clustering is heavily dependent upon where the centroid is placed initially as we see from the figure. When the clusters were supposed to be clustered as per figure 1, they get clustered as per figure 2 because of unfortunate placing of the centroids. Model has problem when data has 1. Different size clusters 2. Different densities 3. Non-globular shape 4.Handling Empty ClustersWhen a cluster is empty, which can occur in a continuously changing database, the concept needs to be dropped. If there are no items in a cluster, the model does not form that cluster. 5. When there are outliers. 6. Updating Centroids Incrementally: Every time a new data point is added, the model 6
needs to be run again. Which is computationally expensive. K-NN AND K-MEANS ARE ARE DIFFERENT. K-NN is a supervised approach for classification whereas K-means is an unsupervised clustering algorithm. 4 Other Approaches K-Medoids: Wheres data point is chosen as center and it minimizes a sum of pairwise dissimilarities. Agglomerative clustering: repeatedly merging the two closest clusters until a single (Single Link) DBSCAN:Density-Based Spatial Clustering of Applications with Noise: Pros: Can handle dynamic datasets and clusters of arbitrary shape Cons: Sensitive to parameters Parameters:Eps, Minimum numer of Pts Points: core, border, noise 1. A point p is a core point if at least minpts points are within distance ( is the maximum radius of the neighborhood from p) of it (including p). Those points are said to be directly reachable from p.[5] 2. A point q is directly reachable from p if point q is within distance from point p and p must be a core point. 3. A point q is reachable from p if there is a path p1,..., pn with p1 = p and pn = q, where each pi+1 is directly reachable from pi (all the points on the path must be core points, with the possible exception of q). 4. All points not reachable from any other point are outliers. 7
In this diagram[6], minpts = 4. Point A and the other red points are core points, because the area surrounding these points in an radius contain at least 4 points (including the point itself). Because they are all reachable from one another, they form a single cluster. Points B and C are not core points, but are reachable from A (via other core points) and thus belong to the cluster as well. Point N is a noise point that is neither a core point nor directly-reachable. Incremental DBSCAN (Addition) When a new point is added: 1. A new cluster can be created. 2. new point can be absorbed into an existing cluster 3. 2 existing clusters can be merged due to the new point. Similarly for Decremental DBSCAN (Deletion) 1. An existing cluster can be deleted due to point deletion. 2. AN existing cluster can be split into two clusters because of deletion. 3. A cluster can shrink in size because of point deletion. All these can be accomplished in O(1) order. 5 References: 1. Book - Machine Learning, ch-3, Tom M. Mitchell. [2] Decision Tree 1: how it works https://www.youtube.com/watch?v=ekd5gxppey0, 2. An efficient k-means clustering al- 8
gorithm: Analysis and implementation, T. Kanungo, D. M. Mount, N. Netanyahu, C. Piatko, R. Silverman, and A. Y. Wu, IEEE Transaction on Pattern Analysis and Machine Intelligence, pp 881892, 24 (2002) 3. https://www-users.cs.umn.edu/kumar/dmbook/ch8.pdf 4. http://scikit-learn.org/stable/modules/generated/sklearn.cluster.dbscan.html 5. http://bigdatamadesimple.com/possibly-the-simplest-way-to-explain-k-means-algorithm/ 6. https://en.wikipedia.org/wiki/d 9