11/2/2017 MIST.6060 Business Intelligence and Data Mining 1. Clustering. Two widely used distance metrics to measure the distance between two records

Size: px

Start display at page:

Download "11/2/2017 MIST.6060 Business Intelligence and Data Mining 1. Clustering. Two widely used distance metrics to measure the distance between two records"

Ashlyn Hancock
6 years ago
Views:

1 11/2/2017 MIST.6060 Business Intelligence and Data Mining 1 An Example Clustering X 2 X 1 Objective of Clustering The objective of clustering is to group the data into clusters such that the records within a group are as similar as possible, while the records between clusters are as dissimilar as possible. Similarity/dissimilarity is measured with a distance metric. Distance Metrics Two widely used distance metrics to measure the distance between two records a = a, a,..., a ) and b = b, b,..., b ) in a p-dimensional space are: ( 1 2 p Euclidean distance: ( 1 2 p d (, b) = ( a1 b1 ) + ( a2 b2 ) + + ( a p bp ) a. Manhattan (Absolute) distance: d a, b) = a b + a b + + a p b. ( p The distance metrics can be applied to not only between records (data points), but also between clusters. The distance between two clusters is measured as the distance between the two cluster centers (centroids). The distance between a data point and a cluster is measured as the distance between the point and the cluster center (centroid).

2 11/2/2017 MIST.6060 Business Intelligence and Data Mining 2 The Single Linkage Clustering Algorithm A Hierarchical Clustering Method 1. For a dataset with N data points, start with N clusters, each containing a single point. 2. Find the nearest pair of data points or clusters and merge them into one cluster. This means that either single points are added to existing clusters or two existing clusters are merged. 3. Repeat Step 2 until all data points are merged into a single cluster. 4. To get n clusters, work backward to sequentially split previously merged clusters by cutting n 1 joins that have the longest distances. Note: the single linkage clustering is also known as the nearest neighbor clustering. As you can see, there are some similarities between this algorithm and the k-nearest neighbor algorithm for classification. An Example for Single Linkage Clustering The AgeIncomeGender dataset below has 10 records and three attributes. ID Age Income Gender 1 52 low f 2 45 high f 3 33 low f 4 51 low f 5 67 low f 6 26 high m 7 68 low m 8 46 high m 9 24 high m high m The Weka output below shows a tree diagram, called dendrogram, which illustrates how clusters are formed from bottom up using single linkage clustering. First, records 1 and 4 are found to be closer than any other two clusters (i.e. records), so they are merged into a cluster. Then, clusters (records) 6 and 9 are the nearest pair among all pairs, so they are merged into another cluster. This process continues until all records are merged into one cluster. It is hard to tell from this dendrogram what record(s) a cluster (leaf node) contains since record IDs are not shown. We explain later how to identify records within a cluster. The height of a link represents a between-cluster distance (in a relative sense since no height value is shown in the dendrogram). For example, the distance between records 1 and 4 is the smallest, which are merged first; the distance between records 6 and 9 is the second smallest, which are merged second, and so on. With this dendrogram, we can get a specified number of clusters by removing the links from top down. To divide the entire dataset into two clusters, for instance, simply remove the top horizontal link.

3 11/2/2017 MIST.6060 Business Intelligence and Data Mining 3 The k-means Clustering Algorithm A Partitioning Method 1. Specify the number of clusters to be formed, k. Randomly select k centroids (or randomly partition the dataset into k initial clusters). 2. Assign each data point to the cluster whose centroid is nearest. Recompute the centroid for the clusters whose elements have changed. 3. Repeat Step 2 until no more reassignments take place. Note: The parameter k here is the number of clusters, which is different from the k in the k-nearest neighbor method. An Example for k-means Clustering Run Weka on the AgeIncomeGender data with default number of clusters k = 2. The output, shown on the next page, provides some characteristics of the clusters. It appears that the first cluster (cluster 0) includes five individuals with relatively old age and low income, who are mainly females; the second cluster (cluster 1) includes five individuals having opposite characteristics.

4 11/2/2017 MIST.6060 Business Intelligence and Data Mining 4 Single Linkage Clustering in Weka The AgeIncomeGender.arff Age Income Gender 52,low,f 45,high,f 33,low,f 51,low,f 67,low,f 26,high,m 68,low,m 46,high,m 24,high,m 39,high,m

A pop-up weka.gui.genericobjecteditor will appear.

5 11/2/2017 MIST.6060 Business Intelligence and Data Mining 5 1. Open the AgeIncomeGender.arff file. 2. Click Cluster / Choose / HierarchicalClusterer. 3. Click the long horizontal box on the right side of the Choose button. A pop-up weka.gui.genericobjecteditor will appear. In the numclusters box, select 1 (the purpose of this is to get a dendrogram that shows the entire bottom-up grouping process). Keep the default values for all the other parameters unchanged. Click OK. 4. Click Start to get the following results:

11/2/2017 MIST.6060 Business Intelligence and Data Mining 6 5. Note that each time the Start button is clicked, a new entry is written into the Result list panel in the lower left corner.

6 11/2/2017 MIST.6060 Business Intelligence and Data Mining 6 5. Note that each time the Start button is clicked, a new entry is written into the Result list panel in the lower left corner. Right click the corresponding entry and then select Visualize tree. A new screen that shows the dendrogram will appear. 6. Change the number of clusters to 9 using the same procedure as in Step 3 to get the following result:

11/2/2017 MIST.6060 Business Intelligence and Data Mining 7 7. Right click the corresponding entry and then select Visualize cluster assignment.

Note that the values on the x-axis range from 0 to 9, which should be taken as from 1 to 10, since the x-axis represents record ID. 8.

7 11/2/2017 MIST.6060 Business Intelligence and Data Mining 7 7. Right click the corresponding entry and then select Visualize cluster assignment. The screen below will appear, which shows, the cluster label (in different color) for each record and the value of the first attribute, Age. Note that the values on the x-axis range from 0 to 9, which should be taken as from 1 to 10, since the x-axis represents record ID. 8. Change y-axis to Cluster to get the following result, which indicates that records 1 and 4 are merged into cluster 0 (when the total number of clusters is 9).

8 11/2/2017 MIST.6060 Business Intelligence and Data Mining 8 9. Reset the number of clusters to 8 and repeat Steps 6, 7 and 8. The output indicates that records 6 and 9 are merged into cluster 4 (when the total number of clusters is 8). 10. Reset the number of clusters to 2 and repeat Steps 6, 7 and 8. The output below indicates that records 1, 3, 4, 5 and 7 are the members of cluster 0, and records 2, 6, 8, 9 and 10 are the members of cluster 1.

11/2/2017 MIST.6060 Business Intelligence and Data Mining 9 K-Means Clustering in Weka 1. Open the AgeIncomeGender.arff file. 2. Click Cluster / Choose / SimpleKMeans. 3.

9 11/2/2017 MIST.6060 Business Intelligence and Data Mining 9 K-Means Clustering in Weka 1. Open the AgeIncomeGender.arff file. 2. Click Cluster / Choose / SimpleKMeans. 3. Click the long horizontal box on the right side of the Choose button. A pop-up weka.gui.genericobjecteditor will appear. In the displaystddevs box, select True. Keep the default values for all the other parameters unchanged. Specifically, the default number of cluster is 2. Click OK. 4. Click Start to get the following results:

11/2/2017 MIST.6060 Business Intelligence and Data Mining 10 5.

10 11/2/2017 MIST.6060 Business Intelligence and Data Mining Follow the same steps (Steps 7 and 8) in the Single Linkage Clustering section to get the following output, which indicates that cluster 0 includes records 1, 3, 4, 5 and 7, and cluster 1 includes records 2, 6, 8, 9 and 10. The result of k-means clustering is the same as that of single linkage clustering.

11 11/2/2017 MIST.6060 Business Intelligence and Data Mining 11 Hierarchical Clustering in Rattle The AgeIncomeGender-recode.csv file: Age,Income,Gender 52,0,0 45,1,0 33,0,0 51,0,0 67,0,0 26,1,1 68,0,1 46,1,1 24,1,1 39,1,1 1. Click Data. Click the File radio button. In the Filename box, find and open the AgeIncomeGender-recode.csv file. Click Execute. Deselect Partition. Select Input for all attributes. Click Execute again.

11/2/2017 MIST.6060 Business Intelligence and Data Mining 12 2. Click Cluster. Select the Hierarchical radio button. Click Execute.

12 11/2/2017 MIST.6060 Business Intelligence and Data Mining Click Cluster. Select the Hierarchical radio button. Click Execute. Then, in the Agglomerate box, select single. In the Number of clusters box, select 2. Click Execute again. Then, click the Stats button. It can be seen that the results using Rattle are different from those using Weka.

13 11/2/2017 MIST.6060 Business Intelligence and Data Mining Click the Dendrogram button. 4. Click the Data Plot button.

11/2/2017 MIST.6060 Business Intelligence and Data Mining 14 K-Means Clustering in Rattle 1. Same as Step 1 in the Hierarchical Clustering in Rattle section. 2.

14 11/2/2017 MIST.6060 Business Intelligence and Data Mining 14 K-Means Clustering in Rattle 1. Same as Step 1 in the Hierarchical Clustering in Rattle section. 2. Click Cluster. Select the KMeans radio button. In the Number of clusters box, select 2. Click Execute. Then, click the Stats button. You will see the same results as those using Weka.

15 11/2/2017 MIST.6060 Business Intelligence and Data Mining Click the Data button. From the data plot, you can verify (using data in the AgeIncomeGender-recode.csv file) that one cluster (in black) contains records 1, 3, 4, 5 and 7, and the other cluster (in red) includes records 2, 6, 8, 9 and 10. This result is consistent with that of using Weka.

Cluster Analysis: Agglomerate Hierarchical Clustering

Cluster Analysis: Agglomerate Hierarchical Clustering Yonghee Lee Department of Statistics, The University of Seoul Oct 29, 2015 Contents 1 Cluster Analysis Introduction Distance matrix Agglomerative Hierarchical