COMP33111: Tutorial and lab exercise 7

COMP33111: Tutorial and lab exercise 7 Guide answers for Part 1: Understanding clustering 1. Explain the main differences between classification and clustering. main differences should include being unsupervised vs. supervised, known number of classes in advance for classification, interpretation of outcomes, prediction vs. exploration, classification is easier for evaluation etc. 2. An example dataset consists of five products whose amounts of sale in two regions are shown bellow. Cluster these products into two groups using the k-means algorithm, the Euclidean distance, and products A and E as initial cluster members. data point product region 1 region 2 1 A 22 21 2 B 19 20 3 C 18 22 4 D 1 3 5 E 4 2 consider data as 2-dimensional vectors with attributes region1 and region2. Step1: centroids C 1 = A(22, 21) and C 2 = E(4,2) product C 1 C 2 cluster A(22, 21) 0 26.17 1 B(19, 20) 3.16 23.43 1 C(18, 22) 4.12 25.61 1 D(1, 3) 27.66 3.16 2 E(4, 2) 26.17 0 2 Step2: new centroids C 1 (19.67, 21) and C 2 (2.5, 2.5). Step1: after one more iteration, there is no change in cluster membership, so the two clusters are {A, B, C} and {D, E}. 3. Cluster the data from the previous example using the k-means algorithm, the Manhattan distance and product A and E as initial cluster members. Use the same procedure as above; the table after the first step should be: product C 1 C 2 cluster A(22, 21) 0 37 1 B(19, 20) 4 33 1 C(18, 22) 5 36 1 D(1, 3) 39 4 2 E(4, 2) 37 0 2

4. Briefly describe the idea of agglomerative clustering. What is the difference between single and complete linkage methods for measuring inter-cluster distances? see slides 46, 48-52, Lecture 7 (Clustering). 5. Cluster the data from question 2 using the agglomerative clustering with single linkage method. The distance between points (i.e. products) should be calculated using the Euclidean distance. Compare the results. Distances between products Product A B C D E A(22, 21) 0 3.16 4.12 27.66 26.17 B(19, 20) 0 2.27 24.76 23.43 C(18, 22) 0 25.50 25.61 D(1, 3) 0 3.16 E(4, 2) 0 Step 1: initial clusters are {A}, {B}, {C}, {D}, {E}, with the distances as above Step 2: minimal distance is between {B} and {C}, so merge {B, C} Step 3: re-calculate the inter-cluster distances (single linkage = MIN) cluster A B, C D E A 0 3.16 27.66 26.17 B, C 0 24.76 23.43 D 0 3.16 E 0 Step 2: minimal distance is now between {A} and {B, C}, so merge {A, B, C} (Note: the same minimal distance between {D} and {E}, so we could merge them alternatively) Step 3: re-calculate the inter-cluster distances cluster A, B, C D E A, B, C 0 24.76 23.43 D 0 3.16 E 0 Step 2: minimal distance between {D} and {E}, so merge {D, E} Step 3: re-calculate the inter-cluster distances cluster A, B, C D, E A, B, C 0 23.43 D, E 0 Merge the remaining two clusters into {A, B, C, D, E}. Resulting dendrogram:

C B A D E B1. The resulting dendrogram leaf 2 [1] leaf 3 [1] node 4 [2] leaf 5 [1] node 4 [2] leaf 6 [1] leaf 7 [1] node 9 [2] leaf 10 [1] node 9 [2] leaf 11 [1] leaf 12 [1] leaf 14 [1] leaf 15 [1] leaf 16 [1] leaf 18 [1] leaf 19 [1] leaf 20 [1] Guide answers for Part 2: Clustering in WEKA

0 1 8 17 2 3 7 9 12 18 19 20 4 10 11 13 5 6 14 15 16 Input data @relation weather @attribute outlook {sunny, overcast, rainy} @attribute temperature real @attribute humidity real @attribute windy {TRUE, FALSE} @attribute play {yes, no} @data sunny,85,85,false,no sunny,80,90,true,no overcast,83,86,false,yes rainy,70,96,false,yes rainy,68,80,false,yes rainy,65,70,true,no overcast,64,65,true,yes sunny,72,95,false,no sunny,69,70,false,yes rainy,75,80,false,yes sunny,75,70,true,yes overcast,72,90,true,yes overcast,81,75,false,yes rainy,71,91,true,no Clustered data @relation weather_clustered @attribute Instance_number numeric @attribute outlook {sunny,overcast,rainy} @attribute temperature numeric @attribute humidity numeric @attribute windy {TRUE,FALSE} @attribute play {yes,no} @attribute Cluster {cluster0,cluster1,cluster2,cluster3,cluster4,cluster5,cluster6,cluster7,cluster8,cluster9, cluster10,cluster11,cluster12,cluster13,cluster14,cluster15,cluster16,cluster17, cluster18,cluster19,cluster20}

@data 0,sunny,85,85,FALSE,no,cluster5 1,sunny,80,90,TRUE,no,cluster7 2,overcast,83,86,FALSE,yes,cluster10 3,rainy,70,96,FALSE,yes,cluster15 4,rainy,68,80,FALSE,yes,cluster14 5,rainy,65,70,TRUE,no,cluster2 6,overcast,64,65,TRUE,yes,cluster18 7,sunny,72,95,FALSE,no,cluster6 8,sunny,69,70,FALSE,yes,cluster16 9,rainy,75,80,FALSE,yes,cluster12 10,sunny,75,70,TRUE,yes,cluster19 11,overcast,72,90,TRUE,yes,cluster20 12,overcast,81,75,FALSE,yes,cluster11 13,rainy,71,91,TRUE,no,cluster3 5 18 19 20