Unsupervised Data Mining: Clustering. Izabela Moise, Evangelos Pournaras, Dirk Helbing

Unsupervised Data Mining: Clustering Izabela Moise, Evangelos Pournaras, Dirk Helbing Izabela Moise, Evangelos Pournaras, Dirk Helbing 1

1. Supervised Data Mining Classification Regression Outlier detection Frequent pattern mining 2. Unsupervised Data Mining Clustering Feature Extraction definition real use-cases method pros and cons Izabela Moise, Evangelos Pournaras, Dirk Helbing 2

Unsupervised Data Mining descriptive or undirected finds hidden structure and relation within the data determine the existence of classes or clusters in the data exploratory analysis all variable are treated in the same way Izabela Moise, Evangelos Pournaras, Dirk Helbing 3

Overview Clustering Main principles Definition Types of clustering Applications Clustering techniques Distance metrics k-means Clustering Izabela Moise, Evangelos Pournaras, Dirk Helbing 4

Clustering Izabela Moise, Evangelos Pournaras, Dirk Helbing 5

Definition Clustering is a data mining function that partitions the data points into natural groups called clusters. The goal: the points within a cluster are very similar, whereas points across clusters are as dissimilar as possible. Unsupervised (requires data, not labels) Outcome clusters Izabela Moise, Evangelos Pournaras, Dirk Helbing 6

Types of Clustering partitional divides data points into non-overlapping clusters, each point is in exactly one subset hierarchal finds clusters using previously built clusters agglomerative start with single-element clusters and merge them exclusive a data point belongs to a single cluster non-exclusive a data point may belong to multiple clusters fuzzy, probabilistic a point belongs to every cluster with a weight between 0 and 1 Izabela Moise, Evangelos Pournaras, Dirk Helbing 7

Applications 1. useful when don t know what you re looking for 2. used as a stand-alone tool to get insight into the data 3. used as a preprocessing tool for other algorithms (outlier detection, data compression) Astronomy: aggregation of stars, galaxies, or super galaxies Spatial Data Analysis: create thematic maps in GIS by clustering feature spaces Image Processing Izabela Moise, Evangelos Pournaras, Dirk Helbing 8

Weblogs: discover groups of similar access patterns City-planning: identifying groups of houses according to their house type, value, and geographical location Land use: identification of areas of similar land use in an earth observation database Earth-quake studies: observed earth quake epicentres should be clustered along continent faults Summarisation: reduce the size of large data sets Marketing Izabela Moise, Evangelos Pournaras, Dirk Helbing 9

Google News Izabela Moise, Evangelos Pournaras, Dirk Helbing 10

Applications Izabela Moise, Evangelos Pournaras, Dirk Helbing 11

What is a Cluster? a subset of objects which are similar the distance between any two objects in the cluster is less than the distance between any object in the cluster and any object outside it a connected region of a multidimensional space containing a relatively high density of objects Izabela Moise, Evangelos Pournaras, Dirk Helbing 12

What Makes a Clustering Good? A good clustering method will produce high quality clusters in which: intra-cluster similarity is high inter-cluster similarity is low depends on the similarity metric and its implementation ability to discover all or some hidden patterns Izabela Moise, Evangelos Pournaras, Dirk Helbing 13

Distance Metrics 1. Euclidean Distance 2. Manhattan Distance 3. Minkowski Distance Izabela Moise, Evangelos Pournaras, Dirk Helbing 14

Calculating Cluster Distances 1. Single link dist(k i, k j ) = min(x i,p, y j,q ) 2. Complete link dist(k i, k j ) = max(x i,p, y j,q ) 3. Average distance dist(k i, k j ) = avg(x i,p, y j,q ) Izabela Moise, Evangelos Pournaras, Dirk Helbing 15

Centroid vs. Medoid Centroid: the middle of a cluster C n centroid = 1 x n i, n = C i=1 does not have to be one of the data points in the cluster Medoid: the central point of a cluster C the data point that is "least dissimilar" from all of the other data points has to be one of the data points in the cluster Centroids distance dist(k i, k j ) = dist(centroid i, centroid j ) Medoids distance dist(k i, k j ) = dist(medoid i, medoid j ) Izabela Moise, Evangelos Pournaras, Dirk Helbing 16

k-means very popular algorithm for clustering object = n-dimensional vector users specifies k ( of clusters) generic sketch: (1) pick k random vectors as centroids (2) assign vectors to closest centroid clusters (3) compute centroids of each cluster (4) repeat from (2) until clusters converge or a finite number of iterations is reached Izabela Moise, Evangelos Pournaras, Dirk Helbing 17

k-means Algorithm Izabela Moise, Evangelos Pournaras, Dirk Helbing 18

k-means in action K-means clustering The dataset. Input k=5 1 1 Introduction to Machine Learning, Xiaojin Zhu Izabela Moise, Evangelos Pournaras, Dirk Helbing 19

k-means in action Randomly picking 5 positions as initial cluster centers (not necessarily a data point) K-means clustering 1 1 Introduction to Machine Learning, Xiaojin Zhu Izabela Moise, Evangelos Pournaras, Dirk Helbing 19

k-means in action Each point finds which cluster center it is closest to (very much like 1NN). The point belongs to that cluster. K-means clustering 1 1 Introduction to Machine Learning, Xiaojin Zhu Izabela Moise, Evangelos Pournaras, Dirk Helbing 19

k-means in action Each cluster computes its new centroid, based on which points belong to it K-means clustering 1 1 Introduction to Machine Learning, Xiaojin Zhu Izabela Moise, Evangelos Pournaras, Dirk Helbing 19

k-means in action Each cluster computes its new centroid, based on which points belong to it And repeat until convergence (cluster centers no longer move) K-means clustering 1 1 Introduction to Machine Learning, Xiaojin Zhu Izabela Moise, Evangelos Pournaras, Dirk Helbing 19

k-means in action K-means: initial cluster centers 1 1 Introduction to Machine Learning, Xiaojin Zhu Izabela Moise, Evangelos Pournaras, Dirk Helbing 19

k-means in action K-means in action 1 1 Introduction to Machine Learning, Xiaojin Zhu Izabela Moise, Evangelos Pournaras, Dirk Helbing 19

k-means in action K-means stops 1 1 Introduction to Machine Learning, Xiaojin Zhu Izabela Moise, Evangelos Pournaras, Dirk Helbing 19

k-means Algorithm Izabela Moise, Evangelos Pournaras, Dirk Helbing 20

k -Means Algorithm Izabela Moise, Evangelos Pournaras, Dirk Helbing 20

Why k-means converges? Whenever an assignment is changed, the sum squared distances of datapoints from their assigned cluster centers is reduced. Whenever a cluster center is moved the sum squared distances of the datapoints from their currently assigned cluster centers is reduced. If the assignments do not change in the assignment step, we have converged. Izabela Moise, Evangelos Pournaras, Dirk Helbing 21

k-means Convergence 1. assign each point to its nearest centroid 2. compute centroid of each cluster Izabela Moise, Evangelos Pournaras, Dirk Helbing 22

k-means Convergence 1. assign each point to its nearest centroid 2. compute centroid of each cluster Algorithm terminates when neither (1) nor (2) results in change of configuration Izabela Moise, Evangelos Pournaras, Dirk Helbing 22

Initial Centroids affect the final clusters (inter-cluster and intra-cluster distances) often chosen randomly clusters vary from one run to another one solution: 1. pick a random point x 1 from dataset 2. find the point x 2 farthest from x 1 in the dataset 3. find x 3 farthest from the closer of x 1, x 2 4. pick k points like this, use them as starting cluster centroids for the k clusters Izabela Moise, Evangelos Pournaras, Dirk Helbing 23

k-means Properties unsupervised, non-deterministic and iterative there are always k clusters there is always at least one point in each cluster clusters are non-hierarchical and they do not overlap Izabela Moise, Evangelos Pournaras, Dirk Helbing 24

Pros and Cons Pros: fast, robust and easy to understand relatively efficient best results when data are well separated from each other Izabela Moise, Evangelos Pournaras, Dirk Helbing 25

Pros and Cons Cons: X requires a priori specification of k X unable to handle noisy data and outliers Centroid is average of cluster members Outlier can dominate average computation Solution: K-medoids X different initial partitions can result in different final clusters Izabela Moise, Evangelos Pournaras, Dirk Helbing 26