Unsupervised learning: Clustering & Dimensionality reduction. Theo Knijnenburg Jorma de Ronde

Size: px

Start display at page:

Download "Unsupervised learning: Clustering & Dimensionality reduction. Theo Knijnenburg Jorma de Ronde"

Molly Campbell
5 years ago
Views:

1 Unsupervised learning: Clustering & Dimensionality reduction Theo Knijnenburg Jorma de Ronde

2 Source of slides Marcel Reinders TU Delft Lodewyk Wessels NKI Bioalgorithms.info Jeffrey D. Ullman Stanford Panos Pardalos - UFL

3 Unsupervised learning Find structure in unlabeled data

4 Unsupervised learning Find structure in unlabeled data samples features healthy labels disease

5 Unsupervised learning Find structure in unlabeled data samples features healthy labels disease

6 Clustering Group samples in unlabeled data samples features

7 Apples and pears

8 Clustering apples and pears Samples (apples and pears) features

9 Clustering apples and pears Weight Color Shape Gene expression Country of origin Sugar levels Taste Taste when fried Price Calories Water content Samples (apples and pears)

10 Classification (labeled data) Weight Color Shape Gene expression Country of origin Sugar levels Taste Taste when fried Price Calories Water content Samples (apples and pears) apples pears

11 Classification (labeled data) Weight Color Shape Gene expression Country of origin Sugar levels Taste Taste when fried Price Calories Water content Samples (apples and pears) r Not rotten rotten nr

12 Clustering apples and pears Weight Color Shape Gene expression Country of origin Sugar levels Taste Taste when fried Price Calories Water content Samples (apples and pears)

13 Clustering microarray data Samples (patients) Gene expression

14 Clustering microarray data Samples (patients) Gene expression w1 healthy Week 2 Week 1 disease w2

15 How many clusters?

16 Why do clustering? Group samples that are close to each other Reduce the amount of data Construct categories or taxonomies in an automated way Statistical and visual description of your data Generate hypotheses

18 Overview of clustering Distance measures Hierarchical clustering Clustering synthetic data Clustering less synthetic data K-means clustering

19 Unsupervised learning Clusters and distance measures

20 What makes a cluster a cluster? Intuitively: group objects together that are similar to eachother

21 Unsupervised clustering (loose) cluster definition: 1) Samples within cluster resemble each other (within variance, σ w (i)) 2) Clusters deviate from each other (between variance, σ B (i))

22 Unsupervised hierarchical clustering First step: Find the two samples that are closest to eachother (according to some measure of distance)

23 Unsupervised clustering We find objects 4 and 2 to be the closest to eachother, so they form the first cluster

24 Unsupervised clustering Next, we again look for the closest two objects, but we now consider objects 4 and 2 to 4 to be in the same cluster, so essentially one object The next two objects closest to eachother are objects 5 and 8

25 Unsupervised clustering Again, we look for the two objects closest to eachother. An object can also be a cluster, so we not only look at the distances between two single samples, but also between a sample and a cluster and between two clusters In this particular case object 3 and the cluster containing 4 and 2, ie c(4,2) are the closest two objects. C(4,2) and 3 now form a new cluster

26 Unsupervised clustering This process is repeated until a single, all-encompassing cluster is reached

27 Unsupervised clustering Finally, we have a hierarchical clustering of our data Since it is hierarchical, we need to set a cut-off if we want to look at clusters that do not contain all samples

28 Hierarchical clustering At each step of this simple hierarchical clustering algorithm we need to know: The distance between every two samples The distance between each sample and cluster The distance between every two clusters In which ways can we define these distance? We need to construct a distance matrix

29 Distance matrix A B C D A B C D 0 A B C+D A B C+D 0

30 Distance between samples

Distance matrices for different distance measures Euclidean Green Blue Yellow Red Green 0 0.1 0.5 1.0 Blue 0 0.5 1.0 Yellow 0 0.

31 Distance matrices for different distance measures Euclidean Green Blue Yellow Red Green Blue Yellow Red Pearson correlation Green Blue Yellow Red Green Blue Yellow Red 0

32 Distance between clusters Single linkage: closest distance between two objects in two clusters Complete linkage: longest distance between two objects in two clusters Average linkage: distance between the averages of each cluster

33 Using different linkage measures can have a dramatic effect on cluster formation

34 How many clusters?

35 Assessing the robustness of a clustering Even when we use random data, we can generate a hierarchical clustering This clustering could even implicate some kind of structure in the data, whereas we know any apparent structure is derived from noisy data without any signal We need to assess the robustness of our clustering

36 Bootstrapping Say we set a cut-off and identify 2 clusters We can use bootstrapping to test the stability of these clusters by taking a random sample from our data (with replacement) Now, generate a new clustering with the bootstrapped data, and repeat x times Calculate for each iteration whether two objects cluster together

37 Bootstrap results for random data Yellow: clusters together often (ie in this case each object always clusters together with itself)

38 Data with structure Good support for nine clusters

39 How to select optimal number of clusters Many different methods exist Visually, you could solve this by checking at which point the dendrogram branch heights start to level off (ie high similarity between objects = low branches)

40 K-means Clustering

41 Algorithm k-means 1. Decide on a value for k. 2. Initialize the k cluster centers (randomly, if necessary). 3. Decide the class memberships of the N objects by assigning them to the nearest cluster center. 4. Re-estimate the k cluster centers, by assuming the memberships found above are correct. 5. If none of the N objects changed membership in the last iteration, exit. Otherwise goto 3.

42 5 K-means Clustering: Step 1 Algorithm: k-means, Distance Metric: Euclidean Distance 4 k k k 3

43 5 K-means Clustering: Step 2 Algorithm: k-means, Distance Metric: Euclidean Distance 4 k k k 3

44 5 K-means Clustering: Step 3 Algorithm: k-means, Distance Metric: Euclidean Distance 4 k k 2 k

45 5 K-means Clustering: Step 4 Algorithm: k-means, Distance Metric: Euclidean Distance 4 k k 2 k

46 5 K-means Clustering: Step 5 Algorithm: k-means, Distance Metric: Euclidean Distance expression in condition k 2 k expression in condition 1 k 1

47 How can we tell the right number of clusters? In general, this is a unsolved problem. However there are many approximate methods. In the next few slides we will see an example. For our example, we will use the familiar katydid/grasshopper dataset. 7 However, in this case we are imagining 6 that we do NOT know the class labels. We are only clustering on the X and Y axis 5 values

48 When k = 1, the objective function is

49 When k = 2, the objective function is

50 When k = 3, the objective function is

51 We can plot the objective function values for k equals 1 to 6 The abrupt change at k = 2, is highly suggestive of two clusters in the data. This technique for determining the number of clusters is known as knee finding or elbow finding. 1.00E E E+02 Objective Function 7.00E E E E E E E E+00 k Note that the results are not always as clear cut as in this toy example

10601 Machine Learning. Hierarchical clustering. Reading: Bishop: 9-9.2

10601 Machine Learning. Hierarchical clustering. Reading: Bishop: 9-9.2 161 Machine Learning Hierarchical clustering Reading: Bishop: 9-9.2 Second half: Overview Clustering - Hierarchical, semi-supervised learning Graphical models - Bayesian networks, HMMs, Reasoning under