CBioVikings. Richard Röttger. Copenhagen February 2 nd, Clustering of Biomedical Data

Size: px

Start display at page:

Download "CBioVikings. Richard Röttger. Copenhagen February 2 nd, Clustering of Biomedical Data"

Corey Dickerson
5 years ago
Views:

1 CBioVikings Copenhagen February 2 nd, Richard Röttger 1

2 Who is talking? 2

3 Resources Go to You will find The dataset These slides An overview paper A small R script for a cluster Analysis R Tutorial R Röttger. Clustering of Biological Datasets in the Era of Big Data. Journal of Integrative Bioinformatics 13 (1), 300 3

4 Clustering in Life Sciences Long-standing problem in computer science grouping or segmenting a collection of objects into subsets or clusters such that those within each cluster are more closely related to one another than objects assigned to different clusters. Image taken from: 4

5 Clustering in Life Sciences Long-standing problem in computer science grouping or segmenting a collection of objects into subsets or clusters such that those within each cluster are more closely related to one another than objects assigned to different clusters. Applied in almost every scientific field, e.g.: Information retrieval Economics and marketing Astronomy Bioinformatics Image taken from: 5

Clustering in Life Sciences Long-standing problem in computer science grouping or segmenting a collection of objects into subsets or clusters such that those within each cluster are more closely

6 Clustering in Life Sciences Long-standing problem in computer science grouping or segmenting a collection of objects into subsets or clusters such that those within each cluster are more closely related to one another than objects assigned to different clusters. Applied in almost every scientific field, e.g.: Information retrieval Economics and marketing Astronomy Bioinformatics In Bioinformatics Homology detection Gene expression study Protein complex prediction Image taken from: 6

7 Complexity of Clustering 7

8 Complexity of Clustering Most Pressing Issues: What tool to use? How to find a best clustering? How to tune the parameters of a tool? How measure to do this in a reliable and reproducible manner? 8

9 Graphical Analysis 9

10 First, let s have a look! Good way to gain an overview Histograms and Scatterplots Can be misleading Hard to automatize 10

11 Scatterplots How many clusters do you see? This is so-called overplotting. Only meaningful for bivariate data 11

12 Density Estimation We have seen that we have a couple of problems Overplotting Wrong bin size can easily hide interesting features Now, let s consider a different approach Assume that our dataset originates from some probability density function If we would know the type and specifics of this density function, we would have all the information we need for a clustering BUT: We do not have this information! And we do not want to make any assumption (i.e., that is the so-called nonparametric density estimation) 12

13 Histogram as a Density Estimate Divide the sample space into a number of bins Approximate the density at the center of each bin by counting 13

14 Drawbacks of a Histogram The density estimate depends on the starting position of the bins Discontinuities are not due to the underlying density; Curse of dimensionality: number of bins grows exponentially with the number of dimensions In high dimensions many examples are needed in order to have non-empty bins Therefore: Unsuitable for high dimensions More sophisticated density estimators required 14

15 Kernel Density Estimators; Parzen Windows We can estimate a density function by employing a kernel function K: Notice how the Parzen window estimate resembles the histogram, with the exception that the bin locations are determined by the data 15

16 Different Kernel Functions 16

17 Revisiting our Example 17

18 Pre-Processing 18

19 Preprocessing: Feature Extraction / Selection Observation Features might be correlated Features might useless for a clustering Features might even be blurring the cluster structure Feature Selection Utilizes only a subset of the available Features Most methods are coupled with a mining tool to determine optimality Feature Extraction Creates new features out of the existing features Seeks to create uncorrelated, better features Examples: PCA, PCoA 19

20 Preprocessing: Normalization Feature1: [0,1] Feature2: [1000,80000] Normalization: Bring both features to [0,1] 20

21 Preprocessing: Normalization Feature1: [0,1] Feature2: [1000,80000] Normalization: Bring both features to [0,1] => Bad with outliers! 21

22 Preprocessing: Normalization Feature1: [0,1] Feature2: [1000,80000] Normalization: Bring both features to [0,1] => Bad with outliers! Standardization: The values are scaled by the deviation from the mean: 22

23 Preprocessing: Normalization Feature1: [0,1] Feature2: [1000,80000] Normalization: Bring both features to [0,1] => Bad with outliers! Standardization: The values are scaled by the deviation from the mean: Generally: Loss of scale and location! 23

24 PCA PCA is a very complex and large topic which can basically fill entire lecture series Furthermore, there are many interpretations and different applications for a PCA 1 Here, we limit ourselfs to the usage of PCA in clustering: Project data to a lower dimensional space With the intention of simplify clustering Hopefully provides a better means for visual inspection see for example: 24

25 PCA PCA is a very complex and large topic which can basically fill entire lecture series Furthermore, there are many interpretations and different applications for a PCA 1 Here, we limit ourselfs to the usage of PCA in clustering: Project data to a lower dimensional space With the intention of simplify clustering Hopefully provides a better means for visual inspection The task of a PCA is to perform a dimensionality reduction in such a way that most of the variance in the original data is preserved see for example: 25

26 An Example 26

27 An Example 27

28 How does a PCA work? The PCA performs a basis transformation, in which the first basis vector is the vector accounting for most of the variance in the dataset, the second for the most of the remaining variance and so on... These basis vectors can be found by the eigenvalue decomposition of the covariance matrix Q or the sample correlation matrix R. The eigenvalues λ 1,, λ d indicate the variance of the eigenvectors y 1,, y d 28

29 The co-variance is defined as The Co-Variance Matrix This is the observed covariance for n observations x i, y i The co-variance matrix is then defined as The covariance matrix generalizes the notion of variance to multiple dimensions 29

30 Example: PCA 30

31 Example: PCA 31

32 Image taken from Ricardo Gutierrez-Osuna s class on Pattern Analysis 32

33 Proximity Calculation 33

34 Different Proximity Measures Similarity Numerical measure of how alike two data objects are Is higher when objects are more alike Often falls in the range [0, 1] Dissimilarity Numerical measure of how different two data objects are Lower when objects are more alike Minimum dissimilarity is often 0 Upper limit varies Often called Distance if it fulfills metric properties 34

35 One-mode / Two-mode One-mode In a one-mode dataset the data is given in a n n-matrix P = (p ij ) p ij relates each pair of objects x i and x j with each other Also often called a similarity/dissimilarity Matrix Normally, a one-mode matrix is symmetric Called one-mode as columns and rows describe the same thing Two-mode A two-mode dataset normally comes as a n d-matrix Each object is in a row, with each property being stored in a different column Sometimes, this mode is called the Raw-data A row is also sometimes called a feature vector 35

36 Proximity Calculation: Continues Data Euclidean Type of measures (Minkowski Distance) Image taken from wikipedia.com 36

37 Proximity Calculation: Continues Data Correlation Coefficient 37

38 Similarity Measures for Binary Variables Most of the measure define a similarity on the count of different mismatches of two objects in the d variables Generally saying, the counts a and d can be seen as matches, the counts b and c as mismatches While b and c can be seen as equivalent, this is certainly not true for the matching states a and d 38

39 Similarity Measures for Binary Variables Matching Coefficient Jaccard coefficient When the presence of a feature has the same explanatory power as the absence, the Matching Coefficient is applied, otherwise the Jaccard coefficient 39

40 Similarity Measures for Categorical Data A straightforward way would be treating each level of the categorical variable as own binary variable and apply the known measures Let s say the variable eye-color {blue, brown, green, gray} Can be converted into binary variables has blue eyes, has brown eyes,... Problem: By default, many negative matches Therefore: It is often counted how often two objects agree on the different variables 40

41 Proximity Calculation: Specialized Functions These Standard Methods are often not sufficient for biological data, as we neither have categorical data of an embedding in a n- dimensional space How to embed a sequence? A network? A Protein structure? Specialized Measures: BLAST Network Edit Distance Protein structure alignments 41

42 Clustering 42

43 From A Criteria to Algorithm Each clustering tool optimizes some inherent idea of a perfect clustering They are all only approximations! Possibilities to separate n objects into k clusters: N 2,5 = 15 N 10,3 = 9330 N 50,4 = N 100,5 = There are estimated ±1 atoms in the observable universe 43

44 From A Criteria to Algorithm Each clustering tool optimizes some inherent idea of a perfect clustering They are all only approximations! Possibilities to separate n objects into k clusters: N 2,5 = 15 N 10,3 = 9330 N 50,4 = It is important to know what exactly the clustering algorithm optimizes! N 100,5 = There are estimated ±1 atoms in the observable universe 44

45 Tool Selection: Overview k-means based Hierarchical Graph-based Density-based Others 45

46 Tool Selection: k-means based Most popular clustering tool Two-step iterative process: Assign objects to closest centers Updates these centers Good time complexity (almost linear) Minimizes the mean-squared-error of the objects to the cluster centers Works quite well in practice 46

47 Tool Selection: Problems with k-means Sensitive to initialization: how do we choose the initial partitions? 47

48 Tool Selection: Problems with k-means Sensitive to initialization: how do we choose the initial partitions? Run several iterations (Subset) Furthest-first initialization 48

49 Tool Selection: Problems with k-means k-means prefers hyperspherical clusters of approximately the same size Image taken from wikipedia.com 49

50 How to find the best k? No easy answer to that Employ domain knowledge Use internal cluster validity indices Use GAP statistic 50

51 Tool Selection: Hierarchical Creates a hierarchical embedding of the clustering Two main branches Agglomerative Divisive Image: Brazma, Alvis, and Jaak Vilo. "Gene expression data analysis." FEBS letters (2000):

52 Tool Selection: Single Linkage The distance between two clusters is represented by the distance of the closest pair of data objects belonging to different clusters. 52

53 Tool Selection: Complete Linkage The distance between two clusters is represented by the distance of the farthest pair of data objects belonging to different clusters 53

54 Tool Selection: Average Linkage The distance between two clusters is represented by the average distance of all pairs of data objects belonging to different clusters Compromise between Single and Complete Link Strengths Less susceptible to noise and outliers Limitations Biased towards spherical clusters 54

55 Tool Selection: Overview k-means based Hierarchical Graph-based Represent the data as a graph Identifying densely connected areas in the graph Examples: MCL, Transitivity Clustering, Affinity Propagation Used for: Network and Complex analysis Density-based Images: Vlasblom, James, and Shoshana J. Wodak. "Markov clustering versus affinity propagation for the partitioning of protein interaction graphs." BMC bioinformatics 10.1 (2009): 1. 55

areas from low-density areas Very Efficient

56 Tool Selection: Overview k-means based Hierarchical Graph-based Density-based separating high-density areas from low-density areas Very Efficient Arbitrary cluster shape Require embedding of the objects 56

57 Cluster Evaluation 57

58 Evaluate a Clustering The validation of clustering structures is the most difficult and frustrating part of cluster analysis. Without a strong effort in this direction, cluster analysis will remain a black art accessible only to those true believers who have experience and great courage. Algorithms for Clustering Data, Jain and Dubes No puppets were harmed in the production of this lecture; generally, the usage of black magic is limited to a minimum at SDU. 58

59 Overview of Cluster Validation Two different kinds of measures can be distinguished External Measures Compare two clusterings Use a gold-standard to evaluate the quality of a clustering Internal Measures Only use the clustering as basis for evaluation Comparable to cluster criteria 59

60 External Measures We can look at each pair of points and define Or map each cluster c j to the gold standard cluster k i with the highest overlap TP if a k i a c j FP if a k i a c j FN if a k i a c j 60

61 Rand Index (pair-wise) Now we can define Measures Jaccard Index (pair-wise) F-measure (mapping) 61

62 Internal Measures Do not have additional information of the ground truth at disposal Similar to cluster criteria Normally based on: Compactness: this measures how closely related the objects in a cluster are Separation: this measures how distinct or well-separated a cluster is from other clusters 62

63 Dunn Index Internal Measures The Dunn Index assesses the clustering performance by relating the maximal cluster diameter to the minimal distance between clusters This measure is prone to outliers for it is based on minimal and maximal distances Davis Bouldin Index The Davies Bouldin Index DB is defined based on the average distances between objects and their cluster centroids 63

64 Silhouette Coefficient Based on: Cohesion a(x): average within cluster distance of x Separation b(x): average distance of x to the closest other cluster Takes values between -1 and 1 64

65 What to do? Such a method does not exists for all use cases How to proceed then? Is there a general rule we could follow? ClustEval: Fully automatizes the clustering We tested 13 clustering methods On 24 datasets (12 real-world, 12 artificial) 13 common validity measures 1000 parameter sets per tool per dataset 65

66 Results 66

67 Results of ClustEval There is no general best performer among the tools Quite often internal and external measures do not agree on the performance assessment When using only biomedical datasets, the Silhouette Value has the best agreement with external measures 67

68 Workshop Introduction 68

69 BreathOMICS data 69

70 Averaged Y What is it good for? -Graph :38 J.I. Baumbach - B&S Analytik, Dortmund, Germany 0.90 SHAM S Pentanone Monomer & Dimer 0.45 CLI S Zeitskala einzeln normiert / a.u. 70

71 Data Preprocessing RAW Smoothed De-noised 71

72 Peak Detection - Local maxima search (LMS) - Merged peak cluster localization (MPCL) Bader et al Wavelet-based multi-scale peak detection Bader et al Water shed transformation (WST) Bunkowski et al Peak model estimation (PME) Kopczynski et al

73 Patients Substances BreathOMICS 73

74 Peak Alignment 74

75 Resources Go to You will find The dataset These slides An overview paper A small R script for a cluster Analysis R Tutorial R Röttger. Clustering of Biological Datasets in the Era of Big Data. Journal of Integrative Bioinformatics 13 (1),

76 Thank you for your Attention Q & A Contact: roettger@imada.sdu.dk 76

Clustering CS 550: Machine Learning

Clustering CS 550: Machine Learning This slide set mainly uses the slides given in the following links: http://www-users.cs.umn.edu/~kumar/dmbook/ch8.pdf http://www-users.cs.umn.edu/~kumar/dmbook/dmslides/chap8_basic_cluster_analysis.pdf