Computing with large data sets

Size: px

Start display at page:

Download "Computing with large data sets"

Geraldine Shepherd
5 years ago
Views:

1 Computing with large data sets Richard Bonneau, spring 2009 Lecture 8(week 5): clustering 1

2 clustering Clustering: a diverse methods for discovering groupings in unlabeled data Because these methods don t work on labeled data they are often referred to as unsupervised learning methods. The next two lectures aim to i)first expose you to 2 simple approaches and then ii) provide one example of a method that aims to determine number of custers using resampling and iii) introduce concepts that might allow us to use external data to determine clusters (semi-supervised methods).

3 clustering: why Reasons for clustering: classification - objects generated or controlled by similar processes - objects with similar attributes simplification - dimensionality reduction (clustering both experiments/conditions and objects) - as a means of navigating similarity even when one is unsure about number or relevance of classes or partitions in data. To create populations of types for downstream analysis. - signal averaging or statistical tests on classes. etc.

4 clustering: tough problems clustering considerations: 1. The large volume of work in this area (with many styles or approaches) makes following recent advances a full time job Define cluster (cell cycle example). 3. Computational considerations. ## number of ways to make 5 equal sized clusters from 25 objects > choose(25,5) * choose(20,5) * choose(15,5) * choose(10, 5) * choose(5,5) [1] e+14 > ## what if the clusters are not equal sizes? > print( "a whole lot!") Clustering algorithms are often entwined with hard optimization problems --> no free lunch --> problem/data defines most efficient solution. 4. Is a given partition/clustering significant? What is the correct number of clusters? To what degree are the answers to these questions dependent on downstream analysis OR questions being asked?

5 clustering: applications Reasons for clustering: Computational Biology: clusters are complexes, co-functional, co-regulated, linear pathways, etc. Prognosis and classification in the clinic. Image segmentation. Document classification, sorting, search. Signal processing and compression.

example 1: microarray clustering Spores were germinated and allowed to grow for a cell-cycle and then starved (initiating spore formation anew). A complete cycle for this organism.

6 example 1: microarray clustering Spores were germinated and allowed to grow for a cell-cycle and then starved (initiating spore formation anew). A complete cycle for this organism. Nearly 5,000 genes were expressed in five distinct waves of transcription as the bacteria progressed from germination through sporulation, and we identified a specific set of functions represented within each wave Bergman, et. al 2006

7 distance/similarity metrics Choosing the correct distance measure is critical. X=(X1, X2,...) and Y=(Y1, Y2,...) Similarity measures: - Pearson correlation - Spearman correlation - Mutual Information Distance measures: - Euclidian distance - Euclidian squared - City Block/Manhattan - Chebyche n i=1 d i, j = (x i y i ) 2 d i, j = n i=1 x i y i d i, j = max i (x i y i )

8 distance metrics: considerations Choosing the correct distance measure is critical. X=(X1, X2,...) and Y=(Y1, Y2,...) Similarity measures: - Pearson correlation - Spearman correlation - Mutual Information Distance measures: - Euclidian distance - Euclidian squared - City Block/Manhattan - Chebyche n i=1 d i, j = (x i y i ) 2 d i, j = n i=1 x i y i d i, j = max i (x i y i ) Do we care about close distances (similarities) more than far distance. Do we want to only pay attention to the most deviant conditions? (Chebyche) Do we expect the relationship to be complex or highly non-linear? (MI, Spearman)

9 k-means clustering each cluster is defined primarily via its centroid (mean over all cluster members for each dimension). number of clusters, K, defined at start of algorithm {µ k, µ k,, µ k } r i,k = {1,0} 0. pick K random points and define as cluster centroids 1. add points to the cluster with closest centroid. (update r ) 2. recalculate means. (update {µ k,µ k,,µ k } ) 3. repeat until max.iter OR cost function, J, does not change more than threshold. J = N K k =1 r i,k (x i µ k ) i=1

10 k-means clustering A very very simple example, two normals with Ok seperation... x <- rbind(matrix(rnorm(100, sd = 0.3), ncol = 2), matrix(rnorm(100, mean = 1, sd = 0.3), ncol = 2)) colnames(x) <- c("x", "y") (cl <- kmeans(x, 2, iter.max = 20)) plot(x, col = cl$cluster) points(cl$centers, col = 1:2, pch = 8, cex=2) x[,1] x[,2] x[,1] x[,2] Amazing!

11 k-means clustering with Ok separation...but even this contrived example goes wrong once in a while with 10 iterations. <- cbind( rnorm(30, mean = 0.2, sd = 0.25), rnorm( 30, mean = -1.5, sd = 0.25 ) ) rownames( ) <- rep("", 30) <- cbind( rnorm(50, mean = -2.0, sd = 0.25), rnorm( 50, mean = -.5, sd = 0.3 ) ) rownames( ) <- rep("", 50) <- cbind( rnorm(20, mean = 2.2, sd = 0.2), rnorm( 20, mean = -0.1, sd = 0.25 ) ) rownames( ) <- rep("", 20) <- cbind( rnorm(30, mean = -1.0, sd = 0.15), rnorm( 30, mean = 1.5, sd = 0.15 ) ) rownames( ) <- rep("", 30) c.all <- rbind(,,, ) c.all.k <- kmeans(c.all, 4, iter.max = 10) plot(c.all, type = "n") text(c.all, label = rownames(c.all), col = c.all.k$cluster) points(c.all.k$centers, col = 1:range(c.all.k$cluster)[2], pch = 8, cex=2) c.all[,2] c.all[,2] Amazing c.all[,1] not Amazing c.all[,1]

12 k-means clustering as classes overlap the separation and number of clusters becomes non-trivial using any clustering method and misclassification occurs. <- cbind( rnorm(30, mean = 0.2, sd = 0.55), rnorm( 30, mean = -1.5, sd = 0.45 ) ) rownames( ) <- rep("", 30) <- cbind( rnorm(50, mean = -2.0, sd = 0.55), rnorm( 50, mean = -.5, sd = 0.5 ) ) rownames( ) <- rep("", 50) <- cbind( rnorm(20, mean = 2.2, sd = 0.5), rnorm( 20, mean = -0.1, sd = 0.55 ) ) rownames( ) <- rep("", 20) <- cbind( rnorm(30, mean = -1.0, sd = 0.55), rnorm( 30, mean = 1.5, sd = 0.55 ) ) rownames( ) <- rep("", 30) c.all <- rbind(,,, ) c.all.k <- kmeans(c.all, 4, iter.max = 10) plot(c.all, type = "n") text(c.all, label = rownames(c.all), col = c.all.k$cluster) points(c.all.k$centers, col = 1:range(c.all.k$cluster)[2], pch = 8, cex=2) c.all[,2] c.all[,1]

13 hierarchical clustering Agglomerative -- start with all object in one group and group closest objects or group until you run out of thingies. Divisive -- start with one sloppy cluster and make best cuts one after another until all clusters have one member. We ll try Agglomerative

14 agglomerative hierarchical clustering 0. define all N objects as groups of size one. 1. Compute distance matrix (we ll try euclidian with out toy example) 2. join two closest groups, remembering distance for that join. 3. repeat until only one all-inclusive group is left. 4. cut tree trick question : what is the time complexity?

15 agglomerative hierarchical clustering what is the distance between two groups Average linkage (mean of all distances between group i and j) single linkage (min distance) mean min complete linkage (max distance) max Ward s method (biased towards similar sized groups)

16 agglomerative hierarchical clustering Dennis Shasha s book Statistics is Easy ( link on Wiki under week 4 ) tmp <- dist( c.all ) str( tmp ) Class 'dist' atomic [1:7140] attr(*, "Size")= int attr(*, "Diag")= logi FALSE..- attr(*, "Upper")= logi FALSE..- attr(*, "method")= chr "euclidean"..- attr(*, "call")= language dist(x = c.all) rm(tmp) c.dist <- dist( c.all, method = "euclidean", diag = FALSE, upper = FALSE) c.hclust <- hclust( c.dist ) str( c.hclust ) plot( c.hclust ) c.all[,2] c.all[,1] Height Cluster Dendrogram c.dist hclust (*, "complete")

17 number of clusters K should be 4 in this case... for ( i in 1:length( h.cuts) ) { tmp.c <- cutree( c.hclust, h = h.cuts[i] ) cut.n[i] <- range( tmp.c )[2] } plot( h.cuts, cut.n, type = "l" ) abline( 4,0, col = "blue", lwd = 3, lty = 2) text( h.cuts, cut.n, label = cut.n, col = "darkgreen") cut.n K = 4 or 5? Height Cluster Dendrogram c.dist hclust (*, "complete") cuts: h.cuts

18 number of clusters : plenty of loose ends... Could we use a small number of labeled points to help decide? What about our gene annotations, could we find K to get the best average p-values for function-label enrichment within clusters? What if we hold back some of the data or use resampling to judge robustness of resulting clusters? Can we derive a principled balance between model complexity (number of clusters = degrees of freedom) and fit (e.g. aggregate distance from cluster centers)?

19 reading for next time: more resampling methods other lectures on clustering are out there: two reviews: (link also on wiki ) bio applications abound: Cluster analysis and its applications to gene expression data R Sharan, R Elkon, R Shami Next: Partitioning around Mediods, Principle Component Analysis, discussion of our confidence interval code in clustering context, can we determine K or cluster validity with resampling?

Clustering. CE-717: Machine Learning Sharif University of Technology Spring Soleymani

Clustering. CE-717: Machine Learning Sharif University of Technology Spring Soleymani Clustering CE-717: Machine Learning Sharif University of Technology Spring 2016 Soleymani Outline Clustering Definition Clustering main approaches Partitional (flat) Hierarchical Clustering validation