R/BioC Exercises & Answers: Unsupervised methods

Size: px

Start display at page:

Download "R/BioC Exercises & Answers: Unsupervised methods"

Abraham Holmes
5 years ago
Views:

1 R/BioC Exercises & Answers: Unsupervised methods Perry Moerland April 20, 2010 Z Information on how to log on to a PC in the exercise room and the UNIX server can be found here: Don t forget to enable X11 forwarding when starting up PuTTY. Z These exercises depend on Chapters 12 (Gentleman et al. 2005) and 13 (Pollard and van der Laan 2005). References to equations are with respect to these chapters. Z I would like to urge you to always closely look at the R code included, the data imported, and the values of the variables created. Also use the help function to look at all the different options functions offer. Don t forget that the goal of these exercises is to make you think! Introduction Unsupervised learning methods aim at detecting structures in data. The term unsupervised refers to the fact that these methods do not use gene or sample annotations, only the (normalized) expression intensities or ratios are used. A primary purpose of such methods is to group similar data together (clustering) and provide a visualization of the data in which structures can easily be recognized. These may be relations among genes, among samples, or even between genes and samples. The discovery of such structures can lead to the development of new hypotheses, e.g., the grouping of genes with similar expression profiles may indicate that they are co-regulated and are possibly involved in the same biological process. If one looks at arrays/samples instead of genes, the separation of expression profiles of patient tissue samples may point to a possible refinement of disease taxonomy. On the other hand, unsupervised methods are often used to confirm known differences between genes or samples. If a clustering algorithm groups samples from two different tumor types into distinct clusters, this provides evidence that the tumor types indeed show clearly detectable differences in their global expression profiles. Z If gene or sample annotations (or classes) are available the more powerful approach is to use them and apply techniques for supervised learning (tomorrow s lab). As you might have noticed, in the above the word similar was used several times. This is really a central concept in unsupervised learning be it clustering or visualization. In clustering one wants to group similar objects together, whereas in visualization one wants to find a representation of a high-dimensional data set in two or three dimensions while loosing as little 1

2 information as possible: objects that are similar in the original data space should also be similar in the low-dimensional space. See (Gentleman et al. 2005) for a more elaborate discussion. Distance measures and clustering We did a mini-experiment with four arrays (A,B,C,D) and 4 genes per array. This data set has been checked for low-quality arrays and properly normalized. The resulting log-ratios are given in file /data/home/stud00/unsup/hcexample.txt. Exercise 1: Copy this file to your home directory, start R, read in hcexample.txt and store it in a variable E for later use. Try to make a plot of the array profiles that looks more or less like the one below. You might want to look into some of the more detailed plotting commands in R: plot.new, plot.window, axis etc. > E <- read.table("hcexample.txt") > plot.new() > plot.window(xlim = c(1, 4), ylim = c(-5, 6)) > axis(1, at = 1:4, lab = c("1", "2", "3", "4")) > axis(2, pos = 0.95) > mtext("genes", side = 1, line = 3, cex = 1.5) > mtext("log2(cy5/cy3)", side = 2, line = 2, cex = 1.5) > for (i in 1:ncol(E)) { + lines(e[, i], lty = i, col = i, lwd = 3) + } > legend(3.3, 2.8, c("a", "B", "C", "D"), lty = 1:4, col = 1:4, + lwd = 3, y.intersp = 1, cex = 1.3) 2

3 log2(cy5/cy3) A B C D Genes There are many other ways to do this. One way to avoid the for statement is to replace it by a call to the function matplot, for example: > matplot(e, type = "l", add = T, col = 1:4, lty = 1:4, lwd = 3) Cluster alogrithms group similar data together. What is meant by the word similar is formally defined by the notion of a distance. In R, the biodist package gives a collection of functions for calculating distance measures. We will have a look at two of them in more detail, first load the package: > library(biodist) Now calculate the Euclidean (12.2, EUC) and the Pearson sample correlation (12.2, COR) distance between the array profiles. > d.euc <- euc(t(e)) > d.euc A B C B C D > d.cor <- cor.dist(t(e), abs = FALSE) > d.cor 3

4 A B C B 2 C 0 2 D Note that you had to transpose the data matrix E since the biodist package calculates pairwise distances for all rows of a matrix. 1 Exercise 2: Can you explain the resulting distance matrix when using the Pearson correlation distance? Profiles of A and C, respectively, B and D are identical (correlation equals one) and therefore have correlation distance equal to zero. Profiles of A and B, respectively, C and D are anticorrelated (correlation equals minus one) and their correlation distance equals two. Such a distance or dissimilarity matrix forms the basis for most clustering algorithms. In microarray literature, an agglomerative (i.e., bottum-up) hierarchical approach such as implemented in the hclust function is popular. As explained in the lecture, hierarchical clustering tries to find a tree-like representation of the data in which clusters have subclusters, which have subclusters, and so on. The number of clusters depends on in how much detail one looks at the tree. Hierarchical clustering uses two types of distances: ˆ Distance between two individual data points (Euclidean, correlation etc. ) ˆ Distance between two clusters of data points, also called linkage (single, average, etc.). Exercise 3: Draw (just with pencil on paper) the dendrograms for both the Euclidean and the correlation distance matrix generated above. Explain your results. Also write the R code using hclust for both cases when using average linkage, store the dendrograms in variables hc.euc and hc.cor respectively. Plot the resulting dendrograms. > par(mfrow = c(1, 2)) > hc.euc <- hclust(d.euc, method = "average") > plot(hc.euc,, main = "Dendrogram (Euclidean)", sub = "Average linkage", + hang = -1) > hc.cor <- hclust(d.cor, method = "average") > plot(hc.cor, main = "Dendrogram (Correlation)", sub = "Average linkage", + hang = -1) > par(mfrow = c(1, 1)) 1 Some of the help pages are wrong and claim that distances are calculated column-wise. 4

5 B C D A Dendrogram (Euclidean) Dendrogram (Correlation) A C B D Height Height d.euc Average linkage d.cor Average linkage Note that the height of the branches indeed corresponds to the distances calculated in the previous exercise. Clusters can be extracted from the resulting dendrograms by a horizontal cut: > cutree(hc.euc, 2) A B C D > cutree(hc.cor, 2) A B C D These two vectors show to which cluster each array is attributed. Clustering synthetic data One peculiarity of clustering algorithms is that they always produce clusters. This happens regardless of whether there is actually any meaningful clustering structure present in the data. Let us now simulate some unstructured data (rnorm randomly generates from a normal distribution) and see what happens. > n.genes < > n.samples <- 50 5

6 > descr <- paste("s", rep(c("0", ""), times = c(9, 41)), 1:50, + sep = "") > set.seed(13) > datamatrix <- matrix(rnorm(n.genes * n.samples), nrow = n.genes) Exercise 4: Use hclust with average, single, and complete linkage to cluster the samples and plot the resulting dendrograms. Which phenomenon do you see with single linkage? How might this make interpretation difficult? The phenomenon you observe is called chaining. This is the sequential addition of single objects to an existing cluster. Due to sequential addition of single objects, often no clear cluster structure is obtained. > deuc <- euc(t(datamatrix)) > havgeuc <- hclust(deuc, method = "average") > plot(havgeuc, labels = descr, cex = 0.6) Cluster Dendrogram Height S19 S44 S07 S23 S29 S40 S38 S46 S26 S50 S27 S48 S17 S43 S03 S49 S01 S30 S09 S13 S12 S25 S08 S31 S28 S11 S45 S21 S18 S06 S39 S32 S35 S04 S34 S02 S10 S22 S15 S41 S24 S33 S14 S36 S42 S05 S20 S16 S37 S47 deuc hclust (*, "average") > hsineuc <- hclust(deuc, method = "single") > plot(hsineuc, labels = descr, cex = 0.6) 6

7 Cluster Dendrogram S19 S44 Height S07 S23 S29 S40 S50 S28 S03 S21 S26 S27 S48 S49 S14 S09 S01 S30 S17 S43 S41 S18 S13 S08 S31 S06 S12 S11 S45 S32 S33 S25 S15 S10 S24 S39 S04 S02 S35 S22 S34 S37 S47 S36 S20 S16 S42 S05 S38 S46 deuc hclust (*, "single") > hcomeuc <- hclust(deuc, method = "complete") > plot(hcomeuc, labels = descr, cex = 0.6) Cluster Dendrogram Height S19 S44 S26 S50 S38 S46 S29 S40 S07 S23 S17 S43 S27 S48 S09 S13 S08 S31 S11 S45 S03 S49 S01 S30 S12 S25 S28 S21 S06 S39 S24 S04 S35 S14 S18 S05 S15 S02 S47 S34 S37 S32 S10 S41 S22 S36 S42 S33 S20 S16 deuc hclust (*, "complete") Some of the dendrograms generated above appear to find structure in the data. You might decide that the complete linkage dendrogram contains four clusters. Let us assume that you 7

8 stored the complete linkage dendrogram in variable hcomeuc (because of the randomness in this data set results are very likely to be different from the ones shown here). > cutree(hcomeuc, 4) [1] [39] > table(cutree(hcomeuc, 4)) Is this structure real? Since the data generated was completely random, this is not very likely. A nice way of testing the stability of a clustering result is to slightly perturb the data numerous times and see whether results are robust with respect to these small perturbations. This can be done by bootstrapping the individual genes, or rows of the data matrix (section in (Pollard and van der Laan 2005)). The idea behind the bootstrap is to create a new data matrix (of the same size as the original) by randomly selecting rows. Sampling is done with replacement, so some rows will be included multiple times and some rows will be omitted. Here is a simple example where we bootstrap row indices for a data matrix with 10 rows: > sample(1:10, replace = T) [1] We can use the ClassDiscovery package to perform bootstrap resampling for clustering. In order to test the stability of k = 4 groups for hierarchical clustering with Euclidean distance and complete linkage using 200 bootstrap samples, we do the following: > library(classdiscovery) > bc <- BootstrapClusterTest(dataMatrix, cuthclust, k = 4, method = "complete", + metric = "euclid", ntimes = 200) > image(bc, col = blueyellow(64)) 8

9 The main result of BootstrapClusterTest is that for each pair of columns (arrays in this case) in the original data, the number of times they belong to the same cluster of a bootstrap sample is calculated. This is stored in bc@result and used to create the heatmap displayed in the figure. Two samples that often cluster together are shown as yellow and two samples that cluster often apart are shown as blue. The figure clearly illustrates the absence of real structure in the data. Exercise 5: Apply the bootstrap cluster test to the Euclidean dendrogram from the toy example data stored in variable E (see previous section). How many clusters would you expect? Is there stable structure in this data set? > bc <- BootstrapClusterTest(E, cuthclust, k = 2, method = "average", + metric = "euclid", ntimes = 200) > image(bc, col = blueyellow(64), labrow = colnames(e), labcol = colnames(e)) 9

10 D C B A A B C D You would expect two clusters {A, B} and {C, D}, therefore you had to choose k = 2 in the call to BootstrapClusterTest. There is indeed stable structure, indicated by the yellow squares telling you that in all bootstrap samples {A, B}, respectively {C, D} cluster together. Occasionally, you see papers comparing two known classes of samples that perform the following analysis: ˆ Find a list of differentially expressed genes. ˆ Cluster the data using only the differentially expressed genes. ˆ Discover that you can successfully distinguish the known classes. Is this really surprising? Exercise 6: Let us try this on our simulated data stored in datamatrix. Divide the 50 samples into two classes (the first 25 and the last 25). Next, perform t-tests (using the function rowttests from the genefilter package) to see how well each gene separates the two classes, and cluster the data using the top 10 genes (hint: use the function order) with Euclidean distance and complete linkage. Use bootstrapping to test the stability of the grouping found. > classes <- as.factor(c(rep(0, 25), rep(1, 25))) > library(genefilter) > t.stats <- abs(rowttests(datamatrix, classes)$statistic) > index.top10 <- order(t.stats, decreasing = T)[1:10] > hc <- hclust(euc(t(datamatrix[index.top10, ])), method = "complete") > plot(hc, labels = descr, cex = 0.6) 10

11 Cluster Dendrogram Height S02 S22 S21 S28 S34 S41 S27 S50 S06 S09 S05 S39 S01 S38 S11 S04 S13 S08 S14 S45 S49 S31 S44 S32 S43 S26 S33 S07 S17 S18 S20 S35 S47 S40 S24 S25 S30 S46 S10 S23 S12 S15 S19 S37 S16 S42 S29 S36 S48 S03 euc(t(datamatrix[index.top10, ])) hclust (*, "complete") > bc <- BootstrapClusterTest(dataMatrix[index.top10, ], cuthclust, + k = 2, method = "complete", metric = "euclid", ntimes = 200) > cols <- rep(c("red", "green"), times = c(25, 25)) > image(bc, col = blueyellow(64), RowSideColors = cols, ColSideColors = cols)

12 Figure 1: The black line depicts one single gene expression profile and the vertical black line in the right-hand image figure illustrates how the black gene expression profile is constructed: the gene expression levels of the same spot (=gene) over multiple microarray experiments are taken. Only ten arrays are considered in this simulated dataset. Therefore, the left-hand figure has ten experiments on the x-axis. The data set consists of nine different profiles; round each profile 50 noisy profiles were generated. Therefore, each simulated array contains 9*(50+1)=459 different profiles. The log-ratio of these gene profiles is indicated on the y-axis. You can also extract the p-values from the call to rowttests and extract the 10 genes with the smallest p-values. The relatively stable clusters in this example illustrate that it is dangerous to filter on a specific contrast such as differential expression. Just due to chance, this can lead to structure even in random data. Clustering less synthetic data Figure 1 shows the gene profiles of yet another synthetic data set which is slightly closer to biological reality. The data can be found in /data/home/stud00/unsup/jq.txt. Exercise 7: When clustering genes with the Euclidean distance, how many clusters would you expect? Write the code to verify your answer. You might want to explore the heatmap function for this purpose. A heat map is a false color image with a dendrogram added to the left side and to the top (like in the previous section). Especially, explore the influence of the scale argument of heatmap. Also try to relate the clusters found to the gene expression profiles of Figure 1. > jq.data <- read.table("jq.txt") > heatmap(as.matrix(jq.data), Colv = NA, scale = "none", labrow = NA, + col = blueyellow(64)) 12

13 V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 > bc <- BootstrapClusterTest(t(jq.data), cuthclust, k = 9, method = "complete", + metric = "euclid", ntimes = 200) > image(bc, col = blueyellow(64), labrow = NA, labcol = NA) 13

14 Given the way the data has been generated (see legend of Figure 1) you would expect nine clusters, corresponding to each of the nine bands. In the heatmap one can clearly recognize the characteristic up-and-down patterns and the expression levels of each of the nine bands. Exercise 8: Until now we often had to specify the number of clusters in advance. There are many methods to choose the optimal number of clusters (section in (Pollard and van der Laan 2005)). A graphical procedure could be to plot the heights of the branches of a dendrogram and then to determine where the curve becomes flatter (remember the fusion graph from the lecture). This corresponds to the point where the data is not structured anymore. Can you do this for our synthetic dataset using the function hclust? Then use the function locator to determine the optimal number of clusters in this dataset. > plot(rev(hclust(euc(as.matrix(jq.data)))$height), main = "height of the branches of the dend height of the branches of the dendrogram rev(hclust(euc(as.matrix(jq.data)))$height) Index With the command > locator(n = 1) and then clicking in the generated plot near the hinge point you can get a more or less accurate estimate of the number of clusters from the returned x-coordinate. K -means clustering Another often-used clustering algorithm is the K -means algorithm. From the lecture you might remember that the K -means algorithm tries to minimize the total within-scatter. We are going to study the behaviour of the algorithm in some more detail on a simple simulated dataset. 14

15 Exercise 9: ˆ Use mvrnorm from the MASS package to create a dataset consisting of three spherical clusters of 50 datapoints each. Means of the clusters are at (0,0), (10,10), and (20,20). Covariance matrices are diagonal with ones on the diagonal. Make a scatterplot of the resulting dataset. > library(mass) > c1 <- mvrnorm(n = 50, c(0, 0), Sigma = matrix(c(1, 0, 0, 1), + 2, 2)) > c2 <- mvrnorm(n = 50, c(10, 10), Sigma = matrix(c(1, 0, 0, 1), + 2, 2)) > c3 <- mvrnorm(n = 50, c(20, 20), Sigma = matrix(c(1, 0, 0, 1), + 2, 2)) > triclust <- rbind(c1, c2, c3) > plot(triclust) triclust[,2] triclust[,1] ˆ Run the function kmeans 50 times and keep track of the total within-scatter. What do you observe? Can you relate this to the clustering solutions found and give an explanation of this behaviour? > res.kmeans <- list(na, 50) > kmeans.error <- vector("numeric", 50) > for (i in 1:50) { + res.kmeans[[i]] <- kmeans(triclust, 3) + kmeans.error[i] <- sum(res.kmeans[[i]]$withinss) 15

16 + } > kmeans.error [1] [8] [15] [22] [29] [36] [43] [50] There are a number of clustering solutions with a much higher within-scatter. These correspond to so-called local minima that K -means can get stuck in. We can plot one of these sub-optimal solutions: > indx <- which(kmeans.error > 5000) > plot(triclust) > points(res.kmeans[[indx[1]]]$centers, pch = "x", cex = 2, col = "red") triclust[,2] x x x triclust[,1] This is a solution where two cluster centers are in a single cluster, and the third is positioned between the remaining two. This is indeed a local minimum since one of the cluster centers now sharing a cluster needs to move towards the cluster without a cluster center. Such a move will result in an initial increase of the within-scatter. Since 16

17 the K -means algorithm only decreases the within-scatter, it remains stuck in this local minimum. ˆ Does hierarchical clustering also suffer from this problem? It does not suffer from this problem, since it is a deterministic algorithm, independent of random initial conditions. ˆ Compute the total within-scatter for kmeans with [2,3,5,10,20,50,100,149] clusters and then calculate the normalised within-scatter by dividing all within-scatter values by the within-scatter for two clusters. Repeat this 10 times and plot the average normalised within-scatter as a function of the number of clusters. > niter <- 10 > K <- c(2, 3, 5, 10, 20, 50, 100, 149) > ws <- vector("numeric", length(k)) > for (i in (1:length(K))) { + s <- 0 + for (j in 1:niter) { + s <- s + sum(kmeans(triclust, K[i])$withinss) + } + ws[i] <- s/niter + } > ws <- ws/ws[1] > plot(ws, xlab = "number of clusters", ylab = "normalised within-scatter", + type = "l", xaxt = "n") > axis(1, labels = K, at = 1:8) 17

18 normalised within scatter number of clusters ˆ What are the most prominent characteristics of this curve? The within-scatter decreases monotonically as the number of clusters increases. The largest decrease occurs when we go from two to three clusters, revealing that there are probably three clusters in the data. References Gentleman, R., V. Carey, W. Huber, R. Irizarry, and S. Dudoit (Eds.) (2005). Bioinformatics and Computational Biology Solutions Using R and Bioconductor. New York: Springer. Gentleman, R., B. Ding, S. Dudoit, and J. Ibrahim (2005). Chapter 12: Distance measures in DNA microarray data analysis. See Gentleman, Carey, Huber, Irizarry, and Dudoit (2005). Pollard, K. and M. van der Laan (2005). Chapter 13: Cluster analysis of genomic data. See Gentleman, Carey, Huber, Irizarry, and Dudoit (2005). This document was generated with > sessioninfo() R version ( ) i386-pc-mingw32 locale: [1] LC_COLLATE=English_United Kingdom

19 [2] LC_CTYPE=English_United Kingdom.1252 [3] LC_MONETARY=English_United Kingdom.1252 [4] LC_NUMERIC=C [5] LC_TIME=English_United Kingdom.1252 attached base packages: [1] stats graphics grdevices utils datasets methods base other attached packages: [1] MASS_7.3-4 genefilter_ ClassDiscovery_ [4] mclust_3.4.4 cluster_ PreProcess_ [7] oompabase_ biodist_ KernSmooth_ [10] Biobase_2.6.1 loaded via a namespace (and not attached): [1] annotate_ AnnotationDbi_1.8.2 DBI_0.2-5 [4] RSQLite_0.8-4 splines_ survival_ [7] xtable_

GS Analysis of Microarray Data

GS Analysis of Microarray Data GS01 0163 Analysis of Microarray Data Keith Baggerly and Bradley Broom Department of Bioinformatics and Computational Biology UT MD Anderson Cancer Center kabagg@mdanderson.org bmbroom@mdanderson.org 19