Simulation studies of module preservation: Simulation study of weak module preservation

Size: px

Start display at page:

Download "Simulation studies of module preservation: Simulation study of weak module preservation"

Derek Parker
5 years ago
Views:

1 Simulation studies of module preservation: Simulation study of weak module preservation Peter Langfelder and Steve Horvath October 25, 2010 Contents 1 Overview 1 1.a Setting up the R session Data simulation 2 3 Module identification 2 4 Calculation of module preservation 5 5 Analysis of results 5 1 Overview This tutorial presents simulation a simulation study of module preservation in which we simulate a reference set with 20 modules of sizes around 200 profiles ( genes ), and a test set in which 10 of the 20 reference modules are preserved, and genes in the other 10 modules are simulated with independent random profiles (in the language of WGCNA these genes are simulated grey ). Unlike in our other simulation studies, here genes in the preserved modules are simulated to be only very weakly co-expressed. In fact, we set up the parameters such that the standard module identification method in WGCNA does not find any modules; hence, cross-tabulation methods would by definition conclude that none of the modules are preserved. To give cross-tabulation methods a chance, we also employ Partitioning Around Medoids (PAM) with a fixed number of clusters to partition the test set into 20 clusters. We find that PAM is moderately successful in identifying the preserved modules. Lastly, we apply the function cluterrepro to this simulated data and find that observed IGP is not very good at distinguishing the preserved and non-preserved modules. We encourage readers unfamiliar with any of the functions used in this tutorial to type, in the active R session, help(functionname) (replace functionname with the actual name of the function) to get a detailed description of what the functions does, what the input arguments mean, and what is the output. 1.a Setting up the R session After starting R we execute a few commands to set the working directory and load the requisite packages: # Display the current working directory getwd(); # If necessary, change the path below to the directory where the data files are stored. # "." means current directory. On Windows use a forward slash / instead of the usual \. workingdir = "."; setwd(workingdir);

2 # Load the packages WGCNA and cluster library(wgcna); library(cluster); # The following setting is important, do not omit. options(stringsasfactors = FALSE); 2 Data simulation We simulate two data sets, each with 100 samples. First we set up simulation parameters such as module sizes etc. We also set up parameters such that in the reference data set the genes in each module are tightly co-expressed, but in the test set genes in each preserved module are only weakly co-expressed. nsamples = 100; ngenes = 5000; nmodules = c(20,20); prop = seq(from = 0.044, to = 0.037, length.out = nmodules[1]+1); modprops = list(prop, prop); nsets = 2; # Here we set how tightly co-expressed the modules should be. mincor = c(0.3, 0.05); maxcor = c(1, 0.35); eigengenes = list(); expr = list(); simlabels = list(); cutheight = c(0.999, ); Next we simulate the data using the WGCNA function simulatemultiexpr. We define the matrix leaveout which tells the simulation function which modules should be left out in each of the data sets. In this case, we leave out half of the modules in the second data set. The seed eigengenes are simulated as independent random vectors. The modules in the test (second) data set are simulated to be very loose. set.seed(1); leaveout = list(rep(false, nmodules[1]), rep(false, nmodules[1])); leaveout[[2]][c(1:(nmodules[1]/2))*2] = TRUE simorder = list(); for (set in 1:nSets) eigengenes[[set]] = matrix(rnorm(nsamples * nmodules[set]), nsamples, nmodules[set]) x = simulatedatexpr(eigengenes[[set]], ngenes, modprops[[set]], mincor = mincor[set], maxcor = maxcor[set], signed = TRUE, backgroundnoise = 1.0, leaveout = leaveout[[set]]); simlabels[[set]] = x$alllabels simorder[[set]] = x$labelorder expr[[set]] = list(data = x$datexpr); colnames(expr[[set]]$data) = spaste("gene.", c(1:ngenes)); 3 Module identification We now identify modules in the each of the simulated data sets using the WGCNA function blockwisemodules. mods = list(); # Sof thresholding powers for network definition. power = c(6, 4); collectgarbage(); labels = list();

3 nn = if (interactive()) nsets else 1; for (set in 1:nn) mods[[set]] = blockwisemodules(expr[[set]]$data, networktype = "signed hybrid", deepsplit = 1, detectcutheight = cutheight[set], TOMType = "none", power = power[set], numericlabels = TRUE, verbose = 4); labels[[set]] = matchlabels(mods[[set]]$colors, simlabels[[set]]); collectgarbage(); We also run Partitioning Around Medoids (PAM) on the data. PAMlabels = matrix(0, ngenes, nsets) for (set in 1:nSets) cr = cor(expr[[set]]$data); cr[cr<0] = 0; adj = cr^power[set]; dist = as.dist(1-adj); PAMlabels[, set] = pam(dist, nmodules[set], cluster.only = TRUE); PAMlabels[, set] = matchlabels(pamlabels[, set], simlabels[[set]]); collectgarbage(); How did module identification do? We plot the gene dendorgrams with the simulaeted and identified module colors. sizegrwindow(10,7); #pdf(file = "Plots/preserved-moduleDetectionFailed-dendrograms.pdf", width = 10, height = 7) layout(matrix(c(1:5), 5, 1), heights = c(rep(c(0.8, 0.2), 2), 0.3)); setnames = c("reference data set", "Test data set"); for (set in 1:nSets) if (set==1) colors = labels2colors(cbind(labels[[1]], simlabels[[set]])) names = c("inferred", "Simulated"); else colors = labels2colors(cbind(pamlabels[, set], simlabels[[set]])) names = c("pam", "Simulated"); plotdendroandcolors(mods[[set]]$dendrograms[[1]], colors, names, dendrolabels = FALSE, hang = 0.03, main = spaste(letters[set], ". ", setnames[set], ": gene clustering tree and module colors"), setlayout = FALSE, abheight = cutheight[set], cex.colorlabels = 1.2, cex.main = 1.5, cex.lab = 1.2, cex.axis = 1.2); The result is shown in Figure 1. In the test data set, hierarchical clustering did not identify any modules. That is because we have simulated the modules with very weak correlations.

4 A. Reference data set: gene clustering tree and module colors Height Inferred Simulated d hclust (*, "average") B. Test data set: gene clustering tree and module colors Height PAM Simulated PAM Simulated d hclust (*, "average") C. PAM vs. simulated module colors Figure 1: Module identification in the simulated data sets. In the reference set the hierarchical clustering (panel A) easily identifies the 20 modules as distinct branches. Simulated and identified module colors, shown below the dendrogram, show excellent agreement. In the test set (panel B) the hierarchical clustering did not identify any recognizable branches. The simulated and PAM colors, shown below the clustering tree, also do not show any apparent relationship to the dendrogram. Panel C shows a comparison of simulated module colors and PAM cluster labels. It is very difficult to argue that any of the modules in the test set are preserved.

5 4 Calculation of module preservation Here we run the main module preservation function modulepreservation. After the calculation we save the results; if a re-analysis of previously calculated results is performed, one can simply read the results from disk, thus saving a lot of time. names(expr) = c("set1", "Set2"); labellist = list(labels[[1]], PAMlabels[, 2]); names(labellist) = names(expr); mp = modulepreservation(expr, labellist, networktype = "signed", npermutations = 200, verbose = 3, maxgoldmodulesize = 1000); # Save the module preservation results as well as the PAM cluster labels save(mp, PAMlabels, file = "preserved-moduledetectionfailed-20modules.rdata"); If the module preservation results have been calculated previously, load the results from the disk: load(file= "preserved-moduledetectionfailed-20modules.rdata"); Calculation of IGP in clusterrepro Here we apply cluterrepro to the test set. We calculated the eigengenes of the reference modules in the test set and use them as the centroids in the IGP calculation. # Need centroids for the new data set. Calculate module eigengenes. MEs = moduleeigengenes(expr[[2]]$data, labels[[1]])$eigengenes # Get rid of the grey eigengene MEs = MEs[, -1] doclusterrepro = TRUE if (doclusterrepro) library(clusterrepro) rownames(mes) = spaste("sample.", c(1:nsamples)); rownames(expr[[2]]$data) = spaste("sample.", c(1:nsamples)); set.seed(40); print(system.time( cr = clusterrepro(as.matrix(mes), expr[[2]]$data, 1000); )); save(cr, file = "preserved-moduledetectionfailed-20modules-cr.rdata"); If the clusterrepro results have been calculated previously, load the results from the disk: load(file = "preserved-moduledetectionfailed-20modules-cr.rdata"); 5 Analysis of results Here we look at how well each method did at identifying the 10 preserved modules in the hopelessly noisy test data. Since the modules all have very similar sizes, we do not plot results as a function of module size; rather, in each plot we simply order the modules by their corresponding preservation statistic and look for a clean separation of preserved and non-preserved modules. # How well can one distinguish preserved from non-preserved modules? sizegrwindow(10,8) #pdf(file = "Plots/preserved-moduleDetectionFailed-20Modules-preservationSuccess.pdf", w= 10, h = 8); prescolor = c("red", "black")[as.numeric(leaveout[[2]])+1]; # Set graphical parameters par(mfrow = c(3,2)); par(mar = c(3.8, 3.8, 2, 0.5)); par(mgp = c(2.3, 0.7, 0));

6 cex.lab = 1.3; cex.axis = 1.3; cex.main = 1.4 # Module preservation: Zsummary scores Zs = mp$preservation$z[[1]][[2]]$zsummary[order(as.numeric(rownames(mp$preservation$z[[1]][[2]])))][-c(1:2)]; order = order(-zs); plot(zs[order], col = prescolor[order], cex.main=cex.main, xlab = "", ylab = "Preservation Zsummary",cex.lab = cex.lab, cex.axis = cex.axis, main = "A. Network-based preservation indices: Zsummary") # Module preservation: psummary statistics Zs = -mp$preservation$log.p[[1]][[2]]$log.psummary[ order(as.numeric(rownames(mp$preservation$z[[1]][[2]])))][-c(1:2)]; order = order(-zs); plot(zs[order], col = prescolor[order], xlab = "", ylab = "-log10(psummary)", cex.lab = cex.lab, cex.axis = cex.axis, cex.main=cex.main, main = "B. Network-based preservation indices: psummary") abline(h=-log10(0.05), col = "blue"); abline(h=-log10(0.05/nmodules[1]), col = "green"); # Co-clustering cc = mp$accuracy$observed[[1]][[2]][-1, coclustering ]; order = order(-cc) plot(cc[order], col = prescolor[order], cex.main=cex.main, xlab = "", ylab = "coclustering", cex.lab = cex.lab, cex.axis = cex.axis, main = "D. Cross-tabulation with results of PAM: Co-clustering") # Cross-tabulation: fisher p-value bestp = apply(tab$ptable[-1, ], 1, min); order = order(bestp) plot(-log10(pmin(rep(1, nmodules[1]), bestp[order])), col = prescolor[order], cex.main=cex.main, xlab = "", ylab = "-log10(overlap p-value)", cex.lab = cex.lab, cex.axis = cex.axis, main = "C. Cross-tabulation with results of PAM: overlap p-value") abline(h=-log10(0.05), col = "blue"); abline(h=-log10(0.05/nmodules[1]), col = "green"); # clusterrepro: observed IGP p = cr$actual.igp; order = order(-p); plot(p[order], col = prescolor[order], xlab = "", ylab = "IGP", cex.lab = cex.lab, cex.axis = cex.axis, cex.main=cex.main, main = "E. clusterrepro: IGP") # clusterrepro: permutation p-value p = cr$p; order = order(p); plot(-log10(p+1e-4)[order], col = prescolor[order], cex.main=cex.main, xlab = "", ylab = "-log10(clusterrepro p-value)", cex.lab = cex.lab, cex.axis = cex.axis, main = "F. clusterrepro: permutation p-value") abline(h=-log10(0.05), col = "blue"); abline(h=-log10(0.05/nmodules[1]), col = "green");

7 # If plotting into a pdf file, close it dev.off(); The resulting plots are shown in Figure 2. The figure shows that network preservation statistics are in this case successful in reliably separating preserved and non-preserved modules. On the other hand, cross-tabulation and clusterrepro have only limited success; based on Bonferoni corrected p-values, most of the preserved modules are called non-preserved. Preservation Zsummary A. Network based preservation indices: Zsummary Non preserved module Preserved module log10(psummary) B. Network based preservation indices: psummary Non preserved module Preserved module coclustering IGP D. Cross tabulation with results of PAM: Co clustering E. clusterrepro: IGP Non preserved module Preserved module Non preserved module Preserved module log10(overlap p value) log10(clusterrepro p value) C. Cross tabulation with results of PAM: overlap p value Non preserved module Preserved module F. clusterrepro: permutation p value Non preserved module Preserved module Figure 2: Success of several module preservation measures at distinguishing weakly preserved from non-preserved modules. In each plot, modules are ordered by the preservation statistic shown in the plot. Red color denotes preserved and black non-preserved modules. In the p-value plots (the right column), the blue line denotes the threshold p = 0.05, and the green line denotes the Bonferoni-corrected threshold p = In the clusterrepro p-value plot, we added 10 4 to all p-values so that zero p-values become 10 4 and fit into the plot. This figure shows that network preservation statistics are in this case successful in reliably separating preserved and non-preserved modules. On the other hand, cross-tabulation and clusterrepro have only limited success; based on Bonferoni corrected p-values, most of the preserved modules are called non-preserved.

Preservation of protein-protein interaction networks Simple simulated example

Preservation of protein-protein interaction networks Simple simulated example Peter Langfelder and Steve Horvath May, 0 Contents Overview.a Setting up the R session............................................