Tutorial for the WGCNA package for R II. Consensus network analysis of liver expression data, female and male mice. 1. Data input and cleaning

Size: px

Start display at page:

Download "Tutorial for the WGCNA package for R II. Consensus network analysis of liver expression data, female and male mice. 1. Data input and cleaning"

Caren Murphy
5 years ago
Views:

1 Tutorial for the WGCNA package for R II. Consensus network analysis of liver expression data, female and male mice 1. Data input and cleaning Peter Langfelder and Steve Horvath February 13, 2016 Contents 1 Data input, cleaning and pre-processing 1 1.a Loading expression data b Rudimentary data cleaning and outlier removal c Loading clinical trait data Data input, cleaning and pre-processing 1.a Loading expression data The expression data is contained in two files that come with this tutorial, LiverFemale3600.csv and LiverMale3600.csv. After starting an R session, we check that the current directory is appropriately set, and load the requisite packages and the data. In addition to expression data, the data files contain extra information about the surveyed probes we do not need. # Display the current working directory getwd(); # If necessary, change the path below to the directory where the data files are stored. # "." means current directory. On Windows use a forward slash / instead of the usual \. workingdir = "."; setwd(workingdir); # Load the package library(wgcna); # The following setting is important, do not omit. options(stringsasfactors = FALSE); #Read in the female liver data set femdata = read.csv("liverfemale3600.csv"); # Read in the male liver data set maledata = read.csv("livermale3600.csv"); # Take a quick look at what is in the data sets (caution, longish output): dim(femdata) names(femdata) dim(maledata) names(maledata)

2 The data sets contain roughly 130 samples each. Note that each row corresponds to a gene and column to a sample or auxiliary information. We now remove the auxiliary data and put the expression data into a multi-set format suitable for consensus analysis. # We work with two sets: nsets = 2; # For easier labeling of plots, create a vector holding descriptive names of the two sets. setlabels = c("female liver", "Male liver") shortlabels = c("female", "Male") # Form multi-set expression data: columns starting from 9 contain actual expression data. multiexpr = vector(mode = "list", length = nsets) multiexpr[[1]] = list(data = as.data.frame(t(femdata[-c(1:8)]))); names(multiexpr[[1]]$data) = femdata$substancebxh; rownames(multiexpr[[1]]$data) = names(femdata)[-c(1:8)]; multiexpr[[2]] = list(data = as.data.frame(t(maledata[-c(1:8)]))); names(multiexpr[[2]]$data) = maledata$substancebxh; rownames(multiexpr[[2]]$data) = names(maledata)[-c(1:8)]; # Check that the data has the correct format for many functions operating on multiple sets: The variable exprsize contains useful information about the sizes of all of the data sets: >exprsize $nsets [1] 2 $ngenes [1] 3600 $nsamples [1] $structureok [1] TRUE 1.b Rudimentary data cleaning and outlier removal We first identify genes and samples with excessive numbers of missing samples. These can be identified using the function goodsamplesgenesms. # Check that all genes and samples have sufficiently low numbers of missing values. gsg = goodsamplesgenesms(multiexpr, verbose = 3); gsg$allok If the last statement returns TRUE, all genes and samples have passed the cuts. If it returns FALSE, the following code removes the offending samples and genes: if (!gsg$allok) # Print information about the removed genes: if (sum(!gsg$goodgenes) > 0) printflush(paste("removing genes:", paste(names(multiexpr[[1]]$data)[!gsg$goodgenes], collapse = ", "))) for (set in 1:exprSize$nSets)

3 if (sum(!gsg$goodsamples[[set]])) printflush(paste("in set", setlabels[set], "removing samples", paste(rownames(multiexpr[[set]]$data)[!gsg$goodsamples[[set]]], collapse = ", "))) # Remove the offending genes and samples multiexpr[[set]]$data = multiexpr[[set]]$data[gsg$goodsamples[[set]], gsg$goodgenes]; # Update exprsize We now cluster the samples on their Euclidean distance, separately in each set. sampletrees = list() sampletrees[[set]] = hclust(dist(multiexpr[[set]]$data), method = "average") The easiest way to see the two dendrograms at the same time is to plot both into a pdf file that can be viewed using standard pdf viewers. pdf(file = "Plots/SampleClustering.pdf", width = 12, height = 12); par(mfrow=c(2,1)) par(mar = c(0, 4, 2, 0)) plot(sampletrees[[set]], main = paste("sample clustering on all genes in", setlabels[set]), xlab="", sub="", cex = 0.7); dev.off(); By inspection, there seems to be one outlier in the female data set, and no obvious outliers in the male set. We now remove the female outlier using a semi-automatic code that only requires a choice of a height cut. We first re-plot the two sample trees with the cut lines included (see Fig. 1): # Choose the "base" cut height for the female data set baseheight = 16 # Adjust the cut height for the male data set for the number of samples cutheights = c(16, 16*exprSize$nSamples[2]/exprSize$nSamples[1]); # Re-plot the dendrograms including the cut lines pdf(file = "Plots/SampleClustering.pdf", width = 12, height = 12); par(mfrow=c(2,1)) par(mar = c(0, 4, 2, 0)) plot(sampletrees[[set]], main = paste("sample clustering on all genes in", setlabels[set]), xlab="", sub="", cex = 0.7); abline(h=cutheights[set], col = "red"); dev.off(); Now comes the actual outlier removal: # Find clusters cut by the line labels = cutreestatic(sampletrees[[set]], cutheight = cutheights[set]) # Keep the largest one (labeled by the number 1) keep = (labels==1) multiexpr[[set]]$data = multiexpr[[set]]$data[keep, ]

4 collectgarbage(); # Check the size of the leftover data exprsize 1.c Loading clinical trait data We now read in the trait data and match the samples for which they were measured to the expression samples. traitdata = read.csv("clinicaltraits.csv"); dim(traitdata) names(traitdata) # remove columns that hold information we do not need. alltraits = traitdata[, -c(31, 16)]; alltraits = alltraits[, c(2, 11:36) ]; # See how big the traits are and what are the trait and sample names dim(alltraits) names(alltraits) alltraits$mice # Form a multi-set structure that will hold the clinical traits. Traits = vector(mode="list", length = nsets); setsamples = rownames(multiexpr[[set]]$data); traitrows = match(setsamples, alltraits$mice); Traits[[set]] = list(data = alltraits[traitrows, -1]); rownames(traits[[set]]$data) = alltraits[traitrows, 1]; collectgarbage(); # Define data set dimensions ngenes = exprsize$ngenes; nsamples = exprsize$nsamples; We now have the expression data for both sets in the variable multiexpr, and the corresponding clinical traits in the variable Traits. The last step is to save the relevant data for use in the subsequent analysis: save(multiexpr, Traits, ngenes, nsamples, setlabels, shortlabels, exprsize, file = "Consensus-dataInput.RData");

5 Sample clustering on all genes in Female liver F2_238 F2_235 F2_279 F2_148 F2_219 F2_152 F2_275 F2_172 F2_176 F2_317 F2_9 F2_93 F2_251 F2_186 F2_281 F2_120 F2_232 F2_179 F2_159 F2_178 F2_123 F2_185 F2_174 F2_149 F2_153 F2_197 F2_5 F2_59 F2_343 F2_49 F2_50 F2_220 F2_233 F2_207 F2_234 F2_105 F2_160 F2_183 F2_353 F2_18 F2_91 F2_17 F2_40 F2_265 F2_282 F2_74 F2_236 F2_315 F2_199 F2_28 F2_4 F2_92 F2_22 F2_33 F2_198 F2_285 F2_29 F2_237 F2_30 F2_147 F2_171 F2_121 F2_286 F2_254 F2_55 F2_76 F2_35 F2_314 F2_84 F2_276 F2_318 F2_57 F2_266 F2_60 F2_116 F2_217 F2_239 F2_7 F2_216 F2_316 F2_56 F2_73 F2_39 F2_184 F2_313 F2_104 F2_252 F2_13 F2_158 F2_173 F2_151 F2_124 F2_146 F2_16 F2_294 F2_268 F2_170 F2_249 F2_41 F2_75 F2_257 F2_209 F2_115 F2_292 F2_94 F2_34 F2_27 F2_114 F2_218 F2_250 F2_230 F2_8 F2_85 F2_284 F2_256 F2_280 F2_10 F2_295 F2_231 F2_122 F2_210 F2_6 F2_208 F2_274 Height F2_182 F2_332 F2_303 F2_304 F2_110 F2_155 F2_302 F2_245 F2_107 F2_228 F2_201 F2_80 F2_81 F2_108 F2_243 F2_263 F2_291 F2_87 F2_330 F2_69 F2_181 F2_157 F2_169 F2_112 F2_195 F2_355 F2_43 F2_20 F2_119 F2_329 F2_165 F2_300 F2_154 F2_167 F2_247 F2_298 F2_72 F2_324 F2_42 F2_145 F2_156 F2_290 F2_111 F2_189 F2_327 F2_309 F2_323 F2_223 F2_127 F2_162 F2_194 F2_222 F2_47 F2_326 F2_320 F2_51 F2_144 F2_289 F2_2 F2_89 F2_141 F2_188 F2_142 F2_26 F2_311 F2_192 F2_296 F2_270 F2_310 F2_312 F2_357 F2_125 F2_126 F2_328 F2_163 F2_190 F2_200 F2_166 F2_191 F2_215 F2_271 F2_261 F2_109 F2_306 F2_272 F2_213 F2_86 F2_299 F2_88 F2_15 F2_66 F2_143 F2_278 F2_321 F2_187 F2_180 F2_78 F2_305 F2_287 F2_308 F2_37 F2_71 F2_68 F2_83 F2_46 F2_225 F2_241 F2_214 F2_224 F2_70 F2_48 F2_117 F2_79 F2_226 F2_264 F2_19 F2_227 F2_212 F2_3 F2_52 F2_54 F2_45 F2_63 F2_65 F2_242 F2_23 F2_244 F2_14 F2_164 F2_248 F2_24 F2_288 F2_307 F2_325 F2_221 Sample clustering on all genes in Male liver Height Figure 1: Clustering of samples in the two input sets, female and male liver expression data. The red line in the female dendrogram is the cut line for outlier detection; samples cut by this line are considered outliers. The corresponding cut line for the male samples lies outside of the dendrogram, indicating that no samples from the male set are considered outliers.

Tutorial for the WGCNA package for R II. Consensus network analysis of liver expression data, female and male mice

Tutorial for the WGCNA package for R II. Consensus network analysis of liver expression data, female and male mice 2.b Step-by-step network construction and module detection Peter Langfelder and Steve