Simulation studies of module preservation: Simulation study of weak module preservation

Size: px
Start display at page:

Download "Simulation studies of module preservation: Simulation study of weak module preservation"

Transcription

1 Simulation studies of module preservation: Simulation study of weak module preservation Peter Langfelder and Steve Horvath October 25, 2010 Contents 1 Overview 1 1.a Setting up the R session Data simulation 2 3 Module identification 2 4 Calculation of module preservation 5 5 Analysis of results 5 1 Overview This tutorial presents simulation a simulation study of module preservation in which we simulate a reference set with 20 modules of sizes around 200 profiles ( genes ), and a test set in which 10 of the 20 reference modules are preserved, and genes in the other 10 modules are simulated with independent random profiles (in the language of WGCNA these genes are simulated grey ). Unlike in our other simulation studies, here genes in the preserved modules are simulated to be only very weakly co-expressed. In fact, we set up the parameters such that the standard module identification method in WGCNA does not find any modules; hence, cross-tabulation methods would by definition conclude that none of the modules are preserved. To give cross-tabulation methods a chance, we also employ Partitioning Around Medoids (PAM) with a fixed number of clusters to partition the test set into 20 clusters. We find that PAM is moderately successful in identifying the preserved modules. Lastly, we apply the function cluterrepro to this simulated data and find that observed IGP is not very good at distinguishing the preserved and non-preserved modules. We encourage readers unfamiliar with any of the functions used in this tutorial to type, in the active R session, help(functionname) (replace functionname with the actual name of the function) to get a detailed description of what the functions does, what the input arguments mean, and what is the output. 1.a Setting up the R session After starting R we execute a few commands to set the working directory and load the requisite packages: # Display the current working directory getwd(); # If necessary, change the path below to the directory where the data files are stored. # "." means current directory. On Windows use a forward slash / instead of the usual \. workingdir = "."; setwd(workingdir);

2 # Load the packages WGCNA and cluster library(wgcna); library(cluster); # The following setting is important, do not omit. options(stringsasfactors = FALSE); 2 Data simulation We simulate two data sets, each with 100 samples. First we set up simulation parameters such as module sizes etc. We also set up parameters such that in the reference data set the genes in each module are tightly co-expressed, but in the test set genes in each preserved module are only weakly co-expressed. nsamples = 100; ngenes = 5000; nmodules = c(20,20); prop = seq(from = 0.044, to = 0.037, length.out = nmodules[1]+1); modprops = list(prop, prop); nsets = 2; # Here we set how tightly co-expressed the modules should be. mincor = c(0.3, 0.05); maxcor = c(1, 0.35); eigengenes = list(); expr = list(); simlabels = list(); cutheight = c(0.999, ); Next we simulate the data using the WGCNA function simulatemultiexpr. We define the matrix leaveout which tells the simulation function which modules should be left out in each of the data sets. In this case, we leave out half of the modules in the second data set. The seed eigengenes are simulated as independent random vectors. The modules in the test (second) data set are simulated to be very loose. set.seed(1); leaveout = list(rep(false, nmodules[1]), rep(false, nmodules[1])); leaveout[[2]][c(1:(nmodules[1]/2))*2] = TRUE simorder = list(); for (set in 1:nSets) eigengenes[[set]] = matrix(rnorm(nsamples * nmodules[set]), nsamples, nmodules[set]) x = simulatedatexpr(eigengenes[[set]], ngenes, modprops[[set]], mincor = mincor[set], maxcor = maxcor[set], signed = TRUE, backgroundnoise = 1.0, leaveout = leaveout[[set]]); simlabels[[set]] = x$alllabels simorder[[set]] = x$labelorder expr[[set]] = list(data = x$datexpr); colnames(expr[[set]]$data) = spaste("gene.", c(1:ngenes)); 3 Module identification We now identify modules in the each of the simulated data sets using the WGCNA function blockwisemodules. mods = list(); # Sof thresholding powers for network definition. power = c(6, 4); collectgarbage(); labels = list();

3 nn = if (interactive()) nsets else 1; for (set in 1:nn) mods[[set]] = blockwisemodules(expr[[set]]$data, networktype = "signed hybrid", deepsplit = 1, detectcutheight = cutheight[set], TOMType = "none", power = power[set], numericlabels = TRUE, verbose = 4); labels[[set]] = matchlabels(mods[[set]]$colors, simlabels[[set]]); collectgarbage(); We also run Partitioning Around Medoids (PAM) on the data. PAMlabels = matrix(0, ngenes, nsets) for (set in 1:nSets) cr = cor(expr[[set]]$data); cr[cr<0] = 0; adj = cr^power[set]; dist = as.dist(1-adj); PAMlabels[, set] = pam(dist, nmodules[set], cluster.only = TRUE); PAMlabels[, set] = matchlabels(pamlabels[, set], simlabels[[set]]); collectgarbage(); How did module identification do? We plot the gene dendorgrams with the simulaeted and identified module colors. sizegrwindow(10,7); #pdf(file = "Plots/preserved-moduleDetectionFailed-dendrograms.pdf", width = 10, height = 7) layout(matrix(c(1:5), 5, 1), heights = c(rep(c(0.8, 0.2), 2), 0.3)); setnames = c("reference data set", "Test data set"); for (set in 1:nSets) if (set==1) colors = labels2colors(cbind(labels[[1]], simlabels[[set]])) names = c("inferred", "Simulated"); else colors = labels2colors(cbind(pamlabels[, set], simlabels[[set]])) names = c("pam", "Simulated"); plotdendroandcolors(mods[[set]]$dendrograms[[1]], colors, names, dendrolabels = FALSE, hang = 0.03, main = spaste(letters[set], ". ", setnames[set], ": gene clustering tree and module colors"), setlayout = FALSE, abheight = cutheight[set], cex.colorlabels = 1.2, cex.main = 1.5, cex.lab = 1.2, cex.axis = 1.2); The result is shown in Figure 1. In the test data set, hierarchical clustering did not identify any modules. That is because we have simulated the modules with very weak correlations.

4 A. Reference data set: gene clustering tree and module colors Height Inferred Simulated d hclust (*, "average") B. Test data set: gene clustering tree and module colors Height PAM Simulated PAM Simulated d hclust (*, "average") C. PAM vs. simulated module colors Figure 1: Module identification in the simulated data sets. In the reference set the hierarchical clustering (panel A) easily identifies the 20 modules as distinct branches. Simulated and identified module colors, shown below the dendrogram, show excellent agreement. In the test set (panel B) the hierarchical clustering did not identify any recognizable branches. The simulated and PAM colors, shown below the clustering tree, also do not show any apparent relationship to the dendrogram. Panel C shows a comparison of simulated module colors and PAM cluster labels. It is very difficult to argue that any of the modules in the test set are preserved.

5 4 Calculation of module preservation Here we run the main module preservation function modulepreservation. After the calculation we save the results; if a re-analysis of previously calculated results is performed, one can simply read the results from disk, thus saving a lot of time. names(expr) = c("set1", "Set2"); labellist = list(labels[[1]], PAMlabels[, 2]); names(labellist) = names(expr); mp = modulepreservation(expr, labellist, networktype = "signed", npermutations = 200, verbose = 3, maxgoldmodulesize = 1000); # Save the module preservation results as well as the PAM cluster labels save(mp, PAMlabels, file = "preserved-moduledetectionfailed-20modules.rdata"); If the module preservation results have been calculated previously, load the results from the disk: load(file= "preserved-moduledetectionfailed-20modules.rdata"); Calculation of IGP in clusterrepro Here we apply cluterrepro to the test set. We calculated the eigengenes of the reference modules in the test set and use them as the centroids in the IGP calculation. # Need centroids for the new data set. Calculate module eigengenes. MEs = moduleeigengenes(expr[[2]]$data, labels[[1]])$eigengenes # Get rid of the grey eigengene MEs = MEs[, -1] doclusterrepro = TRUE if (doclusterrepro) library(clusterrepro) rownames(mes) = spaste("sample.", c(1:nsamples)); rownames(expr[[2]]$data) = spaste("sample.", c(1:nsamples)); set.seed(40); print(system.time( cr = clusterrepro(as.matrix(mes), expr[[2]]$data, 1000); )); save(cr, file = "preserved-moduledetectionfailed-20modules-cr.rdata"); If the clusterrepro results have been calculated previously, load the results from the disk: load(file = "preserved-moduledetectionfailed-20modules-cr.rdata"); 5 Analysis of results Here we look at how well each method did at identifying the 10 preserved modules in the hopelessly noisy test data. Since the modules all have very similar sizes, we do not plot results as a function of module size; rather, in each plot we simply order the modules by their corresponding preservation statistic and look for a clean separation of preserved and non-preserved modules. # How well can one distinguish preserved from non-preserved modules? sizegrwindow(10,8) #pdf(file = "Plots/preserved-moduleDetectionFailed-20Modules-preservationSuccess.pdf", w= 10, h = 8); prescolor = c("red", "black")[as.numeric(leaveout[[2]])+1]; # Set graphical parameters par(mfrow = c(3,2)); par(mar = c(3.8, 3.8, 2, 0.5)); par(mgp = c(2.3, 0.7, 0));

6 cex.lab = 1.3; cex.axis = 1.3; cex.main = 1.4 # Module preservation: Zsummary scores Zs = mp$preservation$z[[1]][[2]]$zsummary[order(as.numeric(rownames(mp$preservation$z[[1]][[2]])))][-c(1:2)]; order = order(-zs); plot(zs[order], col = prescolor[order], cex.main=cex.main, xlab = "", ylab = "Preservation Zsummary",cex.lab = cex.lab, cex.axis = cex.axis, main = "A. Network-based preservation indices: Zsummary") # Module preservation: psummary statistics Zs = -mp$preservation$log.p[[1]][[2]]$log.psummary[ order(as.numeric(rownames(mp$preservation$z[[1]][[2]])))][-c(1:2)]; order = order(-zs); plot(zs[order], col = prescolor[order], xlab = "", ylab = "-log10(psummary)", cex.lab = cex.lab, cex.axis = cex.axis, cex.main=cex.main, main = "B. Network-based preservation indices: psummary") abline(h=-log10(0.05), col = "blue"); abline(h=-log10(0.05/nmodules[1]), col = "green"); # Co-clustering cc = mp$accuracy$observed[[1]][[2]][-1, coclustering ]; order = order(-cc) plot(cc[order], col = prescolor[order], cex.main=cex.main, xlab = "", ylab = "coclustering", cex.lab = cex.lab, cex.axis = cex.axis, main = "D. Cross-tabulation with results of PAM: Co-clustering") # Cross-tabulation: fisher p-value bestp = apply(tab$ptable[-1, ], 1, min); order = order(bestp) plot(-log10(pmin(rep(1, nmodules[1]), bestp[order])), col = prescolor[order], cex.main=cex.main, xlab = "", ylab = "-log10(overlap p-value)", cex.lab = cex.lab, cex.axis = cex.axis, main = "C. Cross-tabulation with results of PAM: overlap p-value") abline(h=-log10(0.05), col = "blue"); abline(h=-log10(0.05/nmodules[1]), col = "green"); # clusterrepro: observed IGP p = cr$actual.igp; order = order(-p); plot(p[order], col = prescolor[order], xlab = "", ylab = "IGP", cex.lab = cex.lab, cex.axis = cex.axis, cex.main=cex.main, main = "E. clusterrepro: IGP") # clusterrepro: permutation p-value p = cr$p; order = order(p); plot(-log10(p+1e-4)[order], col = prescolor[order], cex.main=cex.main, xlab = "", ylab = "-log10(clusterrepro p-value)", cex.lab = cex.lab, cex.axis = cex.axis, main = "F. clusterrepro: permutation p-value") abline(h=-log10(0.05), col = "blue"); abline(h=-log10(0.05/nmodules[1]), col = "green");

7 # If plotting into a pdf file, close it dev.off(); The resulting plots are shown in Figure 2. The figure shows that network preservation statistics are in this case successful in reliably separating preserved and non-preserved modules. On the other hand, cross-tabulation and clusterrepro have only limited success; based on Bonferoni corrected p-values, most of the preserved modules are called non-preserved. Preservation Zsummary A. Network based preservation indices: Zsummary Non preserved module Preserved module log10(psummary) B. Network based preservation indices: psummary Non preserved module Preserved module coclustering IGP D. Cross tabulation with results of PAM: Co clustering E. clusterrepro: IGP Non preserved module Preserved module Non preserved module Preserved module log10(overlap p value) log10(clusterrepro p value) C. Cross tabulation with results of PAM: overlap p value Non preserved module Preserved module F. clusterrepro: permutation p value Non preserved module Preserved module Figure 2: Success of several module preservation measures at distinguishing weakly preserved from non-preserved modules. In each plot, modules are ordered by the preservation statistic shown in the plot. Red color denotes preserved and black non-preserved modules. In the p-value plots (the right column), the blue line denotes the threshold p = 0.05, and the green line denotes the Bonferoni-corrected threshold p = In the clusterrepro p-value plot, we added 10 4 to all p-values so that zero p-values become 10 4 and fit into the plot. This figure shows that network preservation statistics are in this case successful in reliably separating preserved and non-preserved modules. On the other hand, cross-tabulation and clusterrepro have only limited success; based on Bonferoni corrected p-values, most of the preserved modules are called non-preserved.

Preservation of protein-protein interaction networks Simple simulated example

Preservation of protein-protein interaction networks Simple simulated example Preservation of protein-protein interaction networks Simple simulated example Peter Langfelder and Steve Horvath May, 0 Contents Overview.a Setting up the R session............................................

More information

Supplementary text S6 Comparison studies on simulated data

Supplementary text S6 Comparison studies on simulated data Supplementary text S Comparison studies on simulated data Peter Langfelder, Rui Luo, Michael C. Oldham, and Steve Horvath Corresponding author: shorvath@mednet.ucla.edu Overview In this document we illustrate

More information

Tutorial for the WGCNA package for R II. Consensus network analysis of liver expression data, female and male mice. 1. Data input and cleaning

Tutorial for the WGCNA package for R II. Consensus network analysis of liver expression data, female and male mice. 1. Data input and cleaning Tutorial for the WGCNA package for R II. Consensus network analysis of liver expression data, female and male mice 1. Data input and cleaning Peter Langfelder and Steve Horvath February 13, 2016 Contents

More information

Tutorial for the WGCNA package for R II. Consensus network analysis of liver expression data, female and male mice

Tutorial for the WGCNA package for R II. Consensus network analysis of liver expression data, female and male mice Tutorial for the WGCNA package for R II. Consensus network analysis of liver expression data, female and male mice 2.b Step-by-step network construction and module detection Peter Langfelder and Steve

More information

Short tutorial on studying module preservation: Preservation of female mouse liver modules in male data

Short tutorial on studying module preservation: Preservation of female mouse liver modules in male data Short tutorial on studying module preservation: Preservation of female mouse liver modules in male data Peter Langfelder and Steve Horvath October 1, 0 Contents 1 Overview 1 1.a Setting up the R session............................................

More information

Tutorial for the WGCNA package for R: III. Using simulated data to evaluate different module detection methods and gene screening approaches

Tutorial for the WGCNA package for R: III. Using simulated data to evaluate different module detection methods and gene screening approaches Tutorial for the WGCNA package for R: III. Using simulated data to evaluate different module detection methods and gene screening approaches 8. Visualization of gene networks Steve Horvath and Peter Langfelder

More information

Supplemental Data. Cañas et al. Plant Cell (2017) /tpc

Supplemental Data. Cañas et al. Plant Cell (2017) /tpc Supplemental Method 1. WGCNA script. #Microarray and Trait data load getwd() workingdir = "C:/Users/..." setwd(workingdir) library(wgcna) library(flashclust) options(stringsasfactors = FALSE) femdata =

More information

Clustering using WGCNA

Clustering using WGCNA Clustering using WGCNA Overview: The WGCNA package (in R) uses functions that perform a correlation network analysis of large, high-dimensional data sets (RNAseq datasets). This unbiased approach clusters

More information

Meta-analysis of aging methylation data sets Validation success of various meta-analysis methods in selecting genes

Meta-analysis of aging methylation data sets Validation success of various meta-analysis methods in selecting genes Meta-analysis of aging methylation data sets Validation success of various meta-analysis methods in selecting genes Peter Langfelder and Steve Horvath June 27, 2012 Contents 1 Overview 1 2 Setting up the

More information

Meta-analysis of lung cancer expression data sets Validation success of various meta-analysis methods in selecting genes

Meta-analysis of lung cancer expression data sets Validation success of various meta-analysis methods in selecting genes Meta-analysis of lung cancer expression data sets Validation success of various meta-analysis methods in selecting genes Peter Langfelder and Steve Horvath June 26, 2012 Contents 1 Overview 1 2 Setting

More information

Package dynamictreecut

Package dynamictreecut Package dynamictreecut November 18, 2013 Version 1.60-2 Date 2013-11-16 Title Methods for detection of clusters in hierarchical clustering dendrograms. Author Peter Langfelder

More information

(1) where, l. denotes the number of nodes to which both i and j are connected, and k is. the number of connections of a node, with.

(1) where, l. denotes the number of nodes to which both i and j are connected, and k is. the number of connections of a node, with. A simulated gene co-expression network to illustrate the use of the topological overlap matrix for module detection Steve Horvath, Mike Oldham Correspondence to shorvath@mednet.ucla.edu Abstract Here we

More information

Package dynamictreecut

Package dynamictreecut Package dynamictreecut June 13, 2014 Version 1.62 Date 2014-05-07 Title Methods for detection of clusters in hierarchical clustering dendrograms. Author Peter Langfelder and

More information

Identification of consensus modules in Adenocarcinoma data

Identification of consensus modules in Adenocarcinoma data Identification of consensus modules in Adenocarcinoma data Peter Langfelder and Steve Horvath June 9, 01 Contents 1 Overview 1 Setting up the R session 1 Loading of data Scale-free topology analysis Identification

More information

Clustering. Dick de Ridder 6/10/2018

Clustering. Dick de Ridder 6/10/2018 Clustering Dick de Ridder 6/10/2018 In these exercises, you will continue to work with the Arabidopsis ST vs. HT RNAseq dataset. First, you will select a subset of the data and inspect it; then cluster

More information

General instructions:

General instructions: R Tutorial: Geometric Interpretation of Gene Co-Expression Network Analysis, Applied to Female Mouse Liver Microarray Data Jun Dong, Steve Horvath Correspondence: shorvath@mednet.ucla.edu, http://www.ph.ucla.edu/biostat/people/horvath.htm

More information

Introduction to R for Epidemiologists

Introduction to R for Epidemiologists Introduction to R for Epidemiologists Jenna Krall, PhD Thursday, January 29, 2015 Final project Epidemiological analysis of real data Must include: Summary statistics T-tests or chi-squared tests Regression

More information

Exploring cdna Data. Achim Tresch, Andreas Buness, Tim Beißbarth, Wolfgang Huber

Exploring cdna Data. Achim Tresch, Andreas Buness, Tim Beißbarth, Wolfgang Huber Exploring cdna Data Achim Tresch, Andreas Buness, Tim Beißbarth, Wolfgang Huber Practical DNA Microarray Analysis, Heidelberg, March 2005 http://compdiag.molgen.mpg.de/ngfn/pma2005mar.shtml The following

More information

Package MODA. January 8, 2019

Package MODA. January 8, 2019 Type Package Package MODA January 8, 2019 Title MODA: MOdule Differential Analysis for weighted gene co-expression network Version 1.8.0 Date 2016-12-16 Author Dong Li, James B. Brown, Luisa Orsini, Zhisong

More information

Exploring cdna Data. Achim Tresch, Andreas Buness, Tim Beißbarth, Wolfgang Huber

Exploring cdna Data. Achim Tresch, Andreas Buness, Tim Beißbarth, Wolfgang Huber Exploring cdna Data Achim Tresch, Andreas Buness, Tim Beißbarth, Wolfgang Huber Practical DNA Microarray Analysis http://compdiag.molgen.mpg.de/ngfn/pma0nov.shtml The following exercise will guide you

More information

Module 10. Data Visualization. Andrew Jaffe Instructor

Module 10. Data Visualization. Andrew Jaffe Instructor Module 10 Data Visualization Andrew Jaffe Instructor Basic Plots We covered some basic plots on Wednesday, but we are going to expand the ability to customize these basic graphics first. 2/37 But first...

More information

Exploring cdna Data. Achim Tresch, Andreas Buness, Wolfgang Huber, Tim Beißbarth

Exploring cdna Data. Achim Tresch, Andreas Buness, Wolfgang Huber, Tim Beißbarth Exploring cdna Data Achim Tresch, Andreas Buness, Wolfgang Huber, Tim Beißbarth Practical DNA Microarray Analysis http://compdiag.molgen.mpg.de/ngfn/pma0nov.shtml The following exercise will guide you

More information

10601 Machine Learning. Hierarchical clustering. Reading: Bishop: 9-9.2

10601 Machine Learning. Hierarchical clustering. Reading: Bishop: 9-9.2 161 Machine Learning Hierarchical clustering Reading: Bishop: 9-9.2 Second half: Overview Clustering - Hierarchical, semi-supervised learning Graphical models - Bayesian networks, HMMs, Reasoning under

More information

Clustering. Chapter 10 in Introduction to statistical learning

Clustering. Chapter 10 in Introduction to statistical learning Clustering Chapter 10 in Introduction to statistical learning 16 14 12 10 8 6 4 2 0 2 4 6 8 10 12 14 1 Clustering ² Clustering is the art of finding groups in data (Kaufman and Rousseeuw, 1990). ² What

More information

Intro to R Graphics Center for Social Science Computation and Research, 2010 Stephanie Lee, Dept of Sociology, University of Washington

Intro to R Graphics Center for Social Science Computation and Research, 2010 Stephanie Lee, Dept of Sociology, University of Washington Intro to R Graphics Center for Social Science Computation and Research, 2010 Stephanie Lee, Dept of Sociology, University of Washington Class Outline - The R Environment and Graphics Engine - Basic Graphs

More information

Package TROM. August 29, 2016

Package TROM. August 29, 2016 Type Package Title Transcriptome Overlap Measure Version 1.2 Date 2016-08-29 Package TROM August 29, 2016 Author Jingyi Jessica Li, Wei Vivian Li Maintainer Jingyi Jessica

More information

Microarray Technology (Affymetrix ) and Analysis. Practicals

Microarray Technology (Affymetrix ) and Analysis. Practicals Data Analysis and Modeling Methods Microarray Technology (Affymetrix ) and Analysis Practicals B. Haibe-Kains 1,2 and G. Bontempi 2 1 Unité Microarray, Institut Jules Bordet 2 Machine Learning Group, Université

More information

MSA220 - Statistical Learning for Big Data

MSA220 - Statistical Learning for Big Data MSA220 - Statistical Learning for Big Data Lecture 13 Rebecka Jörnsten Mathematical Sciences University of Gothenburg and Chalmers University of Technology Clustering Explorative analysis - finding groups

More information

Tutorial script for whole-cell MALDI-TOF analysis

Tutorial script for whole-cell MALDI-TOF analysis Tutorial script for whole-cell MALDI-TOF analysis Julien Textoris June 19, 2013 Contents 1 Required libraries 2 2 Data loading 2 3 Spectrum visualization and pre-processing 4 4 Analysis and comparison

More information

Package allelematch. R topics documented: February 19, Type Package

Package allelematch. R topics documented: February 19, Type Package Type Package Package allelematch February 19, 2015 Title Identifying unique multilocus genotypes where genotyping error and missing data may be present Version 2.5 Date 2014-09-18 Author Paul Galpern

More information

Package PropClust. September 15, 2018

Package PropClust. September 15, 2018 Type Package Title Propensity Clustering and Decomposition Version 1.4-6 Date 2018-09-12 Package PropClust September 15, 2018 Author John Michael O Ranola, Kenneth Lange, Steve Horvath, Peter Langfelder

More information

Lab 1 Introduction to R

Lab 1 Introduction to R Lab 1 Introduction to R Date: August 23, 2011 Assignment and Report Due Date: August 30, 2011 Goal: The purpose of this lab is to get R running on your machines and to get you familiar with the basics

More information

Cluster Analysis for Microarray Data

Cluster Analysis for Microarray Data Cluster Analysis for Microarray Data Seventh International Long Oligonucleotide Microarray Workshop Tucson, Arizona January 7-12, 2007 Dan Nettleton IOWA STATE UNIVERSITY 1 Clustering Group objects that

More information

Analyzing Genomic Data with NOJAH

Analyzing Genomic Data with NOJAH Analyzing Genomic Data with NOJAH TAB A) GENOME WIDE ANALYSIS Step 1: Select the example dataset or upload your own. Two example datasets are available. Genome-Wide TCGA-BRCA Expression datasets and CoMMpass

More information

GS Analysis of Microarray Data

GS Analysis of Microarray Data GS01 0163 Analysis of Microarray Data Keith Baggerly and Bradley Broom Department of Bioinformatics and Computational Biology UT M. D. Anderson Cancer Center kabagg@mdanderson.org bmbroom@mdanderson.org

More information

Package ConsensusClusterPlus

Package ConsensusClusterPlus Type Package Package ConsensusClusterPlus October 1, 2018 Imports Biobase, ALL, graphics, stats, utils, cluster Title ConsensusClusterPlus Version 1.44.0 Date 2015-12-29 Author Matt Wilkerson ,

More information

UNSUPERVISED LEARNING IN R. Introduction to hierarchical clustering

UNSUPERVISED LEARNING IN R. Introduction to hierarchical clustering UNSUPERVISED LEARNING IN R Introduction to hierarchical clustering Hierarchical clustering Number of clusters is not known ahead of time Two kinds: bottom-up and top-down, this course bottom-up Hierarchical

More information

Eye Localization Using Color Information. Amit Chilgunde

Eye Localization Using Color Information. Amit Chilgunde Eye Localization Using Color Information Amit Chilgunde Department of Electrical and Computer Engineering National University of Singapore, Singapore ABSTRACT In this project, we propose localizing the

More information

CSE 5243 INTRO. TO DATA MINING

CSE 5243 INTRO. TO DATA MINING CSE 5243 INTRO. TO DATA MINING Cluster Analysis: Basic Concepts and Methods Huan Sun, CSE@The Ohio State University 09/25/2017 Slides adapted from UIUC CS412, Fall 2017, by Prof. Jiawei Han 2 Chapter 10.

More information

Package hbm. February 20, 2015

Package hbm. February 20, 2015 Type Package Title Hierarchical Block Matrix Analysis Version 1.0 Date 2015-01-25 Author Maintainer Package hbm February 20, 2015 A package for building hierarchical block matrices from

More information

Computing with large data sets

Computing with large data sets Computing with large data sets Richard Bonneau, spring 2009 Lecture 8(week 5): clustering 1 clustering Clustering: a diverse methods for discovering groupings in unlabeled data Because these methods don

More information

Statistical Programming Camp: An Introduction to R

Statistical Programming Camp: An Introduction to R Statistical Programming Camp: An Introduction to R Handout 3: Data Manipulation and Summarizing Univariate Data Fox Chapters 1-3, 7-8 In this handout, we cover the following new materials: ˆ Using logical

More information

Exploring cdna Data. Achim Tresch, Andreas Buness, Tim Beißbarth, Florian Hahne, Wolfgang Huber. June 17, 2005

Exploring cdna Data. Achim Tresch, Andreas Buness, Tim Beißbarth, Florian Hahne, Wolfgang Huber. June 17, 2005 Exploring cdna Data Achim Tresch, Andreas Buness, Tim Beißbarth, Florian Hahne, Wolfgang Huber June 7, 00 The following exercise will guide you through the first steps of a spotted cdna microarray analysis.

More information

jackstraw: Statistical Inference using Latent Variables

jackstraw: Statistical Inference using Latent Variables jackstraw: Statistical Inference using Latent Variables Neo Christopher Chung August 7, 2018 1 Introduction This is a vignette for the jackstraw package, which performs association tests between variables

More information

ECLT 5810 Clustering

ECLT 5810 Clustering ECLT 5810 Clustering What is Cluster Analysis? Cluster: a collection of data objects Similar to one another within the same cluster Dissimilar to the objects in other clusters Cluster analysis Grouping

More information

Exploratory data analysis for microarrays

Exploratory data analysis for microarrays Exploratory data analysis for microarrays Jörg Rahnenführer Computational Biology and Applied Algorithmics Max Planck Institute for Informatics D-66123 Saarbrücken Germany NGFN - Courses in Practical DNA

More information

Chapter 6: Cluster Analysis

Chapter 6: Cluster Analysis Chapter 6: Cluster Analysis The major goal of cluster analysis is to separate individual observations, or items, into groups, or clusters, on the basis of the values for the q variables measured on each

More information

10701 Machine Learning. Clustering

10701 Machine Learning. Clustering 171 Machine Learning Clustering What is Clustering? Organizing data into clusters such that there is high intra-cluster similarity low inter-cluster similarity Informally, finding natural groupings among

More information

This tutorial is a similar analysis on the GBM data, but only with the 500 most biologically significant genes with respect to the survival time.

This tutorial is a similar analysis on the GBM data, but only with the 500 most biologically significant genes with respect to the survival time. R Tutorial: Geometric Interpretation of Gene Co-Expression Network Analysis, Applied to Brain Cancer Microarray Data Jun Dong, Steve Horvath Correspondence: shorvath@mednet.ucla.edu, http://www.ph.ucla.edu/biostat/people/horvath.htm

More information

ECLT 5810 Clustering

ECLT 5810 Clustering ECLT 5810 Clustering What is Cluster Analysis? Cluster: a collection of data objects Similar to one another within the same cluster Dissimilar to the objects in other clusters Cluster analysis Grouping

More information

R-Programming Fundamentals for Business Students Cluster Analysis, Dendrograms, Word Cloud Clusters

R-Programming Fundamentals for Business Students Cluster Analysis, Dendrograms, Word Cloud Clusters R-Programming Fundamentals for Business Students Cluster Analysis, Dendrograms, Word Cloud Clusters Nick V. Flor, University of New Mexico (nickflor@unm.edu) Assumptions. This tutorial assumes (1) that

More information

Introduction for heatmap3 package

Introduction for heatmap3 package Introduction for heatmap3 package Shilin Zhao April 6, 2015 Contents 1 Example 1 2 Highlights 4 3 Usage 5 1 Example Simulate a gene expression data set with 40 probes and 25 samples. These samples are

More information

An Introduction to Some Graphics in Bioconductor

An Introduction to Some Graphics in Bioconductor n Introduction to ome raphics in ioconductor une 4, 2003 Introduction e first need to set up the basic data regarding the genome of interest. The chrom- ocation class describes the necessary components

More information

An introduction to network inference and mining - TP

An introduction to network inference and mining - TP An introduction to network inference and mining - TP Nathalie Villa-Vialaneix - nathalie.villa@toulouse.inra.fr http://www.nathalievilla.org INRA, UR 0875 MIAT Formation INRA, Niveau 3 Formation INRA (Niveau

More information

Package QUBIC. September 1, 2018

Package QUBIC. September 1, 2018 Type Package Package QUBIC September 1, 2018 Title An R package for qualitative biclustering in support of gene co-expression analyses The core function of this R package is to provide the implementation

More information

Unsupervised learning: Clustering & Dimensionality reduction. Theo Knijnenburg Jorma de Ronde

Unsupervised learning: Clustering & Dimensionality reduction. Theo Knijnenburg Jorma de Ronde Unsupervised learning: Clustering & Dimensionality reduction Theo Knijnenburg Jorma de Ronde Source of slides Marcel Reinders TU Delft Lodewyk Wessels NKI Bioalgorithms.info Jeffrey D. Ullman Stanford

More information

data visualization Show the Data Snow Month skimming deep waters

data visualization Show the Data Snow Month skimming deep waters data visualization skimming deep waters Show the Data Snow 2 4 6 8 12 Minimize Distraction Minimize Distraction Snow 2 4 6 8 12 2 4 6 8 12 Make Big Data Coherent Reveal Several Levels of Detail 1974 1975

More information

Bioconductor s sva package

Bioconductor s sva package Bioconductor s sva package Jeffrey Leek and John Storey Department of Biostatistics University of Washington email: jtleek@u.washington.edu June 14, 2007 Contents 1 Overview 1 2 Simulated Eample 1 3 The

More information

Package ctc. R topics documented: August 2, Version Date Depends amap. Title Cluster and Tree Conversion.

Package ctc. R topics documented: August 2, Version Date Depends amap. Title Cluster and Tree Conversion. Package ctc August 2, 2013 Version 1.35.0 Date 2005-11-16 Depends amap Title Cluster and Tree Conversion. Author Antoine Lucas , Laurent Gautier biocviews Microarray,

More information

Package DiffCorr. August 29, 2016

Package DiffCorr. August 29, 2016 Type Package Package DiffCorr August 29, 2016 Title Analyzing and Visualizing Differential Correlation Networks in Biological Data Version 0.4.1 Date 2015-03-31 Author, Kozo Nishida Maintainer

More information

Package comphclust. February 15, 2013

Package comphclust. February 15, 2013 Package comphclust February 15, 2013 Version 1.0-1 Date 2010-02-27 Title Complementary Hierarchical Clustering Author Gen Nowak and Robert Tibshirani Maintainer Gen Nowak Description

More information

Package NetCluster. R topics documented: February 19, Type Package Version 0.2 Date Title Clustering for networks

Package NetCluster. R topics documented: February 19, Type Package Version 0.2 Date Title Clustering for networks Type Package Version 0.2 Date 2010-05-09 Title Clustering for networks Package NetCluster February 19, 2015 Author Mike Nowak , Solomon Messing , Sean

More information

## For detailed description of RF clustering theory and algorithm, ## please consult the following references.

## For detailed description of RF clustering theory and algorithm, ## please consult the following references. ###################################################### ## Random Forest Clustering Tutorial ## ## ## ## Copyright 2005 Tao Shi, Steve Horvath ## ## ## ## emails: shidaxia@yahoo.com (Tao Shi) ## ## shorvath@mednet.ucla.edu

More information

The Generalized Topological Overlap Matrix in Biological Network Analysis

The Generalized Topological Overlap Matrix in Biological Network Analysis The Generalized Topological Overlap Matrix in Biological Network Analysis Andy Yip, Steve Horvath Email: shorvath@mednet.ucla.edu Depts Human Genetics and Biostatistics, University of California, Los Angeles

More information

Package ibbig. R topics documented: December 24, 2018

Package ibbig. R topics documented: December 24, 2018 Type Package Title Iterative Binary Biclustering of Genesets Version 1.26.0 Date 2011-11-23 Author Daniel Gusenleitner, Aedin Culhane Package ibbig December 24, 2018 Maintainer Aedin Culhane

More information

Package comphclust. May 4, 2017

Package comphclust. May 4, 2017 Version 1.0-3 Date 2017-05-04 Title Complementary Hierarchical Clustering Imports graphics, stats Package comphclust May 4, 2017 Description Performs the complementary hierarchical clustering procedure

More information

Problem Set 3. MATH 778C, Spring 2009, Austin Mohr (with John Boozer) April 15, 2009

Problem Set 3. MATH 778C, Spring 2009, Austin Mohr (with John Boozer) April 15, 2009 Problem Set 3 MATH 778C, Spring 2009, Austin Mohr (with John Boozer) April 15, 2009 1. Show directly that P 1 (s) P 1 (t) for all t s. Proof. Given G, let H s be a subgraph of G on s vertices such that

More information

#1#set Working directory #2# Download packages: source(" bioclite("affy") library (affy)

#1#set Working directory #2# Download packages: source(  bioclite(affy) library (affy) #1#set Working directory #2# Download packages: source("http://bioconductor.org/bioclite.r") bioclite("affy") library (affy) #3# Read the CEL files: Med

More information

CSE 5243 INTRO. TO DATA MINING

CSE 5243 INTRO. TO DATA MINING CSE 5243 INTRO. TO DATA MINING Cluster Analysis: Basic Concepts and Methods Huan Sun, CSE@The Ohio State University Slides adapted from UIUC CS412, Fall 2017, by Prof. Jiawei Han 2 Chapter 10. Cluster

More information

An introduction to the picante package

An introduction to the picante package An introduction to the picante package Steven Kembel (skembel@uoregon.edu) April 2010 Contents 1 Installing picante 1 2 Data formats in picante 1 2.1 Phylogenies................................ 2 2.2 Community

More information

Package nlnet. April 8, 2018

Package nlnet. April 8, 2018 Type Package Package nlnet April 8, 2018 Title Nonlinear Network Reconstruction, Clustering, and Variable Selection Based on DCOL (Distance Based on Conditional Ordered List) Version 1.2 Date 2018-04-07

More information

Gene Clustering & Classification

Gene Clustering & Classification BINF, Introduction to Computational Biology Gene Clustering & Classification Young-Rae Cho Associate Professor Department of Computer Science Baylor University Overview Introduction to Gene Clustering

More information

Package CEMiTool. April 11, 2018

Package CEMiTool. April 11, 2018 Title Co-expression Modules identification Tool Version 1.0.3 Package CEMiTool April 11, 2018 The CEMiTool package unifies the discovery and the analysis of coexpression gene modules in a fully automatic

More information

Exploring and Understanding Data Using R.

Exploring and Understanding Data Using R. Exploring and Understanding Data Using R. Loading the data into an R data frame: variable

More information

Hierarchical and Ensemble Clustering

Hierarchical and Ensemble Clustering Hierarchical and Ensemble Clustering Ke Chen Reading: [7.8-7., EA], [25.5, KPM], [Fred & Jain, 25] COMP24 Machine Learning Outline Introduction Cluster Distance Measures Agglomerative Algorithm Example

More information

Didacticiel Études de cas

Didacticiel Études de cas 1 Subject Two step clustering approach on large dataset. The aim of the clustering is to identify homogenous subgroups of instance in a population 1. In this tutorial, we implement a two step clustering

More information

LD vignette Measures of linkage disequilibrium

LD vignette Measures of linkage disequilibrium LD vignette Measures of linkage disequilibrium David Clayton June 13, 2018 Calculating linkage disequilibrium statistics We shall first load some illustrative data. > data(ld.example) The data are drawn

More information

Package cgh. R topics documented: February 19, 2015

Package cgh. R topics documented: February 19, 2015 Package cgh February 19, 2015 Version 1.0-7.1 Date 2009-11-20 Title Microarray CGH analysis using the Smith-Waterman algorithm Author Tom Price Maintainer Tom Price

More information

BBS654 Data Mining. Pinar Duygulu. Slides are adapted from Nazli Ikizler

BBS654 Data Mining. Pinar Duygulu. Slides are adapted from Nazli Ikizler BBS654 Data Mining Pinar Duygulu Slides are adapted from Nazli Ikizler 1 Classification Classification systems: Supervised learning Make a rational prediction given evidence There are several methods for

More information

java -jar picard.jar CollectInsertSizeMetrics I=aln_sorted.bam O=out.metrics HISTOGRAM_FILE=chartoutput.pdf VALIDATION_STRINGENCY=LENIENT

java -jar picard.jar CollectInsertSizeMetrics I=aln_sorted.bam O=out.metrics HISTOGRAM_FILE=chartoutput.pdf VALIDATION_STRINGENCY=LENIENT Supplementary Note 1 Pre-Processing and Alignment Commands Trimmomatic Command java -jar trimmomatic-0.32.jar PE -threads 15 -phred33 /Volumes/Drobo_Storage/Raw_Data_and_Trimmomatic_Files/First_Six_Samples_Raw_Data/Ra

More information

ECS 234: Data Analysis: Clustering ECS 234

ECS 234: Data Analysis: Clustering ECS 234 : Data Analysis: Clustering What is Clustering? Given n objects, assign them to groups (clusters) based on their similarity Unsupervised Machine Learning Class Discovery Difficult, and maybe ill-posed

More information

Multivariate analyses in ecology. Cluster (part 2) Ordination (part 1 & 2)

Multivariate analyses in ecology. Cluster (part 2) Ordination (part 1 & 2) Multivariate analyses in ecology Cluster (part 2) Ordination (part 1 & 2) 1 Exercise 9B - solut 2 Exercise 9B - solut 3 Exercise 9B - solut 4 Exercise 9B - solut 5 Multivariate analyses in ecology Cluster

More information

Package RTNduals. R topics documented: March 7, Type Package

Package RTNduals. R topics documented: March 7, Type Package Type Package Package RTNduals March 7, 2019 Title Analysis of co-regulation and inference of 'dual regulons' Version 1.7.0 Author Vinicius S. Chagas, Clarice S. Groeneveld, Gordon Robertson, Kerstin B.

More information

Lecture 25: Review I

Lecture 25: Review I Lecture 25: Review I Reading: Up to chapter 5 in ISLR. STATS 202: Data mining and analysis Jonathan Taylor 1 / 18 Unsupervised learning In unsupervised learning, all the variables are on equal standing,

More information

Package EnQuireR. R topics documented: February 19, Type Package Title A package dedicated to questionnaires Version 0.

Package EnQuireR. R topics documented: February 19, Type Package Title A package dedicated to questionnaires Version 0. Type Package Title A package dedicated to questionnaires Version 0.10 Date 2009-06-10 Package EnQuireR February 19, 2015 Author Fournier Gwenaelle, Cadoret Marine, Fournier Olivier, Le Poder Francois,

More information

Running Minitab for the first time on your PC

Running Minitab for the first time on your PC Running Minitab for the first time on your PC Screen Appearance When you select the MINITAB option from the MINITAB 14 program group, or click on MINITAB 14 under RAS you will see the following screen.

More information

Package clusterseq. R topics documented: June 13, Type Package

Package clusterseq. R topics documented: June 13, Type Package Type Package Package clusterseq June 13, 2018 Title Clustering of high-throughput sequencing data by identifying co-expression patterns Version 1.4.0 Depends R (>= 3.0.0), methods, BiocParallel, bayseq,

More information

GS Analysis of Microarray Data

GS Analysis of Microarray Data GS01 0163 Analysis of Microarray Data Keith Baggerly and Bradley Broom Department of Bioinformatics and Computational Biology UT MD Anderson Cancer Center kabagg@mdanderson.org bmbroom@mdanderson.org 19

More information

Statistics 202: Data Mining. c Jonathan Taylor. Clustering Based in part on slides from textbook, slides of Susan Holmes.

Statistics 202: Data Mining. c Jonathan Taylor. Clustering Based in part on slides from textbook, slides of Susan Holmes. Clustering Based in part on slides from textbook, slides of Susan Holmes December 2, 2012 1 / 1 Clustering Clustering Goal: Finding groups of objects such that the objects in a group will be similar (or

More information

5/15/16. Computational Methods for Data Analysis. Massimo Poesio UNSUPERVISED LEARNING. Clustering. Unsupervised learning introduction

5/15/16. Computational Methods for Data Analysis. Massimo Poesio UNSUPERVISED LEARNING. Clustering. Unsupervised learning introduction Computational Methods for Data Analysis Massimo Poesio UNSUPERVISED LEARNING Clustering Unsupervised learning introduction 1 Supervised learning Training set: Unsupervised learning Training set: 2 Clustering

More information

MATH5745 Multivariate Methods Lecture 13

MATH5745 Multivariate Methods Lecture 13 MATH5745 Multivariate Methods Lecture 13 April 24, 2018 MATH5745 Multivariate Methods Lecture 13 April 24, 2018 1 / 33 Cluster analysis. Example: Fisher iris data Fisher (1936) 1 iris data consists of

More information

Package DPBBM. September 29, 2016

Package DPBBM. September 29, 2016 Type Package Title Dirichlet Process Beta-Binomial Mixture Version 0.2.5 Date 2016-09-21 Author Lin Zhang Package DPBBM September 29, 2016 Maintainer Lin Zhang Depends R (>= 3.1.0)

More information

COmbined Mapping of Multiple clustering ALgorithms (COMMUNAL): A Robust Method for Selection of Cluster Number K: R Package Vignette

COmbined Mapping of Multiple clustering ALgorithms (COMMUNAL): A Robust Method for Selection of Cluster Number K: R Package Vignette COmbined Mapping of Multiple clustering ALgorithms (COMMUNAL): A Robust Method for Selection of Cluster Number K: R Package Vignette Timothy E Sweeney Stanford University Albert Chen Stanford University

More information

Package ClustGeo. R topics documented: July 14, Type Package

Package ClustGeo. R topics documented: July 14, Type Package Type Package Package ClustGeo July 14, 2017 Title Hierarchical Clustering with Spatial Constraints Version 2.0 Author Marie Chavent [aut, cre], Vanessa Kuentz [aut], Amaury Labenne [aut], Jerome Saracco

More information

Multivariate Analysis (slides 9)

Multivariate Analysis (slides 9) Multivariate Analysis (slides 9) Today we consider k-means clustering. We will address the question of selecting the appropriate number of clusters. Properties and limitations of the algorithm will be

More information

Hierarchical clustering

Hierarchical clustering Aprendizagem Automática Hierarchical clustering Ludwig Krippahl Hierarchical clustering Summary Hierarchical Clustering Agglomerative Clustering Divisive Clustering Clustering Features 1 Aprendizagem Automática

More information

Data Term. Michael Bleyer LVA Stereo Vision

Data Term. Michael Bleyer LVA Stereo Vision Data Term Michael Bleyer LVA Stereo Vision What happened last time? We have looked at our energy function: E ( D) = m( p, dp) + p I < p, q > N s( p, q) We have learned about an optimization algorithm that

More information

Clustering. Lecture 6, 1/24/03 ECS289A

Clustering. Lecture 6, 1/24/03 ECS289A Clustering Lecture 6, 1/24/03 What is Clustering? Given n objects, assign them to groups (clusters) based on their similarity Unsupervised Machine Learning Class Discovery Difficult, and maybe ill-posed

More information

VIDAEXPERT: DATA ANALYSIS Here is the Statistics button.

VIDAEXPERT: DATA ANALYSIS Here is the Statistics button. Here is the Statistics button. After creating dataset you can analyze it in different ways. First, you can calculate statistics. Open Statistics dialog, Common tabsheet, click Calculate. Min, Max: minimal

More information

Package HMRFBayesHiC

Package HMRFBayesHiC Package HMRFBayesHiC February 3, 2015 Type Package Title HMRFBayesHiC conduct Hidden Markov Random Field (HMRF) Bayes Peak Calling Method on HiC Data Version 1.0 Date 2015-01-30 Author Zheng Xu Maintainer

More information