Package clusterseq. R topics documented: June 13, Type Package

Similar documents
Advanced analysis using bayseq; generic distribution definitions

segmentseq: methods for detecting methylation loci and differential methylation

segmentseq: methods for detecting methylation loci and differential methylation

bayseq: Empirical Bayesian analysis of patterns of differential expression in count data

Transcript quantification using Salmon and differential expression analysis using bayseq

Clustering. RNA-seq: What is it good for? Finding Similarly Expressed Genes. Data... And Lots of It!

Package scmap. March 29, 2019

Package EMDomics. November 11, 2018

Package SC3. September 29, 2018

11/2/2017 MIST.6060 Business Intelligence and Data Mining 1. Clustering. Two widely used distance metrics to measure the distance between two records

Cluster Analysis for Microarray Data

Package MODA. January 8, 2019

Package SC3. November 27, 2017

Machine learning - HT Clustering

Package allelematch. R topics documented: February 19, Type Package

Package CoGAPS. January 24, 2019

Package cghmcr. March 6, 2019

Package sigqc. June 13, 2018

Package anota2seq. January 30, 2018

Package ffpe. October 1, 2018

Package pcagopromoter

Package mgsa. January 13, 2019

Contents. ! Data sets. ! Distance and similarity metrics. ! K-means clustering. ! Hierarchical clustering. ! Evaluation of clustering results

Chapter 6: Cluster Analysis

Clustering Jacques van Helden

Differential Expression Analysis at PATRIC

Clustering. Chapter 10 in Introduction to statistical learning

Package genelistpie. February 19, 2015

Package DMCHMM. January 19, 2019

Package HTSFilter. November 30, 2017

INF4820, Algorithms for AI and NLP: Hierarchical Clustering

Package DFP. February 2, 2018

Package M3Drop. August 23, 2018

Package BayesPeak. R topics documented: September 21, Version Date Title Bayesian Analysis of ChIP-seq Data

Package TMixClust. July 13, 2018

Clustering. CS294 Practical Machine Learning Junming Yin 10/09/06

Package RTCGAToolbox

TCC: Differential expression analysis for tag count data with robust normalization strategies

STATS306B STATS306B. Clustering. Jonathan Taylor Department of Statistics Stanford University. June 3, 2010

Package mcclust.ext. May 15, 2015

Multivariate Analysis

Expression Analysis with the Advanced RNA-Seq Plugin

Package BEARscc. May 2, 2018

Package omicplotr. March 15, 2019

Package LiquidAssociation

Package ChIPXpress. October 4, 2013

Package Linnorm. October 12, 2016

Package EDDA. April 14, 2019

Package GARS. March 17, 2019

Cluster Analysis. Prof. Thomas B. Fomby Department of Economics Southern Methodist University Dallas, TX April 2008 April 2010

Package roar. August 31, 2018

Hierarchical clustering

Package polyphemus. February 15, 2013

Chapter 6 Continued: Partitioning Methods

Package ALDEx2. April 4, 2019

Statistics 202: Data Mining. c Jonathan Taylor. Week 8 Based in part on slides from textbook, slides of Susan Holmes. December 2, / 1

Package SIMLR. December 1, 2017

Package ic10. September 23, 2015

EECS730: Introduction to Bioinformatics

Data Science and Statistics in Research: unlocking the power of your data Session 3.4: Clustering

2. Find the smallest element of the dissimilarity matrix. If this is D lm then fuse groups l and m.

Package snm. July 20, 2018

Package dmrseq. September 14, 2018

riboseqr Introduction Getting Data Workflow Example Thomas J. Hardcastle, Betty Y.W. Chung October 30, 2017

Package ctc. R topics documented: August 2, Version Date Depends amap. Title Cluster and Tree Conversion.

Package comphclust. May 4, 2017

Clustering. CE-717: Machine Learning Sharif University of Technology Spring Soleymani

Package virtualarray

Package frma. R topics documented: March 8, Version Date Title Frozen RMA and Barcode

srap: Simplified RNA-Seq Analysis Pipeline

Package RcisTarget. July 17, 2018

Expander Online Documentation

Drug versus Disease (DrugVsDisease) package

Package JunctionSeq. November 16, 2015

Machine Learning (BSMC-GA 4439) Wenke Liu

Package GCAI.bias. February 19, 2015

Package TCC. April 11, 2018

Analyzing Genomic Data with NOJAH

Package starank. August 3, 2013

Package starank. May 11, 2018

Package GateFinder. March 17, 2019

Package BayesMetaSeq

GS Analysis of Microarray Data

Package PathwaySeq. February 3, 2015

genbart package Vignette Jacob Cardenas, Jacob Turner, and Derek Blankenship

Advanced RNA-Seq 1.5. User manual for. Windows, Mac OS X and Linux. November 2, 2016 This software is for research purposes only.

Package dupradar. R topics documented: July 12, Type Package

Package biosigner. March 6, 2019

Objective of clustering

EBSeqHMM: An R package for identifying gene-expression changes in ordered RNA-seq experiments

Package bacon. October 31, 2018

Machine Learning (BSMC-GA 4439) Wenke Liu

SEEK User Manual. Introduction

Package HTSFilter. August 2, 2013

Package RNAseqNet. May 16, 2017

Package ERSSA. November 4, 2018

Package CONOR. August 29, 2013

Package BgeeDB. January 5, 2019

Package tximport. November 25, 2017

Our typical RNA quantification pipeline

Transcription:

Type Package Package clusterseq June 13, 2018 Title Clustering of high-throughput sequencing data by identifying co-expression patterns Version 1.4.0 Depends R (>= 3.0.0), methods, BiocParallel, bayseq, graphics, stats, utils Imports BiocGenerics Suggests BiocStyle Date 2016-01-19 Author Thomas J. Hardcastle & Irene Papatheodorou Maintainer Thomas J. Hardcastle <tjh48@cam.ac.uk> Identification of clusters of co-expressed genes based on their expression across multiple (replicated) biological samples. License GPL-3 LazyLoad yes biocviews Sequencing, DifferentialExpression, MultipleComparison, Clustering, GeneExpression R topics documented: clusterseq-package..................................... 2 associateposteriors..................................... 3 cd.ratthymus........................................ 4 kcluster........................................... 5 makeclusters........................................ 6 makeclustersff....................................... 7 plotcluster.......................................... 8 ratthymus.......................................... 9 wallace........................................... 10 Index 12 1

2 clusterseq-package clusterseq-package Clustering of high-throughput sequencing data by identifying coexpression patterns Details Identification of clusters of co-expressed genes based on their expression across multiple (replicated) biological samples. The DESCRIPTION file: This package was not yet installed at build time. Index: This package was not yet installed at build time. Thomas J. Hardcastle & Irene Papatheodorou Maintainer: Thomas J. Hardcastle <tjh48@cam.ac.uk> #Load in the processed data of observed read counts at each gene for each sample. data(ratthymus, package = "clusterseq") # Library scaling factors are acquired here using the getlibsizes # function from the bayseq package. libsizes <- getlibsizes(data = ratthymus) # Adjust the data to remove zeros and rescale by the library scaling # factors. Convert to log scale. ratthymus[ratthymus == 0] <- 1 normrt <- log2(t(t(ratthymus / libsizes)) * mean(libsizes)) # run kcluster on reduced set. normrt <- normrt[1:1000,] kclust <- kcluster(normrt) # make the clusters from these data. mkclust <- makeclusters(kclust, normrt, threshold = 1) # or using likelihood data from a Bayesian analysis of the data # load in analysed countdata object data(cd.ratthymus, package = "clusterseq") # estimate likelihoods of dissimilarity on reduced set am <- associateposteriors(cd.ratthymus[1:1000,]) # make clusters from dissimilarity data sx <- makeclusters(am, cd.ratthymus, threshold = 0.5)

associateposteriors 3 # plot first six clusters par(mfrow = c(2,3)) plotcluster(sx[1:6], cd.ratthymus) associateposteriors Associates posterior likelihood to generate co-expression dissimilarities between genes This function aims to find pairwise dissimilarities between genes. It does this by comparing the posterior likelihoods of patterns of differential expression for each gene, and estimating the likelihood that the two genes are not equivalently expressed. associateposteriors(cd, maxsize = 250000, matrixfile = NULL) Arguments cd maxsize matrixfile A countdata object containing posterior likelihoods of differential expression for each gene. The maximum size (in MB) to use when partitioning the data. If given, a file to write the complete (gzipped) matrix of pairwise distances between genes. Defaults to NULL. Details In comparing two genes, we find all patterns of expression considered in the @groups slot of the cd (countdata) object for which the expression of the two genes can be considered monotonic. We then subtract the sum the posterior likelihods of these patterns of expression from 1 to define a likelihood of dissimilarity between the two genes. A data.frame which for each gene defines its nearest neighbour of higher row index and the dissimilarity with that neighbour. Thomas J. Hardcastle makeclusters makeclustersff kcluster

4 cd.ratthymus # load in analysed countdata (bayseq package) object library(bayseq) data(cd.ratthymus, package = "clusterseq") # estimate likelihoods of dissimilarity on reduced set am <- associateposteriors(cd.ratthymus[1:1000,]) cd.ratthymus Data from female rat thymus tissue taken from the Rat BodyMap project (Yu et al, 2014) and processed by bayseq. This data set is a countdata object for 17230 genes from 16 samples of female rat thymus tissue. The tissues are extracted from four different age groups (2, 6, 21 and 104 week) with four replicates at each age. Posterior likelihoods for the 15 possible patterns of differential expression have been precalculated using the codebayseq-package functions. cd.ratthymus Format A countdata object A countdata object Source Illumina sequencing. References Yu Y. et al. A rat RNA-Seq transcriptomic BodyMap across 11 organs and 4 developmental stages. Nature Communications (2014) ratthymus

kcluster 5 kcluster Constructs co-expression dissimilarities from k-means analyses. This function aims to find pairwise distances between genes. It does this by constructing k-means clusterings of the observed (log) expression for each gene, and for each pair of genes, finding the maximum value of k for which the centroids of the clusters are monotonic between the genes. kcluster(cd, maxk = 100, matrixfile = NULL, replicates = NULL, algorithm = "Lloyd", B = 1000, sdm = 1) Arguments cd maxk matrixfile replicates algorithm B sdm A countdata object containing the raw count data for each gene, or a matrix containing the logged and normalised values for each gene (rows) and sample (columns). The maximum value of k for which k-means clustering will be performed. Defaults to 100. If given, a file to write the complete (gzipped) matrix of pairwise distances between genes. Defaults to NULL. If given, a factor or vector that can be cast to a factor that defines the replicate structure of the data. See Details. The algorithm to be used by the kmeans function. Number of iterations of bootstrapping algorithm used to establish clustering validity Thresholding parameter for validity; see Details. Details In comparing two genes, we find the maximum value of k for which separate k-means clusterings of the two genes lead to a monotonic relationship between the centroids of the clusters. For this value of k, the maximum difference between expression levels observed within a cluster of either gene is reported as a measure of the dissimilarity between the two genes. There is a potential issue in that for genes non-differentially expressed across all samples (i.e., the appropriate value of k is 1), there will nevertheless exist clusterings for k > 1. For some arrangements of data, this leads to misattribution of non-differentially expressed genes. We identify these cases by adapting Tibshirani s gap statistic; bootstrapping uniformly distributed data on the same range as the observed data, calculating the dissimilarity score as above, and finding those cases for which the gap between the bootstrapped mean dissimilarity and the observed dissimilarity for k = 1 exceeds that for k = 2 by more than some multiple (sdm) of the standard error of the bootstrapped dissimilarities of k = 2. These cases are forced to be treated as non-differentially expressed by discarding all dissimilarity data for k > 1. If the replicates vector is given, or if the replicates slot of a countdata given as the cd variable is complete, then the k-means clustering will be done on the median of the expression values of each replicate group. Dissimilarity calculations will still be made on the full data.

6 makeclusters A data.frame which for each gene defines its nearest neighbour of higher row index and the dissimilarity with that neighbour. Thomas J. Hardcastle makeclusters makeclustersff associateposteriors #Load in the processed data of observed read counts at each gene for each sample. data(ratthymus, package = "clusterseq") # Library scaling factors are acquired here using the getlibsizes # function from the bayseq package. libsizes <- getlibsizes(data = ratthymus) # Adjust the data to remove zeros and rescale by the library scaling # factors. Convert to log scale. ratthymus[ratthymus == 0] <- 1 normrt <- log2(t(t(ratthymus / libsizes)) * mean(libsizes)) # run kcluster on reduced set. normrt <- normrt[1:1000,] kclust <- kcluster(normrt) head(kclust) # Alternatively, run on a count data object: # load in analysed countdata (bayseq package) object library(bayseq) data(cd.ratthymus, package = "clusterseq") # estimate likelihoods of dissimilarity on reduced set kclust2 <- kcluster(cd.ratthymus[1:1000,]) head(kclust2) makeclusters Creates clusters from a co-expression minimal linkage data.frame. This function uses minimal linkage data to perform rapid clustering by singleton agglomeration (i.e., a gene will always cluster with its nearest neighbours provided the distance to those neighbours does not exceed some threshold). For alternative (but slower) clustering options, see the makeclustersff function. makeclusters(am, cd, threshold = 0.5)

makeclustersff 7 Arguments am cd threshold A data frame constructed by associateposteriors or kcluster, defining for each gene the nearest neighbour of higher row index and the dissimilarity with that neighbour. The data given as input to associateposteriors or kcluster that produced am. A threshold on the maximum dissimilarity at which two genes can cluster. Defaults to 0.5. An IntegerList object, each member of whom defines a cluster of co-expressed genes. The object is ordered decreasingly by the size of each cluster. Thomas J Hardcastle makeclustersff kcluster associateposteriors #Load in the processed data of observed read counts at each gene for each sample. data(ratthymus, package = "clusterseq") # Library scaling factors are acquired here using the getlibsizes # function from the bayseq package. libsizes <- getlibsizes(data = ratthymus) # Adjust the data to remove zeros and rescale by the library scaling # factors. Convert to log scale. ratthymus[ratthymus == 0] <- 1 normrt <- log2(t(t(ratthymus / libsizes)) * mean(libsizes)) # run kcluster on reduced set. normrt <- normrt[1:1000,] kclust <- kcluster(normrt) # make the clusters from these data. mkclust <- makeclusters(kclust, normrt, threshold = 1) makeclustersff Creates clusters from a file containing a full dissimilarity matrix. This function uses the complete pairwise dissimilarity scores to construct a hierarchical clustering of the genes.

8 plotcluster makeclustersff(file, method = "complete", cut.height = 5) Arguments file method cut.height Filename containing the dissimilarity data. Method to use in hclust. Cut height to use in hclust. An IntegerList object containing the clusters derived from a cut hierarchical clustering. Thomas J Hardcastle makeclusters kcluster associateposteriors #Load in the processed data of observed read counts at each gene for each sample. data(ratthymus, package = "clusterseq") # Library scaling factors are acquired here using the getlibsizes # function from the bayseq package. libsizes <- getlibsizes(data = ratthymus) # Adjust the data to remove zeros and rescale by the library scaling # factors. Convert to log scale. ratthymus[ratthymus == 0] <- 1 normrt <- log2(t(t(ratthymus / libsizes)) * mean(libsizes)) # run kcluster on reduced set. For speed, one thousand bootstraps are # used, but higher values should be used in real analyses. # Write full dissimilarity matrix to file "kclust.gz" normrt <- normrt[1:1000,] kclust <- kcluster(normrt, B = 1000, matrixfile = "kclust.gz") # make the clusters from these data. mkclustr <- makeclustersff("kclust.gz") plotcluster Plots data from clusterings. Given clusterings and expression data, plots representative expression data for each clustering.

ratthymus 9 plotcluster(cluster, cd, samplesize = 1000) Arguments cluster cd samplesize A list object defining the clusters, produced by makeclusters or makeclustersff. The data object used to produce the clusters. The maximum number of genes that will be ploted. Details Expression data are normalised and rescaled before plotting. Plotting function. Thomas J Hardcastle makeclusters makeclustersff # load in analysed countdata object data(cd.ratthymus, package = "clusterseq") # estimate likelihoods of dissimilarity on reduced set am <- associateposteriors(cd.ratthymus[1:1000,]) # make clusters from dissimilarity data sx <- makeclusters(am, cd.ratthymus, threshold = 0.5) # plot first six clusters par(mfrow = c(2,3)) plotcluster(sx[1:6], cd.ratthymus) ratthymus Data from female rat thymus tissue taken from the Rat BodyMap project (Yu et al, 2014). This data set is a matrix ( mobdata ) of raw count data acquired for 17230 genes from 16 samples of female rat thymus tissue. The tissues are extracted from four different age groups (2, 6, 21 and 104 week) with four replicates at each age. Gene annotation is given in the rownames of the matrix.

10 wallace ratthymus Format A matrix of RNA-Seq counts in which each of the sixteen columns represents a sample, and each row a gene locus. A matrix Source Illumina sequencing. References Yu Y. et al. A rat RNA-Seq transcriptomic BodyMap across 11 organs and 4 developmental stages. Nature Communications (2014) cd.ratthymus wallace Computes Wallace scores comparing two clustering methods. Given two clusterings A \& B we can calculate the likelihood that two elements are in the same cluster in B given that they are in the same cluster in A, and vice versa. wallace(v1, v2) Arguments v1 v2 SimpleIntegerList object (output from makeclusters or makeclustersff). SimpleIntegerList object (output from makeclusters or makeclustersff). Vector of length 2 giving conditional likelihoods. Thomas J. Hardcastle

wallace 11 # using likelihood data from a Bayesian analysis of the data # load in analysed countdata object data(cd.ratthymus, package = "clusterseq") # estimate likelihoods of dissimilarity on reduced set am <- associateposteriors(cd.ratthymus[1:1000,]) # make clusters from dissimilarity data sx <- makeclusters(am, cd.ratthymus[1:1000,], threshold = 0.5) # or using k-means clustering on raw count data #Load in the processed data of observed read counts at each gene for each sample. data(ratthymus, package = "clusterseq") # Library scaling factors are acquired here using the getlibsizes # function from the bayseq package. libsizes <- getlibsizes(data = ratthymus) # Adjust the data to remove zeros and rescale by the library scaling # factors. Convert to log scale. ratthymus[ratthymus == 0] <- 1 normrt <- log2(t(t(ratthymus / libsizes)) * mean(libsizes)) # run kcluster on reduced set. normrt <- normrt[1:1000,] kclust <- kcluster(normrt, replicates = cd.ratthymus@replicates) # make the clusters from these data. mkclust <- makeclusters(kclust, normrt, threshold = 1) # compare clusterings wallace(sx, mkclust)

Index Topic datasets cd.ratthymus, 4 ratthymus, 9 Topic manip associateposteriors, 3 kcluster, 5 makeclusters, 6 makeclustersff, 7 wallace, 10 Topic package clusterseq-package, 2 Topic plot plotcluster, 8 associateposteriors, 3, 6 8 bayseq-package, 4 cd.ratthymus, 4, 10 clusterseq (clusterseq-package), 2 clusterseq-package, 2 countdata, 3 5 hclust, 8 kcluster, 3, 5, 7, 8 makeclusters, 3, 6, 6, 8, 9 makeclustersff, 3, 6, 7, 7, 9 plotcluster, 8 ratthymus, 4, 9 wallace, 10 12