Package suprahex. August 13, Type Package

Size: px

Start display at page:

Download "Package suprahex. August 13, Type Package"

Willa Clark
5 years ago
Views:

1 Type Package Package suprahex August 13, 2018 Title suprahex: a supra-hexagonal map for analysing tabular omics data Version Date Author Hai Fang and Julian Gough Maintainer Hai Fang <hfang@well.ox.ac.uk> Depends R (>= 3.3), hexbin Imports ape, MASS, grdevices, graphics, stats, utils A supra-hexagonal map is a giant hexagon on a 2-dimensional grid seamlessly consisting of smaller hexagons. It is supposed to train, analyse and visualise a highdimensional omics input data. The suprahex is able to carry out gene clustering/metaclustering and sample correlation, plus intuitive visualisations to facilitate exploratory analysis. More importantly, it allows for overlaying additional data onto the trained map to explore relations between input and additional data. So with suprahex, it is also possible to carry out multilayer omics data comparisons. Newly added utilities are advanced heatmap visualisation and tree-based analysis of sample relationships. Uniquely to this package, users can ultrafastly understand any tabular omics data, both scientifically and artistically, especially in a sample-specific fashion but without loss of information on large genes. URL Collate spipeline.r shexgrid.r stopology.r sinitial.r strainology.r strainseq.r strainbatch.r sbmh.r sneighdirect.r sneighany.r shexdist.r sdistance.r sdmat.r sdmatminima.r sdmatcluster.r scompreorder.r swritedata.r smapoverlay.r vishexpattern.r vishexgrid.r vishexmapping.r vishexcomp.r viscolormap.r viscolorbar.r visvp.r vishexmulcomp.r vishexanimate.r viscompreorder.r visdmatcluster.r viskernels.r viscoloralpha.r visheatmap.r visheatmapadv.r vistreebootstrap.r vistreebsclust.r visdmatheatmap.r vishexbarplot.r shexpolygon.r shexgridvariant.r License GPL-2 biocviews Software, Clustering, Visualization, GeneExpression git_url git_branch RELEASE_3_7 git_last_commit 9f28a1d git_last_commit_date Date/Publication

2 2 R topics documented: R topics documented: Fang Golub sbmh scompreorder sdistance sdmat sdmatcluster sdmatminima shexdist shexgrid shexgridvariant shexpolygon sinitial smapoverlay sneighany sneighdirect spipeline stopology strainbatch strainology strainseq swritedata viscoloralpha viscolorbar viscolormap viscompreorder visdmatcluster visdmatheatmap visheatmap visheatmapadv vishexanimate vishexbarplot vishexcomp vishexgrid vishexmapping vishexmulcomp vishexpattern viskernels vistreebootstrap vistreebsclust visvp Xiang Index 70

3 Fang 3 Fang Human embryo gene expression dataset from Fang et al. (2010) Human embryo dataset contains gene expression levels (5441 genes and 18 embryo samples) from Fang et al. (2010). data(fang) References Fang: a gene expression matrix of 5441 genes x 18 samples, involving six successive stages, each with three replicates. Fang.sampleinfo: a matrix containing the information of the 18 samples for the expression matrix Fang. The three columns correspond to the sample information: "Name", "Stage" and "Replicate". Fang.geneinfo: a matrix containing the information of the 5441 genes for the expression matrix Fang. The three columns correspond to the gene information: "AffyID", "EntrezGene" and "Symbol". Fang et al. (2010). Transcriptome analysis of early organogenesis in human embryos. Developmental Cell, 19(1): Golub Leukemia gene expression dataset from Golub et al. (1999) Leukemia dataset (learning set) contains gene expression levels (3051 genes and 38 patient samples) from Golub et al. (1999). This dataset has been pre-processed: capping into floor of 100 and ceiling of 16000; filtering by exclusion of genes with max/min <= 5 or max min <= 500, where max and min refer respectively to the maximum and minimum intensities for a particular gene across mrna samples; 2-base logarithmic transformation. data(golub) Golub: a gene expression matrix of 3051 genes x 38 samples. These samples include 11 acute myeloid leukemia (AML) and 27 acute lymphoblastic leukemia (ALL) which can be further subtyped into 19 B-cell ALL and 8 T-cell ALL.

4 4 sbmh References Golub et al. (1999). Molecular classification of cancer: class discovery and class prediction by gene expression monitoring, Science, Vol. 286: sbmh Function to identify the best-matching hexagons/rectangles for the input data sbmh is supposed to identify the best-matching hexagons/rectangles (BMH) for the input data. sbmh(smap, data, which_bmh = c("best", "worst", "all")) smap data which_bmh an object of class "smap" or a codebook matrix a data frame or matrix of input data which BMH is requested. It can be a vector consisting of any integer values from [1, nhex]. Alternatively, it can also be one of "best", "worst" and "all" choices. Here, "best" is equivalent to 1, "worst" for nhex, and "all" for seq(1, nhex) a list with following components: bmh: the requested BMH matrix of dlen x length(which_bmh), where dlen is the total number of rows of the input data qerr: the corresponding matrix of quantization errors (i.e., the distance between the input data and their BMH), with the same dimensions as "bmh" above mqe: the mean quantization error for the "best" BMH call: the call that produced this result "which_bmh" upon request can be a vector consisting of any integer values from [1, nhex] spipeline

5 scompreorder 5 # 1) generate an iid normal random matrix of 100x10 data <- matrix( rnorm(100*10,mean=0,sd=1), nrow=100, ncol=10) # 2) from this input matrix, determine nhex=5*sqrt(nrow(data))=50, # but it returns nhex=61, via "shexgrid(nhex=50)", to make sure a supra-hexagonal grid stopol <- stopology(data=data, lattice="hexa", shape="suprahex") # 3) initialise the codebook matrix using "uniform" method si <- sinitial(data=data, stopol=stopol, init="uniform") # 4) define trainology at "rough" stage st_rough <- strainology(smap=si, data=data, stage="rough") # 5) training at "rough" stage sm_rough <- strainbatch(smap=si, data=data, strain=st_rough) # 6) define trainology at "finetune" stage st_finetune <- strainology(smap=si, data=data, stage="finetune") # 7) training at "finetune" stage sm_finetune <- strainbatch(smap=sm_rough, data=data, strain=st_rough) # 8) find the best-matching hexagons/rectangles for the input data response <- sbmh(smap=sm_finetune, data=data, which_bmh="best") scompreorder Function to reorder component planes scompreorder is supposed to reorder component planes for the input map/data. It returns an object of class "sreorder". It is realized by using a new map grid (with sheet shape consisting of a rectangular lattice) to train component plane vectors (either column-wise vectors of codebook/data matrix or the covariance matrix thereof). As a result, similar component planes are placed closer to each other. It is highly recommend to use trained map (i.e. codebook matrix) as input if data matrix is hugely big to save computational costs. scompreorder(smap, xdim = NULL, ydim = NULL, amplifier = NULL, metric = c("none", "pearson", "spearman", "kendall", "euclidean", "manhattan", "cos", "mi"), init = c("linear", "uniform", "sample"), algorithm = c("sequential", "batch"), alphatype = c("invert", "linear", "power"), neighkernel = c("gaussian", "bubble", "cutgaussian", "ep", "gamma")) smap xdim ydim an object of class "smap" or input data frame/matrix an integer specifying x-dimension of the grid an integer specifying y-dimension of the grid

6 6 scompreorder amplifier metric init algorithm alphatype neighkernel an integer specifying the amplifier (3 by default) of the number of component planes. The product of the component number and the amplifier constitutes the number of rectangles in the sheet grid distance metric used to define the similarity between component planes. It can be "none", which means directly using column-wise vectors of codebook/data matrix. Otherwise, first calculate the covariance matrix from the codebook/data matrix. The distance metric used for calculating the covariance matrix between component planes can be: "pearson" for pearson correlation, "spearman" for spearman rank correlation, "kendall" for kendall tau rank correlation, "euclidean" for euclidean distance, "manhattan" for cityblock distance, "cos" for cosine similarity, "mi" for mutual information. See sdistance for details an initialisation method. It can be one of "uniform", "sample" and "linear" initialisation methods the training algorithm. It can be one of "sequential" and "batch" algorithm. By default, it uses sequential algorithm. If the input data contains a large number of samples but not a great amount of zero entries, then it is reasonable to use batch algorithm for its fast computations (probably also without the compromise of accuracy) the alpha type. It can be one of "invert", "linear" and "power" alpha types the training neighbor kernel. It can be one of "gaussian", "bubble", "cutgaussian", "ep" and "gamma" kernels an object of class "sreorder", a list with following components: nhex: the total number of rectanges in the grid xdim: x-dimension of the grid ydim: y-dimension of the grid uorder: the unique order/placement for each component plane that is reordered to the "sheet"- shape grid with rectangular lattice coord: a matrix of nhex x 2, with each row corresponding to the coordinates of each "uorder" rectangle in the 2D map grid call: the call that produced this result All component planes are uniquely placed within a "sheet"-shape rectangle grid: Each component plane mapped to the "sheet"-shape grid with rectangular lattice is determinied iteratively in an order from the best matched to the next compromised one. If multiple compoments are hit in the same rectangular lattice, the worse one is always sacrificed by moving to the next best one till all components are placed somewhere exclusively on their own. The size of "sheet"-shape rectangle grid depends on the input arguments: How the input parameters are used to determine nhex is taken priority in the following order: "xdim & ydim" > "nhex" > "data". If both of xdim and ydim are given, nhex = xdim ydim.

7 sdistance 7 If only data is input, nhex = 5 sqrt(dlen), where dlen is the number of rows of the input data. After nhex is determined, xy-dimensions of rectangle grid are then determined according to the square root of the two biggest eigenvalues of the input data. stopology, spipeline, sbmh, sdistance, viscompreorder # 1) generate an iid normal random matrix of 100x10 data <- matrix( rnorm(100*10,mean=0,sd=1), nrow=100, ncol=10) colnames(data) <- paste(rep('s',10), seq(1:10), sep="") # 2) get trained using by default setup smap <- spipeline(data=data) # 3) reorder component planes in different ways # 3a) directly using column-wise vectors of codebook matrix sreorder <- scompreorder(smap=smap, amplifier=2, metric="none") # 3b) according to covariance matrix of pearson correlation of codebook matrix sreorder <- scompreorder(smap=smap, amplifier=2, metric="pearson") # 3c) according to covariance matrix of pearson correlation of input matrix sreorder <- scompreorder(smap=data, amplifier=2, metric="pearson") sdistance Function to compute the pairwise distance for a given data matrix sdistance is supposed to compute and return the distance matrix between the rows of a data matrix using a specified distance metric sdistance(data, metric = c("pearson", "spearman", "kendall", "euclidean", "manhattan", "cos", "mi")) data metric a data frame or matrix of input data distance metric used to calculate a symmetric distance matrix. See below for options available dist: a symmetric distance matrix of nrow x nrow, where nrow is the number of rows of input data matrix

8 8 sdistance The distance metrics are supported: "pearson": Pearson correlation. that two curves that have identical shape, but different magnitude will still have a correlation of 1 "spearman": Spearman rank correlation. As a nonparametric version of the pearson correlation, it calculates the correlation between the ranks of the data values in the two vectors (more robust against outliers) "kendall": Kendall tau rank correlation. Compared to spearman rank correlation, it goes a step further by using only the relative ordering to calculate the correlation. For all pairs of data points (x i, y i ) and (x j, y j ), it calls a pair of points either as concordant (Nc in total) if (x i x j ) (y i y j ) > 0, or as discordant (Nd in total) if (x i x j ) (y i y j ) < 0. Finally, it calculates gamma coefficient (Nc Nd)/(Nc + Nd) as a measure of association which is highly resistant to tied data "euclidean": Euclidean distance. Unlike the correlation-based distance measures, it takes the magnitude into account (input data should be suitably normalized "manhattan": Cityblock distance. The distance between two vectors is the sum of absolute value of their differences along any coordinate dimension "cos": Cosine similarity. As an uncentered version of pearson correlation, it is a measure of similarity between two vectors of an inner product space, i.e., measuring the cosine of the angle between them (using a dot product and magnitude) "mi": Mutual information (MI). M I provides a general measure of dependencies between variables, in particular, positive, negative and nonlinear correlations. The caclulation of M I is implemented via applying adaptive partitioning method for deriving equal-probability bins (i.e., each bin contains approximately the same number of data points). The number of bins is heuristically determined (the lower bound): 1 + log2(n), where n is the length of the vector. Because M I increases with entropy, we normalize it to allow comparison of different pairwise clone similarities: 2 M I/[H(x) + H(y)], where H(x) and H(y) stand for the entropy for the vector x and y, respectively sdmatcluster # 1) generate an iid normal random matrix of 100x10 data <- matrix( rnorm(100*10,mean=0,sd=1), nrow=100, ncol=10) # 2) calculate distance matrix using different metric smap <- spipeline(data=data) # 2a) using "pearson" metric dist <- sdistance(data=data, metric="pearson") # 2b) using "cos" metric # dist <- sdistance(data=data, metric="cos") # 2c) using "spearman" metric # dist <- sdistance(data=data, metric="spearman") # 2d) using "kendall" metric # dist <- sdistance(data=data, metric="kendall") # 2e) using "euclidean" metric # dist <- sdistance(data=data, metric="euclidean") # 2f) using "manhattan" metric

9 sdmat 9 # dist <- sdistance(data=data, metric="manhattan") # 2g) using "mi" metric # dist <- sdistance(data=data, metric="mi") sdmat Function to calculate distance matrix in high-dimensional input space but according to neighborhood relationships in 2D output space sdmat is supposed to calculate distance (measured in high-dimensional input space) to neighbors (defined by based on 2D output space) for each of hexagons/rectangles sdmat(smap, which_neigh = 1, distmeasure = c("median", "mean", "min", "max")) smap which_neigh distmeasure an object of class "smap" which neighbors in 2D output space are used for the calculation. By default, it sets to "1" for direct neighbors, and "2" for neighbors within neighbors no more than 2, and so on distance measure used to calculate distances in high-dimensional input space dmat: a vector with the length of nhex. It stores the distance a hexaon/rectangle is away from its output-space-defined neighbors in high-dimensional input space "which_neigh" is defined in output 2D space, but "distmeasure" is defined in high-dimensional input space sneighany # 1) generate an iid normal random matrix of 100x10 data <- matrix( rnorm(100*10,mean=0,sd=1), nrow=100, ncol=10) # 2) get trained using by default setup smap <- spipeline(data=data) # 3) calculate "median" distances in INPUT space to different neighbors in 2D OUTPUT space # 3a) using direct neighbors in 2D OUTPUT space dmat <- sdmat(smap=smap, which_neigh=1, distmeasure="median") # 3b) using no more than 2-topological neighbors in 2D OUTPUT space # dmat <- sdmat(smap=smap, which_neigh=2, distmeasure="median")

10 10 sdmatcluster sdmatcluster Function to partition a grid map into clusters sdmatcluster is supposed to obtain clusters from a grid map. It returns an object of class "sbase". sdmatcluster(smap, which_neigh = 1, distmeasure = c("median", "mean", "min", "max"), clusterlinkage = c("average", "complete", "single", "bmh"), reindexseed = c("hclust", "svd", "none")) smap which_neigh distmeasure an object of class "smap" which neighbors in 2D output space are used for the calculation. By default, it sets to "1" for direct neighbors, and "2" for neighbors within neighbors no more than 2, and so on distance measure used to calculate distances in high-dimensional input space. It can be one of "median", "mean", "min" and "max" measures clusterlinkage cluster linkage used to derive clusters. It can be "bmh", which accumulates a cluster just based on best-matching hexagons/rectanges but can not ensure each cluster is continuous. Instead, each cluster is continuous when using regiongrowing algorithm with one of "average", "complete" and "single" linkages reindexseed the way to index seed. It can be "hclust" for reindexing seeds according to hierarchical clustering of patterns seen in seeds, "svd" for reindexing seeds according to svd of patterns seen in seeds, or "none" for seeds being simply increased by the hexagon indexes (i.e. always in an increasing order as hexagons radiate outwards) an object of class "sbase", a list with following components: seeds: the vector to store cluster seeds, i.e., a list of local minima (in 2D output space) of distance matrix (in input space). They are represented by the indexes of hexagons/rectangles bases: the vector with the length of nhex to store the cluster memberships/bases, where nhex is the total number of hexagons/rectanges in the grid call: the call that produced this result The first item in the return "seeds" is the first cluster, whose memberships are those in the return "bases" that equals 1. The same relationship is held for the second item, and so on spipeline, sdmatminima, sbmh, sneighdirect, sdistance, visdmatcluster

11 sdmatminima 11 # 1) generate an iid normal random matrix of 100x10 data <- matrix( rnorm(100*10,mean=0,sd=1), nrow=100, ncol=10) # 2) get trained using by default setup smap <- spipeline(data=data) # 3) partition the grid map into clusters based on different criteria # 3a) based on "bmh" criterion # sbase <- sdmatcluster(smap=smap, which_neigh=1, distmeasure="median", clusterlinkage="bmh") # 3b) using region-growing algorithm with linkage "average" sbase <- sdmatcluster(smap=smap, which_neigh=1, distmeasure="median", clusterlinkage="average") # 4) visualise clusters/bases partitioned from the smap visdmatcluster(smap,sbase) sdmatminima Function to identify local minima (in 2D output space) of distance matrix (in high-dimensional input space) sdmatminima is supposed to identify local minima of distance matrix (resulting from sdmat). The criterion of being local minima is that the distance associated with a hexagon/rectangle is always smaller than its direct neighbors (i.e., 1-neighborhood) sdmatminima(smap, which_neigh = 1, distmeasure = c("median", "mean", "min", "max")) smap which_neigh distmeasure an object of class "smap" which neighbors in 2D output space are used for the calculation. By default, it sets to "1" for direct neighbors, and "2" for neighbors within neighbors no more than 2, and so on distance measure used to calculate distances in high-dimensional input space. It can be one of "median", "mean", "min" and "max" measures minima: a vector to store a list of local minima (represented by the indexes of hexogans/rectangles Do not get confused by "which_neigh" and the criteria of being local minima. Both of them deal with 2D output space. However, "which_neigh" is used to assist in the calculation of distance matrix (so can be 1-neighborhood or more); instead, the criterion of being local minima is only 1-neighborhood in the strictest sense

12 12 shexdist sdmat, sneighany # 1) generate an iid normal random matrix of 100x10 data <- matrix( rnorm(100*10,mean=0,sd=1), nrow=100, ncol=10) # 2) get trained using by default setup smap <- spipeline(data=data) # 3) identify local minima of distance matrix based on "median" distances and direct neighbors minima <- sdmatminima(smap=smap, which_neigh=1, distmeasure="median") shexdist Function to calculate distances between hexagons/rectangles in a 2D grid shexdist is supposed to calculate euclidian distances between each pair of hexagons/rectangles in a 2D grid of input "stopol" or "smap" object. It returns a symmetric matrix containing pairwise distances. shexdist(sobj) sobj an object of class "stopol" or "sinit" or "smap" dist: a symmetric matrix of nhex x nhex, containing pairwise distances, where nhex is the total number of hexagons/rectanges in the grid The return matrix has rows/columns ordered in the same order as the "coord" matrix of the input object does. stopology, sinitial

13 shexgrid 13 # 1) generate an iid normal random matrix of 100x10 data <- matrix( rnorm(100*10,mean=0,sd=1), nrow=100, ncol=10) # 2) from this input matrix, determine nhex=5*sqrt(nrow(data))=50, # but it returns nhex=61, via "shexgrid(nhex=50)", to make sure a supra-hexagonal grid stopol <- stopology(data=data, lattice="hexa", shape="suprahex") # 3) initialise the codebook matrix using "uniform" method si <- sinitial(data=data, stopol=stopol, init="uniform") # 4) calculate distances between hexagons/rectangles in a 2D grid based on different objects # 4a) based on an object of class "stopol" dist <- shexdist(sobj=stopol) # 4b) based on an object of class "smap" dist <- shexdist(sobj=si) shexgrid Function to define a supra-hexagonal grid shexgrid is supposed to define a supra-hexagonal map grid. A supra-hexagon is a giant hexagon, which seamlessly consists of smaller hexagons. Due to the symmetric nature, it can be uniquely determined by specifying the radius away from the grid centroid. This function takes input the grid radius (or the number of hexagons in the grid, but will be adjusted to meet the definition of supra-hexagon), and returns a list (see below) containing: the grid radius, the total number of hexagons in the grid, the 2D coordinates of the grid centroid, the step for each hexogan away from the grid centroid, and the 2D coordinates of all hexagons in the grid. shexgrid(r = NULL, nhex = NULL) r nhex an integer specifying the radius in a supra-hexagonal grid the number of input hexagons in the grid an object of class "shex", a list with following components: r: the grid radius nhex: the total number of hexagons in the grid. It may differ from the input value; actually it is always no less than the input one to ensure a supra-hexagonal grid exactly formed centroid: the 2D coordinates of the grid centroid stepcentroid: a vector with the length of nhex. It stores how many steps a hexagon is awawy from the grid centroid ( 1 for the centroid itself). Starting with the centroid, it orders outward. Also, for those hexagons of the same step, it orders from the rightmost in an anti-clock wise

14 14 shexgridvariant anglecentroid: a vector with the length of nhex. It stores the angle a hexagon is in terms of the grid centroid ( 0 for the centroid itself). For those hexagons of the same step, it orders from the rightmost in an anti-clock wise coord: a matrix of nhex x 2 with each row specifying the 2D coordinates of a hexagon in the grid. The order of rows is the same as centroid above call: the call that produced this result The relationships among return values: nhex = r (r 1)/2 centroid = coord[1, ] stepcentroid[1] = 1 stepcentroid[2 : nhex] = unlist(sapply(2 : r, function(x)(c((1 + 6 x (x 1)/2 6 (x 1) + 1) : (1 + 6 x (x 1)/2)) >= 1) x)) spipeline # The supra-hexagonal grid is exactly determined by specifying the radius. shex <- shexgrid(r=2) # The grid is determined according to the number of input hexagons (after being adjusted). # The return res$nhex is always no less than the input one. # It ensures a supra-hexagonal grid is exactly formed. shex <- shexgrid(nhex=12) # Ignore input nhex if r is also given shex <- shexgrid(r=3, nhex=100) # By default, r=3 if no parameters are specified shex <- shexgrid() shexgridvariant Function to define a variant of a supra-hexagonal grid shexgridvariant is supposed to define a variant of a supra-hexagonal map grid. In essence, it is the subset of the supra-hexagon. shexgridvariant(r = NULL, nhex = NULL, shape = c("suprahex", "triangle", "diamond", "hourglass", "trefoil", "ladder", "butterfly", "ring", "bridge"))

15 shexgridvariant 15 r nhex shape an integer specifying the radius in a supra-hexagonal grid the number of input hexagons in the grid the grid shape, either "suprahex" for the suprahex itself, or its variants (including "triangle" for the triangle-shaped variant, "diamond" for the diamondshaped variant, "hourglass" for the hourglass-shaped variant, "trefoil" for the trefoil-shaped variant, "ladder" for the ladder-shaped variant, "butterfly" for the butterfly-shaped variant, "ring" for the ring-shaped variant, and "bridge" for the bridge-shaped variant) an object of class "shex", a list with following components: none r: the grid radius nhex: the total number of hexagons in the grid. It may differ from the input value; actually it is always no less than the input one to ensure a supra-hexagonal grid exactly formed centroid: the 2D coordinates of the grid centroid stepcentroid: a vector with the length of nhex. It stores how many steps a hexagon is awawy from the grid centroid ( 1 for the centroid itself). Starting with the centroid, it orders outward. Also, for those hexagons of the same step, it orders from the rightmost in an anti-clock wise anglecentroid: a vector with the length of nhex. It stores the angle a hexagon is in terms of the grid centroid ( 0 for the centroid itself). For those hexagons of the same step, it orders from the rightmost in an anti-clock wise coord: a matrix of nhex x 2 with each row specifying the 2D coordinates of a hexagon in the grid. The order of rows is the same as centroid above call: the call that produced this result shexgrid # For "suprahex" shape itself shex <- shexgridvariant(r=6, shape="suprahex") ## Not run: library(ggplot2) #geom_polygon(color="black", fill=na) # For "suprahex" shape itself shex <- shexgridvariant(r=6, shape="suprahex") df_polygon <- shexpolygon(shex) df_coord <- data.frame(shex$coord, index=1:nrow(shex$coord)) gp_suprahex <- ggplot(data=df_polygon, aes(x,y,group=index)) + geom_polygon(aes(fill=factor(stepcentroid%%2))) +

16 16 shexgridvariant coord_fixed(ratio=1) + theme_void() + theme(legend.position="none") + geom_text(data=df_coord, aes(x,y,label=index), color="white", size=3) + labs(title="suprahex (r=6; xdim=ydim=11)") + theme(plot.title=element_text(hjust=0.5,size=8)) # For "triangle" shape shex <- shexgridvariant(r=6, shape="triangle") df_polygon <- shexpolygon(shex) df_coord <- data.frame(shex$coord, index=1:nrow(shex$coord)) gp_triangle <- ggplot(data=df_polygon, aes(x,y,group=index)) + geom_polygon(aes(fill=factor(stepcentroid%%2))) + coord_fixed(ratio=1) + theme_void() + theme(legend.position="none") + geom_text(data=df_coord, aes(x,y,label=index), color="white", size=3) + labs(title="triangle (r=6; xdim=ydim=6)") + theme(plot.title=element_text(hjust=0.5,size=8)) # For "diamond" shape shex <- shexgridvariant(r=6, shape="diamond") df_polygon <- shexpolygon(shex) df_coord <- data.frame(shex$coord, index=1:nrow(shex$coord)) gp_diamond <- ggplot(data=df_polygon, aes(x,y,group=index)) + geom_polygon(aes(fill=factor(stepcentroid%%2))) + coord_fixed(ratio=1) + theme_void() + theme(legend.position="none") + geom_text(data=df_coord, aes(x,y,label=index), color="white", size=3) + labs(title="diamond (r=6; xdim=6, ydim=11)") + theme(plot.title=element_text(hjust=0.5,size=8)) # For "hourglass" shape shex <- shexgridvariant(r=6, shape="hourglass") df_polygon <- shexpolygon(shex) df_coord <- data.frame(shex$coord, index=1:nrow(shex$coord)) gp_hourglass <- ggplot(data=df_polygon, aes(x,y,group=index)) + geom_polygon(aes(fill=factor(stepcentroid%%2))) + coord_fixed(ratio=1) + theme_void() + theme(legend.position="none") + geom_text(data=df_coord, aes(x,y,label=index), color="white", size=3) + labs(title="hourglass (r=6; xdim=6, ydim=11)") + theme(plot.title=element_text(hjust=0.5,size=8)) # For "trefoil" shape shex <- shexgridvariant(r=6, shape="trefoil") df_polygon <- shexpolygon(shex) df_coord <- data.frame(shex$coord, index=1:nrow(shex$coord)) gp_trefoil <- ggplot(data=df_polygon, aes(x,y,group=index)) + geom_polygon(aes(fill=factor(stepcentroid%%2))) + coord_fixed(ratio=1) + theme_void() + theme(legend.position="none") + geom_text(data=df_coord, aes(x,y,label=index), color="white", size=3) + labs(title="trefoil (r=6; xdim=ydim=11)") + theme(plot.title=element_text(hjust=0.5,size=8)) # For "ladder" shape shex <- shexgridvariant(r=6, shape="ladder") df_polygon <- shexpolygon(shex) df_coord <- data.frame(shex$coord, index=1:nrow(shex$coord)) gp_ladder <- ggplot(data=df_polygon, aes(x,y,group=index)) + geom_polygon(aes(fill=factor(stepcentroid%%2))) + coord_fixed(ratio=1) + theme_void() + theme(legend.position="none") + geom_text(data=df_coord, aes(x,y,label=index), color="white", size=3) +

17 shexpolygon 17 labs(title="ladder (r=6; xdim=11, ydim=6)") + theme(plot.title=element_text(hjust=0.5,size=8)) # For "butterfly" shape shex <- shexgridvariant(r=6, shape="butterfly") df_polygon <- shexpolygon(shex) df_coord <- data.frame(shex$coord, index=1:nrow(shex$coord)) gp_butterfly <- ggplot(data=df_polygon, aes(x,y,group=index)) + geom_polygon(aes(fill=factor(stepcentroid%%2))) + coord_fixed(ratio=1) + theme_void() + theme(legend.position="none") + geom_text(data=df_coord, aes(x,y,label=index), color="white", size=3) + labs(title="butterfly (r=6; xdim=ydim=11)") + theme(plot.title=element_text(hjust=0.5,size=8)) # For "ring" shape shex <- shexgridvariant(r=6, shape="ring") df_polygon <- shexpolygon(shex) df_coord <- data.frame(shex$coord, index=1:nrow(shex$coord)) gp_ring <- ggplot(data=df_polygon, aes(x,y,group=index)) + geom_polygon(aes(fill=factor(stepcentroid%%2))) + coord_fixed(ratio=1) + theme_void() + theme(legend.position="none") + geom_text(data=df_coord, aes(x,y,label=index), color="white", size=3) + labs(title="ring (r=6; xdim=ydim=11)") + theme(plot.title=element_text(hjust=0.5,size=8)) # For "bridge" shape shex <- shexgridvariant(r=6, shape="bridge") df_polygon <- shexpolygon(shex) df_coord <- data.frame(shex$coord, index=1:nrow(shex$coord)) gp_bridge <- ggplot(data=df_polygon, aes(x,y,group=index)) + geom_polygon(aes(fill=factor(stepcentroid%%2))) + coord_fixed(ratio=1) + theme_void() + theme(legend.position="none") + geom_text(data=df_coord, aes(x,y,label=index), color="white", size=3) + labs(title="bridge (r=6; xdim=11, ydim=6)") + theme(plot.title=element_text(hjust=0.5,size=8)) # combined visuals library(gridextra) grid.arrange(grobs=list(gp_suprahex, gp_ring, gp_diamond, gp_trefoil, gp_butterfly, gp_hourglass, gp_ladder, gp_bridge, gp_triangle), layout_matrix=rbind(c(1,1,2,2,3),c(1,1,2,2,3),c(4,4,5,5,6),c(4,4,5,5,6),c(7,7,8,8,9)), nrow=5, ncol=5) ## End(Not run) shexpolygon Function to extract polygon location per hexagon within a suprahexagonal grid shexpolygon is supposed to extract polygon location per hexagon within a supra-hexagonal grid shexpolygon(sobj, area.size = 1)

18 18 sinitial sobj area.size an object of class "smap" or "sinit" or "stopol" or "shex" an inteter or a vector specifying the area size of each hexagon a data frame of five columns ( x, y, index, stepcentroid, anglecentroid ) storing polygon location per hexagon. None shexgridvariant, spipeline sobj <- stopology(xdim=4, ydim=4, lattice="hexa", shape="suprahex") df_polygon <- shexpolygon(sobj, area.size=1) sinitial Function to initialise a sinit object given a topology and input data sinitial is supposed to initialise an object of class "sinit" given a topology and input data. As a matter of fact, it initialises the codebook matrix (in input high-dimensional space). The return object inherits the topology information (i.e., a "stopol" object from stopology), along with initialised codebook matrix and method used. sinitial(data, stopol, init = c("linear", "uniform", "sample")) data stopol init a data frame or matrix of input data an object of class "stopol" (see stopology) an initialisation method. It can be one of "uniform", "sample" and "linear" initialisation methods

19 sinitial 19 an object of class "sinit", a list with following components: nhex: the total number of hexagons/rectanges in the grid xdim: x-dimension of the grid ydim: y-dimension of the grid r: the hypothetical radius of the grid lattice: the grid lattice shape: the grid shape coord: a matrix of nhex x 2, with each row corresponding to the coordinates of a hexagon/rectangle in the 2D map grid init: an initialisation method codebook: a codebook matrix of nhex x ncol(data), with each row corresponding to a prototype vector in input high-dimensional space call: the call that produced this result The initialisation methods include: "uniform": the codebook matrix is uniformly initialised via randomly taking any values within the interval [min, max] of each column of input data "sample": the codebook matrix is initialised via randomly sampling/selecting input data "linear": the codebook matrix is linearly initialised along the first two greatest eigenvectors of input data stopology # 1) generate an iid normal random matrix of 100x10 data <- matrix( rnorm(100*10,mean=0,sd=1), nrow=100, ncol=10) # 2) from this input matrix, determine nhex=5*sqrt(nrow(data))=50, # but it returns nhex=61, via "shexgrid(nhex=50)", to make sure a supra-hexagonal grid stopol <- stopology(data=data, lattice="hexa", shape="suprahex") # 3) initialise the codebook matrix using different mehtods # 3a) using "uniform" method si_uniform <- sinitial(data=data, stopol=stopol, init="uniform") # 3b) using "sample" method # si_sample <- sinitial(data=data, stopol=stopol, init="sample") # 3c) using "linear" method # si_linear <- sinitial(data=data, stopol=stopol, init="linear")

20 20 smapoverlay smapoverlay Function to overlay additional data onto the trained map for viewing the distribution of that additional data smapoverlay is supposed to overlay additional data onto the trained map for viewing the distribution of that additional data. It returns an object of class "smap". It is realised by first estimating the hit histogram weighted by the neighborhood kernel, and then calculating the distribution of the additional data over the map (similarly weighted by the neighborhood kernel). The final overlaid distribution of additional data is normalised by the hit histogram. smapoverlay(smap, data, additional) smap data additional an object of class "smap" a data frame or matrix of input data a numeric vector or numeric matrix used to overlay onto the trained map. It must have the length (if being vector) or row number (if matrix) being equal to the number of rows in input data an object of class "smap", a list with following components: nhex: the total number of hexagons/rectanges in the grid xdim: x-dimension of the grid ydim: y-dimension of the grid r: the hypothetical radius of the grid lattice: the grid lattice shape: the grid shape coord: a matrix of nhex x 2, with rows corresponding to the coordinates of all hexagons/rectangles in the 2D map grid init: an initialisation method neighkernel: the training neighborhood kernel codebook: a codebook matrix of nhex x ncol(additional), with rows corresponding to overlaid vectors hits: a vector of nhex, each element meaning that a hexagon/rectangle contains the number of input data vectors being hit wherein mqe: the mean quantization error for the "best" BMH call: the call that produced this result Weighting by neighbor kernel is to avoid rigid overlaying by only focusing on the best-matching map nodes as there may exist several closest best-matching nodes for an input data vector.

21 sneighany 21 spipeline, sbmh, shexdist, vishexmulcomp # 1) generate an iid normal random matrix of 100x10 data <- matrix( rnorm(100*10,mean=0,sd=1), nrow=100, ncol=10) colnames(data) <- paste(rep('s',10), seq(1:10), sep="") # 2) get trained using by default setup smap <- spipeline(data=data) # 3) overlay additional data onto the trained map # here using the first two columns of the input "data" as "additional" # codebook in "soverlay" is the same as the first two columns of codebook in "smap" soverlay <- smapoverlay(smap=smap, data=data, additional=data[,1:2]) # 4) viewing the distribution of that additional data vishexmulcomp(soverlay) sneighany Function to calculate any neighbors for each hexagon/rectangle in a grid sneighany is supposed to calculate any neighbors for each hexagon/rectangle in a regular 2D grid. It returns a matrix with rows for the self, and columns for its any neighbors. sneighany(sobj) sobj an object of class "stopol" or "sinit" or "smap" aneigh: a matrix of nhex x nhex, containing distance info in terms of any neighbors, where nhex is the total number of hexagons/rectanges in the grid The return matrix has rows for the self, and columns for its neighbors. The non-zeros mean the distance away from its neighbors, and the zeros for the self-self. It has rows/columns ordered in the same order as the "coord" matrix of the input object does. sneighdirect

22 22 sneighdirect # 1) generate an iid normal random matrix of 100x10 data <- matrix( rnorm(100*10,mean=0,sd=1), nrow=100, ncol=10) # 2) from this input matrix, determine nhex=5*sqrt(nrow(data))=50, # but it returns nhex=61, via "shexgrid(nhex=50)", to make sure a supra-hexagonal grid stopol <- stopology(data=data, lattice="hexa", shape="suprahex") # 3) initialise the codebook matrix using "uniform" method si <- sinitial(data=data, stopol=stopol, init="uniform") # 4) calculate any neighbors based on different objects # 4a) based on an object of class "stopol" aneigh <- sneighany(sobj=stopol) # 4b) based on an object of class "smap" # aneigh <- sneighany(sobj=si) sneighdirect Function to calculate direct neighbors for each hexagon/rectangle in a grid sneighdirect is supposed to calculate direct neighbors for each hexagon/rectangle in a regular 2D grid. It returns a matrix with rows for the self, and columns for its direct neighbors. sneighdirect(sobj) sobj an object of class "stopol" or "sinit" or "smap" dneigh: a matrix of nhex x nhex, containing presence/absence info in terms of direct neighbors, where nhex is the total number of hexagons/rectanges in the grid The return matrix has rows for the self, and columns for its direct neighbors. The "1" means the presence of direct neighbors, "0" for the absence. It has rows/columns ordered in the same order as the "coord" matrix of the input object does. shexdist

23 spipeline 23 # 1) generate an iid normal random matrix of 100x10 data <- matrix( rnorm(100*10,mean=0,sd=1), nrow=100, ncol=10) # 2) from this input matrix, determine nhex=5*sqrt(nrow(data))=50, # but it returns nhex=61, via "shexgrid(nhex=50)", to make sure a supra-hexagonal grid stopol <- stopology(data=data, lattice="hexa", shape="suprahex") # 3) initialise the codebook matrix using "uniform" method si <- sinitial(data=data, stopol=stopol, init="uniform") # 4) calculate direct neighbors based on different objects # 4a) based on an object of class "stopol" dneigh <- sneighdirect(sobj=stopol) # 4b) based on an object of class "smap" # dneigh <- sneighdirect(sobj=si) spipeline Function to setup the pipeline for completing ab initio training given the input data spipeline is supposed to finish ab inito training for the input data. It returns an object of class "smap". spipeline(data = NULL, xdim = NULL, ydim = NULL, nhex = NULL, lattice = c("hexa", "rect"), shape = c("suprahex", "sheet", "triangle", "diamond", "hourglass", "trefoil", "ladder", "butterfly", "ring", "bridge"), scale = 5, init = c("linear", "uniform", "sample"), algorithm = c("batch", "sequential"), alphatype = c("invert", "linear", "power"), neighkernel = c("gaussian", "bubble", "cutgaussian", "ep", "gamma"), finetunesustain = FALSE, verbose = TRUE) data xdim ydim nhex lattice shape a data frame or matrix of input data an integer specifying x-dimension of the grid an integer specifying y-dimension of the grid the number of hexagons/rectangles in the grid the grid lattice, either "hexa" for a hexagon or "rect" for a rectangle the grid shape, either "suprahex" for a supra-hexagonal grid or "sheet" for a hexagonal/rectangle sheet. Also supported are suprahex s variants (including "triangle" for the triangle-shaped variant, "diamond" for the diamond-shaped variant, "hourglass" for the hourglass-shaped variant, "trefoil" for the trefoilshaped variant, "ladder" for the ladder-shaped variant, "butterfly" for the butterflyshaped variant, "ring" for the ring-shaped variant, and "bridge" for the bridgeshaped variant)

24 24 spipeline scale init algorithm alphatype neighkernel the scaling factor. Only used when automatically estimating the grid dimension from input data matrix. By default, it is 5 (big map). Other suggested values: 1 for small map, and 3 for median map an initialisation method. It can be one of "uniform", "sample" and "linear" initialisation methods the training algorithm. It can be one of "sequential" and "batch" algorithm. By default, it uses batch algorithm purely because of its fast computations (probably also without the compromise of accuracy). However, it is highly recommended not to use batch algorithm if the input data contain lots of zeros; it is because matrix multiplication used in the batch algorithm can be problematic in this context. If much computation resource is at hand, it is alwasy safe to use the sequential algorithm the alpha type. It can be one of "invert", "linear" and "power" alpha types the training neighborhood kernel. It can be one of "gaussian", "bubble", "cutgaussian", "ep" and "gamma" kernels finetunesustain logical to indicate whether sustain the "finetune" training. If true, it will repeat the "finetune" stage until the mean quantization error does get worse. By default, it sets to true verbose logical to indicate whether the messages will be displayed in the screen. By default, it sets to false for no display an object of class "smap", a list with following components: nhex: the total number of hexagons/rectanges in the grid xdim: x-dimension of the grid ydim: y-dimension of the grid r: the hypothetical radius of the grid lattice: the grid lattice shape: the grid shape coord: a matrix of nhex x 2, with rows corresponding to the coordinates of all hexagons/rectangles in the 2D map grid polygon: a data frame of three columns ( x, y, id ) storing polygon location per hexagon in the 2D map grid init: an initialisation method neighkernel: the training neighborhood kernel codebook: a codebook matrix of nhex x ncol(data), with rows corresponding to prototype vectors in input high-dimensional space hits: a vector of nhex, each element meaning that a hexagon/rectangle contains the number of input data vectors being hit wherein mqe: the mean quantization error for the "best" BMH call: the call that produced this result

25 spipeline 25 The pipeline sequentially consists of: i) stopology used to define the topology of a grid (with "suprahex" shape by default ) according to the input data; ii) sinitial used to initialise the codebook matrix given the pre-defined topology and the input data (by default using "uniform" initialisation method); iii) strainology and strainseq used to get the grid map trained at both "rough" and "finetune" stages. If instructed, sustain the "finetune" training until the mean quantization error does get worse; iv) sbmh used to identify the best-matching hexagons/rectangles (BMH) for the input data, and these response data are appended to the resulting object of "smap" class. References Hai Fang and Julian Gough. (2014) suprahex: an R/Bioconductor package for tabular omics data analysis using a supra-hexagonal map. Biochemical and Biophysical Research Communications, 443(1), stopology, sinitial, strainology, strainseq, strainbatch, sbmh, vishexmulcomp # 1) generate an iid normal random matrix of 100x10 data <- matrix( rnorm(100*10,mean=0,sd=1), nrow=100, ncol=10) colnames(data) <- paste(rep('s',10), seq(1:10), sep="") # 2) get trained using by default setup but with different neighborhood kernels # 2a) with "gaussian" kernel smap <- spipeline(data=data, neighkernel="gaussian") # 2b) with "bubble" kernel # smap <- spipeline(data=data, neighkernel="bubble") # 2c) with "cutgaussian" kernel # smap <- spipeline(data=data, neighkernel="cutgaussian") # 2d) with "ep" kernel # smap <- spipeline(data=data, neighkernel="ep") # 2e) with "gamma" kernel # smap <- spipeline(data=data, neighkernel="gamma") # 3) visualise multiple component planes of a supra-hexagonal grid vishexmulcomp(smap, colormap="jet", ncolors=20, zlim=c(-1,1), gp=grid::gpar(cex=0.8)) # 4) get trained using by default setup but using the shape "butterfly" smap <- spipeline(data=data, shape="trefoil", algorithm=c("batch","sequential")[2]) vishexmulcomp(smap, colormap="jet", ncolors=20, zlim=c(-1,1), gp=grid::gpar(cex=0.8))

26 26 stopology stopology Function to define the topology of a map grid stopology is supposed to define the topology of a 2D map grid. The topological shape can be either a supra-hexagonal grid or a hexagonal/rectangle sheet. It returns an object of "stopol" class, containing: the total number of hexagons/rectangles in the grid, the grid xy-dimensions, the grid lattice, the grid shape, and the 2D coordinates of all hexagons/rectangles in the grid. The 2D coordinates can be directly used to measure distances between any pair of lattice hexagons/rectangles. stopology(data = NULL, xdim = NULL, ydim = NULL, nhex = NULL, lattice = c("hexa", "rect"), shape = c("suprahex", "sheet", "triangle", "diamond", "hourglass", "trefoil", "ladder", "butterfly", "ring", "bridge"), scale = 5) data xdim ydim nhex lattice shape scale a data frame or matrix of input data an integer specifying x-dimension of the grid an integer specifying y-dimension of the grid the number of hexagons/rectangles in the grid the grid lattice, either "hexa" for a hexagon or "rect" for a rectangle the grid shape, either "suprahex" for a supra-hexagonal grid or "sheet" for a hexagonal/rectangle sheet. Also supported are suprahex s variants (including "triangle" for the triangle-shaped variant, "diamond" for the diamond-shaped variant, "hourglass" for the hourglass-shaped variant, "trefoil" for the trefoilshaped variant, "ladder" for the ladder-shaped variant, "butterfly" for the butterflyshaped variant, "ring" for the ring-shaped variant, and "bridge" for the bridgeshaped variant) the scaling factor. Only used when automatically estimating the grid dimension from input data matrix. By default, it is 5 (big map). Other suggested values: 1 for small map, and 3 for median map an object of class "stopol", a list with following components: nhex: the total number of hexagons/rectanges in the grid. It is not always the same as the input nhex (if any); see "" below for the explaination xdim: x-dimension of the grid ydim: y-dimension of the grid r: the hypothetical radius of the grid lattice: the grid lattice shape: the grid shape

27 stopology 27 coord: a matrix of nhex x 2, with each row corresponding to the coordinates of a hexagon/rectangle in the 2D map grid call: the call that produced this result The output of nhex depends on the input arguments and grid shape: How the input parameters are used to determine nhex is taken priority in the following order: "xdim & ydim" > "nhex" > "data" If both of xdim and ydim are given, nhex = xdim ydim for the "sheet" shape, r = (min(xdim, ydim) + 1)/2 for the "suprahex" shape If only data is input, nhex = scale sqrt(dlen), where dlen is the number of rows of the input data, and scale can be 5 (big map), 3 (median map) and 1 (normal map) With nhex in hand, it depends on the grid shape: For "sheet" shape, xy-dimensions of sheet grid is determined according to the square root of the two biggest eigenvalues of the input data For "suprahex" shape, see shexgrid for calculating the grid radius r. The xdim (and ydim) is related to r via xdim = 2 r 1 shexgrid, vishexmapping # For "suprahex" shape stopol <- stopology(xdim=3, ydim=3, lattice="hexa", shape="suprahex") # Error: "The suprahex shape grid only allows for hexagonal lattice" # stopol <- stopology(xdim=3, ydim=3, lattice="rect", shape="suprahex") # For "sheet" shape with hexagonal lattice stopol <- stopology(xdim=3, ydim=3, lattice="hexa", shape="sheet") # For "sheet" shape with rectangle lattice stopol <- stopology(xdim=3, ydim=3, lattice="rect", shape="sheet") # By default, nhex=19 (i.e., r=3; xdim=ydim=5) for "suprahex" shape stopol <- stopology(shape="suprahex") # By default, xdim=ydim=5 (i.e., nhex=25) for "sheet" shape stopol <- stopology(shape="sheet") # Determine the topolopy of a supra-hexagonal grid based on input data # 1) generate an iid normal random matrix of 100x10 data <- matrix(rnorm(100*10,mean=0,sd=1), nrow=100, ncol=10) # 2) from this input matrix, determine nhex=5*sqrt(nrow(data))=50, # but it returns nhex=61, via "shexgrid(nhex=50)", to make sure a supra-hexagonal grid stopol <- stopology(data=data, lattice="hexa", shape="suprahex") # stopol <- stopology(data=data, lattice="hexa", shape="trefoil") # do visualisation vishexmapping(stopol,mappingtype="indexes")

9/29/13. Outline Data mining tasks. Clustering algorithms. Applications of clustering in biology

9/29/13. Outline Data mining tasks. Clustering algorithms. Applications of clustering in biology 9/9/ I9 Introduction to Bioinformatics, Clustering algorithms Yuzhen Ye (yye@indiana.edu) School of Informatics & Computing, IUB Outline Data mining tasks Predictive tasks vs descriptive tasks Example