Package suprahex. August 13, Type Package

Size: px
Start display at page:

Download "Package suprahex. August 13, Type Package"

Transcription

1 Type Package Package suprahex August 13, 2018 Title suprahex: a supra-hexagonal map for analysing tabular omics data Version Date Author Hai Fang and Julian Gough Maintainer Hai Fang <hfang@well.ox.ac.uk> Depends R (>= 3.3), hexbin Imports ape, MASS, grdevices, graphics, stats, utils A supra-hexagonal map is a giant hexagon on a 2-dimensional grid seamlessly consisting of smaller hexagons. It is supposed to train, analyse and visualise a highdimensional omics input data. The suprahex is able to carry out gene clustering/metaclustering and sample correlation, plus intuitive visualisations to facilitate exploratory analysis. More importantly, it allows for overlaying additional data onto the trained map to explore relations between input and additional data. So with suprahex, it is also possible to carry out multilayer omics data comparisons. Newly added utilities are advanced heatmap visualisation and tree-based analysis of sample relationships. Uniquely to this package, users can ultrafastly understand any tabular omics data, both scientifically and artistically, especially in a sample-specific fashion but without loss of information on large genes. URL Collate spipeline.r shexgrid.r stopology.r sinitial.r strainology.r strainseq.r strainbatch.r sbmh.r sneighdirect.r sneighany.r shexdist.r sdistance.r sdmat.r sdmatminima.r sdmatcluster.r scompreorder.r swritedata.r smapoverlay.r vishexpattern.r vishexgrid.r vishexmapping.r vishexcomp.r viscolormap.r viscolorbar.r visvp.r vishexmulcomp.r vishexanimate.r viscompreorder.r visdmatcluster.r viskernels.r viscoloralpha.r visheatmap.r visheatmapadv.r vistreebootstrap.r vistreebsclust.r visdmatheatmap.r vishexbarplot.r shexpolygon.r shexgridvariant.r License GPL-2 biocviews Software, Clustering, Visualization, GeneExpression git_url git_branch RELEASE_3_7 git_last_commit 9f28a1d git_last_commit_date Date/Publication

2 2 R topics documented: R topics documented: Fang Golub sbmh scompreorder sdistance sdmat sdmatcluster sdmatminima shexdist shexgrid shexgridvariant shexpolygon sinitial smapoverlay sneighany sneighdirect spipeline stopology strainbatch strainology strainseq swritedata viscoloralpha viscolorbar viscolormap viscompreorder visdmatcluster visdmatheatmap visheatmap visheatmapadv vishexanimate vishexbarplot vishexcomp vishexgrid vishexmapping vishexmulcomp vishexpattern viskernels vistreebootstrap vistreebsclust visvp Xiang Index 70

3 Fang 3 Fang Human embryo gene expression dataset from Fang et al. (2010) Human embryo dataset contains gene expression levels (5441 genes and 18 embryo samples) from Fang et al. (2010). data(fang) References Fang: a gene expression matrix of 5441 genes x 18 samples, involving six successive stages, each with three replicates. Fang.sampleinfo: a matrix containing the information of the 18 samples for the expression matrix Fang. The three columns correspond to the sample information: "Name", "Stage" and "Replicate". Fang.geneinfo: a matrix containing the information of the 5441 genes for the expression matrix Fang. The three columns correspond to the gene information: "AffyID", "EntrezGene" and "Symbol". Fang et al. (2010). Transcriptome analysis of early organogenesis in human embryos. Developmental Cell, 19(1): Golub Leukemia gene expression dataset from Golub et al. (1999) Leukemia dataset (learning set) contains gene expression levels (3051 genes and 38 patient samples) from Golub et al. (1999). This dataset has been pre-processed: capping into floor of 100 and ceiling of 16000; filtering by exclusion of genes with max/min <= 5 or max min <= 500, where max and min refer respectively to the maximum and minimum intensities for a particular gene across mrna samples; 2-base logarithmic transformation. data(golub) Golub: a gene expression matrix of 3051 genes x 38 samples. These samples include 11 acute myeloid leukemia (AML) and 27 acute lymphoblastic leukemia (ALL) which can be further subtyped into 19 B-cell ALL and 8 T-cell ALL.

4 4 sbmh References Golub et al. (1999). Molecular classification of cancer: class discovery and class prediction by gene expression monitoring, Science, Vol. 286: sbmh Function to identify the best-matching hexagons/rectangles for the input data sbmh is supposed to identify the best-matching hexagons/rectangles (BMH) for the input data. sbmh(smap, data, which_bmh = c("best", "worst", "all")) smap data which_bmh an object of class "smap" or a codebook matrix a data frame or matrix of input data which BMH is requested. It can be a vector consisting of any integer values from [1, nhex]. Alternatively, it can also be one of "best", "worst" and "all" choices. Here, "best" is equivalent to 1, "worst" for nhex, and "all" for seq(1, nhex) a list with following components: bmh: the requested BMH matrix of dlen x length(which_bmh), where dlen is the total number of rows of the input data qerr: the corresponding matrix of quantization errors (i.e., the distance between the input data and their BMH), with the same dimensions as "bmh" above mqe: the mean quantization error for the "best" BMH call: the call that produced this result "which_bmh" upon request can be a vector consisting of any integer values from [1, nhex] spipeline

5 scompreorder 5 # 1) generate an iid normal random matrix of 100x10 data <- matrix( rnorm(100*10,mean=0,sd=1), nrow=100, ncol=10) # 2) from this input matrix, determine nhex=5*sqrt(nrow(data))=50, # but it returns nhex=61, via "shexgrid(nhex=50)", to make sure a supra-hexagonal grid stopol <- stopology(data=data, lattice="hexa", shape="suprahex") # 3) initialise the codebook matrix using "uniform" method si <- sinitial(data=data, stopol=stopol, init="uniform") # 4) define trainology at "rough" stage st_rough <- strainology(smap=si, data=data, stage="rough") # 5) training at "rough" stage sm_rough <- strainbatch(smap=si, data=data, strain=st_rough) # 6) define trainology at "finetune" stage st_finetune <- strainology(smap=si, data=data, stage="finetune") # 7) training at "finetune" stage sm_finetune <- strainbatch(smap=sm_rough, data=data, strain=st_rough) # 8) find the best-matching hexagons/rectangles for the input data response <- sbmh(smap=sm_finetune, data=data, which_bmh="best") scompreorder Function to reorder component planes scompreorder is supposed to reorder component planes for the input map/data. It returns an object of class "sreorder". It is realized by using a new map grid (with sheet shape consisting of a rectangular lattice) to train component plane vectors (either column-wise vectors of codebook/data matrix or the covariance matrix thereof). As a result, similar component planes are placed closer to each other. It is highly recommend to use trained map (i.e. codebook matrix) as input if data matrix is hugely big to save computational costs. scompreorder(smap, xdim = NULL, ydim = NULL, amplifier = NULL, metric = c("none", "pearson", "spearman", "kendall", "euclidean", "manhattan", "cos", "mi"), init = c("linear", "uniform", "sample"), algorithm = c("sequential", "batch"), alphatype = c("invert", "linear", "power"), neighkernel = c("gaussian", "bubble", "cutgaussian", "ep", "gamma")) smap xdim ydim an object of class "smap" or input data frame/matrix an integer specifying x-dimension of the grid an integer specifying y-dimension of the grid

6 6 scompreorder amplifier metric init algorithm alphatype neighkernel an integer specifying the amplifier (3 by default) of the number of component planes. The product of the component number and the amplifier constitutes the number of rectangles in the sheet grid distance metric used to define the similarity between component planes. It can be "none", which means directly using column-wise vectors of codebook/data matrix. Otherwise, first calculate the covariance matrix from the codebook/data matrix. The distance metric used for calculating the covariance matrix between component planes can be: "pearson" for pearson correlation, "spearman" for spearman rank correlation, "kendall" for kendall tau rank correlation, "euclidean" for euclidean distance, "manhattan" for cityblock distance, "cos" for cosine similarity, "mi" for mutual information. See sdistance for details an initialisation method. It can be one of "uniform", "sample" and "linear" initialisation methods the training algorithm. It can be one of "sequential" and "batch" algorithm. By default, it uses sequential algorithm. If the input data contains a large number of samples but not a great amount of zero entries, then it is reasonable to use batch algorithm for its fast computations (probably also without the compromise of accuracy) the alpha type. It can be one of "invert", "linear" and "power" alpha types the training neighbor kernel. It can be one of "gaussian", "bubble", "cutgaussian", "ep" and "gamma" kernels an object of class "sreorder", a list with following components: nhex: the total number of rectanges in the grid xdim: x-dimension of the grid ydim: y-dimension of the grid uorder: the unique order/placement for each component plane that is reordered to the "sheet"- shape grid with rectangular lattice coord: a matrix of nhex x 2, with each row corresponding to the coordinates of each "uorder" rectangle in the 2D map grid call: the call that produced this result All component planes are uniquely placed within a "sheet"-shape rectangle grid: Each component plane mapped to the "sheet"-shape grid with rectangular lattice is determinied iteratively in an order from the best matched to the next compromised one. If multiple compoments are hit in the same rectangular lattice, the worse one is always sacrificed by moving to the next best one till all components are placed somewhere exclusively on their own. The size of "sheet"-shape rectangle grid depends on the input arguments: How the input parameters are used to determine nhex is taken priority in the following order: "xdim & ydim" > "nhex" > "data". If both of xdim and ydim are given, nhex = xdim ydim.

7 sdistance 7 If only data is input, nhex = 5 sqrt(dlen), where dlen is the number of rows of the input data. After nhex is determined, xy-dimensions of rectangle grid are then determined according to the square root of the two biggest eigenvalues of the input data. stopology, spipeline, sbmh, sdistance, viscompreorder # 1) generate an iid normal random matrix of 100x10 data <- matrix( rnorm(100*10,mean=0,sd=1), nrow=100, ncol=10) colnames(data) <- paste(rep('s',10), seq(1:10), sep="") # 2) get trained using by default setup smap <- spipeline(data=data) # 3) reorder component planes in different ways # 3a) directly using column-wise vectors of codebook matrix sreorder <- scompreorder(smap=smap, amplifier=2, metric="none") # 3b) according to covariance matrix of pearson correlation of codebook matrix sreorder <- scompreorder(smap=smap, amplifier=2, metric="pearson") # 3c) according to covariance matrix of pearson correlation of input matrix sreorder <- scompreorder(smap=data, amplifier=2, metric="pearson") sdistance Function to compute the pairwise distance for a given data matrix sdistance is supposed to compute and return the distance matrix between the rows of a data matrix using a specified distance metric sdistance(data, metric = c("pearson", "spearman", "kendall", "euclidean", "manhattan", "cos", "mi")) data metric a data frame or matrix of input data distance metric used to calculate a symmetric distance matrix. See below for options available dist: a symmetric distance matrix of nrow x nrow, where nrow is the number of rows of input data matrix

8 8 sdistance The distance metrics are supported: "pearson": Pearson correlation. that two curves that have identical shape, but different magnitude will still have a correlation of 1 "spearman": Spearman rank correlation. As a nonparametric version of the pearson correlation, it calculates the correlation between the ranks of the data values in the two vectors (more robust against outliers) "kendall": Kendall tau rank correlation. Compared to spearman rank correlation, it goes a step further by using only the relative ordering to calculate the correlation. For all pairs of data points (x i, y i ) and (x j, y j ), it calls a pair of points either as concordant (Nc in total) if (x i x j ) (y i y j ) > 0, or as discordant (Nd in total) if (x i x j ) (y i y j ) < 0. Finally, it calculates gamma coefficient (Nc Nd)/(Nc + Nd) as a measure of association which is highly resistant to tied data "euclidean": Euclidean distance. Unlike the correlation-based distance measures, it takes the magnitude into account (input data should be suitably normalized "manhattan": Cityblock distance. The distance between two vectors is the sum of absolute value of their differences along any coordinate dimension "cos": Cosine similarity. As an uncentered version of pearson correlation, it is a measure of similarity between two vectors of an inner product space, i.e., measuring the cosine of the angle between them (using a dot product and magnitude) "mi": Mutual information (MI). M I provides a general measure of dependencies between variables, in particular, positive, negative and nonlinear correlations. The caclulation of M I is implemented via applying adaptive partitioning method for deriving equal-probability bins (i.e., each bin contains approximately the same number of data points). The number of bins is heuristically determined (the lower bound): 1 + log2(n), where n is the length of the vector. Because M I increases with entropy, we normalize it to allow comparison of different pairwise clone similarities: 2 M I/[H(x) + H(y)], where H(x) and H(y) stand for the entropy for the vector x and y, respectively sdmatcluster # 1) generate an iid normal random matrix of 100x10 data <- matrix( rnorm(100*10,mean=0,sd=1), nrow=100, ncol=10) # 2) calculate distance matrix using different metric smap <- spipeline(data=data) # 2a) using "pearson" metric dist <- sdistance(data=data, metric="pearson") # 2b) using "cos" metric # dist <- sdistance(data=data, metric="cos") # 2c) using "spearman" metric # dist <- sdistance(data=data, metric="spearman") # 2d) using "kendall" metric # dist <- sdistance(data=data, metric="kendall") # 2e) using "euclidean" metric # dist <- sdistance(data=data, metric="euclidean") # 2f) using "manhattan" metric

9 sdmat 9 # dist <- sdistance(data=data, metric="manhattan") # 2g) using "mi" metric # dist <- sdistance(data=data, metric="mi") sdmat Function to calculate distance matrix in high-dimensional input space but according to neighborhood relationships in 2D output space sdmat is supposed to calculate distance (measured in high-dimensional input space) to neighbors (defined by based on 2D output space) for each of hexagons/rectangles sdmat(smap, which_neigh = 1, distmeasure = c("median", "mean", "min", "max")) smap which_neigh distmeasure an object of class "smap" which neighbors in 2D output space are used for the calculation. By default, it sets to "1" for direct neighbors, and "2" for neighbors within neighbors no more than 2, and so on distance measure used to calculate distances in high-dimensional input space dmat: a vector with the length of nhex. It stores the distance a hexaon/rectangle is away from its output-space-defined neighbors in high-dimensional input space "which_neigh" is defined in output 2D space, but "distmeasure" is defined in high-dimensional input space sneighany # 1) generate an iid normal random matrix of 100x10 data <- matrix( rnorm(100*10,mean=0,sd=1), nrow=100, ncol=10) # 2) get trained using by default setup smap <- spipeline(data=data) # 3) calculate "median" distances in INPUT space to different neighbors in 2D OUTPUT space # 3a) using direct neighbors in 2D OUTPUT space dmat <- sdmat(smap=smap, which_neigh=1, distmeasure="median") # 3b) using no more than 2-topological neighbors in 2D OUTPUT space # dmat <- sdmat(smap=smap, which_neigh=2, distmeasure="median")

10 10 sdmatcluster sdmatcluster Function to partition a grid map into clusters sdmatcluster is supposed to obtain clusters from a grid map. It returns an object of class "sbase". sdmatcluster(smap, which_neigh = 1, distmeasure = c("median", "mean", "min", "max"), clusterlinkage = c("average", "complete", "single", "bmh"), reindexseed = c("hclust", "svd", "none")) smap which_neigh distmeasure an object of class "smap" which neighbors in 2D output space are used for the calculation. By default, it sets to "1" for direct neighbors, and "2" for neighbors within neighbors no more than 2, and so on distance measure used to calculate distances in high-dimensional input space. It can be one of "median", "mean", "min" and "max" measures clusterlinkage cluster linkage used to derive clusters. It can be "bmh", which accumulates a cluster just based on best-matching hexagons/rectanges but can not ensure each cluster is continuous. Instead, each cluster is continuous when using regiongrowing algorithm with one of "average", "complete" and "single" linkages reindexseed the way to index seed. It can be "hclust" for reindexing seeds according to hierarchical clustering of patterns seen in seeds, "svd" for reindexing seeds according to svd of patterns seen in seeds, or "none" for seeds being simply increased by the hexagon indexes (i.e. always in an increasing order as hexagons radiate outwards) an object of class "sbase", a list with following components: seeds: the vector to store cluster seeds, i.e., a list of local minima (in 2D output space) of distance matrix (in input space). They are represented by the indexes of hexagons/rectangles bases: the vector with the length of nhex to store the cluster memberships/bases, where nhex is the total number of hexagons/rectanges in the grid call: the call that produced this result The first item in the return "seeds" is the first cluster, whose memberships are those in the return "bases" that equals 1. The same relationship is held for the second item, and so on spipeline, sdmatminima, sbmh, sneighdirect, sdistance, visdmatcluster

11 sdmatminima 11 # 1) generate an iid normal random matrix of 100x10 data <- matrix( rnorm(100*10,mean=0,sd=1), nrow=100, ncol=10) # 2) get trained using by default setup smap <- spipeline(data=data) # 3) partition the grid map into clusters based on different criteria # 3a) based on "bmh" criterion # sbase <- sdmatcluster(smap=smap, which_neigh=1, distmeasure="median", clusterlinkage="bmh") # 3b) using region-growing algorithm with linkage "average" sbase <- sdmatcluster(smap=smap, which_neigh=1, distmeasure="median", clusterlinkage="average") # 4) visualise clusters/bases partitioned from the smap visdmatcluster(smap,sbase) sdmatminima Function to identify local minima (in 2D output space) of distance matrix (in high-dimensional input space) sdmatminima is supposed to identify local minima of distance matrix (resulting from sdmat). The criterion of being local minima is that the distance associated with a hexagon/rectangle is always smaller than its direct neighbors (i.e., 1-neighborhood) sdmatminima(smap, which_neigh = 1, distmeasure = c("median", "mean", "min", "max")) smap which_neigh distmeasure an object of class "smap" which neighbors in 2D output space are used for the calculation. By default, it sets to "1" for direct neighbors, and "2" for neighbors within neighbors no more than 2, and so on distance measure used to calculate distances in high-dimensional input space. It can be one of "median", "mean", "min" and "max" measures minima: a vector to store a list of local minima (represented by the indexes of hexogans/rectangles Do not get confused by "which_neigh" and the criteria of being local minima. Both of them deal with 2D output space. However, "which_neigh" is used to assist in the calculation of distance matrix (so can be 1-neighborhood or more); instead, the criterion of being local minima is only 1-neighborhood in the strictest sense

12 12 shexdist sdmat, sneighany # 1) generate an iid normal random matrix of 100x10 data <- matrix( rnorm(100*10,mean=0,sd=1), nrow=100, ncol=10) # 2) get trained using by default setup smap <- spipeline(data=data) # 3) identify local minima of distance matrix based on "median" distances and direct neighbors minima <- sdmatminima(smap=smap, which_neigh=1, distmeasure="median") shexdist Function to calculate distances between hexagons/rectangles in a 2D grid shexdist is supposed to calculate euclidian distances between each pair of hexagons/rectangles in a 2D grid of input "stopol" or "smap" object. It returns a symmetric matrix containing pairwise distances. shexdist(sobj) sobj an object of class "stopol" or "sinit" or "smap" dist: a symmetric matrix of nhex x nhex, containing pairwise distances, where nhex is the total number of hexagons/rectanges in the grid The return matrix has rows/columns ordered in the same order as the "coord" matrix of the input object does. stopology, sinitial

13 shexgrid 13 # 1) generate an iid normal random matrix of 100x10 data <- matrix( rnorm(100*10,mean=0,sd=1), nrow=100, ncol=10) # 2) from this input matrix, determine nhex=5*sqrt(nrow(data))=50, # but it returns nhex=61, via "shexgrid(nhex=50)", to make sure a supra-hexagonal grid stopol <- stopology(data=data, lattice="hexa", shape="suprahex") # 3) initialise the codebook matrix using "uniform" method si <- sinitial(data=data, stopol=stopol, init="uniform") # 4) calculate distances between hexagons/rectangles in a 2D grid based on different objects # 4a) based on an object of class "stopol" dist <- shexdist(sobj=stopol) # 4b) based on an object of class "smap" dist <- shexdist(sobj=si) shexgrid Function to define a supra-hexagonal grid shexgrid is supposed to define a supra-hexagonal map grid. A supra-hexagon is a giant hexagon, which seamlessly consists of smaller hexagons. Due to the symmetric nature, it can be uniquely determined by specifying the radius away from the grid centroid. This function takes input the grid radius (or the number of hexagons in the grid, but will be adjusted to meet the definition of supra-hexagon), and returns a list (see below) containing: the grid radius, the total number of hexagons in the grid, the 2D coordinates of the grid centroid, the step for each hexogan away from the grid centroid, and the 2D coordinates of all hexagons in the grid. shexgrid(r = NULL, nhex = NULL) r nhex an integer specifying the radius in a supra-hexagonal grid the number of input hexagons in the grid an object of class "shex", a list with following components: r: the grid radius nhex: the total number of hexagons in the grid. It may differ from the input value; actually it is always no less than the input one to ensure a supra-hexagonal grid exactly formed centroid: the 2D coordinates of the grid centroid stepcentroid: a vector with the length of nhex. It stores how many steps a hexagon is awawy from the grid centroid ( 1 for the centroid itself). Starting with the centroid, it orders outward. Also, for those hexagons of the same step, it orders from the rightmost in an anti-clock wise

14 14 shexgridvariant anglecentroid: a vector with the length of nhex. It stores the angle a hexagon is in terms of the grid centroid ( 0 for the centroid itself). For those hexagons of the same step, it orders from the rightmost in an anti-clock wise coord: a matrix of nhex x 2 with each row specifying the 2D coordinates of a hexagon in the grid. The order of rows is the same as centroid above call: the call that produced this result The relationships among return values: nhex = r (r 1)/2 centroid = coord[1, ] stepcentroid[1] = 1 stepcentroid[2 : nhex] = unlist(sapply(2 : r, function(x)(c((1 + 6 x (x 1)/2 6 (x 1) + 1) : (1 + 6 x (x 1)/2)) >= 1) x)) spipeline # The supra-hexagonal grid is exactly determined by specifying the radius. shex <- shexgrid(r=2) # The grid is determined according to the number of input hexagons (after being adjusted). # The return res$nhex is always no less than the input one. # It ensures a supra-hexagonal grid is exactly formed. shex <- shexgrid(nhex=12) # Ignore input nhex if r is also given shex <- shexgrid(r=3, nhex=100) # By default, r=3 if no parameters are specified shex <- shexgrid() shexgridvariant Function to define a variant of a supra-hexagonal grid shexgridvariant is supposed to define a variant of a supra-hexagonal map grid. In essence, it is the subset of the supra-hexagon. shexgridvariant(r = NULL, nhex = NULL, shape = c("suprahex", "triangle", "diamond", "hourglass", "trefoil", "ladder", "butterfly", "ring", "bridge"))

15 shexgridvariant 15 r nhex shape an integer specifying the radius in a supra-hexagonal grid the number of input hexagons in the grid the grid shape, either "suprahex" for the suprahex itself, or its variants (including "triangle" for the triangle-shaped variant, "diamond" for the diamondshaped variant, "hourglass" for the hourglass-shaped variant, "trefoil" for the trefoil-shaped variant, "ladder" for the ladder-shaped variant, "butterfly" for the butterfly-shaped variant, "ring" for the ring-shaped variant, and "bridge" for the bridge-shaped variant) an object of class "shex", a list with following components: none r: the grid radius nhex: the total number of hexagons in the grid. It may differ from the input value; actually it is always no less than the input one to ensure a supra-hexagonal grid exactly formed centroid: the 2D coordinates of the grid centroid stepcentroid: a vector with the length of nhex. It stores how many steps a hexagon is awawy from the grid centroid ( 1 for the centroid itself). Starting with the centroid, it orders outward. Also, for those hexagons of the same step, it orders from the rightmost in an anti-clock wise anglecentroid: a vector with the length of nhex. It stores the angle a hexagon is in terms of the grid centroid ( 0 for the centroid itself). For those hexagons of the same step, it orders from the rightmost in an anti-clock wise coord: a matrix of nhex x 2 with each row specifying the 2D coordinates of a hexagon in the grid. The order of rows is the same as centroid above call: the call that produced this result shexgrid # For "suprahex" shape itself shex <- shexgridvariant(r=6, shape="suprahex") ## Not run: library(ggplot2) #geom_polygon(color="black", fill=na) # For "suprahex" shape itself shex <- shexgridvariant(r=6, shape="suprahex") df_polygon <- shexpolygon(shex) df_coord <- data.frame(shex$coord, index=1:nrow(shex$coord)) gp_suprahex <- ggplot(data=df_polygon, aes(x,y,group=index)) + geom_polygon(aes(fill=factor(stepcentroid%%2))) +

16 16 shexgridvariant coord_fixed(ratio=1) + theme_void() + theme(legend.position="none") + geom_text(data=df_coord, aes(x,y,label=index), color="white", size=3) + labs(title="suprahex (r=6; xdim=ydim=11)") + theme(plot.title=element_text(hjust=0.5,size=8)) # For "triangle" shape shex <- shexgridvariant(r=6, shape="triangle") df_polygon <- shexpolygon(shex) df_coord <- data.frame(shex$coord, index=1:nrow(shex$coord)) gp_triangle <- ggplot(data=df_polygon, aes(x,y,group=index)) + geom_polygon(aes(fill=factor(stepcentroid%%2))) + coord_fixed(ratio=1) + theme_void() + theme(legend.position="none") + geom_text(data=df_coord, aes(x,y,label=index), color="white", size=3) + labs(title="triangle (r=6; xdim=ydim=6)") + theme(plot.title=element_text(hjust=0.5,size=8)) # For "diamond" shape shex <- shexgridvariant(r=6, shape="diamond") df_polygon <- shexpolygon(shex) df_coord <- data.frame(shex$coord, index=1:nrow(shex$coord)) gp_diamond <- ggplot(data=df_polygon, aes(x,y,group=index)) + geom_polygon(aes(fill=factor(stepcentroid%%2))) + coord_fixed(ratio=1) + theme_void() + theme(legend.position="none") + geom_text(data=df_coord, aes(x,y,label=index), color="white", size=3) + labs(title="diamond (r=6; xdim=6, ydim=11)") + theme(plot.title=element_text(hjust=0.5,size=8)) # For "hourglass" shape shex <- shexgridvariant(r=6, shape="hourglass") df_polygon <- shexpolygon(shex) df_coord <- data.frame(shex$coord, index=1:nrow(shex$coord)) gp_hourglass <- ggplot(data=df_polygon, aes(x,y,group=index)) + geom_polygon(aes(fill=factor(stepcentroid%%2))) + coord_fixed(ratio=1) + theme_void() + theme(legend.position="none") + geom_text(data=df_coord, aes(x,y,label=index), color="white", size=3) + labs(title="hourglass (r=6; xdim=6, ydim=11)") + theme(plot.title=element_text(hjust=0.5,size=8)) # For "trefoil" shape shex <- shexgridvariant(r=6, shape="trefoil") df_polygon <- shexpolygon(shex) df_coord <- data.frame(shex$coord, index=1:nrow(shex$coord)) gp_trefoil <- ggplot(data=df_polygon, aes(x,y,group=index)) + geom_polygon(aes(fill=factor(stepcentroid%%2))) + coord_fixed(ratio=1) + theme_void() + theme(legend.position="none") + geom_text(data=df_coord, aes(x,y,label=index), color="white", size=3) + labs(title="trefoil (r=6; xdim=ydim=11)") + theme(plot.title=element_text(hjust=0.5,size=8)) # For "ladder" shape shex <- shexgridvariant(r=6, shape="ladder") df_polygon <- shexpolygon(shex) df_coord <- data.frame(shex$coord, index=1:nrow(shex$coord)) gp_ladder <- ggplot(data=df_polygon, aes(x,y,group=index)) + geom_polygon(aes(fill=factor(stepcentroid%%2))) + coord_fixed(ratio=1) + theme_void() + theme(legend.position="none") + geom_text(data=df_coord, aes(x,y,label=index), color="white", size=3) +

17 shexpolygon 17 labs(title="ladder (r=6; xdim=11, ydim=6)") + theme(plot.title=element_text(hjust=0.5,size=8)) # For "butterfly" shape shex <- shexgridvariant(r=6, shape="butterfly") df_polygon <- shexpolygon(shex) df_coord <- data.frame(shex$coord, index=1:nrow(shex$coord)) gp_butterfly <- ggplot(data=df_polygon, aes(x,y,group=index)) + geom_polygon(aes(fill=factor(stepcentroid%%2))) + coord_fixed(ratio=1) + theme_void() + theme(legend.position="none") + geom_text(data=df_coord, aes(x,y,label=index), color="white", size=3) + labs(title="butterfly (r=6; xdim=ydim=11)") + theme(plot.title=element_text(hjust=0.5,size=8)) # For "ring" shape shex <- shexgridvariant(r=6, shape="ring") df_polygon <- shexpolygon(shex) df_coord <- data.frame(shex$coord, index=1:nrow(shex$coord)) gp_ring <- ggplot(data=df_polygon, aes(x,y,group=index)) + geom_polygon(aes(fill=factor(stepcentroid%%2))) + coord_fixed(ratio=1) + theme_void() + theme(legend.position="none") + geom_text(data=df_coord, aes(x,y,label=index), color="white", size=3) + labs(title="ring (r=6; xdim=ydim=11)") + theme(plot.title=element_text(hjust=0.5,size=8)) # For "bridge" shape shex <- shexgridvariant(r=6, shape="bridge") df_polygon <- shexpolygon(shex) df_coord <- data.frame(shex$coord, index=1:nrow(shex$coord)) gp_bridge <- ggplot(data=df_polygon, aes(x,y,group=index)) + geom_polygon(aes(fill=factor(stepcentroid%%2))) + coord_fixed(ratio=1) + theme_void() + theme(legend.position="none") + geom_text(data=df_coord, aes(x,y,label=index), color="white", size=3) + labs(title="bridge (r=6; xdim=11, ydim=6)") + theme(plot.title=element_text(hjust=0.5,size=8)) # combined visuals library(gridextra) grid.arrange(grobs=list(gp_suprahex, gp_ring, gp_diamond, gp_trefoil, gp_butterfly, gp_hourglass, gp_ladder, gp_bridge, gp_triangle), layout_matrix=rbind(c(1,1,2,2,3),c(1,1,2,2,3),c(4,4,5,5,6),c(4,4,5,5,6),c(7,7,8,8,9)), nrow=5, ncol=5) ## End(Not run) shexpolygon Function to extract polygon location per hexagon within a suprahexagonal grid shexpolygon is supposed to extract polygon location per hexagon within a supra-hexagonal grid shexpolygon(sobj, area.size = 1)

18 18 sinitial sobj area.size an object of class "smap" or "sinit" or "stopol" or "shex" an inteter or a vector specifying the area size of each hexagon a data frame of five columns ( x, y, index, stepcentroid, anglecentroid ) storing polygon location per hexagon. None shexgridvariant, spipeline sobj <- stopology(xdim=4, ydim=4, lattice="hexa", shape="suprahex") df_polygon <- shexpolygon(sobj, area.size=1) sinitial Function to initialise a sinit object given a topology and input data sinitial is supposed to initialise an object of class "sinit" given a topology and input data. As a matter of fact, it initialises the codebook matrix (in input high-dimensional space). The return object inherits the topology information (i.e., a "stopol" object from stopology), along with initialised codebook matrix and method used. sinitial(data, stopol, init = c("linear", "uniform", "sample")) data stopol init a data frame or matrix of input data an object of class "stopol" (see stopology) an initialisation method. It can be one of "uniform", "sample" and "linear" initialisation methods

19 sinitial 19 an object of class "sinit", a list with following components: nhex: the total number of hexagons/rectanges in the grid xdim: x-dimension of the grid ydim: y-dimension of the grid r: the hypothetical radius of the grid lattice: the grid lattice shape: the grid shape coord: a matrix of nhex x 2, with each row corresponding to the coordinates of a hexagon/rectangle in the 2D map grid init: an initialisation method codebook: a codebook matrix of nhex x ncol(data), with each row corresponding to a prototype vector in input high-dimensional space call: the call that produced this result The initialisation methods include: "uniform": the codebook matrix is uniformly initialised via randomly taking any values within the interval [min, max] of each column of input data "sample": the codebook matrix is initialised via randomly sampling/selecting input data "linear": the codebook matrix is linearly initialised along the first two greatest eigenvectors of input data stopology # 1) generate an iid normal random matrix of 100x10 data <- matrix( rnorm(100*10,mean=0,sd=1), nrow=100, ncol=10) # 2) from this input matrix, determine nhex=5*sqrt(nrow(data))=50, # but it returns nhex=61, via "shexgrid(nhex=50)", to make sure a supra-hexagonal grid stopol <- stopology(data=data, lattice="hexa", shape="suprahex") # 3) initialise the codebook matrix using different mehtods # 3a) using "uniform" method si_uniform <- sinitial(data=data, stopol=stopol, init="uniform") # 3b) using "sample" method # si_sample <- sinitial(data=data, stopol=stopol, init="sample") # 3c) using "linear" method # si_linear <- sinitial(data=data, stopol=stopol, init="linear")

20 20 smapoverlay smapoverlay Function to overlay additional data onto the trained map for viewing the distribution of that additional data smapoverlay is supposed to overlay additional data onto the trained map for viewing the distribution of that additional data. It returns an object of class "smap". It is realised by first estimating the hit histogram weighted by the neighborhood kernel, and then calculating the distribution of the additional data over the map (similarly weighted by the neighborhood kernel). The final overlaid distribution of additional data is normalised by the hit histogram. smapoverlay(smap, data, additional) smap data additional an object of class "smap" a data frame or matrix of input data a numeric vector or numeric matrix used to overlay onto the trained map. It must have the length (if being vector) or row number (if matrix) being equal to the number of rows in input data an object of class "smap", a list with following components: nhex: the total number of hexagons/rectanges in the grid xdim: x-dimension of the grid ydim: y-dimension of the grid r: the hypothetical radius of the grid lattice: the grid lattice shape: the grid shape coord: a matrix of nhex x 2, with rows corresponding to the coordinates of all hexagons/rectangles in the 2D map grid init: an initialisation method neighkernel: the training neighborhood kernel codebook: a codebook matrix of nhex x ncol(additional), with rows corresponding to overlaid vectors hits: a vector of nhex, each element meaning that a hexagon/rectangle contains the number of input data vectors being hit wherein mqe: the mean quantization error for the "best" BMH call: the call that produced this result Weighting by neighbor kernel is to avoid rigid overlaying by only focusing on the best-matching map nodes as there may exist several closest best-matching nodes for an input data vector.

21 sneighany 21 spipeline, sbmh, shexdist, vishexmulcomp # 1) generate an iid normal random matrix of 100x10 data <- matrix( rnorm(100*10,mean=0,sd=1), nrow=100, ncol=10) colnames(data) <- paste(rep('s',10), seq(1:10), sep="") # 2) get trained using by default setup smap <- spipeline(data=data) # 3) overlay additional data onto the trained map # here using the first two columns of the input "data" as "additional" # codebook in "soverlay" is the same as the first two columns of codebook in "smap" soverlay <- smapoverlay(smap=smap, data=data, additional=data[,1:2]) # 4) viewing the distribution of that additional data vishexmulcomp(soverlay) sneighany Function to calculate any neighbors for each hexagon/rectangle in a grid sneighany is supposed to calculate any neighbors for each hexagon/rectangle in a regular 2D grid. It returns a matrix with rows for the self, and columns for its any neighbors. sneighany(sobj) sobj an object of class "stopol" or "sinit" or "smap" aneigh: a matrix of nhex x nhex, containing distance info in terms of any neighbors, where nhex is the total number of hexagons/rectanges in the grid The return matrix has rows for the self, and columns for its neighbors. The non-zeros mean the distance away from its neighbors, and the zeros for the self-self. It has rows/columns ordered in the same order as the "coord" matrix of the input object does. sneighdirect

22 22 sneighdirect # 1) generate an iid normal random matrix of 100x10 data <- matrix( rnorm(100*10,mean=0,sd=1), nrow=100, ncol=10) # 2) from this input matrix, determine nhex=5*sqrt(nrow(data))=50, # but it returns nhex=61, via "shexgrid(nhex=50)", to make sure a supra-hexagonal grid stopol <- stopology(data=data, lattice="hexa", shape="suprahex") # 3) initialise the codebook matrix using "uniform" method si <- sinitial(data=data, stopol=stopol, init="uniform") # 4) calculate any neighbors based on different objects # 4a) based on an object of class "stopol" aneigh <- sneighany(sobj=stopol) # 4b) based on an object of class "smap" # aneigh <- sneighany(sobj=si) sneighdirect Function to calculate direct neighbors for each hexagon/rectangle in a grid sneighdirect is supposed to calculate direct neighbors for each hexagon/rectangle in a regular 2D grid. It returns a matrix with rows for the self, and columns for its direct neighbors. sneighdirect(sobj) sobj an object of class "stopol" or "sinit" or "smap" dneigh: a matrix of nhex x nhex, containing presence/absence info in terms of direct neighbors, where nhex is the total number of hexagons/rectanges in the grid The return matrix has rows for the self, and columns for its direct neighbors. The "1" means the presence of direct neighbors, "0" for the absence. It has rows/columns ordered in the same order as the "coord" matrix of the input object does. shexdist

23 spipeline 23 # 1) generate an iid normal random matrix of 100x10 data <- matrix( rnorm(100*10,mean=0,sd=1), nrow=100, ncol=10) # 2) from this input matrix, determine nhex=5*sqrt(nrow(data))=50, # but it returns nhex=61, via "shexgrid(nhex=50)", to make sure a supra-hexagonal grid stopol <- stopology(data=data, lattice="hexa", shape="suprahex") # 3) initialise the codebook matrix using "uniform" method si <- sinitial(data=data, stopol=stopol, init="uniform") # 4) calculate direct neighbors based on different objects # 4a) based on an object of class "stopol" dneigh <- sneighdirect(sobj=stopol) # 4b) based on an object of class "smap" # dneigh <- sneighdirect(sobj=si) spipeline Function to setup the pipeline for completing ab initio training given the input data spipeline is supposed to finish ab inito training for the input data. It returns an object of class "smap". spipeline(data = NULL, xdim = NULL, ydim = NULL, nhex = NULL, lattice = c("hexa", "rect"), shape = c("suprahex", "sheet", "triangle", "diamond", "hourglass", "trefoil", "ladder", "butterfly", "ring", "bridge"), scale = 5, init = c("linear", "uniform", "sample"), algorithm = c("batch", "sequential"), alphatype = c("invert", "linear", "power"), neighkernel = c("gaussian", "bubble", "cutgaussian", "ep", "gamma"), finetunesustain = FALSE, verbose = TRUE) data xdim ydim nhex lattice shape a data frame or matrix of input data an integer specifying x-dimension of the grid an integer specifying y-dimension of the grid the number of hexagons/rectangles in the grid the grid lattice, either "hexa" for a hexagon or "rect" for a rectangle the grid shape, either "suprahex" for a supra-hexagonal grid or "sheet" for a hexagonal/rectangle sheet. Also supported are suprahex s variants (including "triangle" for the triangle-shaped variant, "diamond" for the diamond-shaped variant, "hourglass" for the hourglass-shaped variant, "trefoil" for the trefoilshaped variant, "ladder" for the ladder-shaped variant, "butterfly" for the butterflyshaped variant, "ring" for the ring-shaped variant, and "bridge" for the bridgeshaped variant)

24 24 spipeline scale init algorithm alphatype neighkernel the scaling factor. Only used when automatically estimating the grid dimension from input data matrix. By default, it is 5 (big map). Other suggested values: 1 for small map, and 3 for median map an initialisation method. It can be one of "uniform", "sample" and "linear" initialisation methods the training algorithm. It can be one of "sequential" and "batch" algorithm. By default, it uses batch algorithm purely because of its fast computations (probably also without the compromise of accuracy). However, it is highly recommended not to use batch algorithm if the input data contain lots of zeros; it is because matrix multiplication used in the batch algorithm can be problematic in this context. If much computation resource is at hand, it is alwasy safe to use the sequential algorithm the alpha type. It can be one of "invert", "linear" and "power" alpha types the training neighborhood kernel. It can be one of "gaussian", "bubble", "cutgaussian", "ep" and "gamma" kernels finetunesustain logical to indicate whether sustain the "finetune" training. If true, it will repeat the "finetune" stage until the mean quantization error does get worse. By default, it sets to true verbose logical to indicate whether the messages will be displayed in the screen. By default, it sets to false for no display an object of class "smap", a list with following components: nhex: the total number of hexagons/rectanges in the grid xdim: x-dimension of the grid ydim: y-dimension of the grid r: the hypothetical radius of the grid lattice: the grid lattice shape: the grid shape coord: a matrix of nhex x 2, with rows corresponding to the coordinates of all hexagons/rectangles in the 2D map grid polygon: a data frame of three columns ( x, y, id ) storing polygon location per hexagon in the 2D map grid init: an initialisation method neighkernel: the training neighborhood kernel codebook: a codebook matrix of nhex x ncol(data), with rows corresponding to prototype vectors in input high-dimensional space hits: a vector of nhex, each element meaning that a hexagon/rectangle contains the number of input data vectors being hit wherein mqe: the mean quantization error for the "best" BMH call: the call that produced this result

25 spipeline 25 The pipeline sequentially consists of: i) stopology used to define the topology of a grid (with "suprahex" shape by default ) according to the input data; ii) sinitial used to initialise the codebook matrix given the pre-defined topology and the input data (by default using "uniform" initialisation method); iii) strainology and strainseq used to get the grid map trained at both "rough" and "finetune" stages. If instructed, sustain the "finetune" training until the mean quantization error does get worse; iv) sbmh used to identify the best-matching hexagons/rectangles (BMH) for the input data, and these response data are appended to the resulting object of "smap" class. References Hai Fang and Julian Gough. (2014) suprahex: an R/Bioconductor package for tabular omics data analysis using a supra-hexagonal map. Biochemical and Biophysical Research Communications, 443(1), stopology, sinitial, strainology, strainseq, strainbatch, sbmh, vishexmulcomp # 1) generate an iid normal random matrix of 100x10 data <- matrix( rnorm(100*10,mean=0,sd=1), nrow=100, ncol=10) colnames(data) <- paste(rep('s',10), seq(1:10), sep="") # 2) get trained using by default setup but with different neighborhood kernels # 2a) with "gaussian" kernel smap <- spipeline(data=data, neighkernel="gaussian") # 2b) with "bubble" kernel # smap <- spipeline(data=data, neighkernel="bubble") # 2c) with "cutgaussian" kernel # smap <- spipeline(data=data, neighkernel="cutgaussian") # 2d) with "ep" kernel # smap <- spipeline(data=data, neighkernel="ep") # 2e) with "gamma" kernel # smap <- spipeline(data=data, neighkernel="gamma") # 3) visualise multiple component planes of a supra-hexagonal grid vishexmulcomp(smap, colormap="jet", ncolors=20, zlim=c(-1,1), gp=grid::gpar(cex=0.8)) # 4) get trained using by default setup but using the shape "butterfly" smap <- spipeline(data=data, shape="trefoil", algorithm=c("batch","sequential")[2]) vishexmulcomp(smap, colormap="jet", ncolors=20, zlim=c(-1,1), gp=grid::gpar(cex=0.8))

26 26 stopology stopology Function to define the topology of a map grid stopology is supposed to define the topology of a 2D map grid. The topological shape can be either a supra-hexagonal grid or a hexagonal/rectangle sheet. It returns an object of "stopol" class, containing: the total number of hexagons/rectangles in the grid, the grid xy-dimensions, the grid lattice, the grid shape, and the 2D coordinates of all hexagons/rectangles in the grid. The 2D coordinates can be directly used to measure distances between any pair of lattice hexagons/rectangles. stopology(data = NULL, xdim = NULL, ydim = NULL, nhex = NULL, lattice = c("hexa", "rect"), shape = c("suprahex", "sheet", "triangle", "diamond", "hourglass", "trefoil", "ladder", "butterfly", "ring", "bridge"), scale = 5) data xdim ydim nhex lattice shape scale a data frame or matrix of input data an integer specifying x-dimension of the grid an integer specifying y-dimension of the grid the number of hexagons/rectangles in the grid the grid lattice, either "hexa" for a hexagon or "rect" for a rectangle the grid shape, either "suprahex" for a supra-hexagonal grid or "sheet" for a hexagonal/rectangle sheet. Also supported are suprahex s variants (including "triangle" for the triangle-shaped variant, "diamond" for the diamond-shaped variant, "hourglass" for the hourglass-shaped variant, "trefoil" for the trefoilshaped variant, "ladder" for the ladder-shaped variant, "butterfly" for the butterflyshaped variant, "ring" for the ring-shaped variant, and "bridge" for the bridgeshaped variant) the scaling factor. Only used when automatically estimating the grid dimension from input data matrix. By default, it is 5 (big map). Other suggested values: 1 for small map, and 3 for median map an object of class "stopol", a list with following components: nhex: the total number of hexagons/rectanges in the grid. It is not always the same as the input nhex (if any); see "" below for the explaination xdim: x-dimension of the grid ydim: y-dimension of the grid r: the hypothetical radius of the grid lattice: the grid lattice shape: the grid shape

27 stopology 27 coord: a matrix of nhex x 2, with each row corresponding to the coordinates of a hexagon/rectangle in the 2D map grid call: the call that produced this result The output of nhex depends on the input arguments and grid shape: How the input parameters are used to determine nhex is taken priority in the following order: "xdim & ydim" > "nhex" > "data" If both of xdim and ydim are given, nhex = xdim ydim for the "sheet" shape, r = (min(xdim, ydim) + 1)/2 for the "suprahex" shape If only data is input, nhex = scale sqrt(dlen), where dlen is the number of rows of the input data, and scale can be 5 (big map), 3 (median map) and 1 (normal map) With nhex in hand, it depends on the grid shape: For "sheet" shape, xy-dimensions of sheet grid is determined according to the square root of the two biggest eigenvalues of the input data For "suprahex" shape, see shexgrid for calculating the grid radius r. The xdim (and ydim) is related to r via xdim = 2 r 1 shexgrid, vishexmapping # For "suprahex" shape stopol <- stopology(xdim=3, ydim=3, lattice="hexa", shape="suprahex") # Error: "The suprahex shape grid only allows for hexagonal lattice" # stopol <- stopology(xdim=3, ydim=3, lattice="rect", shape="suprahex") # For "sheet" shape with hexagonal lattice stopol <- stopology(xdim=3, ydim=3, lattice="hexa", shape="sheet") # For "sheet" shape with rectangle lattice stopol <- stopology(xdim=3, ydim=3, lattice="rect", shape="sheet") # By default, nhex=19 (i.e., r=3; xdim=ydim=5) for "suprahex" shape stopol <- stopology(shape="suprahex") # By default, xdim=ydim=5 (i.e., nhex=25) for "sheet" shape stopol <- stopology(shape="sheet") # Determine the topolopy of a supra-hexagonal grid based on input data # 1) generate an iid normal random matrix of 100x10 data <- matrix(rnorm(100*10,mean=0,sd=1), nrow=100, ncol=10) # 2) from this input matrix, determine nhex=5*sqrt(nrow(data))=50, # but it returns nhex=61, via "shexgrid(nhex=50)", to make sure a supra-hexagonal grid stopol <- stopology(data=data, lattice="hexa", shape="suprahex") # stopol <- stopology(data=data, lattice="hexa", shape="trefoil") # do visualisation vishexmapping(stopol,mappingtype="indexes")

9/29/13. Outline Data mining tasks. Clustering algorithms. Applications of clustering in biology

9/29/13. Outline Data mining tasks. Clustering algorithms. Applications of clustering in biology 9/9/ I9 Introduction to Bioinformatics, Clustering algorithms Yuzhen Ye (yye@indiana.edu) School of Informatics & Computing, IUB Outline Data mining tasks Predictive tasks vs descriptive tasks Example

More information

Exploratory data analysis for microarrays

Exploratory data analysis for microarrays Exploratory data analysis for microarrays Jörg Rahnenführer Computational Biology and Applied Algorithmics Max Planck Institute for Informatics D-66123 Saarbrücken Germany NGFN - Courses in Practical DNA

More information

ECLT 5810 Clustering

ECLT 5810 Clustering ECLT 5810 Clustering What is Cluster Analysis? Cluster: a collection of data objects Similar to one another within the same cluster Dissimilar to the objects in other clusters Cluster analysis Grouping

More information

Clustering and Visualisation of Data

Clustering and Visualisation of Data Clustering and Visualisation of Data Hiroshi Shimodaira January-March 28 Cluster analysis aims to partition a data set into meaningful or useful groups, based on distances between data points. In some

More information

ECLT 5810 Clustering

ECLT 5810 Clustering ECLT 5810 Clustering What is Cluster Analysis? Cluster: a collection of data objects Similar to one another within the same cluster Dissimilar to the objects in other clusters Cluster analysis Grouping

More information

CS 1675 Introduction to Machine Learning Lecture 18. Clustering. Clustering. Groups together similar instances in the data sample

CS 1675 Introduction to Machine Learning Lecture 18. Clustering. Clustering. Groups together similar instances in the data sample CS 1675 Introduction to Machine Learning Lecture 18 Clustering Milos Hauskrecht milos@cs.pitt.edu 539 Sennott Square Clustering Groups together similar instances in the data sample Basic clustering problem:

More information

Measure of Distance. We wish to define the distance between two objects Distance metric between points:

Measure of Distance. We wish to define the distance between two objects Distance metric between points: Measure of Distance We wish to define the distance between two objects Distance metric between points: Euclidean distance (EUC) Manhattan distance (MAN) Pearson sample correlation (COR) Angle distance

More information

The som Package. September 19, Description Self-Organizing Map (with application in gene clustering)

The som Package. September 19, Description Self-Organizing Map (with application in gene clustering) The som Package September 19, 2004 Version 0.3-4 Date 2004-09-18 Title Self-Organizing Map Author Maintainer Depends R (>= 1.9.0) Self-Organizing Map (with application in gene clustering) License GPL version

More information

Lecture on Modeling Tools for Clustering & Regression

Lecture on Modeling Tools for Clustering & Regression Lecture on Modeling Tools for Clustering & Regression CS 590.21 Analysis and Modeling of Brain Networks Department of Computer Science University of Crete Data Clustering Overview Organizing data into

More information

Cluster Analysis. Mu-Chun Su. Department of Computer Science and Information Engineering National Central University 2003/3/11 1

Cluster Analysis. Mu-Chun Su. Department of Computer Science and Information Engineering National Central University 2003/3/11 1 Cluster Analysis Mu-Chun Su Department of Computer Science and Information Engineering National Central University 2003/3/11 1 Introduction Cluster analysis is the formal study of algorithms and methods

More information

Statistical Analysis of Metabolomics Data. Xiuxia Du Department of Bioinformatics & Genomics University of North Carolina at Charlotte

Statistical Analysis of Metabolomics Data. Xiuxia Du Department of Bioinformatics & Genomics University of North Carolina at Charlotte Statistical Analysis of Metabolomics Data Xiuxia Du Department of Bioinformatics & Genomics University of North Carolina at Charlotte Outline Introduction Data pre-treatment 1. Normalization 2. Centering,

More information

University of Florida CISE department Gator Engineering. Clustering Part 5

University of Florida CISE department Gator Engineering. Clustering Part 5 Clustering Part 5 Dr. Sanjay Ranka Professor Computer and Information Science and Engineering University of Florida, Gainesville SNN Approach to Clustering Ordinary distance measures have problems Euclidean

More information

Figure (5) Kohonen Self-Organized Map

Figure (5) Kohonen Self-Organized Map 2- KOHONEN SELF-ORGANIZING MAPS (SOM) - The self-organizing neural networks assume a topological structure among the cluster units. - There are m cluster units, arranged in a one- or two-dimensional array;

More information

Hierarchical Clustering

Hierarchical Clustering What is clustering Partitioning of a data set into subsets. A cluster is a group of relatively homogeneous cases or observations Hierarchical Clustering Mikhail Dozmorov Fall 2016 2/61 What is clustering

More information

Chapter 2 Basic Structure of High-Dimensional Spaces

Chapter 2 Basic Structure of High-Dimensional Spaces Chapter 2 Basic Structure of High-Dimensional Spaces Data is naturally represented geometrically by associating each record with a point in the space spanned by the attributes. This idea, although simple,

More information

MSA220 - Statistical Learning for Big Data

MSA220 - Statistical Learning for Big Data MSA220 - Statistical Learning for Big Data Lecture 13 Rebecka Jörnsten Mathematical Sciences University of Gothenburg and Chalmers University of Technology Clustering Explorative analysis - finding groups

More information

Hard clustering. Each object is assigned to one and only one cluster. Hierarchical clustering is usually hard. Soft (fuzzy) clustering

Hard clustering. Each object is assigned to one and only one cluster. Hierarchical clustering is usually hard. Soft (fuzzy) clustering An unsupervised machine learning problem Grouping a set of objects in such a way that objects in the same group (a cluster) are more similar (in some sense or another) to each other than to those in other

More information

Unsupervised Learning : Clustering

Unsupervised Learning : Clustering Unsupervised Learning : Clustering Things to be Addressed Traditional Learning Models. Cluster Analysis K-means Clustering Algorithm Drawbacks of traditional clustering algorithms. Clustering as a complex

More information

Package twilight. August 3, 2013

Package twilight. August 3, 2013 Package twilight August 3, 2013 Version 1.37.0 Title Estimation of local false discovery rate Author Stefanie Scheid In a typical microarray setting with gene expression data observed

More information

Fuzzy Segmentation. Chapter Introduction. 4.2 Unsupervised Clustering.

Fuzzy Segmentation. Chapter Introduction. 4.2 Unsupervised Clustering. Chapter 4 Fuzzy Segmentation 4. Introduction. The segmentation of objects whose color-composition is not common represents a difficult task, due to the illumination and the appropriate threshold selection

More information

Gene Clustering & Classification

Gene Clustering & Classification BINF, Introduction to Computational Biology Gene Clustering & Classification Young-Rae Cho Associate Professor Department of Computer Science Baylor University Overview Introduction to Gene Clustering

More information

CSE 5243 INTRO. TO DATA MINING

CSE 5243 INTRO. TO DATA MINING CSE 5243 INTRO. TO DATA MINING Cluster Analysis: Basic Concepts and Methods Huan Sun, CSE@The Ohio State University Slides adapted from UIUC CS412, Fall 2017, by Prof. Jiawei Han 2 Chapter 10. Cluster

More information

Finding Clusters 1 / 60

Finding Clusters 1 / 60 Finding Clusters Types of Clustering Approaches: Linkage Based, e.g. Hierarchical Clustering Clustering by Partitioning, e.g. k-means Density Based Clustering, e.g. DBScan Grid Based Clustering 1 / 60

More information

Unsupervised Learning

Unsupervised Learning Outline Unsupervised Learning Basic concepts K-means algorithm Representation of clusters Hierarchical clustering Distance functions Which clustering algorithm to use? NN Supervised learning vs. unsupervised

More information

1 Case study of SVM (Rob)

1 Case study of SVM (Rob) DRAFT a final version will be posted shortly COS 424: Interacting with Data Lecturer: Rob Schapire and David Blei Lecture # 8 Scribe: Indraneel Mukherjee March 1, 2007 In the previous lecture we saw how

More information

Clustering CS 550: Machine Learning

Clustering CS 550: Machine Learning Clustering CS 550: Machine Learning This slide set mainly uses the slides given in the following links: http://www-users.cs.umn.edu/~kumar/dmbook/ch8.pdf http://www-users.cs.umn.edu/~kumar/dmbook/dmslides/chap8_basic_cluster_analysis.pdf

More information

Points Lines Connected points X-Y Scatter. X-Y Matrix Star Plot Histogram Box Plot. Bar Group Bar Stacked H-Bar Grouped H-Bar Stacked

Points Lines Connected points X-Y Scatter. X-Y Matrix Star Plot Histogram Box Plot. Bar Group Bar Stacked H-Bar Grouped H-Bar Stacked Plotting Menu: QCExpert Plotting Module graphs offers various tools for visualization of uni- and multivariate data. Settings and options in different types of graphs allow for modifications and customizations

More information

CHAPTER 3 TUMOR DETECTION BASED ON NEURO-FUZZY TECHNIQUE

CHAPTER 3 TUMOR DETECTION BASED ON NEURO-FUZZY TECHNIQUE 32 CHAPTER 3 TUMOR DETECTION BASED ON NEURO-FUZZY TECHNIQUE 3.1 INTRODUCTION In this chapter we present the real time implementation of an artificial neural network based on fuzzy segmentation process

More information

Package SC3. September 29, 2018

Package SC3. September 29, 2018 Type Package Title Single-Cell Consensus Clustering Version 1.8.0 Author Vladimir Kiselev Package SC3 September 29, 2018 Maintainer Vladimir Kiselev A tool for unsupervised

More information

CANCER PREDICTION USING PATTERN CLASSIFICATION OF MICROARRAY DATA. By: Sudhir Madhav Rao &Vinod Jayakumar Instructor: Dr.

CANCER PREDICTION USING PATTERN CLASSIFICATION OF MICROARRAY DATA. By: Sudhir Madhav Rao &Vinod Jayakumar Instructor: Dr. CANCER PREDICTION USING PATTERN CLASSIFICATION OF MICROARRAY DATA By: Sudhir Madhav Rao &Vinod Jayakumar Instructor: Dr. Michael Nechyba 1. Abstract The objective of this project is to apply well known

More information

Package DiffCorr. August 29, 2016

Package DiffCorr. August 29, 2016 Type Package Package DiffCorr August 29, 2016 Title Analyzing and Visualizing Differential Correlation Networks in Biological Data Version 0.4.1 Date 2015-03-31 Author, Kozo Nishida Maintainer

More information

Clustering analysis of gene expression data

Clustering analysis of gene expression data Clustering analysis of gene expression data Chapter 11 in Jonathan Pevsner, Bioinformatics and Functional Genomics, 3 rd edition (Chapter 9 in 2 nd edition) Human T cell expression data The matrix contains

More information

Resting state network estimation in individual subjects

Resting state network estimation in individual subjects Resting state network estimation in individual subjects Data 3T NIL(21,17,10), Havard-MGH(692) Young adult fmri BOLD Method Machine learning algorithm MLP DR LDA Network image Correlation Spatial Temporal

More information

Olmo S. Zavala Romero. Clustering Hierarchical Distance Group Dist. K-means. Center of Atmospheric Sciences, UNAM.

Olmo S. Zavala Romero. Clustering Hierarchical Distance Group Dist. K-means. Center of Atmospheric Sciences, UNAM. Center of Atmospheric Sciences, UNAM November 16, 2016 Cluster Analisis Cluster analysis or clustering is the task of grouping a set of objects in such a way that objects in the same group (called a cluster)

More information

Simulation of WSN in NetSim Clustering using Self-Organizing Map Neural Network

Simulation of WSN in NetSim Clustering using Self-Organizing Map Neural Network Simulation of WSN in NetSim Clustering using Self-Organizing Map Neural Network Software Recommended: NetSim Standard v11.0, Visual Studio 2015/2017, MATLAB 2016a Project Download Link: https://github.com/netsim-tetcos/wsn_som_optimization_v11.0/archive/master.zip

More information

Foundations of Machine Learning CentraleSupélec Fall Clustering Chloé-Agathe Azencot

Foundations of Machine Learning CentraleSupélec Fall Clustering Chloé-Agathe Azencot Foundations of Machine Learning CentraleSupélec Fall 2017 12. Clustering Chloé-Agathe Azencot Centre for Computational Biology, Mines ParisTech chloe-agathe.azencott@mines-paristech.fr Learning objectives

More information

How do microarrays work

How do microarrays work Lecture 3 (continued) Alvis Brazma European Bioinformatics Institute How do microarrays work condition mrna cdna hybridise to microarray condition Sample RNA extract labelled acid acid acid nucleic acid

More information

Simulation of WSN in NetSim Clustering using Self-Organizing Map Neural Network

Simulation of WSN in NetSim Clustering using Self-Organizing Map Neural Network Simulation of WSN in NetSim Clustering using Self-Organizing Map Neural Network Software Recommended: NetSim Standard v11.1 (32/64bit), Visual Studio 2015/2017, MATLAB (32/64 bit) Project Download Link:

More information

Package SC3. November 27, 2017

Package SC3. November 27, 2017 Type Package Title Single-Cell Consensus Clustering Version 1.7.1 Author Vladimir Kiselev Package SC3 November 27, 2017 Maintainer Vladimir Kiselev A tool for unsupervised

More information

Document Clustering: Comparison of Similarity Measures

Document Clustering: Comparison of Similarity Measures Document Clustering: Comparison of Similarity Measures Shouvik Sachdeva Bhupendra Kastore Indian Institute of Technology, Kanpur CS365 Project, 2014 Outline 1 Introduction The Problem and the Motivation

More information

CHAPTER 4: CLUSTER ANALYSIS

CHAPTER 4: CLUSTER ANALYSIS CHAPTER 4: CLUSTER ANALYSIS WHAT IS CLUSTER ANALYSIS? A cluster is a collection of data-objects similar to one another within the same group & dissimilar to the objects in other groups. Cluster analysis

More information

Midterm Examination CS540-2: Introduction to Artificial Intelligence

Midterm Examination CS540-2: Introduction to Artificial Intelligence Midterm Examination CS540-2: Introduction to Artificial Intelligence March 15, 2018 LAST NAME: FIRST NAME: Problem Score Max Score 1 12 2 13 3 9 4 11 5 8 6 13 7 9 8 16 9 9 Total 100 Question 1. [12] Search

More information

University of Florida CISE department Gator Engineering. Clustering Part 4

University of Florida CISE department Gator Engineering. Clustering Part 4 Clustering Part 4 Dr. Sanjay Ranka Professor Computer and Information Science and Engineering University of Florida, Gainesville DBSCAN DBSCAN is a density based clustering algorithm Density = number of

More information

SYDE Winter 2011 Introduction to Pattern Recognition. Clustering

SYDE Winter 2011 Introduction to Pattern Recognition. Clustering SYDE 372 - Winter 2011 Introduction to Pattern Recognition Clustering Alexander Wong Department of Systems Design Engineering University of Waterloo Outline 1 2 3 4 5 All the approaches we have learned

More information

CSE 5243 INTRO. TO DATA MINING

CSE 5243 INTRO. TO DATA MINING CSE 5243 INTRO. TO DATA MINING Cluster Analysis: Basic Concepts and Methods Huan Sun, CSE@The Ohio State University 09/25/2017 Slides adapted from UIUC CS412, Fall 2017, by Prof. Jiawei Han 2 Chapter 10.

More information

ECS 234: Data Analysis: Clustering ECS 234

ECS 234: Data Analysis: Clustering ECS 234 : Data Analysis: Clustering What is Clustering? Given n objects, assign them to groups (clusters) based on their similarity Unsupervised Machine Learning Class Discovery Difficult, and maybe ill-posed

More information

Package macorrplot. R topics documented: October 2, Title Visualize artificial correlation in microarray data. Version 1.50.

Package macorrplot. R topics documented: October 2, Title Visualize artificial correlation in microarray data. Version 1.50. Package macorrplot October 2, 2018 Title Visualize artificial correlation in microarray data Version 1.50.0 Author Alexander Ploner Description Graphically displays correlation

More information

Distance-based Methods: Drawbacks

Distance-based Methods: Drawbacks Distance-based Methods: Drawbacks Hard to find clusters with irregular shapes Hard to specify the number of clusters Heuristic: a cluster must be dense Jian Pei: CMPT 459/741 Clustering (3) 1 How to Find

More information

Clustering Part 4 DBSCAN

Clustering Part 4 DBSCAN Clustering Part 4 Dr. Sanjay Ranka Professor Computer and Information Science and Engineering University of Florida, Gainesville DBSCAN DBSCAN is a density based clustering algorithm Density = number of

More information

Clustering. CE-717: Machine Learning Sharif University of Technology Spring Soleymani

Clustering. CE-717: Machine Learning Sharif University of Technology Spring Soleymani Clustering CE-717: Machine Learning Sharif University of Technology Spring 2016 Soleymani Outline Clustering Definition Clustering main approaches Partitional (flat) Hierarchical Clustering validation

More information

Understanding Clustering Supervising the unsupervised

Understanding Clustering Supervising the unsupervised Understanding Clustering Supervising the unsupervised Janu Verma IBM T.J. Watson Research Center, New York http://jverma.github.io/ jverma@us.ibm.com @januverma Clustering Grouping together similar data

More information

Clustering. Lecture 6, 1/24/03 ECS289A

Clustering. Lecture 6, 1/24/03 ECS289A Clustering Lecture 6, 1/24/03 What is Clustering? Given n objects, assign them to groups (clusters) based on their similarity Unsupervised Machine Learning Class Discovery Difficult, and maybe ill-posed

More information

Clustering. Informal goal. General types of clustering. Applications: Clustering in information search and analysis. Example applications in search

Clustering. Informal goal. General types of clustering. Applications: Clustering in information search and analysis. Example applications in search Informal goal Clustering Given set of objects and measure of similarity between them, group similar objects together What mean by similar? What is good grouping? Computation time / quality tradeoff 1 2

More information

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

CS246: Mining Massive Datasets Jure Leskovec, Stanford University CS246: Mining Massive Datasets Jure Leskovec, Stanford University http://cs246.stanford.edu /2/8 Jure Leskovec, Stanford CS246: Mining Massive Datasets 2 Task: Given a large number (N in the millions or

More information

CSE 5243 INTRO. TO DATA MINING

CSE 5243 INTRO. TO DATA MINING CSE 5243 INTRO. TO DATA MINING Cluster Analysis: Basic Concepts and Methods Huan Sun, CSE@The Ohio State University 09/28/2017 Slides adapted from UIUC CS412, Fall 2017, by Prof. Jiawei Han 2 Chapter 10.

More information

Clustering. CS294 Practical Machine Learning Junming Yin 10/09/06

Clustering. CS294 Practical Machine Learning Junming Yin 10/09/06 Clustering CS294 Practical Machine Learning Junming Yin 10/09/06 Outline Introduction Unsupervised learning What is clustering? Application Dissimilarity (similarity) of objects Clustering algorithm K-means,

More information

Quiz Section Week 8 May 17, Machine learning and Support Vector Machines

Quiz Section Week 8 May 17, Machine learning and Support Vector Machines Quiz Section Week 8 May 17, 2016 Machine learning and Support Vector Machines Another definition of supervised machine learning Given N training examples (objects) {(x 1,y 1 ), (x 2,y 2 ),, (x N,y N )}

More information

Machine Learning (BSMC-GA 4439) Wenke Liu

Machine Learning (BSMC-GA 4439) Wenke Liu Machine Learning (BSMC-GA 4439) Wenke Liu 01-31-017 Outline Background Defining proximity Clustering methods Determining number of clusters Comparing two solutions Cluster analysis as unsupervised Learning

More information

Cluster analysis. Agnieszka Nowak - Brzezinska

Cluster analysis. Agnieszka Nowak - Brzezinska Cluster analysis Agnieszka Nowak - Brzezinska Outline of lecture What is cluster analysis? Clustering algorithms Measures of Cluster Validity What is Cluster Analysis? Finding groups of objects such that

More information

Distance Measures for Gene Expression (and other high-dimensional) Data

Distance Measures for Gene Expression (and other high-dimensional) Data Distance Measures for Gene Expression (and other high-dimensional) Data Utah State University Spring 04 STAT 5570: Statistical Bioinformatics Notes. References Chapter of Bioconductor Monograph (course

More information

INF4820 Algorithms for AI and NLP. Evaluating Classifiers Clustering

INF4820 Algorithms for AI and NLP. Evaluating Classifiers Clustering INF4820 Algorithms for AI and NLP Evaluating Classifiers Clustering Murhaf Fares & Stephan Oepen Language Technology Group (LTG) September 27, 2017 Today 2 Recap Evaluation of classifiers Unsupervised

More information

Multivariate Analysis

Multivariate Analysis Multivariate Analysis Cluster Analysis Prof. Dr. Anselmo E de Oliveira anselmo.quimica.ufg.br anselmo.disciplinas@gmail.com Unsupervised Learning Cluster Analysis Natural grouping Patterns in the data

More information

Package EMDomics. November 11, 2018

Package EMDomics. November 11, 2018 Package EMDomics November 11, 2018 Title Earth Mover's Distance for Differential Analysis of Genomics Data The EMDomics algorithm is used to perform a supervised multi-class analysis to measure the magnitude

More information

Clustering Jacques van Helden

Clustering Jacques van Helden Statistical Analysis of Microarray Data Clustering Jacques van Helden Jacques.van.Helden@ulb.ac.be Contents Data sets Distance and similarity metrics K-means clustering Hierarchical clustering Evaluation

More information

Package lhs. R topics documented: January 4, 2018

Package lhs. R topics documented: January 4, 2018 Package lhs January 4, 2018 Version 0.16 Date 2017-12-23 Title Latin Hypercube Samples Author [aut, cre] Maintainer Depends R (>= 3.3.0) Suggests RUnit Provides a number of methods

More information

Road map. Basic concepts

Road map. Basic concepts Clustering Basic concepts Road map K-means algorithm Representation of clusters Hierarchical clustering Distance functions Data standardization Handling mixed attributes Which clustering algorithm to use?

More information

Performance Evaluation of Clustering Methods in Microarray Data

Performance Evaluation of Clustering Methods in Microarray Data American Journal of Bioinformatics Research 2016, 6(1): 19-25 DOI: 10.5923/j.bioinformatics.20160601.03 Performance Evaluation of Clustering Methods in Microarray Data Md. Siraj-Ud-Doulah *, Md. Bipul

More information

Package seg. February 15, 2013

Package seg. February 15, 2013 Package seg February 15, 2013 Version 0.2-4 Date 2013-01-21 Title A set of tools for residential segregation research Author Seong-Yun Hong, David O Sullivan Maintainer Seong-Yun Hong

More information

Statistics 202: Data Mining. c Jonathan Taylor. Week 8 Based in part on slides from textbook, slides of Susan Holmes. December 2, / 1

Statistics 202: Data Mining. c Jonathan Taylor. Week 8 Based in part on slides from textbook, slides of Susan Holmes. December 2, / 1 Week 8 Based in part on slides from textbook, slides of Susan Holmes December 2, 2012 1 / 1 Part I Clustering 2 / 1 Clustering Clustering Goal: Finding groups of objects such that the objects in a group

More information

Cluster Analysis. Angela Montanari and Laura Anderlucci

Cluster Analysis. Angela Montanari and Laura Anderlucci Cluster Analysis Angela Montanari and Laura Anderlucci 1 Introduction Clustering a set of n objects into k groups is usually moved by the aim of identifying internally homogenous groups according to a

More information

Notes. Reminder: HW2 Due Today by 11:59PM. Review session on Thursday. Midterm next Tuesday (10/10/2017)

Notes. Reminder: HW2 Due Today by 11:59PM. Review session on Thursday. Midterm next Tuesday (10/10/2017) 1 Notes Reminder: HW2 Due Today by 11:59PM TA s note: Please provide a detailed ReadMe.txt file on how to run the program on the STDLINUX. If you installed/upgraded any package on STDLINUX, you should

More information

The C Clustering Library

The C Clustering Library The C Clustering Library The University of Tokyo, Institute of Medical Science, Human Genome Center Michiel de Hoon, Seiya Imoto, Satoru Miyano 29 April 2018 The C Clustering Library for cdna microarray

More information

9/17/2009. Wenyan Li (Emily Li) Sep. 15, Introduction to Clustering Analysis

9/17/2009. Wenyan Li (Emily Li) Sep. 15, Introduction to Clustering Analysis Introduction ti to K-means Algorithm Wenan Li (Emil Li) Sep. 5, 9 Outline Introduction to Clustering Analsis K-means Algorithm Description Eample of K-means Algorithm Other Issues of K-means Algorithm

More information

CS 2750 Machine Learning. Lecture 19. Clustering. CS 2750 Machine Learning. Clustering. Groups together similar instances in the data sample

CS 2750 Machine Learning. Lecture 19. Clustering. CS 2750 Machine Learning. Clustering. Groups together similar instances in the data sample Lecture 9 Clustering Milos Hauskrecht milos@cs.pitt.edu 539 Sennott Square Clustering Groups together similar instances in the data sample Basic clustering problem: distribute data into k different groups

More information

The Curse of Dimensionality

The Curse of Dimensionality The Curse of Dimensionality ACAS 2002 p1/66 Curse of Dimensionality The basic idea of the curse of dimensionality is that high dimensional data is difficult to work with for several reasons: Adding more

More information

Clustering. Robert M. Haralick. Computer Science, Graduate Center City University of New York

Clustering. Robert M. Haralick. Computer Science, Graduate Center City University of New York Clustering Robert M. Haralick Computer Science, Graduate Center City University of New York Outline K-means 1 K-means 2 3 4 5 Clustering K-means The purpose of clustering is to determine the similarity

More information

Network Traffic Measurements and Analysis

Network Traffic Measurements and Analysis DEIB - Politecnico di Milano Fall, 2017 Introduction Often, we have only a set of features x = x 1, x 2,, x n, but no associated response y. Therefore we are not interested in prediction nor classification,

More information

What to come. There will be a few more topics we will cover on supervised learning

What to come. There will be a few more topics we will cover on supervised learning Summary so far Supervised learning learn to predict Continuous target regression; Categorical target classification Linear Regression Classification Discriminative models Perceptron (linear) Logistic regression

More information

EECS730: Introduction to Bioinformatics

EECS730: Introduction to Bioinformatics EECS730: Introduction to Bioinformatics Lecture 15: Microarray clustering http://compbio.pbworks.com/f/wood2.gif Some slides were adapted from Dr. Shaojie Zhang (University of Central Florida) Microarray

More information

Contents. ! Data sets. ! Distance and similarity metrics. ! K-means clustering. ! Hierarchical clustering. ! Evaluation of clustering results

Contents. ! Data sets. ! Distance and similarity metrics. ! K-means clustering. ! Hierarchical clustering. ! Evaluation of clustering results Statistical Analysis of Microarray Data Contents Data sets Distance and similarity metrics K-means clustering Hierarchical clustering Evaluation of clustering results Clustering Jacques van Helden Jacques.van.Helden@ulb.ac.be

More information

Fathom Dynamic Data TM Version 2 Specifications

Fathom Dynamic Data TM Version 2 Specifications Data Sources Fathom Dynamic Data TM Version 2 Specifications Use data from one of the many sample documents that come with Fathom. Enter your own data by typing into a case table. Paste data from other

More information

Exploratory Data Analysis using Self-Organizing Maps. Madhumanti Ray

Exploratory Data Analysis using Self-Organizing Maps. Madhumanti Ray Exploratory Data Analysis using Self-Organizing Maps Madhumanti Ray Content Introduction Data Analysis methods Self-Organizing Maps Conclusion Visualization of high-dimensional data items Exploratory data

More information

Behavioral Data Mining. Lecture 18 Clustering

Behavioral Data Mining. Lecture 18 Clustering Behavioral Data Mining Lecture 18 Clustering Outline Why? Cluster quality K-means Spectral clustering Generative Models Rationale Given a set {X i } for i = 1,,n, a clustering is a partition of the X i

More information

Unsupervised Learning. Presenter: Anil Sharma, PhD Scholar, IIIT-Delhi

Unsupervised Learning. Presenter: Anil Sharma, PhD Scholar, IIIT-Delhi Unsupervised Learning Presenter: Anil Sharma, PhD Scholar, IIIT-Delhi Content Motivation Introduction Applications Types of clustering Clustering criterion functions Distance functions Normalization Which

More information

Lesson 3. Prof. Enza Messina

Lesson 3. Prof. Enza Messina Lesson 3 Prof. Enza Messina Clustering techniques are generally classified into these classes: PARTITIONING ALGORITHMS Directly divides data points into some prespecified number of clusters without a hierarchical

More information

Classification of Hyperspectral Breast Images for Cancer Detection. Sander Parawira December 4, 2009

Classification of Hyperspectral Breast Images for Cancer Detection. Sander Parawira December 4, 2009 1 Introduction Classification of Hyperspectral Breast Images for Cancer Detection Sander Parawira December 4, 2009 parawira@stanford.edu In 2009 approximately one out of eight women has breast cancer.

More information

10701 Machine Learning. Clustering

10701 Machine Learning. Clustering 171 Machine Learning Clustering What is Clustering? Organizing data into clusters such that there is high intra-cluster similarity low inter-cluster similarity Informally, finding natural groupings among

More information

BBS654 Data Mining. Pinar Duygulu. Slides are adapted from Nazli Ikizler

BBS654 Data Mining. Pinar Duygulu. Slides are adapted from Nazli Ikizler BBS654 Data Mining Pinar Duygulu Slides are adapted from Nazli Ikizler 1 Classification Classification systems: Supervised learning Make a rational prediction given evidence There are several methods for

More information

Introduction to Pattern Recognition Part II. Selim Aksoy Bilkent University Department of Computer Engineering

Introduction to Pattern Recognition Part II. Selim Aksoy Bilkent University Department of Computer Engineering Introduction to Pattern Recognition Part II Selim Aksoy Bilkent University Department of Computer Engineering saksoy@cs.bilkent.edu.tr RETINA Pattern Recognition Tutorial, Summer 2005 Overview Statistical

More information

Data Preprocessing. Javier Béjar. URL - Spring 2018 CS - MAI 1/78 BY: $\

Data Preprocessing. Javier Béjar. URL - Spring 2018 CS - MAI 1/78 BY: $\ Data Preprocessing Javier Béjar BY: $\ URL - Spring 2018 C CS - MAI 1/78 Introduction Data representation Unstructured datasets: Examples described by a flat set of attributes: attribute-value matrix Structured

More information

Data Clustering. Danushka Bollegala

Data Clustering. Danushka Bollegala Data Clustering Danushka Bollegala Outline Why cluster data? Clustering as unsupervised learning Clustering algorithms k-means, k-medoids agglomerative clustering Brown s clustering Spectral clustering

More information

Notes. Reminder: HW2 Due Today by 11:59PM. Review session on Thursday. Midterm next Tuesday (10/09/2018)

Notes. Reminder: HW2 Due Today by 11:59PM. Review session on Thursday. Midterm next Tuesday (10/09/2018) 1 Notes Reminder: HW2 Due Today by 11:59PM TA s note: Please provide a detailed ReadMe.txt file on how to run the program on the STDLINUX. If you installed/upgraded any package on STDLINUX, you should

More information

Dimension reduction : PCA and Clustering

Dimension reduction : PCA and Clustering Dimension reduction : PCA and Clustering By Hanne Jarmer Slides by Christopher Workman Center for Biological Sequence Analysis DTU The DNA Array Analysis Pipeline Array design Probe design Question Experimental

More information

Motion estimation for video compression

Motion estimation for video compression Motion estimation for video compression Blockmatching Search strategies for block matching Block comparison speedups Hierarchical blockmatching Sub-pixel accuracy Motion estimation no. 1 Block-matching

More information

Clustering. Mihaela van der Schaar. January 27, Department of Engineering Science University of Oxford

Clustering. Mihaela van der Schaar. January 27, Department of Engineering Science University of Oxford Department of Engineering Science University of Oxford January 27, 2017 Many datasets consist of multiple heterogeneous subsets. Cluster analysis: Given an unlabelled data, want algorithms that automatically

More information

Hierarchical Clustering 4/5/17

Hierarchical Clustering 4/5/17 Hierarchical Clustering 4/5/17 Hypothesis Space Continuous inputs Output is a binary tree with data points as leaves. Useful for explaining the training data. Not useful for making new predictions. Direction

More information

Exploratory Analysis: Clustering

Exploratory Analysis: Clustering Exploratory Analysis: Clustering (some material taken or adapted from slides by Hinrich Schutze) Heejun Kim June 26, 2018 Clustering objective Grouping documents or instances into subsets or clusters Documents

More information

This research aims to present a new way of visualizing multi-dimensional data using generalized scatterplots by sensitivity coefficients to highlight

This research aims to present a new way of visualizing multi-dimensional data using generalized scatterplots by sensitivity coefficients to highlight This research aims to present a new way of visualizing multi-dimensional data using generalized scatterplots by sensitivity coefficients to highlight local variation of one variable with respect to another.

More information

[7.3, EA], [9.1, CMB]

[7.3, EA], [9.1, CMB] K-means Clustering Ke Chen Reading: [7.3, EA], [9.1, CMB] Outline Introduction K-means Algorithm Example How K-means partitions? K-means Demo Relevant Issues Application: Cell Neulei Detection Summary

More information

CSE 6242 A / CS 4803 DVA. Feb 12, Dimension Reduction. Guest Lecturer: Jaegul Choo

CSE 6242 A / CS 4803 DVA. Feb 12, Dimension Reduction. Guest Lecturer: Jaegul Choo CSE 6242 A / CS 4803 DVA Feb 12, 2013 Dimension Reduction Guest Lecturer: Jaegul Choo CSE 6242 A / CS 4803 DVA Feb 12, 2013 Dimension Reduction Guest Lecturer: Jaegul Choo Data is Too Big To Do Something..

More information