Grid-Layout Visualization Method in the Microarray Data Analysis Interactive Graphics Toolkit

Size: px
Start display at page:

Download "Grid-Layout Visualization Method in the Microarray Data Analysis Interactive Graphics Toolkit"

Transcription

1 Grid-Layout Visualization Method in the Microarray Data Analysis Interactive Graphics Toolkit Li Xiao, Oleg Shats, and Simon Sherman * Nebraska Informatics Center for the Life Sciences Eppley Institute for Research in Cancer and Allied Diseases University of Nebraska Medical Center, Omaha, NE {lxiao, oshats, ssherm}@unmc.edu Abstract The expression levels of thousands of genes in different tissues or cells in different conditions can be detected all at one time by DNA microarray technology. A new, gridlayout method for the visualization results of hierarchical cluster analysis of DNA microarray data is proposed and incorporated in the Microarray Interactive Graphics Toolkit (MIGT). The grid-layout consists of a set of regular, two-dimensional grid units. Each unit represents a cluster or a group of gene clusters. The units are connected to adjacent ones by the neighborhood relation of the clusters in a hierarchical tree. Nodes lying near each other in the hierarchical tree are mapped onto nearby grid-layout units. The number of units may vary on a scale from a few dozen up to several thousands, depending on the number of the nodes in a hierarchical tree. Different colors are assigned to the units with RGB value according to the coordinates of the units, and the inter-distances, which are the distances between clusters in a hierarchical tree, and the intra-distances, which are the distances between genes within one cluster. The closer the inter-distances, the more similar the color of the units are, the smaller the intra-distances, the warmer the color of the unit is. 1. Introduction DNA microarrays exploit the preferential binding of complementary, single-stranded molecular fragments of DNA that are attached at fixed locations (spots) on glass slides. There may be up to tens of thousands of spots on a slide, each representing a single gene. The microarray technology offers an opportunity to simultaneously screen the expression pattern of a large number of distinct genes. * Corresponding author An aim of the microarray analysis is to identify genes differentially expressed in the target cells, as compared to the reference cells. The difference between expression profiles from cell samples can be quantitatively characterized. This allows researchers to track the effect of interventions or natural processes on gene expression levels, as well as to identify the functions of genes and the biochemical pathways they participate in. For the analysis of the microarray data, the cluster analysis methods [1-4] and methods for the identification of genes differentially expressed in the target cells [5], as compared to the reference cells, can be utilized. The cluster approach holds much promise for determining groups of genes with a similar function. A measure of correlation between gene expressions can be used to cluster together genes with similar expressions. Although both academic [7] and commercial [8] software for cluster analysis are already available, there is a large need for the proper visualization of results. Due to the fact that the size of microarray data involved in clustering is very large, the dendrogram visualization method, which is usually used for presenting the result of hierarchical clustering, does not work in a proper way. Therefore, in this paper, we are presenting a novel grid-layout method that is useful in visualizing the results of the hierarchical clustering of gene expressions obtained by DNA array experiments. The proposed method allows us to present the subgroups of genes in a hierarchical cluster tree by colorful rectangular or hexagon units, which become parts of a grid-layout map. In the hierarchical tree, the subgroups are divided by the cut-off value defined by the user. The distances between these subgroups are presented by the different colors assigned to the corresponding units in such a way that the neighboring subgroups have close colors, valued in the RGB format. Moreover, the assignment of the RGB value to one subgroup grid is based on the scale of distances between genes within the subgroup in this grid. In the grid-layout method, both information on the distances between the /03 $17.00 (C) 2003 IEEE 1

2 subgroups in a hierarchical tree and information on the distances within subgroups are presented. Therefore, this method can simplify a visual presentation of hierarchical clustering of the large size microarray data. The proposed grid-layout method is incorporated into the Microarray Interactive Graphics Toolkit (MIGT) that is under construction in our laboratory. A set of tools for preliminary analysis of microarray data, such as normalization, scatter plotting and hierarchical clustering, is provided in the MIGT as well. In the next section the grid-layout method, proposed for visualization results of hierarchical clustering on microarray data, is described. Section 3 describes the MIGT software package in which this grid-layout method, as well as the prerequisite method, hierarchical clustering for DNA microarray analysis, is introduced. Section 4 provides conclusions and the direction of future work. 2. Grid-layout Visualization Method for the Hierarchical Clustering of Microarray Data In the field of microarray data analysis, the hierarchical trees (dendrograms) are usually utilized for visualizing and analyzing results of the hierarchical clustering [7, 8]. When there are more than 30 nodes in the original data, however, the corresponding dendrograms may look crowded, and visual analysis of results of the clustering became very difficult. In fact, in the software package developed at Stanford University that is called TreeView [7], enlarged dendrograms can be viewed in a window with a vertical scroll bar. The user can observe the genes within one cluster or between neighboring clusters, but the inter-distances between all clusters are not well exhibited due to the large size of original data. In this paper, for analysis results of the hierarchical clustering we propose a method, which we called a gridlayout method. An analogous method is initially used for visualization results of self-organizing mapping [10, 11], and we are proposing to use it to visualize results of the hierarchical clustering of microarray data. The grid-layout consists of a regular, two-dimensional grid of units. Each unit is a cluster or a group of clusters. The units are connected to adjacent ones by the neighborhood relation of the clusters in a hierarchical tree. Nodes lying near each other in the hierarchical tree are mapped onto nearby grid-layout units. The number of units can vary from a few dozen up to several thousand, depending on the size of the nodes in a hierarchical tree. Different colors are assigned with an RGB value to the units. These assignments depend on the coordinates of the units, the inter-distances between clusters, and the intradistances between genes within one cluster. The closer the inter-distances, the more similar the color of the units are, the smaller the intra-distances, the warmer the color of the unit is. The grid-layout method consists of the three steps described below. Step1. Determining the number of the grid-layout units. There are two methods to cut a tree: (i) by giving the height on which the hierarchical tree will be cut, and (ii) by giving the maximum number nc of branches to be kept in the hierarchical tree, and the maximum number nc is less than or equal to grid-layout units nu. In this paper, we are using the last method. When the user predefines nc to be exhibited in the gridlayout format, the number of units, nu, that has to be bigger or equal to nc, can be found in the following way. The width of the grid-layout rounds, gw, can be estimated from nc as the nearest integers toward minus infinity: nc. The height, gl, of the grid-layout rounds can be estimated from a ration nc gw as the nearest integers toward infinity: nc gw. Then, the number of units can be calculated as nu = gw gl. The remaining empty grids can be found as: reg = nu nc. For example, suppose the number of clusters, nc, is 18, the width of the grid-layout, gw, is 4, the height of the grid-layout is 5, and the number of units, nu, is 20. The number of the remaining empty grids, reg, is 2. Step 2. Calculating coordinates of the units. The user can select one shape of lattice from rectangle R and hexagon H, and also determine size of a single rectangle R and a single hexagon H. The coordinate matrix of grid-layout is calculated in such a way that the length of each R or the width of each H multiply the numbers of units in the height and the width of gridlayout obtained by the method described in step 1. A hierarchical dendrogram (tree) is split into the given number nc of subclusters from left to right on the dendrogram. The ordered split subclusters are put on the grid-layout, one by one, starting from the left-upper corner of coordinate matrix of grid-layout, in the direction of from up to down, and move to next column of gridlayout till the former column is filled. The remaining empty grids, reg, are dispatched onto the grid-layout based on the bigger inter-distance of the neighboring subclusters. Step 3. Assigning colors to the grid-layout units. Colors are encapsulated to the units using the RGB (Red, Green, Blue) format. In the RGB format, each color can be presented as a mixture of the three components, red, blue, and green. Colors can also be encapsulated by /03 $17.00 (C) 2003 IEEE 2

3 HSB (Hue, Saturation, Brightness) format. Hue varies from 0 to 1, the resulting color varies from red, yellow, green, cyan, blue, magenta to red. Saturation varies from 0 to 1, when the saturation is 0, the colors are unsaturated; they are simply shades of gray, when the saturation is 1, the colors are fully saturated; they contain no white component. The brightness increases as the value varies from 0 to 1. To assign values of colors to the neighboring gridlayout units, two types of information are taken into account. First, the distances between overage clusters (inter-cluster distances) and the distances between genes within one cluster (intra-cluster-distances) are separately scaled and assigned to two index vectors. These vectors we have called inter-vector and intravector indexes of subclusters. According to the scale of the value of inter-vector, the hue value of the corresponding subclusters are arranged as the scale as the hue varied from 0 near to 1 (from red, yellow, green, cyan, blue to magenta). In this way, the neighboring subclusters should have a similar color (hue) if there are many more subclusters. Further, according to the scale of the values of the intra-vector, the saturation value of the corresponding subclusters are assigned scaling from 0.3 to 1, in the rule that closed intra-distances have closed saturation, and the bigger intra-distances, the bigger the saturation value. The value of saturation variance from 0.3 to 1 is set to avoid the overrepresentation of darker colors. The value of brightness for all units is assigned to a fixed value over 0.5, to avoid the overrepresentation of dark colors. After the HSB values of the corresponding subcluster are obtained, a Java function, HSBtoRGB, is used to transfer the HSB value to an RGB value of the color with the indicated hue, saturation, and brightness. 3. Microarray Interactive Graphics Toolkit In the MIGT, the commonly used microarray data analysis methods, such as normalization, scatter plot and hierarchical cluster, are implemented. The MIGT consists of seven tabbed panes: Load&Edit, Scatter Plot, Normalization, Cluster, Visualization, Web, and About. The Load&Edit pane allows users to load and edit microarray data. The Scatter Plot pane provides users with a 3D scatter plot allowing interactive manipulation with gene expressions data. The Normalization pane provides several tools to normalize microarray data. The Cluster pane allows users to do the hierarchical cluster analysis of gene expression data. The Visualization pane allows users to observe results of the hierarchical clustering by two methods, dendrogram and grid-layout. The Web pane allows users to connect their application to the databases on the Web and extract the corresponding biological information on observing genes from the databases. The About pane gives out the help information on a current version of the MIGT. In this paper, we introduce the Visualization and Cluster panes in details Cluster Pane Hierarchical clustering is the most widely used method for the analysis of patterns of gene expression [1-4, 9]. By hierarchical clustering, a representation of the gene expression is produced as the shape of a binary tree, in which the most similar patterns are clustered in a hierarchy. Divisive algorithms that work by recursively partitioning the data set until singleton sets are achieved, and agglomerative algorithms that work by beginning with singleton sets and merging them until the whole data set is achieved, are two major types of hierarchical clustering algorithms. The agglomerative methods, which are far more common, are incorporated in our MIGT package. In agglomerative methods, the distances between pairs of the elements of a data set are calculated based on different distance metrics and placed into a list of distance sets, d 1, d 2,..., d n. Then a cost function is used to find the pair of sets { d i, d j } from the list which is cheapest to merge. Finally, the corresponding singleton set x i and x j are removed from data sets and replaced with x i x j. This process is repeated until there is only one set remaining. Different distance metrics and cost functions utilized in agglomerative methods may bring out different clusters. Main metrics and cost functions used in agglomerative clustering algorithms are described in detail by Everritt and Dunn [15]. The Cluster pane provides a hierarchical clustering procedure, which consists of the following three phases. Distance Phase: The task in this phase is to find the similarity or dissimilarity between every pair of genes in the microarray data set. The microarray data X is treated as M samples of N genes. The distance is the measure of the similarity or dissimilarity, and is calculated by the distance metrics. The data structure of distances vector presented in distance phase, contains all the distances between each pair of genes. Since there are M*(M-1)/2 pairs of genes in X, the size of the distance vector is M*(M-1)/2. Grouping Phase: In this phase, the pairs of genes that are in close proximity are linked together, thus these genes are grouped into a binary, hierarchical cluster tree. The distance vector generated in the distance phase is used to determine the proximity of genes to each other. As genes are paired into binary clusters, the newly formed clusters are grouped into large clusters until a hierarchical /03 $17.00 (C) 2003 IEEE 3

4 tree is formed. Cluster information will be returned in a vector with the size of a row being m-1 and column being 3, where m is the number of genes in the samples. The first two columns of the vector contain gene cluster indices linked in pairs to form a binary tree, and the third column is the distance between the pairs of genes. Division Phase: In this phase, the division of the hierarchical tree into clusters is determined by a user predefined threshold value. The genes, grouped as a hierarchical tree in the grouping phase, are divided into clusters. The grouped clusters are put in a vector with two columns that the first column being gene names and the second column being the group ID that the gene on the same row should be. After completing all three phases, relationships between certain subgroups in the hierarchical cluster can be viewed in the Visualization pane. In the MIGT, the data structures of the hierarchical cluster algorithms are compatible with MATLAB [13]. In each phase of clustering, the processing data can be opened and observed directly. The processing data can be transferred into MATLAB for a comprehensive statistical analysis. The data in MATLAB can also be transferred into the MIGT for an interactive visualization Visualization pane In the Visualization pane, two methods, dendrogram and grid-layout, are designed to observer the clustering results. The dendrogram method plots the hierarchical tree information obtained as a graph from the Cluster pane. The input parameter of the dendrogram method is the vector generated in the Grouping Phase of the Cluster pane. There are three columns and m-1 rows in the vector, where m is the number of genes in the microarray samples. The pairs of genes on the one row of the vector have the cheapest cost function of hierarchical clustering. The order of indices of the vector depends on both the order and content of the leaf nodes of the dendrogram. In the dendrogram, the numbers along the horizontal axis represent the indices of the genes in the microarray data set. The links between genes are represented as upside-down U-shaped lines. The height of the U indicates the distance between the genes. Beside the plot of the dendrogram, there are two lists on the Visualization pane: one is a Gene list that shows the gene names that have been used for hierarchical clustering and another one is the Highlights list that shows the highlighted gene information. A mouse event is used to draw a rectangle on the user s interested part of the dendrogram, and the leaf nodes within the rectangle are highlighted when the mouse button is released. The gene information of the highlighted leaf nodes can be shown on the Highlights list on the Visualization pane. The grid-layout method plots the hierarchical tree information on the grid-layout consisting of a twodimensional grid of units. The input parameters of the grid-layout method are the vector generated by the Grouping Phase and the vector generated by Division Phase in the Cluster pane. Two shapes of grid, rectangle and hexagon, are utilized. Two kinds of grid-layout are provided: one is showing the labels of genes on the grids, another one is showing different colors on the grids. As described in section 2, the subgroup genes split by the Division Phase are put on the grids one-by-one, based on their neighborhood in the hierarchical clustering. This neighborhood can also be viewed from the relationship of the labels along the horizontal axis on the dendrogram. Due to the large size of the gene names, only the labels of genes can be shown on the grids. However, showing all labels of the genes belonging to one grid, would bring out the mess visualization. Therefore, the label of only one gene name can be shown in one grid. This label represents the gene that occurs on the first place of the subgroup. The inter-distances and intra-distances can be obtained from the vector generated by the Grouping Phase. The inter-distance of two subgroups of genes in a hierarchical tree is the distance of the top, newly formed, binary cluster, created by grouping the genes of the first subgroup minus the distance of the top, newly formed, binary cluster created by grouping the genes of the second subgroup. By the distances between genes within one subgroup shown in the vector of Grouping Phase, the intra-distances, are calculated as the average distances between genes within one subgroup. The colors (RGB values) of the grids are assigned, in the way of step 3 of grid-layout method introduced in section 2. The genes on the user s interested grids can be shown in a Highlights list by clicking on the corresponding grid. The performance of the hierarchical cluster visualization in the MIGT is tested by the Lymphochip microarray data set [6], which has 42 samples of 2041 genes, with genes preferentially expressed in lymphoid cells. After the hierarchical clustering of the data, we set up the 20*20 units for a grid-layout for the cluster visualization. The grid-layout, before assigning color, is shown in Figure 1, and the corresponding grid-layout after assigning color, is shown in Figure 2. Each label in the grids delegates a group of genes together. For example, there are nine genes in the first grid (left-upper corner). Due to the size of the grid, only one gene label '345' (Clone name is ) is shown. Behind gene label 345, there are genes: '2145' (Clone= ), '413' (Clone= ), '799' (Clone= ), '211' (Clone= ), '959' (Clone= ), '53' (Clone=17), '1411' (Clone=128088) and '25' (Clone=65). The HSB value of the first grid is (0.6, 0.6, 0.5) and the RGB value of this grid, which has an olive-green color is (127,127, 49). We found that the colors of the grids at the /03 $17.00 (C) 2003 IEEE 4

5 upper-right of the layout are red, and the intra-distances of these subgroups are smaller. The colors of the grids at the center of the layout are gray, and the intra-distances of these subgroups are relatively bigger. visualize results of the hierarchical clustering. In the MIGT the grid-layout method is combined with an interactive observation method. The user can focus on gene information through interactive observations, while the grid-layout visualization allows the user to present results of such observations in a simple way. In addition to the grid-layout visualization method, the basic procedures for microarray data analysis, such as normalization, scatter plotting and hierarchical clustering are incorporated in the MIGT. In future work, the more complicated interactive operations on grid-layout, visualization and low dimensional focus, should be considered. For instance, the cluster information in one or several grids that the user may select can be expended as a new grid-layout for the cluster information of the interested grids. This possibility will be also provided in the new version of the MIGT. Acknowledgements Figure 1. The grid-layout before assigning color. Figure 2. The grid-layout after assigning color. 4. Conclusion and future work The abilities of visualization and interaction are important for the user to observe and mine the microarray data. In this paper, we introduce the grid-layout visualization method for microarray data clustering, as well as the MIGT software designed for microarray data analysis. In the grid-layout method, grids are used to This work was partially supported by the NSF EPSCoR grant (# ). This publication was made possible by NIH Grant Number RR15635 from the COBRE Program of the National Center for Research Resources. References [1] D.A. Lashkari, J.L. DeRisi, J.H. McCusker, A.F. Namath, C. Gentile, S.Y. Hwang, P.O. Brown, and R.W. Davis, Yeast microarrays for genome wide parallel genetic and gene expression analysis, PNAS, National Academy of Sciences, 1997, 94: [2] M.B. Eisen, P.T. Spellman, P.O. Brown, and D. Botstein. Cluster analysis and display of genome-wide expression patterns, PNAS, National Academy of Sciences, 1998, 95: [3] P. Tamayo, D. Slonim, J. Mesirov, Q. Zhu, S. Kitareewan, E. Dmitrovsky, E.S. Lander, and T.R. Golub, Interpreting patterns of gene expression with self-organizing maps: Methods and application to hematopoietic differentiation, PNAS, National Academy of Sciences, 1999, 96: [4] R. Herwig, A.J. Poustka, C. Müller, C. Bull, H. Lehrach, and J. O'Brien, Large-scale clustering of cdna-fingerprinting data, Genome Research, Cold Spring Harbor Laboratory Press, 1999, 9: [5] M.K. Kerr, M. Martin, and G.A. Churchill, Analysis of variance for gene expression microarray data, Journal of Computational Biology, M.A. Liebert, Inc., 2000, 6: [6] A.A. Alizadeh, M. Eisen, R.E. Davis, C. Ma, I. Lossos, A. Rosenwald, J.C. Boldrick, H. Sabet, T. Tran, X. Yu, J.I. Powell, L. Yang, G.E. Marti, T. Moore, J. Hudson JR, L. Lu, D.B. Lewis, R. Tibshirani, G. Sherlock, W.C. Chan, T.C. Greiner, D.D. Weisenburger, J.O. Armitage, R. Warnke, R. Levy, W. Wilson, M. R. Grever, J.C. Byrd, D. Botstein, P.O. Brown, and /03 $17.00 (C) 2003 IEEE 5

6 L.M. Staudt, Distinct Types of Diffuse Large B-Cell Lymphoma Identified By Gene Expression Profiling, Nature, Nature Macmillan Publishers, 2000, 403: [7] Cluster and TreeView, Stanford University, [8] GeneSight, BioDiscovery Inc., [9] P.O. Brown and D. Botstein, Exploring the new world of the genome with DNA microarrays, Nature Genetics, Nature Macmillan Publishers, 1999, 21 (Suppl.), [10] T. Kohonen, S. Kaski, K. Lagus, J. Salojärvi, J. Honkela, V. Paatero, and A. Saarela. Self organization of a massive document collection. IEEE Transactions on Neural Networks, IEEE Neural Networks Society, 2000, 11: [11] T. Kohonen, Self-Organizing Maps, Third Extended Edition, Springer Series in Information Sciences, Vol. 30, Springer, Berlin, Heidelberg, New York, 2001, ISBN , ISSN X. [12] A. Brazma, A. Robinson, G. Camero, and M. Ashburner, One-stop shop for microarray data is an universal, public DNAmicroarray database a realistic goal?, Nature, Nature Macmillan Publishers, 2000, 403: [13]MATLAB (Version 6), [14]MAExplorer, [15] B.S. Everritt and G. Dunn, Applied Multivariate Data Analysis, Oxford university press, New York, /03 $17.00 (C) 2003 IEEE 6

An integrated tool for microarray data clustering and cluster validity assessment

An integrated tool for microarray data clustering and cluster validity assessment An integrated tool for microarray data clustering and cluster validity assessment Nadia Bolshakova Department of Computer Science Trinity College Dublin Ireland +353 1 608 3688 Nadia.Bolshakova@cs.tcd.ie

More information

Estimating Error-Dimensionality Relationship for Gene Expression Based Cancer Classification

Estimating Error-Dimensionality Relationship for Gene Expression Based Cancer Classification 1 Estimating Error-Dimensionality Relationship for Gene Expression Based Cancer Classification Feng Chu and Lipo Wang School of Electrical and Electronic Engineering Nanyang Technological niversity Singapore

More information

Double Self-Organizing Maps to Cluster Gene Expression Data

Double Self-Organizing Maps to Cluster Gene Expression Data Double Self-Organizing Maps to Cluster Gene Expression Data Dali Wang, Habtom Ressom, Mohamad Musavi, Cristian Domnisoru University of Maine, Department of Electrical & Computer Engineering, Intelligent

More information

Supervised Clustering of Yeast Gene Expression Data

Supervised Clustering of Yeast Gene Expression Data Supervised Clustering of Yeast Gene Expression Data In the DeRisi paper five expression profile clusters were cited, each containing a small number (7-8) of genes. In the following examples we apply supervised

More information

Comparisons and validation of statistical clustering techniques for microarray gene expression data. Outline. Microarrays.

Comparisons and validation of statistical clustering techniques for microarray gene expression data. Outline. Microarrays. Comparisons and validation of statistical clustering techniques for microarray gene expression data Susmita Datta and Somnath Datta Presented by: Jenni Dietrich Assisted by: Jeffrey Kidd and Kristin Wheeler

More information

Cluster Analysis for Microarray Data

Cluster Analysis for Microarray Data Cluster Analysis for Microarray Data Seventh International Long Oligonucleotide Microarray Workshop Tucson, Arizona January 7-12, 2007 Dan Nettleton IOWA STATE UNIVERSITY 1 Clustering Group objects that

More information

THE EFFECT OF NOISY BOOTSTRAPPING ON THE ROBUSTNESS OF SUPERVISED CLASSIFICATION OF GENE EXPRESSION DATA

THE EFFECT OF NOISY BOOTSTRAPPING ON THE ROBUSTNESS OF SUPERVISED CLASSIFICATION OF GENE EXPRESSION DATA THE EFFECT OF NOISY BOOTSTRAPPING ON THE ROBUSTNESS OF SUPERVISED CLASSIFICATION OF GENE EXPRESSION DATA Niv Efron and Nathan Intrator School of Computer Science, Tel-Aviv University, Ramat-Aviv 69978,

More information

Automatic Techniques for Gridding cdna Microarray Images

Automatic Techniques for Gridding cdna Microarray Images Automatic Techniques for Gridding cda Microarray Images aima Kaabouch, Member, IEEE, and Hamid Shahbazkia Department of Electrical Engineering, University of orth Dakota Grand Forks, D 58202-765 2 University

More information

Exploratory data analysis for microarrays

Exploratory data analysis for microarrays Exploratory data analysis for microarrays Jörg Rahnenführer Computational Biology and Applied Algorithmics Max Planck Institute for Informatics D-66123 Saarbrücken Germany NGFN - Courses in Practical DNA

More information

Clustering. Lecture 6, 1/24/03 ECS289A

Clustering. Lecture 6, 1/24/03 ECS289A Clustering Lecture 6, 1/24/03 What is Clustering? Given n objects, assign them to groups (clusters) based on their similarity Unsupervised Machine Learning Class Discovery Difficult, and maybe ill-posed

More information

MICROARRAY IMAGE SEGMENTATION USING CLUSTERING METHODS

MICROARRAY IMAGE SEGMENTATION USING CLUSTERING METHODS Mathematical and Computational Applications, Vol. 5, No. 2, pp. 240-247, 200. Association for Scientific Research MICROARRAY IMAGE SEGMENTATION USING CLUSTERING METHODS Volkan Uslan and Đhsan Ömür Bucak

More information

Exploratory data analysis for microarrays

Exploratory data analysis for microarrays Exploratory data analysis for microarrays Jörg Rahnenführer Computational Biology and Applied Algorithmics Max Planck Institute for Informatics D-66123 Saarbrücken Germany NGFN - Courses in Practical DNA

More information

Exploratory data analysis for microarrays

Exploratory data analysis for microarrays Exploratory data analysis for microarrays Adrian Alexa Computational Biology and Applied Algorithmics Max Planck Institute for Informatics D-66123 Saarbrücken slides by Jörg Rahnenführer NGFN - Courses

More information

COMBINED METHOD TO VISUALISE AND REDUCE DIMENSIONALITY OF THE FINANCIAL DATA SETS

COMBINED METHOD TO VISUALISE AND REDUCE DIMENSIONALITY OF THE FINANCIAL DATA SETS COMBINED METHOD TO VISUALISE AND REDUCE DIMENSIONALITY OF THE FINANCIAL DATA SETS Toomas Kirt Supervisor: Leo Võhandu Tallinn Technical University Toomas.Kirt@mail.ee Abstract: Key words: For the visualisation

More information

IT-Dendrogram: A New Member of the In-Tree (IT) Clustering Family

IT-Dendrogram: A New Member of the In-Tree (IT) Clustering Family IT-Dendrogram: A New Member of the In-Tree (IT) Clustering Family Teng Qiu (qiutengcool@163.com) Yongjie Li (liyj@uestc.edu.cn) University of Electronic Science and Technology of China, Chengdu, China

More information

10701 Machine Learning. Clustering

10701 Machine Learning. Clustering 171 Machine Learning Clustering What is Clustering? Organizing data into clusters such that there is high intra-cluster similarity low inter-cluster similarity Informally, finding natural groupings among

More information

Clustering CS 550: Machine Learning

Clustering CS 550: Machine Learning Clustering CS 550: Machine Learning This slide set mainly uses the slides given in the following links: http://www-users.cs.umn.edu/~kumar/dmbook/ch8.pdf http://www-users.cs.umn.edu/~kumar/dmbook/dmslides/chap8_basic_cluster_analysis.pdf

More information

GPU Accelerated PK-means Algorithm for Gene Clustering

GPU Accelerated PK-means Algorithm for Gene Clustering GPU Accelerated PK-means Algorithm for Gene Clustering Wuchao Situ, Yau-King Lam, Yi Xiao, P.W.M. Tsang, and Chi-Sing Leung Department of Electronic Engineering, City University of Hong Kong, Hong Kong,

More information

Clustering Techniques

Clustering Techniques Clustering Techniques Bioinformatics: Issues and Algorithms CSE 308-408 Fall 2007 Lecture 16 Lopresti Fall 2007 Lecture 16-1 - Administrative notes Your final project / paper proposal is due on Friday,

More information

Hierarchical clustering

Hierarchical clustering Hierarchical clustering Rebecca C. Steorts, Duke University STA 325, Chapter 10 ISL 1 / 63 Agenda K-means versus Hierarchical clustering Agglomerative vs divisive clustering Dendogram (tree) Hierarchical

More information

Genomics - Problem Set 2 Part 1 due Friday, 1/26/2018 by 9:00am Part 2 due Friday, 2/2/2018 by 9:00am

Genomics - Problem Set 2 Part 1 due Friday, 1/26/2018 by 9:00am Part 2 due Friday, 2/2/2018 by 9:00am Genomics - Part 1 due Friday, 1/26/2018 by 9:00am Part 2 due Friday, 2/2/2018 by 9:00am One major aspect of functional genomics is measuring the transcript abundance of all genes simultaneously. This was

More information

Gene Clustering & Classification

Gene Clustering & Classification BINF, Introduction to Computational Biology Gene Clustering & Classification Young-Rae Cho Associate Professor Department of Computer Science Baylor University Overview Introduction to Gene Clustering

More information

Biosphere: the interoperation of web services in microarray cluster analysis

Biosphere: the interoperation of web services in microarray cluster analysis Biosphere: the interoperation of web services in microarray cluster analysis Kei-Hoi Cheung 1,2,*, Remko de Knikker 1, Youjun Guo 1, Guoneng Zhong 1, Janet Hager 3,4, Kevin Y. Yip 5, Albert K.H. Kwan 5,

More information

Cluster Analysis and Visualization. Workshop on Statistics and Machine Learning 2004/2/6

Cluster Analysis and Visualization. Workshop on Statistics and Machine Learning 2004/2/6 Cluster Analysis and Visualization Workshop on Statistics and Machine Learning 2004/2/6 Outlines Introduction Stages in Clustering Clustering Analysis and Visualization One/two-dimensional Data Histogram,

More information

EECS730: Introduction to Bioinformatics

EECS730: Introduction to Bioinformatics EECS730: Introduction to Bioinformatics Lecture 15: Microarray clustering http://compbio.pbworks.com/f/wood2.gif Some slides were adapted from Dr. Shaojie Zhang (University of Central Florida) Microarray

More information

Excel Core Certification

Excel Core Certification Microsoft Office Specialist 2010 Microsoft Excel Core Certification 2010 Lesson 6: Working with Charts Lesson Objectives This lesson introduces you to working with charts. You will look at how to create

More information

Cluster analysis of 3D seismic data for oil and gas exploration

Cluster analysis of 3D seismic data for oil and gas exploration Data Mining VII: Data, Text and Web Mining and their Business Applications 63 Cluster analysis of 3D seismic data for oil and gas exploration D. R. S. Moraes, R. P. Espíndola, A. G. Evsukoff & N. F. F.

More information

Lesson 3. Prof. Enza Messina

Lesson 3. Prof. Enza Messina Lesson 3 Prof. Enza Messina Clustering techniques are generally classified into these classes: PARTITIONING ALGORITHMS Directly divides data points into some prespecified number of clusters without a hierarchical

More information

Objective of clustering

Objective of clustering Objective of clustering Discover structures and patterns in high-dimensional data. Group data with similar patterns together. This reduces the complexity and facilitates interpretation. Expression level

More information

Measure of Distance. We wish to define the distance between two objects Distance metric between points:

Measure of Distance. We wish to define the distance between two objects Distance metric between points: Measure of Distance We wish to define the distance between two objects Distance metric between points: Euclidean distance (EUC) Manhattan distance (MAN) Pearson sample correlation (COR) Angle distance

More information

Data Mining. Dr. Raed Ibraheem Hamed. University of Human Development, College of Science and Technology Department of Computer Science

Data Mining. Dr. Raed Ibraheem Hamed. University of Human Development, College of Science and Technology Department of Computer Science Data Mining Dr. Raed Ibraheem Hamed University of Human Development, College of Science and Technology Department of Computer Science 06 07 Department of CS - DM - UHD Road map Cluster Analysis: Basic

More information

Clustering Jacques van Helden

Clustering Jacques van Helden Statistical Analysis of Microarray Data Clustering Jacques van Helden Jacques.van.Helden@ulb.ac.be Contents Data sets Distance and similarity metrics K-means clustering Hierarchical clustering Evaluation

More information

Missing Data Estimation in Microarrays Using Multi-Organism Approach

Missing Data Estimation in Microarrays Using Multi-Organism Approach Missing Data Estimation in Microarrays Using Multi-Organism Approach Marcel Nassar and Hady Zeineddine Progress Report: Data Mining Course Project, Spring 2008 Prof. Inderjit S. Dhillon April 02, 2008

More information

How do microarrays work

How do microarrays work Lecture 3 (continued) Alvis Brazma European Bioinformatics Institute How do microarrays work condition mrna cdna hybridise to microarray condition Sample RNA extract labelled acid acid acid nucleic acid

More information

Graphics and Java 2D Introduction OBJECTIVES. One picture is worth ten thousand words.

Graphics and Java 2D Introduction OBJECTIVES. One picture is worth ten thousand words. 1 2 12 Graphics and Java 2D One picture is worth ten thousand words. Chinese proverb Treat nature in terms of the cylinder, the sphere, the cone, all in perspective. Paul Cézanne Colors, like features,

More information

A Highly-usable Projected Clustering Algorithm for Gene Expression Profiles

A Highly-usable Projected Clustering Algorithm for Gene Expression Profiles A Highly-usable Projected Clustering Algorithm for Gene Expression Profiles Kevin Y. Yip Department of Computer Science and Information Systems University of Hong Kong Pokfulam Road, Hong Kong ylyip@csis.hku.hk

More information

10601 Machine Learning. Hierarchical clustering. Reading: Bishop: 9-9.2

10601 Machine Learning. Hierarchical clustering. Reading: Bishop: 9-9.2 161 Machine Learning Hierarchical clustering Reading: Bishop: 9-9.2 Second half: Overview Clustering - Hierarchical, semi-supervised learning Graphical models - Bayesian networks, HMMs, Reasoning under

More information

Giri Narasimhan. CAP 5510: Introduction to Bioinformatics. ECS 254; Phone: x3748

Giri Narasimhan. CAP 5510: Introduction to Bioinformatics. ECS 254; Phone: x3748 CAP 5510: Introduction to Bioinformatics Giri Narasimhan ECS 254; Phone: x3748 giri@cis.fiu.edu www.cis.fiu.edu/~giri/teach/bioinfs07.html 3/3/08 CAP5510 1 Gene g Probe 1 Probe 2 Probe N 3/3/08 CAP5510

More information

Combining nearest neighbor classifiers versus cross-validation selection

Combining nearest neighbor classifiers versus cross-validation selection Statistics Preprints Statistics 4-2004 Combining nearest neighbor classifiers versus cross-validation selection Minhui Paik Iowa State University, 100min@gmail.com Yuhong Yang Iowa State University Follow

More information

Understanding Clustering Supervising the unsupervised

Understanding Clustering Supervising the unsupervised Understanding Clustering Supervising the unsupervised Janu Verma IBM T.J. Watson Research Center, New York http://jverma.github.io/ jverma@us.ibm.com @januverma Clustering Grouping together similar data

More information

Cluster Analysis. Ying Shen, SSE, Tongji University

Cluster Analysis. Ying Shen, SSE, Tongji University Cluster Analysis Ying Shen, SSE, Tongji University Cluster analysis Cluster analysis groups data objects based only on the attributes in the data. The main objective is that The objects within a group

More information

Contents. ! Data sets. ! Distance and similarity metrics. ! K-means clustering. ! Hierarchical clustering. ! Evaluation of clustering results

Contents. ! Data sets. ! Distance and similarity metrics. ! K-means clustering. ! Hierarchical clustering. ! Evaluation of clustering results Statistical Analysis of Microarray Data Contents Data sets Distance and similarity metrics K-means clustering Hierarchical clustering Evaluation of clustering results Clustering Jacques van Helden Jacques.van.Helden@ulb.ac.be

More information

Statistics 202: Data Mining. c Jonathan Taylor. Clustering Based in part on slides from textbook, slides of Susan Holmes.

Statistics 202: Data Mining. c Jonathan Taylor. Clustering Based in part on slides from textbook, slides of Susan Holmes. Clustering Based in part on slides from textbook, slides of Susan Holmes December 2, 2012 1 / 1 Clustering Clustering Goal: Finding groups of objects such that the objects in a group will be similar (or

More information

10. Clustering. Introduction to Bioinformatics Jarkko Salojärvi. Based on lecture slides by Samuel Kaski

10. Clustering. Introduction to Bioinformatics Jarkko Salojärvi. Based on lecture slides by Samuel Kaski 10. Clustering Introduction to Bioinformatics 30.9.2008 Jarkko Salojärvi Based on lecture slides by Samuel Kaski Definition of a cluster Typically either 1. A group of mutually similar samples, or 2. A

More information

The Allen Human Brain Atlas offers three types of searches to allow a user to: (1) obtain gene expression data for specific genes (or probes) of

The Allen Human Brain Atlas offers three types of searches to allow a user to: (1) obtain gene expression data for specific genes (or probes) of Microarray Data MICROARRAY DATA Gene Search Boolean Syntax Differential Search Mouse Differential Search Search Results Gene Classification Correlative Search Download Search Results Data Visualization

More information

CLUSTERING IN BIOINFORMATICS

CLUSTERING IN BIOINFORMATICS CLUSTERING IN BIOINFORMATICS CSE/BIMM/BENG 8 MAY 4, 0 OVERVIEW Define the clustering problem Motivation: gene expression and microarrays Types of clustering Clustering algorithms Other applications of

More information

Gene selection through Switched Neural Networks

Gene selection through Switched Neural Networks Gene selection through Switched Neural Networks Marco Muselli Istituto di Elettronica e di Ingegneria dell Informazione e delle Telecomunicazioni Consiglio Nazionale delle Ricerche Email: Marco.Muselli@ieiit.cnr.it

More information

Classification Tasks for Microarrays

Classification Tasks for Microarrays Exploratory Data Analysis for Microarrays Jörg Rahnenführer Universität Dortmund, Fachbereich Statistik Email: rahnenfuehrer@statistik.uni-dortmund.de NGFN Courses in Practical DNA Microarray Analysis

More information

Graflog User s guide

Graflog User s guide Graflog User s guide Command line & Web-based tool to graph the results obtained with the CD++ toolkit. http://www.sce.carleton.ca/faculty/wainer/wbgraf/index.html Table of contents 1 Introduction... 1

More information

Analyzing ICAT Data. Analyzing ICAT Data

Analyzing ICAT Data. Analyzing ICAT Data Analyzing ICAT Data Gary Van Domselaar University of Alberta Analyzing ICAT Data ICAT: Isotope Coded Affinity Tag Introduced in 1999 by Ruedi Aebersold as a method for quantitative analysis of complex

More information

cdna Microarray Genome Image Processing Using Fixed Spot Position

cdna Microarray Genome Image Processing Using Fixed Spot Position American Journal of Applied Sciences 3 (2): 1730-1734, 2006 ISSN 1546-9239 2006 Science Publications cdna Microarray Genome Image Processing Using Fixed Spot Position Basim Alhadidi, Hussam Nawwaf Fakhouri

More information

New Genetic Operators for Solving TSP: Application to Microarray Gene Ordering

New Genetic Operators for Solving TSP: Application to Microarray Gene Ordering New Genetic Operators for Solving TSP: Application to Microarray Gene Ordering Shubhra Sankar Ray, Sanghamitra Bandyopadhyay, and Sankar K. Pal Machine Intelligence Unit, Indian Statistical Institute,

More information

Tutorial:OverRepresentation - OpenTutorials

Tutorial:OverRepresentation - OpenTutorials Tutorial:OverRepresentation From OpenTutorials Slideshow OverRepresentation (about 12 minutes) (http://opentutorials.rbvi.ucsf.edu/index.php?title=tutorial:overrepresentation& ce_slide=true&ce_style=cytoscape)

More information

TRANSACTIONAL CLUSTERING. Anna Monreale University of Pisa

TRANSACTIONAL CLUSTERING. Anna Monreale University of Pisa TRANSACTIONAL CLUSTERING Anna Monreale University of Pisa Clustering Clustering : Grouping of objects into different sets, or more precisely, the partitioning of a data set into subsets (clusters), so

More information

Exploratory Data Analysis for Microarrays

Exploratory Data Analysis for Microarrays Exploratory Data Analysis for Microarrays Computational Biology and Applied Algorithmics Max Planck Institute for Informatics D-66123 Saarbrücken Germany NGFN Courses in Practical DNA Microarray Analysis

More information

Statistics 202: Data Mining. c Jonathan Taylor. Week 8 Based in part on slides from textbook, slides of Susan Holmes. December 2, / 1

Statistics 202: Data Mining. c Jonathan Taylor. Week 8 Based in part on slides from textbook, slides of Susan Holmes. December 2, / 1 Week 8 Based in part on slides from textbook, slides of Susan Holmes December 2, 2012 1 / 1 Part I Clustering 2 / 1 Clustering Clustering Goal: Finding groups of objects such that the objects in a group

More information

ViTraM: VIsualization of TRAnscriptional Modules

ViTraM: VIsualization of TRAnscriptional Modules ViTraM: VIsualization of TRAnscriptional Modules Version 2.0 October 1st, 2009 KULeuven, Belgium 1 Contents 1 INTRODUCTION AND INSTALLATION... 4 1.1 Introduction...4 1.2 Software structure...5 1.3 Requirements...5

More information

Dimension reduction : PCA and Clustering

Dimension reduction : PCA and Clustering Dimension reduction : PCA and Clustering By Hanne Jarmer Slides by Christopher Workman Center for Biological Sequence Analysis DTU The DNA Array Analysis Pipeline Array design Probe design Question Experimental

More information

Cluster Analysis. Mu-Chun Su. Department of Computer Science and Information Engineering National Central University 2003/3/11 1

Cluster Analysis. Mu-Chun Su. Department of Computer Science and Information Engineering National Central University 2003/3/11 1 Cluster Analysis Mu-Chun Su Department of Computer Science and Information Engineering National Central University 2003/3/11 1 Introduction Cluster analysis is the formal study of algorithms and methods

More information

WORD Creating Objects: Tables, Charts and More

WORD Creating Objects: Tables, Charts and More WORD 2007 Creating Objects: Tables, Charts and More Microsoft Office 2007 TABLE OF CONTENTS TABLES... 1 TABLE LAYOUT... 1 TABLE DESIGN... 2 CHARTS... 4 PICTURES AND DRAWINGS... 8 USING DRAWINGS... 8 Drawing

More information

Clustering Gene Expression Data with Memetic Algorithms based on Minimum Spanning Trees

Clustering Gene Expression Data with Memetic Algorithms based on Minimum Spanning Trees Clustering Gene Expression Data with Memetic Algorithms based on Minimum Spanning Trees Nora Speer, Peter Merz, Christian Spieth, Andreas Zell University of Tübingen, Center for Bioinformatics (ZBIT),

More information

Machine Learning (BSMC-GA 4439) Wenke Liu

Machine Learning (BSMC-GA 4439) Wenke Liu Machine Learning (BSMC-GA 4439) Wenke Liu 01-31-017 Outline Background Defining proximity Clustering methods Determining number of clusters Comparing two solutions Cluster analysis as unsupervised Learning

More information

Mining Gene Expression Data Using PCA Based Clustering

Mining Gene Expression Data Using PCA Based Clustering Vol. 5, No. 1, January-June 2012, pp. 13-18, Published by Serials Publications, ISSN: 0973-7413 Mining Gene Expression Data Using PCA Based Clustering N.P. Gopalan 1 and B. Sathiyabhama 2 * 1 Department

More information

9/29/13. Outline Data mining tasks. Clustering algorithms. Applications of clustering in biology

9/29/13. Outline Data mining tasks. Clustering algorithms. Applications of clustering in biology 9/9/ I9 Introduction to Bioinformatics, Clustering algorithms Yuzhen Ye (yye@indiana.edu) School of Informatics & Computing, IUB Outline Data mining tasks Predictive tasks vs descriptive tasks Example

More information

An Efficient Optimal Leaf Ordering for Hierarchical Clustering in Microarray Gene Expression Data Analysis

An Efficient Optimal Leaf Ordering for Hierarchical Clustering in Microarray Gene Expression Data Analysis An Efficient Optimal Leaf Ordering for Hierarchical Clustering in Microarray Gene Expression Data Analysis Jianting Zhang Le Gruenwald School of Computer Science The University of Oklahoma Norman, Oklahoma,

More information

2. Background. 2.1 Clustering

2. Background. 2.1 Clustering 2. Background 2.1 Clustering Clustering involves the unsupervised classification of data items into different groups or clusters. Unsupervised classificaiton is basically a learning task in which learning

More information

Machine Learning (BSMC-GA 4439) Wenke Liu

Machine Learning (BSMC-GA 4439) Wenke Liu Machine Learning (BSMC-GA 4439) Wenke Liu 01-25-2018 Outline Background Defining proximity Clustering methods Determining number of clusters Other approaches Cluster analysis as unsupervised Learning Unsupervised

More information

ECS 234: Data Analysis: Clustering ECS 234

ECS 234: Data Analysis: Clustering ECS 234 : Data Analysis: Clustering What is Clustering? Given n objects, assign them to groups (clusters) based on their similarity Unsupervised Machine Learning Class Discovery Difficult, and maybe ill-posed

More information

netzen - a software tool for the analysis and visualization of network data about

netzen - a software tool for the analysis and visualization of network data about Architect and main contributor: Dr. Carlos D. Correa Other contributors: Tarik Crnovrsanin and Yu-Hsuan Chan PI: Dr. Kwan-Liu Ma Visualization and Interface Design Innovation (ViDi) research group Computer

More information

Image Analysis Lecture Segmentation. Idar Dyrdal

Image Analysis Lecture Segmentation. Idar Dyrdal Image Analysis Lecture 9.1 - Segmentation Idar Dyrdal Segmentation Image segmentation is the process of partitioning a digital image into multiple parts The goal is to divide the image into meaningful

More information

CS 534: Computer Vision Segmentation and Perceptual Grouping

CS 534: Computer Vision Segmentation and Perceptual Grouping CS 534: Computer Vision Segmentation and Perceptual Grouping Ahmed Elgammal Dept of Computer Science CS 534 Segmentation - 1 Outlines Mid-level vision What is segmentation Perceptual Grouping Segmentation

More information

APPLY DATA CLUSTERING TO GENE EXPRESSION DATA

APPLY DATA CLUSTERING TO GENE EXPRESSION DATA California State University, San Bernardino CSUSB ScholarWorks Electronic Theses, Projects, and Dissertations Office of Graduate Studies 12-2015 APPLY DATA CLUSTERING TO GENE EXPRESSION DATA Abdullah Jameel

More information

Clustering gene expression data

Clustering gene expression data Clustering gene expression data 1 How Gene Expression Data Looks Entries of the Raw Data matrix: Ratio values Absolute values Row = gene s expression pattern Column = experiment/condition s profile genes

More information

ViTraM: VIsualization of TRAnscriptional Modules

ViTraM: VIsualization of TRAnscriptional Modules ViTraM: VIsualization of TRAnscriptional Modules Version 1.0 June 1st, 2009 Hong Sun, Karen Lemmens, Tim Van den Bulcke, Kristof Engelen, Bart De Moor and Kathleen Marchal KULeuven, Belgium 1 Contents

More information

Synoptics Limited reserves the right to make changes without notice both to this publication and to the product that it describes.

Synoptics Limited reserves the right to make changes without notice both to this publication and to the product that it describes. GeneTools Getting Started Although all possible care has been taken in the preparation of this publication, Synoptics Limited accepts no liability for any inaccuracies that may be found. Synoptics Limited

More information

Visualizing Gene Clusters using Neighborhood Graphs in R

Visualizing Gene Clusters using Neighborhood Graphs in R Theresa Scharl & Friedrich Leisch Visualizing Gene Clusters using Neighborhood Graphs in R Technical Report Number 16, 2008 Department of Statistics University of Munich http://www.stat.uni-muenchen.de

More information

Cluster analysis. Agnieszka Nowak - Brzezinska

Cluster analysis. Agnieszka Nowak - Brzezinska Cluster analysis Agnieszka Nowak - Brzezinska Outline of lecture What is cluster analysis? Clustering algorithms Measures of Cluster Validity What is Cluster Analysis? Finding groups of objects such that

More information

/00/$10.00 (C) 2000 IEEE

/00/$10.00 (C) 2000 IEEE A SOM based cluster visualization and its application for false coloring Johan Himberg Helsinki University of Technology Laboratory of Computer and Information Science P.O. Box 54, FIN-215 HUT, Finland

More information

BiCluster Viewer: A Visualization Tool for Analyzing Gene Expression Data

BiCluster Viewer: A Visualization Tool for Analyzing Gene Expression Data J. Heinrich, M. Burch, R. Seifert, D. Weiskopf BiCluster Viewer: A Visualization Tool for Analyzing Gene Expression Data Stuttgart, July 2011 VISUS - Visualization Research Center University of Stuttgart

More information

Data Mining. Part 2. Data Understanding and Preparation. 2.4 Data Transformation. Spring Instructor: Dr. Masoud Yaghini. Data Transformation

Data Mining. Part 2. Data Understanding and Preparation. 2.4 Data Transformation. Spring Instructor: Dr. Masoud Yaghini. Data Transformation Data Mining Part 2. Data Understanding and Preparation 2.4 Spring 2010 Instructor: Dr. Masoud Yaghini Outline Introduction Normalization Attribute Construction Aggregation Attribute Subset Selection Discretization

More information

Outlier Detection and Removal Algorithm in K-Means and Hierarchical Clustering

Outlier Detection and Removal Algorithm in K-Means and Hierarchical Clustering World Journal of Computer Application and Technology 5(2): 24-29, 2017 DOI: 10.13189/wjcat.2017.050202 http://www.hrpub.org Outlier Detection and Removal Algorithm in K-Means and Hierarchical Clustering

More information

Machine Learning and Data Mining. Clustering (1): Basics. Kalev Kask

Machine Learning and Data Mining. Clustering (1): Basics. Kalev Kask Machine Learning and Data Mining Clustering (1): Basics Kalev Kask Unsupervised learning Supervised learning Predict target value ( y ) given features ( x ) Unsupervised learning Understand patterns of

More information

DATA MINING LECTURE 7. Hierarchical Clustering, DBSCAN The EM Algorithm

DATA MINING LECTURE 7. Hierarchical Clustering, DBSCAN The EM Algorithm DATA MINING LECTURE 7 Hierarchical Clustering, DBSCAN The EM Algorithm CLUSTERING What is a Clustering? In general a grouping of objects such that the objects in a group (cluster) are similar (or related)

More information

EXCEL 2003 DISCLAIMER:

EXCEL 2003 DISCLAIMER: EXCEL 2003 DISCLAIMER: This reference guide is meant for experienced Microsoft Excel users. It provides a list of quick tips and shortcuts for familiar features. This guide does NOT replace training or

More information

Improved Processing of Microarray Data Using Image Reconstruction Techniques

Improved Processing of Microarray Data Using Image Reconstruction Techniques 176 IEEE TRANSACTIONS ON NANOBIOSCIENCE, VOL. 2, NO. 4, DECEMBER 2003 Improved Processing of Microarray Data Using Image Reconstruction Techniques Paul O Neill, George D. Magoulas, Member, IEEE, and Xiaohui

More information

Gene expression & Clustering (Chapter 10)

Gene expression & Clustering (Chapter 10) Gene expression & Clustering (Chapter 10) Determining gene function Sequence comparison tells us if a gene is similar to another gene, e.g., in a new species Dynamic programming Approximate pattern matching

More information

5/15/16. Computational Methods for Data Analysis. Massimo Poesio UNSUPERVISED LEARNING. Clustering. Unsupervised learning introduction

5/15/16. Computational Methods for Data Analysis. Massimo Poesio UNSUPERVISED LEARNING. Clustering. Unsupervised learning introduction Computational Methods for Data Analysis Massimo Poesio UNSUPERVISED LEARNING Clustering Unsupervised learning introduction 1 Supervised learning Training set: Unsupervised learning Training set: 2 Clustering

More information

SYDE Winter 2011 Introduction to Pattern Recognition. Clustering

SYDE Winter 2011 Introduction to Pattern Recognition. Clustering SYDE 372 - Winter 2011 Introduction to Pattern Recognition Clustering Alexander Wong Department of Systems Design Engineering University of Waterloo Outline 1 2 3 4 5 All the approaches we have learned

More information

Data Clustering Hierarchical Clustering, Density based clustering Grid based clustering

Data Clustering Hierarchical Clustering, Density based clustering Grid based clustering Data Clustering Hierarchical Clustering, Density based clustering Grid based clustering Team 2 Prof. Anita Wasilewska CSE 634 Data Mining All Sources Used for the Presentation Olson CF. Parallel algorithms

More information

Distributed and clustering techniques for Multiprocessor Systems

Distributed and clustering techniques for Multiprocessor Systems www.ijcsi.org 199 Distributed and clustering techniques for Multiprocessor Systems Elsayed A. Sallam Associate Professor and Head of Computer and Control Engineering Department, Faculty of Engineering,

More information

Keywords Clustering, Goals of clustering, clustering techniques, clustering algorithms.

Keywords Clustering, Goals of clustering, clustering techniques, clustering algorithms. Volume 3, Issue 5, May 2013 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com A Survey of Clustering

More information

Clustering. CE-717: Machine Learning Sharif University of Technology Spring Soleymani

Clustering. CE-717: Machine Learning Sharif University of Technology Spring Soleymani Clustering CE-717: Machine Learning Sharif University of Technology Spring 2016 Soleymani Outline Clustering Definition Clustering main approaches Partitional (flat) Hierarchical Clustering validation

More information

Performance Evaluation of Clustering Methods in Microarray Data

Performance Evaluation of Clustering Methods in Microarray Data American Journal of Bioinformatics Research 2016, 6(1): 19-25 DOI: 10.5923/j.bioinformatics.20160601.03 Performance Evaluation of Clustering Methods in Microarray Data Md. Siraj-Ud-Doulah *, Md. Bipul

More information

INF4820, Algorithms for AI and NLP: Hierarchical Clustering

INF4820, Algorithms for AI and NLP: Hierarchical Clustering INF4820, Algorithms for AI and NLP: Hierarchical Clustering Erik Velldal University of Oslo Sept. 25, 2012 Agenda Topics we covered last week Evaluating classifiers Accuracy, precision, recall and F-score

More information

Efficient Image Compression of Medical Images Using the Wavelet Transform and Fuzzy c-means Clustering on Regions of Interest.

Efficient Image Compression of Medical Images Using the Wavelet Transform and Fuzzy c-means Clustering on Regions of Interest. Efficient Image Compression of Medical Images Using the Wavelet Transform and Fuzzy c-means Clustering on Regions of Interest. D.A. Karras, S.A. Karkanis and D. E. Maroulis University of Piraeus, Dept.

More information

Microsoft. Excel. Microsoft Office Specialist 2010 Series EXAM COURSEWARE Achieve more. For Evaluation Only

Microsoft. Excel. Microsoft Office Specialist 2010 Series EXAM COURSEWARE Achieve more. For Evaluation Only Microsoft Excel 2010 Microsoft Office Specialist 2010 Series COURSEWARE 3243 1 EXAM 77 882 Achieve more Microsoft Office Specialist 2010 Microsoft Excel Core Certification 2010 Lesson 6: Working with

More information

A Distributed Formation of Orthogonal Convex Polygons in Mesh-Connected Multicomputers

A Distributed Formation of Orthogonal Convex Polygons in Mesh-Connected Multicomputers A Distributed Formation of Orthogonal Convex Polygons in Mesh-Connected Multicomputers Jie Wu Department of Computer Science and Engineering Florida Atlantic University Boca Raton, FL 3343 Abstract The

More information

Unsupervised learning: Clustering & Dimensionality reduction. Theo Knijnenburg Jorma de Ronde

Unsupervised learning: Clustering & Dimensionality reduction. Theo Knijnenburg Jorma de Ronde Unsupervised learning: Clustering & Dimensionality reduction Theo Knijnenburg Jorma de Ronde Source of slides Marcel Reinders TU Delft Lodewyk Wessels NKI Bioalgorithms.info Jeffrey D. Ullman Stanford

More information

Classification. Vladimir Curic. Centre for Image Analysis Swedish University of Agricultural Sciences Uppsala University

Classification. Vladimir Curic. Centre for Image Analysis Swedish University of Agricultural Sciences Uppsala University Classification Vladimir Curic Centre for Image Analysis Swedish University of Agricultural Sciences Uppsala University Outline An overview on classification Basics of classification How to choose appropriate

More information

ECLT 5810 Clustering

ECLT 5810 Clustering ECLT 5810 Clustering What is Cluster Analysis? Cluster: a collection of data objects Similar to one another within the same cluster Dissimilar to the objects in other clusters Cluster analysis Grouping

More information