Comparisons and validation of statistical clustering techniques for microarray gene expression data. Outline. Microarrays.

Similar documents
Comparisons and validation of statistical clustering techniques for microarray gene expression data

CLUSTERING IN BIOINFORMATICS

Olmo S. Zavala Romero. Clustering Hierarchical Distance Group Dist. K-means. Center of Atmospheric Sciences, UNAM.

Cluster Analysis for Microarray Data

Exploratory data analysis for microarrays

Clustering CS 550: Machine Learning

Dimension reduction : PCA and Clustering

How do microarrays work

CS 2750 Machine Learning. Lecture 19. Clustering. CS 2750 Machine Learning. Clustering. Groups together similar instances in the data sample

10701 Machine Learning. Clustering

Clustering Techniques

9/29/13. Outline Data mining tasks. Clustering algorithms. Applications of clustering in biology

CS 1675 Introduction to Machine Learning Lecture 18. Clustering. Clustering. Groups together similar instances in the data sample

Clustering. Lecture 6, 1/24/03 ECS289A

Gene expression & Clustering (Chapter 10)

Statistics 202: Data Mining. c Jonathan Taylor. Week 8 Based in part on slides from textbook, slides of Susan Holmes. December 2, / 1

Gene Clustering & Classification

ECS 234: Data Analysis: Clustering ECS 234

Clustering analysis of gene expression data

Hierarchical Clustering 4/5/17

Objective of clustering

Clustering. Unsupervised Learning

Unsupervised Learning : Clustering

K-Means Clustering 3/3/17

Statistics 202: Data Mining. c Jonathan Taylor. Clustering Based in part on slides from textbook, slides of Susan Holmes.

Missing Data Estimation in Microarrays Using Multi-Organism Approach

Unsupervised Learning. Presenter: Anil Sharma, PhD Scholar, IIIT-Delhi

10601 Machine Learning. Hierarchical clustering. Reading: Bishop: 9-9.2

Clustering, cont. Genome 373 Genomic Informatics Elhanan Borenstein. Some slides adapted from Jacques van Helden

Hard clustering. Each object is assigned to one and only one cluster. Hierarchical clustering is usually hard. Soft (fuzzy) clustering

High throughput Data Analysis 2. Cluster Analysis

Clustering. CS294 Practical Machine Learning Junming Yin 10/09/06

Machine Learning. B. Unsupervised Learning B.1 Cluster Analysis. Lars Schmidt-Thieme

CS Introduction to Data Mining Instructor: Abdullah Mueen

University of Florida CISE department Gator Engineering. Clustering Part 2

Hierarchical Clustering Lecture 9

APPLY DATA CLUSTERING TO GENE EXPRESSION DATA

Unsupervised Learning and Clustering

Unsupervised Learning and Clustering

Giri Narasimhan. CAP 5510: Introduction to Bioinformatics. ECS 254; Phone: x3748

CSE 5243 INTRO. TO DATA MINING

MICROARRAY IMAGE SEGMENTATION USING CLUSTERING METHODS

Clustering Algorithms: Can anything be Concluded?

Double Self-Organizing Maps to Cluster Gene Expression Data

Cluster Analysis. Ying Shen, SSE, Tongji University

Clustering. RNA-seq: What is it good for? Finding Similarly Expressed Genes. Data... And Lots of It!

5/15/16. Computational Methods for Data Analysis. Massimo Poesio UNSUPERVISED LEARNING. Clustering. Unsupervised learning introduction

HIERARCHICAL clustering analysis (or HCA) is an

EECS730: Introduction to Bioinformatics

Contents. List of Figures. List of Tables. List of Algorithms. I Clustering, Data, and Similarity Measures 1

Analyzing ICAT Data. Analyzing ICAT Data

Cluster Analysis and Visualization. Workshop on Statistics and Machine Learning 2004/2/6

Summer School in Statistics for Astronomers & Physicists June 15-17, Cluster Analysis

Cluster analysis. Agnieszka Nowak - Brzezinska

Genomics - Problem Set 2 Part 1 due Friday, 1/26/2018 by 9:00am Part 2 due Friday, 2/2/2018 by 9:00am

Machine Learning (BSMC-GA 4439) Wenke Liu

Hierarchical Clustering

Unsupervised Data Mining: Clustering. Izabela Moise, Evangelos Pournaras, Dirk Helbing

Forestry Applied Multivariate Statistics. Cluster Analysis

Clustering. CE-717: Machine Learning Sharif University of Technology Spring Soleymani

Introduction to Clustering

Machine Learning. Unsupervised Learning. Manfred Huber

Unsupervised: no target value to predict

AN IMPROVED K-MEANS CLUSTERING ALGORITHM FOR IMAGE SEGMENTATION

Workload Characterization Techniques

Introduction to GE Microarray data analysis Practical Course MolBio 2012

Machine Learning (BSMC-GA 4439) Wenke Liu

Multivariate analyses in ecology. Cluster (part 2) Ordination (part 1 & 2)

Clustering Gene Expression Data: Acknowledgement: Elizabeth Garrett-Mayer; Shirley Liu; Robert Tibshirani; Guenther Walther; Trevor Hastie

Data Science and Statistics in Research: unlocking the power of your data Session 3.4: Clustering

STATS306B STATS306B. Clustering. Jonathan Taylor Department of Statistics Stanford University. June 3, 2010

Data Clustering. Algorithmic Thinking Luay Nakhleh Department of Computer Science Rice University

Generalization of Hierarchical Crisp Clustering Algorithms to Fuzzy Logic

Understanding Clustering Supervising the unsupervised

Accelerating Unique Strategy for Centroid Priming in K-Means Clustering

Information Retrieval and Web Search Engines

Course on Microarray Gene Expression Analysis

Chapter VIII.3: Hierarchical Clustering

Clustering. Chapter 10 in Introduction to statistical learning

An empirical study on Principal Component Analysis for clustering gene expression data

Measure of Distance. We wish to define the distance between two objects Distance metric between points:

Unsupervised Learning

/ Computational Genomics. Normalization

Chapter DM:II. II. Cluster Analysis

Iterative Signature Algorithm for the Analysis of Large-Scale Gene Expression Data. By S. Bergmann, J. Ihmels, N. Barkai

Data Exploration with PCA and Unsupervised Learning with Clustering Paul Rodriguez, PhD PACE SDSC

Unsupervised Learning

Clustering: K-means and Kernel K-means

Clustering and Dissimilarity Measures. Clustering. Dissimilarity Measures. Cluster Analysis. Perceptually-Inspired Measures

Clustering and Visualisation of Data

ECLT 5810 Clustering

Performance Evaluation of Clustering Methods in Microarray Data

Multivariate Methods

Clustering. Cluster Analysis of Microarray Data. Microarray Data for Clustering. Data for Clustering

10. MLSP intro. (Clustering: K-means, EM, GMM, etc.)

Finding Clusters 1 / 60

Biosphere: the interoperation of web services in microarray cluster analysis

Package cycle. March 30, 2019

Data Processing and Analysis in Systems Medicine. Milena Kraus Data Management for Digital Health Summer 2017

CSE 6242 A / CS 4803 DVA. Feb 12, Dimension Reduction. Guest Lecturer: Jaegul Choo

Transcription:

Comparisons and validation of statistical clustering techniques for microarray gene expression data Susmita Datta and Somnath Datta Presented by: Jenni Dietrich Assisted by: Jeffrey Kidd and Kristin Wheeler Mentor: Dr. Takis Benos 26 June 2003 Outline Brief microarray overview Purpose of the paper Discuss clustering algorithms Experiment and Results Conclusions Allow for monitoring of gene expression at the transcript level Slide with single-stranded DNA molecules attached at fixed positions (probes) Exploit the complementary binding of single-stranded DNA sequences Results in a large data set containing expression levels of thousands of genes Microarray experiments are often used to track the changes in gene expression Over time In the presence of various agents 1

Microarray Slide Gene expression profiles characterize the dynamic functioning of each gene in the genome Expression data can be represented as a matrix where the rows are genes and the columns are samples The values in the cells of the matrix represent the expression levels Experiment Gene Expression Matrix *Numerical values encoded by color Experiment Design Decide on probes and genes Type of microarray Data Normalization Data analysis Identify differentially expressed genes Cluster genes based on expression patterns 2

Clustering Goal of microarray data analysis identify changing levels of gene expression correlate the changes to identify sets of genes with similar profiles Clustering group objects into subsets Clustering algorithms can be used to group genes that have similar expression patterns Purpose of the Paper Currently, there are no clear guidelines for choosing a clustering algorithm to group genes based on their expression profiles This paper evaluated the performance of six different algorithms using a microarray data set on sporulation of budding yeast Clustering Algorithms Clustering Algorithms Hierarchical clustering with correlation UPGMA (most commonly used algorithm) Clustering by K-means Diana Fanny Model-based clustering Hierarchical clustering with partial least squares Algorithms differ in the measure of similarity used when grouping the objects and grouping technique Some need and use previous knowledge about the suspected number of clusters 3

Hierarchical clustering Produces a hierarchy of clusters rather than a predefined number of clusters (agglomerative approach) Initially, each observation is in its own cluster Subsequently, the two closest clusters are combined into a single cluster The similarity (distance) measure used is the average method Distance between clusters is the average of the distances between the points in one cluster and those in the other cluster K-means Clustering Uses advance knowledge about the number of clusters to be formed k clusters Initially, all objects are randomly assigned to one of k clusters Objects are moved between clusters in an attempt to minimize the distance between that object and its cluster Diana Divisive clustering method All objects start in one cluster and broken into smaller groups Genes with larger dissimilarity are put in different clusters Uses the standard Euclidean distance measure Fanny Uses fuzzy logic and produces a probability vector for each observation Hard cluster is formed by assigning an observation to a group with the highest probability Uses the Manhattan distance measurement d= x-u + y-v where (x,y) and (u,v) are two points Needs a predefined number of clusters (k) 4

Model-based Clustering Treats data as a mixture distribution Often based on a Gaussian distribution Describes each cluster using a probabilistic model No predefined number of clusters Experiment Run each of the six clustering algorithms with sporulation of yeast data set Used three validation measurements to compare the results Average proportion of non-overlap measure Average distance between means measure Average distance measure Results Average proportion of non-overlap & average distance between means measures Based on average proportion of nonoverlap and average distance between means measures: Model-based clustering appears to be worst Hierarchical clustering with correlation and Fanny seem to be the best Based on average distance measure: Hierarchical clustering performed worst Overall, Diana performed consistently well for all three measurements 5

Average distance between mean measures Comparison of model profiles The smaller the distance from the model profile the closer the results of that algorithm are to the results of the model. Time (hours) 0 2 4 6 8 10 12 Average Temporal Profiles Group category I II III IV V VI VII 52 genes 62 genes 47 genes 95 genes 158 genes 61 genes 5 genes Chu et al. (1998) Science 282: 699-705 6

Guidelines for choosing a clustering algorithm 1. Produce a visual plot of the first two principal components to determine the method that gives the most separation between groups 2. Check for consistency of the method with temporal observations 3. Compare the average group temporal profiles with the model profiles produced from a known, hand-picked set of genes (training set) Other aspects It may be important to inspect the computational stability as well as the computational time of an algorithm before making a choice In conclusion The clustering algorithm directly effects the interpretation and analysis of the data Therefore, careful consideration of the algorithms is necessary before a choice is made For this data set, Diana performed consistently well when compared with the model profile and when looking at the three validation measures References Datta, S, and S Datta. (2003) Comparisons and validation of statistical clustering techniques for microarray gene expression data. Bioinformatics(19) 459-466 Brazma, A, and J Vilo. (2000) Gene expression data analysis. FEBS (Letters 480) 17-24 Quackenbush, J. (2001) Computational Analysis of Microarray Data. Nature Reviews (2) 418-427 Hastie, T, R Tibshirani, and J Friedman. (2001) The Elements of Statistical Learning 453-480 http://www.austinlinks.com/fuzzy/overview.html http://www.maths.lth.se/help/r/.r/library/cluster/html/fanny.html 7