Gene Clustering & Classification

Similar documents
ECLT 5810 Clustering

ECLT 5810 Clustering

BBS654 Data Mining. Pinar Duygulu. Slides are adapted from Nazli Ikizler

CSE 5243 INTRO. TO DATA MINING

10701 Machine Learning. Clustering

Unsupervised Learning

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Slides From Lecture Notes for Chapter 8. Introduction to Data Mining

CSE 5243 INTRO. TO DATA MINING

Unsupervised Learning. Presenter: Anil Sharma, PhD Scholar, IIIT-Delhi

Clustering Techniques

Clustering CS 550: Machine Learning

Unsupervised Learning and Clustering

Cluster Analysis. Ying Shen, SSE, Tongji University

Clustering. CE-717: Machine Learning Sharif University of Technology Spring Soleymani

INF4820, Algorithms for AI and NLP: Evaluating Classifiers Clustering

Unsupervised Data Mining: Clustering. Izabela Moise, Evangelos Pournaras, Dirk Helbing

Cluster analysis. Agnieszka Nowak - Brzezinska

Unsupervised Learning. Supervised learning vs. unsupervised learning. What is Cluster Analysis? Applications of Cluster Analysis

Unsupervised Learning and Clustering

9/29/13. Outline Data mining tasks. Clustering algorithms. Applications of clustering in biology

CLUSTERING IN BIOINFORMATICS

Road map. Basic concepts

Unsupervised Learning : Clustering

10601 Machine Learning. Hierarchical clustering. Reading: Bishop: 9-9.2

Clustering. Lecture 6, 1/24/03 ECS289A

Clustering (Basic concepts and Algorithms) Entscheidungsunterstützungssysteme

Exploratory data analysis for microarrays

Cluster Analysis. CSE634 Data Mining

Big Data Analytics! Special Topics for Computer Science CSE CSE Feb 9

CLUSTERING. CSE 634 Data Mining Prof. Anita Wasilewska TEAM 16

University of Florida CISE department Gator Engineering. Clustering Part 5

CSE 5243 INTRO. TO DATA MINING

Lecture on Modeling Tools for Clustering & Regression

Improving the Efficiency of Fast Using Semantic Similarity Algorithm

Clustering Part 3. Hierarchical Clustering

Lecture-17: Clustering with K-Means (Contd: DT + Random Forest)

INF4820 Algorithms for AI and NLP. Evaluating Classifiers Clustering

What to come. There will be a few more topics we will cover on supervised learning

CS7267 MACHINE LEARNING

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

INF4820 Algorithms for AI and NLP. Evaluating Classifiers Clustering

Cluster Analysis. Angela Montanari and Laura Anderlucci

Information Retrieval and Web Search Engines

Case-Based Reasoning. CS 188: Artificial Intelligence Fall Nearest-Neighbor Classification. Parametric / Non-parametric.

CS 188: Artificial Intelligence Fall 2008

Notes. Reminder: HW2 Due Today by 11:59PM. Review session on Thursday. Midterm next Tuesday (10/09/2018)

Data Mining: Concepts and Techniques. Chapter March 8, 2007 Data Mining: Concepts and Techniques 1

Unsupervised Learning Partitioning Methods

Clustering. RNA-seq: What is it good for? Finding Similarly Expressed Genes. Data... And Lots of It!

Data Mining. Dr. Raed Ibraheem Hamed. University of Human Development, College of Science and Technology Department of Computer Science

Understanding Clustering Supervising the unsupervised

Machine Learning (BSMC-GA 4439) Wenke Liu

Statistics 202: Data Mining. c Jonathan Taylor. Clustering Based in part on slides from textbook, slides of Susan Holmes.

Information Retrieval and Web Search Engines

DATA MINING LECTURE 7. Hierarchical Clustering, DBSCAN The EM Algorithm

CS 2750 Machine Learning. Lecture 19. Clustering. CS 2750 Machine Learning. Clustering. Groups together similar instances in the data sample

Cluster Analysis for Microarray Data

Lecture Notes for Chapter 8. Introduction to Data Mining

Statistics 202: Data Mining. c Jonathan Taylor. Week 8 Based in part on slides from textbook, slides of Susan Holmes. December 2, / 1

Hierarchical Clustering

Cluster Analysis. Prof. Thomas B. Fomby Department of Economics Southern Methodist University Dallas, TX April 2008 April 2010

CS Introduction to Data Mining Instructor: Abdullah Mueen

MultiDimensional Signal Processing Master Degree in Ingegneria delle Telecomunicazioni A.A

Clustering part II 1

Hierarchical clustering

ECS 234: Data Analysis: Clustering ECS 234

University of Florida CISE department Gator Engineering. Clustering Part 2

Data Informatics. Seon Ho Kim, Ph.D.

Working with Unlabeled Data Clustering Analysis. Hsiao-Lung Chan Dept Electrical Engineering Chang Gung University, Taiwan

Data Mining Algorithms

What is Cluster Analysis? COMP 465: Data Mining Clustering Basics. Applications of Cluster Analysis. Clustering: Application Examples 3/17/2015

Foundations of Machine Learning CentraleSupélec Fall Clustering Chloé-Agathe Azencot

Chapter DM:II. II. Cluster Analysis

Cluster Analysis: Agglomerate Hierarchical Clustering

Kapitel 4: Clustering

Hard clustering. Each object is assigned to one and only one cluster. Hierarchical clustering is usually hard. Soft (fuzzy) clustering

PAM algorithm. Types of Data in Cluster Analysis. A Categorization of Major Clustering Methods. Partitioning i Methods. Hierarchical Methods

Lesson 3. Prof. Enza Messina

Unsupervised Learning. Andrea G. B. Tettamanzi I3S Laboratory SPARKS Team

Clustering and Visualisation of Data

EECS730: Introduction to Bioinformatics

CHAPTER 4: CLUSTER ANALYSIS

Clustering. CS294 Practical Machine Learning Junming Yin 10/09/06

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

Machine Learning. Unsupervised Learning. Manfred Huber

Hierarchical Clustering

Olmo S. Zavala Romero. Clustering Hierarchical Distance Group Dist. K-means. Center of Atmospheric Sciences, UNAM.

Classification. Vladimir Curic. Centre for Image Analysis Swedish University of Agricultural Sciences Uppsala University

Gene expression & Clustering (Chapter 10)

Hierarchical Clustering

INF4820. Clustering. Erik Velldal. Nov. 17, University of Oslo. Erik Velldal INF / 22

What is Unsupervised Learning?

Clustering in Data Mining

Notes. Reminder: HW2 Due Today by 11:59PM. Review session on Thursday. Midterm next Tuesday (10/10/2017)

Information Retrieval and Organisation

Supervised vs unsupervised clustering

CHAPTER-6 WEB USAGE MINING USING CLUSTERING

CS570: Introduction to Data Mining

Data Mining Chapter 9: Descriptive Modeling Fall 2011 Ming Li Department of Computer Science and Technology Nanjing University

INF4820, Algorithms for AI and NLP: Hierarchical Clustering

Transcription:

BINF, Introduction to Computational Biology Gene Clustering & Classification Young-Rae Cho Associate Professor Department of Computer Science Baylor University Overview Introduction to Gene Clustering Partition-based Clustering Methods Hierarchical Clustering Methods Validation of Clustering Gene Classification & Sample Classification

What is Clustering? Cluster A group of data objects Similar (or related) to one another within the same group Dissimilar (or unrelated) to the objects in different groups Clustering (or Cluster Analysis) Finding similarities between data objects Grouping similar data objects into the same clusters Unsupervised learning: no pre-defined classes Applications A stand-alone method for data analysis A preprocessing step for other data analysis Measuring Quality of Clustering High Quality Clusters have High intra-class similarity: cohesiveness within clusters Low inter-class similarity: separability between clusters Quality of Clustering Depends on Clustering methods Handling both cohesiveness and separability Ability to discover hidden patterns Defining similar enough problem of determining a threshold Data sets Amount of data Complexity of data type High dimensionality

Gene Clustering Definition Grouping the genes which have similar features Purpose Finding a group of genes that perform the same functions Features Sequence similarity Motif inclusion Structure similarity Expression profile coherence Interaction evidence Microarray Experiment () Microarrays Estimate gene expression levels under varying conditions / time points by measuring the amount of mrna for that particular gene Microarray Experiment Process Produce cdna from mrna (DNA is more stable) Attach phosphor to cdna to see when a particular gene is expressed Different color phosphors are available to compare many samples at once Hybridize cdna over the microarray Scan the microarray with a phosphor-illuminating laser Illumination reveals transcribed genes Scan microarray multiple times for the different color phosphor s

Microarray Experiment () phosphors laser illumination www.affymetrix.com Expression Data () Microarray Results Green: expressed from controlled samples Red: expressed from experimental samples Yellow: expressed in both samples Black: not expressed in either controlled or experimental samples Expression Data The intensity of the colors is quantified into numeric values Each gene has an expression value on each experimental condition or each time point Time: Time X Time Y Time Z Gene Gene Gene. Gene Gene

Expression Data () Data Conversion n-dimension Mapping Pairwise Distance Matrix Clustering Expression Data Cohesiveness and Separability Principles violates both cohesiveness and separability principles satisfies both cohesiveness and separability principles Co-expression Coherent expression patterns of genes across experimental conditions or time points More co-expressed genes within the same clusters, and less co-expressed genes from different clusters

Distance / Similarity Measures Minkowski Distance Given two data, X={x,x,,x n } and Y={y,y,, y n }, on n dimensions, d Euclidean distance when p=, and Manhattan distance when p= Pearson Coefficient Evaluates correlation between two data on n dimensions r n i n i ( x x)( y y) i ( n ) x i y / p p x i y i co-variance between X and Y individual variance of X and Y If r >, X and Y are positively correlated. If r =, X and Y are independent. If r <, X and Y are negatively correlated. Overview Introduction to Gene Clustering Partition-Based Clustering Methods Hierarchical Clustering Methods Validation of Clustering Gene Classification & Sample Classification

Partition-Based Methods Main Idea Constructing the best partition of the data with n objects into k clusters Issue Finding a partition that optimize the criterion of cluster quality: high intra-class similarity (cohesiveness) and low inter-class similarity (separability) Methods Theoretical method: Enumerate exhaustively all possible partitions and select the best one Heuristic method: k-means, k-medoids k-means Main Idea Finding the best partition of n data objects into k clusters in which each object belongs to the cluster with the nearest mean Process () Partition data objects randomly into k clusters. () Compute the mean point of the objects in each cluster as a centroid () Assign each object to the nearest centroid and generate k new clusters () Repeat () and (), until there is no change of the objects in each cluster

Example of k-means Random partition (k=) Compute cluster means Re-assign objects Compute cluster means Re-assign objects Strength and Weakness of k-means Strength Relatively efficient time complexity? (n objects, k clusters, t iterations) Weakness Need to specify k, the number of clusters, in advance Sensitive to noise and outliers Not suitable to detect clusters with non-convex shapes Sometimes fall into local optimum, not identifying global optimum of clusters

Overview Introduction to Gene Clustering Partition-Based Clustering Methods Hierarchical Clustering Methods Validation of Clustering Gene Classification & Sample Classification Hierarchical Methods Main Idea Decomposing data objects into several levels of nested partitioning (tree of clusters) Step Step Step Step Step a a b b a b c d e c c d e d d e e Step Step Step Step Step bottom-up approach (agglomerative algorithm) top-down approach (divisive algorithm)

Agglomerative Algorithm Process Start with all single-node clusters Iteratively merge the closest (the most similar) clusters Eventually, all nodes belong to one cluster. Single-Link Distance: Complete-Link Distance: Average-Link Distance: Centroid Distance: where m i and m j are means of C i and C j Distance Measures between Clusters ), ( min ), (, y x d C C d i C j y C x j i ), ( max ), (, y x d C C d i C j y C x j i C i j x C y j i j i y x d n n C C d ), ( ), ( ), ( ), ( j i j i m m d C C d

// Comparison of Distance Measures Single-Link Distance Complete-Link Distance Average-Link Distance Centroid Distance Divisive Algorithm Process Start with one single clusters with all nodes Iteratively divide the farthest (the most dissimilar) clusters Eventually, all clusters have a single node.

Strength and Weakness of Hierarchical Methods Strength Not require the number of clusters, k, in advance Weakness Require the stopping condition Sensitive to noise Not able to undo what was done previously Not scalable, at least O(n ) where n objects Overview Introduction to Gene Clustering Partition-Based Clustering Methods Hierarchical Clustering Methods Validation of Clustering Gene Classification & Sample Classification

Cluster Validation Definition Assessing the quality of clustering results Why Validating? To avoid finding clusters formed by chance To compare clustering algorithms To choose clustering parameters Methods External index: when ground truth is available Internal index: when ground truth is unavailable Internal Index Error Measures Absolute error = x i - x i Squared error = (x i x i ) Sum of Squared Error (SSE) Measure of cohesiveness by within-cluster sum of squared error Measure of separability by between-cluster sum of squared error Relationship between WSS and BSS?

External Index () Notations N: the total number of data objects C = {C, C,, C n }: the set of clusters reported by a clustering algorithm P = {P, P,, P m }: the set of ground truth clusters Incident Matrix (N N) matrix C ij = if two data objects O i and O j belong to the same cluster in C C ij = otherwise P ij = if O i and O j belong to the same ground truth cluster in P P ij = otherwise External Index () Result Categories SS: C ij = and P ij = (agree) DD: C ij = and P ij = (agree) SD: C ij = and P ij = (disagree) DS: C ij = and P ij = (disagree) Rand Index Rand = SS + DD SS + DD + SD + DS Jaccard Index Jaccard Coefficient = SS SS + SD + DS

f-measure Recall & Precision Comparison between an output cluster and a ground-truth cluster Let an output cluster X, and a ground-truth cluster Y X Y Recall (Sensitivity, or True positive rate) = Y X Y Precision (Positive predictive value) = X f-measure Harmonic mean of Recall and Precision f-measure = (Recall Precision) / (Recall + Precision) Statistical p-value p-value of Hyper-Geometric Distribution Let the set of all data objects, N Let an output cluster X, and a ground-truth cluster Y Probability that at least k data objects in X are included in Y Y N Y k i X i p where k = X Y i N X A low p-value indicates it is less probable that the cluster X is produced by chance - log(p) is usually used for clustering evaluation

Overview Introduction to Gene Clustering Partition-Based Clustering Methods Hierarchical Clustering Methods Validation of Clustering Gene Classification & Sample Classification Supervised vs. Unsupervised Learning Supervised Learning Called classification Training data (observations, measurement, etc.) are given Training data include class labels predefined Find rules or models of class labels of training data New data are classified based on the rules or models Unsupervised Learning Called clustering No training data are given New data are classified without any training data

Classification vs. Prediction Classification Training class labels in attributes of a training data set Predicts class labels of a new data set based on the rules or models of class labels of the training data set Prediction Modeling continuous-valued functions for a data set Predicts unknown or missing values in the data set Gene Classification () Definition Finding the functions of unknown genes by training expression data of known (functionally characterized) genes Expression Data Expression levels (numeric values) for each gene across time points Time-series expression data for each gene (Gene-Time expression data)

Gene Classification () Assumption Genes having the same functions are co-expressed (co-expression: coherent expression patterns of genes across time points) Process () Finding coherent patterns of time-series expression profiles for each class Training step () Matching the time-series expression profile of an unknown gene to the patterns Predicting step Time-Series Expression Data Examples

Sample Classification Definition Finding the class (disease) of unknown samples by training expression data of known samples Expression Data Expression levels (numeric values) for each gene across samples at a certain time point Gene-Sample expression data Process () Informative gene selection in a training data set () Class prediction of unknown samples Informative Gene Selection () Purpose Eliminating irrelevant genes Selecting significant (informative) genes for class prediction Examples disease disease gene gene ideal case

Informative Gene Selection () Statistical Approach Selects genes expressed more differently between classes Computes statistical information (mean, standard deviation) of expression values of the samples in each class Measures the correlation of expressions between classes for each gene Ranks genes by the correlation metric Informative Gene Selection () Example informative genes samples

Class Prediction Weighted Voting Approach For each unknown sample, each informative gene votes for either class or class, based on whether its expression value in the sample is closer to or Computes the weighted vote of a gene g by w(g)v(g) where x(g) is the expression value of g Advanced Topics Sample Classification in Time-Series Data Classifying samples using time-series expression data for each gene (gene-sample-time expression data) Bi-Clustering Simultaneous clustering of both genes and conditions (or time points)

Questions? Lecture Slides are found on the Course Website, web.ecs.baylor.edu/faculty/cho/