High throughput Data Analysis 2. Cluster Analysis

Similar documents
Hard clustering. Each object is assigned to one and only one cluster. Hierarchical clustering is usually hard. Soft (fuzzy) clustering

Clustering and The Expectation-Maximization Algorithm

Chapter DM:II. II. Cluster Analysis

Dimension reduction : PCA and Clustering

Network Traffic Measurements and Analysis

Unsupervised Learning and Clustering

Exploratory data analysis for microarrays

ECS 234: Data Analysis: Clustering ECS 234

Machine Learning (BSMC-GA 4439) Wenke Liu

Gene Clustering & Classification

Clustering. RNA-seq: What is it good for? Finding Similarly Expressed Genes. Data... And Lots of It!

Machine Learning (BSMC-GA 4439) Wenke Liu

Machine Learning and Data Mining. Clustering (1): Basics. Kalev Kask

Cluster Analysis for Microarray Data

Supervised vs. Unsupervised Learning

Unsupervised Learning and Clustering

Clustering CS 550: Machine Learning

What is Clustering? Clustering. Characterizing Cluster Methods. Clusters. Cluster Validity. Basic Clustering Methodology

CS 2750 Machine Learning. Lecture 19. Clustering. CS 2750 Machine Learning. Clustering. Groups together similar instances in the data sample

Measure of Distance. We wish to define the distance between two objects Distance metric between points:

Cluster Analysis. Mu-Chun Su. Department of Computer Science and Information Engineering National Central University 2003/3/11 1

Classification. Vladimir Curic. Centre for Image Analysis Swedish University of Agricultural Sciences Uppsala University

Foundations of Machine Learning CentraleSupélec Fall Clustering Chloé-Agathe Azencot

A Dendrogram. Bioinformatics (Lec 17)

Machine Learning. Unsupervised Learning. Manfred Huber

Cluster Analysis: Agglomerate Hierarchical Clustering

CSE 5243 INTRO. TO DATA MINING

Clustering & Classification (chapter 15)

Cluster Analysis. Ying Shen, SSE, Tongji University

Clustering. Lecture 6, 1/24/03 ECS289A

Machine Learning. B. Unsupervised Learning B.1 Cluster Analysis. Lars Schmidt-Thieme

CS 1675 Introduction to Machine Learning Lecture 18. Clustering. Clustering. Groups together similar instances in the data sample

EECS730: Introduction to Bioinformatics

k-means Clustering Todd W. Neller Gettysburg College Laura E. Brown Michigan Technological University

CSE 158. Web Mining and Recommender Systems. Midterm recap

Clustering. Chapter 10 in Introduction to statistical learning

Unsupervised Learning

CSE 5243 INTRO. TO DATA MINING

Working with Unlabeled Data Clustering Analysis. Hsiao-Lung Chan Dept Electrical Engineering Chang Gung University, Taiwan

Machine Learning for OR & FE

Clustering. Supervised vs. Unsupervised Learning

Cluster Analysis. Prof. Thomas B. Fomby Department of Economics Southern Methodist University Dallas, TX April 2008 April 2010

Chapter 6 Continued: Partitioning Methods

Clustering. CS294 Practical Machine Learning Junming Yin 10/09/06

What to come. There will be a few more topics we will cover on supervised learning

Cluster Analysis. Angela Montanari and Laura Anderlucci

Clustering (Basic concepts and Algorithms) Entscheidungsunterstützungssysteme

Unsupervised Learning

Bioinformatics - Lecture 07

Giri Narasimhan. CAP 5510: Introduction to Bioinformatics. ECS 254; Phone: x3748

Clustering, cont. Genome 373 Genomic Informatics Elhanan Borenstein. Some slides adapted from Jacques van Helden

CHAPTER 5 CLUSTER VALIDATION TECHNIQUES

Data Mining Chapter 9: Descriptive Modeling Fall 2011 Ming Li Department of Computer Science and Technology Nanjing University

Unsupervised Learning. Presenter: Anil Sharma, PhD Scholar, IIIT-Delhi

9/29/13. Outline Data mining tasks. Clustering algorithms. Applications of clustering in biology

Unsupervised Data Mining: Clustering. Izabela Moise, Evangelos Pournaras, Dirk Helbing

ECLT 5810 Clustering

ECLT 5810 Clustering

Chapter 6: Cluster Analysis

10601 Machine Learning. Hierarchical clustering. Reading: Bishop: 9-9.2

Unsupervised Learning

Hierarchical Clustering

INF4820 Algorithms for AI and NLP. Evaluating Classifiers Clustering

MSA220 - Statistical Learning for Big Data

Distances, Clustering! Rafael Irizarry!

COMP 465: Data Mining Still More on Clustering

Statistical Methods for Data Mining

Statistical Analysis of Metabolomics Data. Xiuxia Du Department of Bioinformatics & Genomics University of North Carolina at Charlotte

Clustering and Dimensionality Reduction. Stony Brook University CSE545, Fall 2017

University of Florida CISE department Gator Engineering. Clustering Part 2

10-701/15-781, Fall 2006, Final

1 Case study of SVM (Rob)

Unsupervised Learning Partitioning Methods

Based on Raymond J. Mooney s slides

Structured prediction using the network perceptron

The k-means Algorithm and Genetic Algorithm

K-Means Clustering 3/3/17

Multivariate Analysis

Introduction to Machine Learning

Olmo S. Zavala Romero. Clustering Hierarchical Distance Group Dist. K-means. Center of Atmospheric Sciences, UNAM.

Keywords Clustering, Goals of clustering, clustering techniques, clustering algorithms.

Understanding Clustering Supervising the unsupervised

Lecture on Modeling Tools for Clustering & Regression

Data Preprocessing. Javier Béjar. URL - Spring 2018 CS - MAI 1/78 BY: $\

INF4820, Algorithms for AI and NLP: Evaluating Classifiers Clustering

CS Introduction to Data Mining Instructor: Abdullah Mueen

Clustering algorithms

Finding Clusters 1 / 60

Introduction to Pattern Recognition Part II. Selim Aksoy Bilkent University Department of Computer Engineering

COMP 551 Applied Machine Learning Lecture 13: Unsupervised learning

Unsupervised Learning

Unsupervised learning: Clustering & Dimensionality reduction. Theo Knijnenburg Jorma de Ronde

10701 Machine Learning. Clustering

Clustering in Ratemaking: Applications in Territories Clustering

A Naïve Soft Computing based Approach for Gene Expression Data Analysis

CHAPTER 4: CLUSTER ANALYSIS

What is clustering. Organizing data into clusters such that there is high intra- cluster similarity low inter- cluster similarity

Big Data Methods. Chapter 5: Machine learning. Big Data Methods, Chapter 5, Slide 1

Clustering. CE-717: Machine Learning Sharif University of Technology Spring Soleymani

May 1, CODY, Error Backpropagation, Bischop 5.3, and Support Vector Machines (SVM) Bishop Ch 7. May 3, Class HW SVM, PCA, and K-means, Bishop Ch

Transcription:

High throughput Data Analysis 2 Cluster Analysis

Overview Why clustering? Hierarchical clustering K means clustering Issues with above two Other methods Quality of clustering results

Introduction WHY DO CLUSTERING?

Why clustering? Group genes based on common features e.g. common expression pattern common phenotype common time course Find common function of genes in cluster Pathway? Assign function to new genes Guilt by association Find intrinsic structure in data; hypotheses free? Does not require (does not use) additional information Advantage and disadvantage!

Basic Idea expression condition 2 Cluster A Cluster B expression condition 1 How do you formalize this?

What to Cluster Genes expression (e.g. across tissues, time, individuals, ) phenotype Samples expression phenotype Individuals expression phenotype genotype Species

Visualizing Samples Before batch correction: After batch correction:

Bi Clustering samples genes

Supervised vs. Unsupervised Supervised Learning Learn to predict outcome based on examples e.g. regression Needs examples (training data) Unsupervised Learning Find intrinsic structure in data e.g. clustering Does not need additional data/information

Clustering: Ingredients In real life n >> 2 dimensions Group genes that are close in n dimensional space Requires measure of distance between genes (objects) e.g. Euclidean distance (more later) Find clusters of genes that are close to each other, but far from the rest

HIERARCHICAL CLUSTERING

Hierarchical Clustering expression condition 2 expression condition 1 23 clusters

Joining Rules expression condition 1 expression condition 2

Joining Rules expression condition 2 Average Linkage (mean) expression condition 1

Joining Rules expression condition 2 Single Linkage (min) expression condition 1

Joining Rules expression condition 2 Complete Linkage (max) expression condition 1

Joining Rules

Limitations of Hierarchical Clustering Linking cannot be reversed (updated)!

How many clusters? Where do you cut?

DISTANCE MEASURES

Distance Measures Euclidean distance (L2 norm) Pearson correlation Mutual information Manhattan distance (L1 norm) Other correlation measures many others dissimilarity distance

Euclidean distance Distance in (Euclidean) space Small if absolute values similar d( x, y) x i yi 2

Pearson Correlation Known from linear regression Similar if patterns similar p( x, y) xij xi yij yi 2 x 2 ij xi yij yi j j d(x, y) = 1 (p(x, y)) 2

Euclidean vs. Pearson time expression level

Weighted Distances Assign weights to parameters d ( x, y) w d x y j j j i i

Weighted Distances Assign weights to parameters d ( x, y) w j d j xi yi j Not all parameters may be equally important Setting all w j the same may not give all parameters equal influence!! Influence is determined by variance. Time

Example: Time Course Determine weights by supervised approach AhR independent AhR controlled

Tibschirani et al. Specifying an appropriate dissimilarity measure is far more important in obtaining success with clustering than choice of clustering algorithm.

K MEANS

K Means Fix number of clusters a priory (k) Optimally split data into k clusters minimize variance within clusters minimizes Euclidean distance

K Means expression condition 1 expression condition 2

K Means expression condition 2 1. randomly assign genes to k clusters 2. compute centriods 3. re assign genes to closest centroid expression condition 1

K Means expression condition 2 1. randomly assign genes to k clusters 2. compute centroids 3. re assign genes to closest centroid 4. repeat until stable Repeat many times with random initial conditions. expression condition 1

K Means expression condition 2 too few clusters expression condition 1

K Means expression condition 2 too many clusters expression condition 1

How to determine k? Try different k Maximize between cluster variance versus within cluster variance Within cluster point scatter

How to determine k? Try different k Maximize between cluster variance versus within cluster variance Within cluster point scatter W ( C) 1 2 K k 1 C( i) k C( i' ) k d( x i, x i ') will always decrease as K increases

Gap Statistic W(C) real random W(C) K K

Figure of Merit (FOM) Cross validation: hide some data train method (i.e. identify clusters) test on hidden data Hide data of condition (parameter) e After clustering quantify similarity based on e: FOM ( e) 1 n K k 1 C( i) k ( x ik ( e) x k ( e)) 2

Figure of Merit (FOM) FOM will also always decrease with increasing K Thus, need to normalize ( adjusted FOM ) For random data, FOM decreases with Thus, FOM adj ( e) FOM ( e) n k n n k n

Figure of Merit (FOM) best k?

Getting K Gap Statistic: minimize within cluster variance, compare against random FOM: cross validation; minimize variance in training data, compare against random

Hierarchical vs. K Means Hierarchical Variable number of clusters Does not directly imply number of clusters Cluster assignment fixed Any distance measure Clusters are deterministic K Means Fixed number of clusters Cluster assignment dynamic (adaptive) Only Euclidean distance (but variations exist) Clusters are nondeterministic

OTHER METHODS

K Medoids Like K Means, but for arbitrary distance measures Center defined by most central object Test each object in each cluster computationally very expensive K Means K Medoids

Fuzzy C Means Assign objects to many (or all) clusters with different certainty Membership values between 0 and 1 Considers uncertainty in data & clustering Allows for multi cluster membership (e.g. participating in several pathways)

Principle Component Analysis (PCA) Dimension reduction Project data on smaller number of dimensions Each dimension is linear combination of original dimensions Reduce high dimensional data to smaller number of relevant components (e.g. can be visualized in 3D)

Principle Component Analysis (PCA) expression condition 2 expression condition 1

Principle Component Analysis (PCA) Very useful for visualizing high dimensional data removing redundancy/dependency in data (note ICA) clustering detecting (and removing) batch effects or other confounding effects in data

Model based Clustering Mixture modeling Make assumptions of data distribution Fit model (set of distributions) to data Cluster 1 Cluster 2 e.g. Gaussian mixture model

Model based Clustering Mixture modeling Make assumptions of data distribution Works well if model (i.e. distribution) known May give higher power (additional information used) May give spurious results if assumptions incorrect

ASSESSING CLUSTER QUALITY

What matters? Good statistical separation Stability of results Agreement with external (independent) data Biological plausibility

Davies Bouldin Index Minimize within versus between variance ), ( ) ( ) ( max 1 1 j i j i k i j i Q Q S Q S Q S k DB S() = average distance to center

Silhouette a(i) = average distance to all other cluster members b(i) = average distance to members of neighboring cluster Average s(i) should be close to 1.

Using external data E.g. expression data: GO enrichment Enrichment of known transcription factor target genes Enrichment of regulatory sequence motives.

Biological plausibility Very problem dependent What is known about the process? Are genes known to be related grouped in one cluster? (and vice versa) When clustering samples: are conditions that are similar grouped together? Are similar cell types/tissues clustering?

Further Reading The Elements of Statistical Learning, Hastie et al. http://www stat.stanford.edu/~tibs/elemstatlearn/ http://machaon.karanagai.com/validation_algorithms.html http://en.wikipedia.org/wiki/cluster_analysis