Clustering analysis of gene expression data

Similar documents
Dimension reduction : PCA and Clustering

Exploratory data analysis for microarrays

9/29/13. Outline Data mining tasks. Clustering algorithms. Applications of clustering in biology

Clustering and Visualisation of Data

Cluster Analysis. Mu-Chun Su. Department of Computer Science and Information Engineering National Central University 2003/3/11 1

Cluster Analysis: Agglomerate Hierarchical Clustering

Network Traffic Measurements and Analysis

Dimension Reduction CS534

Case-Based Reasoning. CS 188: Artificial Intelligence Fall Nearest-Neighbor Classification. Parametric / Non-parametric.

CS 188: Artificial Intelligence Fall 2008

Extracting Information from Complex Networks

Clustering. CS294 Practical Machine Learning Junming Yin 10/09/06

CS325 Artificial Intelligence Ch. 20 Unsupervised Machine Learning

Cluster Analysis for Microarray Data

Feature Extractors. CS 188: Artificial Intelligence Fall Nearest-Neighbor Classification. The Perceptron Update Rule.

Basics of Network Analysis

CLUSTERING IN BIOINFORMATICS

Statistical Analysis of Metabolomics Data. Xiuxia Du Department of Bioinformatics & Genomics University of North Carolina at Charlotte

Hierarchical clustering

Supervised vs unsupervised clustering

CS 343: Artificial Intelligence

Clustering CS 550: Machine Learning

10701 Machine Learning. Clustering

Clustering. Robert M. Haralick. Computer Science, Graduate Center City University of New York

May 1, CODY, Error Backpropagation, Bischop 5.3, and Support Vector Machines (SVM) Bishop Ch 7. May 3, Class HW SVM, PCA, and K-means, Bishop Ch

Multivariate Analysis

Kernels and Clustering

Clustering and Dimensionality Reduction. Stony Brook University CSE545, Fall 2017

CSE 40171: Artificial Intelligence. Learning from Data: Unsupervised Learning

Information Networks: PageRank

Gene expression & Clustering (Chapter 10)

Statistical Methods for Data Mining

Machine Learning and Data Mining. Clustering (1): Basics. Kalev Kask

Clustering. Lecture 6, 1/24/03 ECS289A

Gene Clustering & Classification

Clustering K-means. Machine Learning CSEP546 Carlos Guestrin University of Washington February 18, Carlos Guestrin

10-701/15-781, Fall 2006, Final

10601 Machine Learning. Hierarchical clustering. Reading: Bishop: 9-9.2

CSE 255 Lecture 5. Data Mining and Predictive Analytics. Dimensionality Reduction

Introduction to Machine Learning. Xiaojin Zhu

CSE 547: Machine Learning for Big Data Spring Problem Set 2. Please read the homework submission policies.

MSA220 - Statistical Learning for Big Data

CSE 258 Lecture 5. Web Mining and Recommender Systems. Dimensionality Reduction

BBS654 Data Mining. Pinar Duygulu. Slides are adapted from Nazli Ikizler

CSE 573: Artificial Intelligence Autumn 2010

Mining Social Network Graphs

Comparisons and validation of statistical clustering techniques for microarray gene expression data. Outline. Microarrays.

Introduction to Pattern Recognition Part II. Selim Aksoy Bilkent University Department of Computer Engineering

What is clustering. Organizing data into clusters such that there is high intra- cluster similarity low inter- cluster similarity

Spectral Classification

ECS 234: Data Analysis: Clustering ECS 234

Unsupervised Learning

Giri Narasimhan. CAP 5510: Introduction to Bioinformatics. ECS 254; Phone: x3748

Unsupervised Learning

Lecture 25: Review I

Clustering and Dimensionality Reduction

SGN (4 cr) Chapter 11

INF 4300 Classification III Anne Solberg The agenda today:

Unsupervised learning in Vision

Unsupervised Learning

General Instructions. Questions

Unsupervised Learning and Data Mining

Cluster Analysis. Prof. Thomas B. Fomby Department of Economics Southern Methodist University Dallas, TX April 2008 April 2010

Unsupervised Learning. Pantelis P. Analytis. Introduction. Finding structure in graphs. Clustering analysis. Dimensionality reduction.

Centralities (4) By: Ralucca Gera, NPS. Excellence Through Knowledge

Outline. Multivariate analysis: Least-squares linear regression Curve fitting

CS224W: Social and Information Network Analysis Jure Leskovec, Stanford University

PROMO 2017a - Tutorial

Stats 170A: Project in Data Science Exploratory Data Analysis: Clustering Algorithms

Understanding Clustering Supervising the unsupervised

Hierarchical Clustering 4/5/17

CIS 520, Machine Learning, Fall 2015: Assignment 7 Due: Mon, Nov 16, :59pm, PDF to Canvas [100 points]

Multivariate Methods

Linear and Non-linear Dimentionality Reduction Applied to Gene Expression Data of Cancer Tissue Samples

Olmo S. Zavala Romero. Clustering Hierarchical Distance Group Dist. K-means. Center of Atmospheric Sciences, UNAM.

Bioinformatics - Lecture 07

Week 7 Picturing Network. Vahe and Bethany

Part I. Hierarchical clustering. Hierarchical Clustering. Hierarchical clustering. Produces a set of nested clusters organized as a

How do microarrays work

Clustering. Informal goal. General types of clustering. Applications: Clustering in information search and analysis. Example applications in search

Working with Unlabeled Data Clustering Analysis. Hsiao-Lung Chan Dept Electrical Engineering Chang Gung University, Taiwan

Classification: Feature Vectors

Step-by-Step Guide to Advanced Genetic Analysis

Hard clustering. Each object is assigned to one and only one cluster. Hierarchical clustering is usually hard. Soft (fuzzy) clustering

Lecture Topic Projects

Segmentation Computer Vision Spring 2018, Lecture 27

Forestry Applied Multivariate Statistics. Cluster Analysis

Types of general clustering methods. Clustering Algorithms for general similarity measures. Similarity between clusters

Collaborative filtering based on a random walk model on a graph

Part 1: Link Analysis & Page Rank

Computational Statistics The basics of maximum likelihood estimation, Bayesian estimation, object recognitions

k-nn Disgnosing Breast Cancer

EECS730: Introduction to Bioinformatics

CS224W: Social and Information Network Analysis Jure Leskovec, Stanford University

AN IMPROVED HYBRIDIZED K- MEANS CLUSTERING ALGORITHM (IHKMCA) FOR HIGHDIMENSIONAL DATASET & IT S PERFORMANCE ANALYSIS

Clustering Gene Expression Data: Acknowledgement: Elizabeth Garrett-Mayer; Shirley Liu; Robert Tibshirani; Guenther Walther; Trevor Hastie

K-means Clustering & PCA

Chapter 1. Using the Cluster Analysis. Background Information

Clustering k-mean clustering

Statistics 202: Data Mining. c Jonathan Taylor. Week 8 Based in part on slides from textbook, slides of Susan Holmes. December 2, / 1

Transcription:

Clustering analysis of gene expression data Chapter 11 in Jonathan Pevsner, Bioinformatics and Functional Genomics, 3 rd edition (Chapter 9 in 2 nd edition)

Human T cell expression data The matrix contains 47 expression samples from Lukk et al, Nature Biotechnology 2010 All samples are from T cells in different individuals Only the top 3000 genes with the largest variability were used The value is log2 of gene s expression level in a given sample as measured by the microarray technology T cells a T cell

How to interpret the expression data if you still have many genes and many samples?

Clustering to the rescue!

Clustering is a part of Machine Learning Supervised Learning: A machine learning technique whereby a system uses a set of training examples to learn how to correctly perform a task Example: a bunch of cancer expression samples each annotated with cancer type/tissue. Goal: predict cancer type based on expression pattern Unsupervised Learning (including clustering): In machine learning, unsupervised learning is a class of problems in which one seeks to determine how the data are organized. One only has unlabeled examples. Example: a bunch of breast cancer expression samples. Identify several different (yet unknown) subtypes with different likely outcomes

What is clustering? The goal of clustering is to group data points that are close (or similar) to each other Usually we need to identify such groups (or clusters) in an unsupervised manner Sometimes we take into account prior information (Bayesian methods) Need to define distance d ij between objects i and j In our case objects could be either genes or samples Easy in 2 dimensions but hard in 3000 dimensions Need to somehow reduce dimensionality x x x x x x x x x

How to define distance? Correlation coefficient distance d( X, Y) 1 ( X, Y) 1 Cov( X, Y ) ( Var( X ) Var( Y )

Reminder: Principal Component Analysis

Multivariable statistics and Principal Component Analysis (PCA) A table of n observations in which p variables were measured p x p symmetric matrix R of corr. coefficients PCA: Diagonalize matrix R

Trick: Rotate Coordinate Axes Suppose we have a population measured on p random variables X 1,,X p. Note that these random variables represent the p-axes of the Cartesian coordinate system in which the population resides. Our goal is to develop a new set of p axes (linear combinations of the original p axes) in the directions of greatest variability: X 2 X 1 This is accomplished by rotating the axes. Adapted from slides by Prof. S. Narasimhan, Computer Vision course at CMU

Principle Component Analysis (PCA) p x p symmetric matrix R of corr. coefficients R=n 1 Z *Z is a square of the matrix Z of standardized r.v.: all eigenvalues of R are non negative Diagonal elements=1 tr(r)=p Can be diagonalized: R=V*D*V where D is the diagonal matrix d(1,1) largest eig. value, d(p,p) the smallest one The meaning of V(i,k) contribution of the data type i to the k th eigenvector tr(d)=p, the largest eigenvalue d(1,1) absorbs a fraction =d(1,1)/p of all correlations can be ~100% Scores: Y=Z*V: n x p matrix. Meaning of Y(,k) participation of the sample # in the k th eigenvector

Back to clustering

Let s work with real cancer data! Data from Wolberg, Street, and Mangasarian (1994) Fine needle aspirates = biopsy for breast cancer Black dots cell nuclei. Irregular shapes/sizes may mean cancer 212 cancer patients and 357 healthy (column 1) 30 other properties (see table)

Common types of clustering algorithms Hierarchical if don t know in advance # of clusters Agglomerative: start: N clusters, merge into 1 cluster Divisive: start with 1 cluster and breaks it up into N Non hierarchical algorithms Principal Component Analysis (PCA) plot pairs of top eigenvectors of the covariance matrix Cov(X i, X j ) and uses visual information to group K means clustering: Iteratively apply the following two steps: Calculate the centroid (center of mass) of each cluster Assign each to the cluster to the nearest centroi

Credit: XKCD comics

UPGMA algorithm Hierarchical agglomerative clustering algorithm UPGMA = Unweighted Pair Group Method with Arithmetic mean Iterative algorithm: Start with a pair with the smallest d(x,y) Cluster these two together and replace it with their arithmetic mean (X+Y)/2 Recalculate all distances to this new cluster node Repeat until all nodes are merged

Output of UPGMA algorithm

Matlab demo

Human T cell expression data The matrix contains 47 expression samples from Lukk et al, Nature Biotechnology 2010 All samples are from T cells in different individuals Only the top 3000 genes with the largest variability were used The value is log2 of gene s expression level in a given sample as measured by the microarray technology T cells a T cell

Choices of distance metrics in clustergram( RowPDistValue, ColumnPDistValue,)

Choices of hierarchical clustering algorithm in clustergram( linkage, )

Clustering group exercise Each group will analyze a cluster of genes identified in the T cell expression table Analyze the table of top 100 genes by variance in 47 samples Cluster them using: Group 1: UPGMA = linkage, average, RowPDistValue, euclidian, Group 2: linkage, single, RowPDistValue, cityblock, Group 3: linkage, average, RowPDistValue, correlation, Group 4: UPGMA = linkage, single, RowPDistValue, euclidian, Use clustergram(, 'Standardize','Row', linkage, as specified for your group, RowPDistValue as specified for your group, 'RowLabels',gene_names1,'ColumnLabels, array_names)

Matlab code load expression_table.mat gene_variation=std(exp_t')'; [a,b]=sort(gene_variation,'descend'); ngenes=100; exp_t1=exp_t(b(1:ngenes),:); gene_names1=gene_names(b(1:ngenes)); %%% for group 1 CGobj1 = clustergram(exp_t1, 'Standardize','Row', 'RowLabels', gene_names1,'columnlabels',array_names) set(cgobj1,'rowlabels',gene_names1,'columnlabels', array_names,'linkage', 'average','rowpdist','euclidean');

Cluster analysis group exercise Which biological functions are overrepresented in different clusters? Pick a cluster: Select a node on the tree of rows, Right click Choose export group info into the workspace Name it gene_list Run the following two Matlab commands to display genes g1=gene_list.rownodenames; for m=1:length(g1); disp(g1{m}); end;

Search for shared biological functions copy the list of displayed genes go to "Start Analysis" on https://david.ncifcrf.gov/tools.jsp paste gene list into the box in the left panel select ENSEMBL_GENE_ID select gene list radio button pick "Functional Annotation Clustering" for gene list broken into multiple groups (clusters) each with related biological functions Select "Functional Annotation Clustering" rectangular button below to display annotation results Write down the # of genes in the cluster and the top functions in two most interesting clusters

Before clustering

UPGMA hierarchical clustering, Euclidian distance

UPGMA hierarchical clustering, correlation distance

Credit: XKCD comics

Basic concepts of network analysis

Degree of a node its # of neighbors 2 Degree K 2 =2 1 Degree K 1 =4

Directed networks have in and out degrees In degree K in =2 Out degree K out =5

How to find important nodes? By their degree Hubs = important Degree centrality Google tweaked it in 1990s Hubs pointed by other hubs = more important 2?? 1?

How Google PageRank algorithm works? Too many hits to a typical query: need to rank One could rank the importance of webpages by K in, but: Too democratic: It doesn t take into account importance of webpages sending links it s easy to trick and artificially boost the rank Google s solution: simulate the behavior of many random surfers and then count the number of hits for every webpage Popular pages send more surfers your way the Google weight is proportional to K in weighted by popularity

Mathematics of the Google Google solves a self consistent Eq.: G i ~ j T ij G j find the principal eigenvector T ij = A ij /K out (j) Problem: surfers trapped infinite loops with one entry and no exit Model with random jumps: G i ~ (1 ) j T ij G j + j G j =0.15 meaning that an average web surfer (circa 1995) on average jumped around 1/ 6 webpages before going somewhere else

How to find important nodes? By their degree Hubs = important Degree centrality By their connectivity Connectors = important Betweenness centrality

Betweenness centrality: definition Take a node i There are (N 1)*(N 2)/2 pairs of other nodes For each pair find the shortest path on the network If more than one shortest path, sample them equally Betweenness centrality C(I = the probability that a randomly selected pair has shortest path going through i

Clustering coefficient: Are my friends also friends with each other

Clustering coefficient: what fraction of your neighbors are connected to each other? For the entire network: C=mean(C i )=Σ i C i /N Network Science: Graph Theory 2012

Why it matters for expression data analysis?

T cell expression data The matrix contains 47 expression samples from Lukk et al, Nature Biotechnology 2011 All samples are normal T cells from different individuals Only the top 3000 genes with the largest variability were used The value is log2 of gene s expression level in a given sample as measured by microarray technology

Before clustering

UPGMA hierarchical clustering, Euclidian distance

Another way to analyze expression data: Co expression networks

How to construct a co-expression network? A co-expression network Samples Start with a matrix of log2 of expression levels of N genes in K samples (conditions): for our T cell data N=3000, K=47 For each of N(N 1)/2 pairs of genes i and j calculate the correlation coefficient ρ ij =σ ij /σ i σ j of gene levels across K samples Put a threshold, e.g. ρ ij >0.85, or otherwise select the most correlated pairs of genes (~4500 in our case). Now you have a weighted network. Identify densely interconnected functional modules in this network. Modules can be used to infer unknown functions of genes via Guilt by Association principle. Functional modules

Co expression network analysis exercise Start Gephi and open coexpression_network_random_start.gephi Run Layout Fruchterman Reingold Speed 10.0 Run Average degree, Network diameter, Modularity in the Statistics tab in the right panel. Color nodes by modularity class : Appearance Nodes Attribute Palette Icon Size nodes first by degree. Appearance Nodes Attribute Circles Icon Nodes in large tightly connected clusters have large degree Then size nodes by betweenness centrality Appearance Nodes Attribute Circles Icon These are coordinator nodes connecting different clusters to each other. Potentially biologically interesting

Credit: XKCD comics