Application of Hierarchical Clustering to Find Expression Modules in Cancer

Similar documents
SEEK User Manual. Introduction

EECS730: Introduction to Bioinformatics

Drug versus Disease (DrugVsDisease) package

Gene Clustering & Classification

Statistical Analysis of Metabolomics Data. Xiuxia Du Department of Bioinformatics & Genomics University of North Carolina at Charlotte

/ Computational Genomics. Normalization

Exploratory data analysis for microarrays

Expander Online Documentation

Package ibbig. R topics documented: December 24, 2018

Expander 7.2 Online Documentation

Package twilight. August 3, 2013

Supplementary Materials for. A gene ontology inferred from molecular networks

MATH3880 Introduction to Statistics and DNA MATH5880 Statistics and DNA Practical Session Monday, 16 November pm BRAGG Cluster

Biclustering Algorithms for Gene Expression Analysis

10601 Machine Learning. Model and feature selection

9/29/13. Outline Data mining tasks. Clustering algorithms. Applications of clustering in biology

Microarray Analysis Classification by SVM and PAM

Contents. ! Data sets. ! Distance and similarity metrics. ! K-means clustering. ! Hierarchical clustering. ! Evaluation of clustering results

Clustering Jacques van Helden

STEM. Short Time-series Expression Miner (v1.1) User Manual

Introduction to GE Microarray data analysis Practical Course MolBio 2012

CANCER PREDICTION USING PATTERN CLASSIFICATION OF MICROARRAY DATA. By: Sudhir Madhav Rao &Vinod Jayakumar Instructor: Dr.

Supporting Information

Package RmiR. R topics documented: September 26, 2018

Clustering. Lecture 6, 1/24/03 ECS289A

ECS 234: Data Analysis: Clustering ECS 234

Package sigqc. June 13, 2018

Lecture 5. Functional Analysis with Blast2GO Enriched functions. Kegg Pathway Analysis Functional Similarities B2G-Far. FatiGO Babelomics.

Unsupervised learning: Clustering & Dimensionality reduction. Theo Knijnenburg Jorma de Ronde

Cross-validation and the Bootstrap

DI TRANSFORM. The regressive analyses. identify relationships

GS Analysis of Microarray Data

CARMAweb users guide version Johannes Rainer

Supplementary text S6 Comparison studies on simulated data

Bayesian Pathway Analysis (BPA) Tutorial

ChromHMM: automating chromatin-state discovery and characterization

Step-by-Step Guide to Advanced Genetic Analysis

ROTS: Reproducibility Optimized Test Statistic

Cross-validation and the Bootstrap

Chapter 6: Linear Model Selection and Regularization

Microarray Excel Hands-on Workshop Handout

High throughput Data Analysis 2. Cluster Analysis

Cluster Analysis for Microarray Data

mirnet Tutorial Starting with expression data

Genomics - Problem Set 2 Part 1 due Friday, 1/26/2018 by 9:00am Part 2 due Friday, 2/2/2018 by 9:00am

Evaluating Classifiers

Fast or furious? - User analysis of SF Express Inc

STATISTICS (STAT) Statistics (STAT) 1

Identifying differentially expressed genes with siggenes

Package OLIN. September 30, 2018

Feature Selection. CE-725: Statistical Pattern Recognition Sharif University of Technology Spring Soleymani

CS313 Exercise 4 Cover Page Fall 2017

Chapter 3. Requirement Based System Test Case Prioritization of New and Regression Test Cases. 3.1 Introduction

Comparison of different preprocessing techniques and feature selection algorithms in cancer datasets

Differential Expression Analysis at PATRIC

On Demand Phenotype Ranking through Subspace Clustering

Tutorial - Analysis of Microarray Data. Microarray Core E Consortium for Functional Glycomics Funded by the NIGMS

The supclust Package

Affymetrix Microarrays

Genomics - Problem Set 2 Part 1 due Friday, 1/25/2019 by 9:00am Part 2 due Friday, 2/1/2019 by 9:00am

Triclustering in Gene Expression Data Analysis: A Selected Survey

Evolutionary origins of modularity

An Unsupervised Approach for Combining Scores of Outlier Detection Techniques, Based on Similarity Measures

Gene Set Enrichment Analysis. GSEA User Guide

GRIND Gene Regulation INference using Dual thresholding. Ir. W.P.A. Ligtenberg Eindhoven University of Technology

Gene signature selection to predict survival benefits from adjuvant chemotherapy in NSCLC patients

Assignment 4 (Sol.) Introduction to Data Analytics Prof. Nandan Sudarsanam & Prof. B. Ravindran

Clustering analysis of gene expression data

Two mode Network. PAD 637, Lab 8 Spring 2013 Yoonie Lee

Chapter 2 Basic Structure of High-Dimensional Spaces

Package pcagopromoter

VIDAEXPERT: DATA ANALYSIS Here is the Statistics button.

Analysis of Algorithms

Midterm Examination CS540-2: Introduction to Artificial Intelligence

Package nlnet. April 8, 2018

Classification and Regression

Supplementary Material

PSS718 - Data Mining

Contents. Preface to the Second Edition

Correlation Motif Vignette

Lecture 4: Advanced Data Structures

Package RobustRankAggreg

6-1 THE STANDARD NORMAL DISTRIBUTION

1. Select[] is used to select items meeting a specified criterion.

BCLUST -- A program to assess reliability of gene clusters from expression data by using consensus tree and bootstrap resampling method

Midterm Examination CS 540-2: Introduction to Artificial Intelligence

GenViewer Tutorial / Manual

Package ctc. R topics documented: August 2, Version Date Depends amap. Title Cluster and Tree Conversion.

Package subtype. February 20, 2015

Package OrderedList. December 31, 2017

The clusterrepro Package

Objective of clustering

10. Clustering. Introduction to Bioinformatics Jarkko Salojärvi. Based on lecture slides by Samuel Kaski

Visual Representations for Machine Learning

Evaluating Classifiers

Screening Design Selection

CPSC 340: Machine Learning and Data Mining. Outlier Detection Fall 2018

Bayesian network: a toy example

More about liquid association

Class Prediction Methods Applied to Microarray Data for Classification

Transcription:

Application of Hierarchical Clustering to Find Expression Modules in Cancer T. M. Murali August 18, 2008

Innovative Application of Hierarchical Clustering A module map showing conditional activity of expression modules in cancer, Eran Segal, Nir Friedman, Daphne Koller and Aviv Regev, Nature Genetics 36, 1090 1098, 2004 Analyse gene expression data to find groups of genes expressed in concert between different cancers. Use hierarchical clustering innovatively.

Goals Move away from standard approach: determine genes that respond (based on cut-off) and study these genes further. Develop method that can analyse large numbers (1000s) of samples across multiple conditions.

Goals Move away from standard approach: determine genes that respond (based on cut-off) and study these genes further. Develop method that can analyse large numbers (1000s) of samples across multiple conditions. Patterns of co-expression across all conditions are not very interesting.

Goals Move away from standard approach: determine genes that respond (based on cut-off) and study these genes further. Develop method that can analyse large numbers (1000s) of samples across multiple conditions. Patterns of co-expression across all conditions are not very interesting. Compute gene modules: groups of genes that show concerted behaviour across multiple conditions. Specifically, Segal et al. associate with each gene module, a set of samples in which the module is up-regulated and a set of samples in which the module is down-regulated.

Key Steps

Key Steps Group genes into predefined gene sets, e.g., groups of genes with the same functional annotation. Convert gene-by-array matrix into gene-set-by-array matrix. Hierarchically cluster gene sets in this matrix. Identify interesting gene set clusters (nodes) in the tree. In each gene set cluster, remove genes not expressed consistently with the cluster.

Gene Expression Data Sets

Data Normalisation

Data Normalisation Needed because some arrays measure absolute value of gene expression and others measure relative values. Affymetrix microarrays: take logarithm to the base-2 and zero transform within data set. cdna microarrays:

Data Normalisation Needed because some arrays measure absolute value of gene expression and others measure relative values. Affymetrix microarrays: take logarithm to the base-2 and zero transform within data set. cdna microarrays: zero transform within data set.

Pre-defined Genes Sets

Computing Gene-Set-By-Array Matrix Goal is to construct a gene-set-by-array matrix. For each gene set-array pair, find an average expression value of that gene set in that array.

Computing Gene-Set-By-Array Matrix Goal is to construct a gene-set-by-array matrix. For each gene set-array pair, find an average expression value of that gene set in that array. A gene is induced (respectively, repressed in an array if its change in expression is 2 (respectively, 2). For each gene set-array pair, compute the fraction of genes induced or repressed. Use these values in the gene-set-by-array matrix.

Computing Significant Entries in the Gene-Set-By-Array Matrix Many entries in the gene-set-by-array matrix may not be statistically significant.

Computing Significant Entries in the Gene-Set-By-Array Matrix Many entries in the gene-set-by-array matrix may not be statistically significant. For a given array, fraction of induced genes in a gene set may be close to the fraction of induced genes in the array.

Computing Significant Entries in the Gene-Set-By-Array Matrix Many entries in the gene-set-by-array matrix may not be statistically significant. For a given array, fraction of induced genes in a gene set may be close to the fraction of induced genes in the array. Statistical test: for a given array, is the fraction of induced genes in a gene set much larger than the fraction of induced genes in the entire array?

Computing Significant Entries in the Gene-Set-By-Array Matrix Many entries in the gene-set-by-array matrix may not be statistically significant. For a given array, fraction of induced genes in a gene set may be close to the fraction of induced genes in the array. Statistical test: for a given array, is the fraction of induced genes in a gene set much larger than the fraction of induced genes in the entire array? Compute the p-value (statistical significance) of the fraction. Exercise. Do so for every gene-set-array pair. Use false discovery rate correction to account for multiple hypotheses testing. Replace insignificant entries by 0.

Hierarchical Clustering Start from a gene-set-by-array matrix containing fraction of induced/repressed genes. Fraction is negative if repressed. Apply bottom-up hierarchical clustering. Vector at internal node is

Hierarchical Clustering Start from a gene-set-by-array matrix containing fraction of induced/repressed genes. Fraction is negative if repressed. Apply bottom-up hierarchical clustering. Vector at internal node is average of vectors at descendant leaves.

Hierarchical Clustering Start from a gene-set-by-array matrix containing fraction of induced/repressed genes. Fraction is negative if repressed. Apply bottom-up hierarchical clustering. Vector at internal node is average of vectors at descendant leaves. Which nodes do we select as clusters in the tree?

Hierarchical Clustering Start from a gene-set-by-array matrix containing fraction of induced/repressed genes. Fraction is negative if repressed. Apply bottom-up hierarchical clustering. Vector at internal node is average of vectors at descendant leaves. Which nodes do we select as clusters in the tree? Associate each interior node with Pearson correlation between the two children. Cluster node whose Pearson correlation differs by more than 0.05 from the Pearson correlation of its parent.

Turning Clusters into Modules Each cluster is a gene set and a set of arrays.

Turning Clusters into Modules Each cluster is a gene set and a set of arrays. The gene set in a cluster is the union of descendant gene sets (leaves). The arrays in a cluster are only those that are induced or repressed in the cluster. Module Cluster minus genes whose expression is not consistent with the rest of the cluster.

Testing Consistency of a Gene with a Gene Set Let g be the gene and G be the gene set.

Testing Consistency of a Gene with a Gene Set Let g be the gene and G be the gene set. Let I (respectively, R) be the set of arrays in which G is significantly induced (respectively, repressed). For an array a in I (or R), let p a be the fraction of genes that are induced (or repressed) by two-fold or more in a.

Testing Consistency of a Gene with a Gene Set Let g be the gene and G be the gene set. Let I (respectively, R) be the set of arrays in which G is significantly induced (respectively, repressed). For an array a in I (or R), let p a be the fraction of genes that are induced (or repressed) by two-fold or more in a. Measure extent to which g s expression changed by more (or less) than two-fold in the arrays in I (or R):

Testing Consistency of a Gene with a Gene Set Let g be the gene and G be the gene set. Let I (respectively, R) be the set of arrays in which G is significantly induced (respectively, repressed). For an array a in I (or R), let p a be the fraction of genes that are induced (or repressed) by two-fold or more in a. Measure extent to which g s expression changed by more (or less) than two-fold in the arrays in I (or R): Score(g) = log(p a ) + log(p a ) a I g is induced in a a R g is repressed in a

Testing Consistency of a Gene with a Gene Set Let g be the gene and G be the gene set. Let I (respectively, R) be the set of arrays in which G is significantly induced (respectively, repressed). For an array a in I (or R), let p a be the fraction of genes that are induced (or repressed) by two-fold or more in a. Measure extent to which g s expression changed by more (or less) than two-fold in the arrays in I (or R): Score(g) = log(p a ) + log(p a ) a I g is induced in a a R g is repressed in a An array contributes to the score only if g is consistent with G in the array. Larger contribution from arrays with fewer induced genes. Compute statistical significance of this score.

Computing Statistical Significance of Score(g) Score(g) = log(p a ) + log(p a ) a I g is induced in a a R g is repressed in a

Computing Statistical Significance of Score(g) Score(g) = log(p a ) + log(p a ) a I g is induced in a a R g is repressed in a Null hypothesis: genes in each array are randomly permuted, i.e., the p a induced genes in an array a I are chosen randomly.

Computing Statistical Significance of Score(g) Score(g) = log(p a ) + log(p a ) a I g is induced in a a R g is repressed in a Null hypothesis: genes in each array are randomly permuted, i.e., the p a induced genes in an array a I are chosen randomly. Each element in Score(g) is an independent binary random variable. Random variable takes the value log(p a ) with probability p a and the value 0 with the probability 1 p a.

Computing Statistical Significance of Score(g) Score(g) = log(p a ) + log(p a ) a I g is induced in a a R g is repressed in a Null hypothesis: genes in each array are randomly permuted, i.e., the p a induced genes in an array a I are chosen randomly. Each element in Score(g) is an independent binary random variable. Random variable takes the value log(p a ) with probability p a and the value 0 with the probability 1 p a. Under the null hypothesis, Score(g) has mean a I R p a log p a and variance a I R p a(1 p a ) log 2 p a.

Computing Statistical Significance of Score(g) Score(g) = log(p a ) + log(p a ) a I g is induced in a a R g is repressed in a Null hypothesis: genes in each array are randomly permuted, i.e., the p a induced genes in an array a I are chosen randomly. Each element in Score(g) is an independent binary random variable. Random variable takes the value log(p a ) with probability p a and the value 0 with the probability 1 p a. Under the null hypothesis, Score(g) has mean a I R p a log p a and variance a I R p a(1 p a ) log 2 p a. Suppose we observe a score of t. What is the probability of achieving a score of t or higher under the null hypothesis? Exercise.

Further Analysis Statistical significance of computed modules using leave-one-out cross validation (read supplement). Compute enrichment of clinical annotations of the arrays in a module. Visualisation of modules. Literature-based analysis of modules

Bone Osteoblastic Module

Conclusions Used pre-defined gene sets to drive hierarchical clustering algorithm. Remove genes from a cluster of gene sets if the gene s expression profile deviates from the cluster. Automatically decide which arrays are part of a module. Natural segue into lectures on biclustering where we will automatically decide which arrays and which genes to include in a bicluster.

Software Exercise 1. Register for and download Genomica. 2. Use Genomica to compute a module map for the sample data set. 3. Download human Entrez Gene gene sets and gene expression data. 4. Run Genomica on these data sets. 5. Change parameters and run Genomica again. Are the results different?

Computational Exercises 1. In case of d min, show that the hierarchical clustering algorithm returns the minimum spanning tree. 2. How can we measure the useful biological knowledge that a cluster contains? 3. Given an array, the set of genes induced in that array, and a specific gene set, devise a statistical test to determine if the number of induced genes in the gene set is (much) larger than the number of induced genes in the entire array. 4. Under the null hypothesis, Score(g) has mean a I R p a log p a and variance a I R p a(1 p a ) log 2 p a. Suppose we observe a score of t. What is the probability of achieving a score of t or higher under the null hypothesis? 5. How will you modify Genomica to accept a new dataset without performing all computations from scratch?