An Efficient Optimal Leaf Ordering for Hierarchical Clustering in Microarray Gene Expression Data Analysis

Similar documents
Genomics - Problem Set 2 Part 1 due Friday, 1/26/2018 by 9:00am Part 2 due Friday, 2/2/2018 by 9:00am

Supervised Clustering of Yeast Gene Expression Data

Contents. ! Data sets. ! Distance and similarity metrics. ! K-means clustering. ! Hierarchical clustering. ! Evaluation of clustering results

Clustering Jacques van Helden

Clustering. k-mean clustering. Genome 559: Introduction to Statistical and Computational Genomics Elhanan Borenstein

Clustering, cont. Genome 373 Genomic Informatics Elhanan Borenstein. Some slides adapted from Jacques van Helden

Genomics - Problem Set 2 Part 1 due Friday, 1/25/2019 by 9:00am Part 2 due Friday, 2/1/2019 by 9:00am

Genetic Algorithm for Circuit Partitioning

Problem Definition. Clustering nonlinearly separable data:

Dimension reduction : PCA and Clustering

Clustering k-mean clustering

Microarray gene expression data association rules mining based on BSC-tree and FIS-tree

Data Mining. Dr. Raed Ibraheem Hamed. University of Human Development, College of Science and Technology Department of Computer Science

Lesson 3. Prof. Enza Messina

Missing Data Estimation in Microarrays Using Multi-Organism Approach

Motivation for B-Trees

Gene expression & Clustering (Chapter 10)

CHAPTER 6 MODIFIED FUZZY TECHNIQUES BASED IMAGE SEGMENTATION

Branch-and-bound: an example

Classification. Vladimir Curic. Centre for Image Analysis Swedish University of Agricultural Sciences Uppsala University

Business Club. Decision Trees

Data Preprocessing. Slides by: Shree Jaswal

Contents. Preface to the Second Edition

Triclustering in Gene Expression Data Analysis: A Selected Survey

Algorithms for Bioinformatics

INF4820, Algorithms for AI and NLP: Hierarchical Clustering

Spatial and Temporal Aware, Trajectory Mobility Profile Based Location Management for Mobile Computing

Cluster analysis. Agnieszka Nowak - Brzezinska

Image Analysis - Lecture 5

A GENETIC ALGORITHM FOR CLUSTERING ON VERY LARGE DATA SETS

Evaluation and comparison of gene clustering methods in microarray analysis

Hierarchical Clustering

Geometric data structures:

10701 Machine Learning. Clustering

CS229 Lecture notes. Raphael John Lamarre Townshend

Outline. Possible solutions. The basic problem. How? How? Relevance Feedback, Query Expansion, and Inputs to Ranking Beyond Similarity

Bioinformatics Programming. EE, NCKU Tien-Hao Chang (Darby Chang)

Genetic Algorithm for Dynamic Capacitated Minimum Spanning Tree

Gene Clustering & Classification

Spectral Graph Multisection Through Orthogonality. Huanyang Zheng and Jie Wu CIS Department, Temple University

Exact Combinatorial Algorithms for Graph Bisection

CLUSTERING IN BIOINFORMATICS

Study and Implementation of CHAMELEON algorithm for Gene Clustering

Cse634 DATA MINING TEST REVIEW. Professor Anita Wasilewska Computer Science Department Stony Brook University

8/19/13. Computational problems. Introduction to Algorithm

amiri advanced databases '05

Dynamic Programming Course: A structure based flexible search method for motifs in RNA. By: Veksler, I., Ziv-Ukelson, M., Barash, D.

CHAPTER 4 AN IMPROVED INITIALIZATION METHOD FOR FUZZY C-MEANS CLUSTERING USING DENSITY BASED APPROACH

Algorithms for Bounded-Error Correlation of High Dimensional Data in Microarray Experiments

Classification and Regression

Clustering. Lecture 6, 1/24/03 ECS289A

Data Structures and Algorithms

Clustering Techniques

Supervised vs unsupervised clustering

BCLUST -- A program to assess reliability of gene clusters from expression data by using consensus tree and bootstrap resampling method

2-3 Tree. Outline B-TREE. catch(...){ printf( "Assignment::SolveProblem() AAAA!"); } ADD SLIDES ON DISJOINT SETS

Gene selection through Switched Neural Networks

CS570: Introduction to Data Mining

ECG782: Multidimensional Digital Signal Processing

Chapter 8: Data Abstractions

Clustering part II 1

CHAPTER 3 ASSOCIATON RULE BASED CLUSTERING

Building Intelligent Learning Database Systems

Cluster Analysis. Ying Shen, SSE, Tongji University

5/15/16. Computational Methods for Data Analysis. Massimo Poesio UNSUPERVISED LEARNING. Clustering. Unsupervised learning introduction

Data and File Structures Laboratory

Computational Geometry

CLASSIFICATION WITH RADIAL BASIS AND PROBABILISTIC NEURAL NETWORKS

Well labeled paths and the volume of a polytope

DYNAMIC CORE BASED CLUSTERING OF GENE EXPRESSION DATA. Received April 2013; revised October 2013

Fast Methods with Sieve

GPU Accelerated PK-means Algorithm for Gene Clustering

A Parallel Algorithm for Exact Structure Learning of Bayesian Networks

Iterative Learning of Single Individual Haplotypes from High-Throughput DNA Sequencing Data

Efficient Generation of Linear Secret Sharing. Scheme Matrices from Threshold Access Trees

Clustering CS 550: Machine Learning

Structural and Syntactic Pattern Recognition

Lecture 4 Hierarchical clustering

Double Self-Organizing Maps to Cluster Gene Expression Data

Parallelizing Adaptive Triangular Grids with Refinement Trees and Space Filling Curves

Distance-based Methods: Drawbacks

CS Introduction to Data Mining Instructor: Abdullah Mueen

CS301 - Data Structures Glossary By

ECS 234: Data Analysis: Clustering ECS 234

System of Systems Architecture Generation and Evaluation using Evolutionary Algorithms

Automatic Database Clustering Using Data Mining

Chapter DM:II. II. Cluster Analysis

Genetic Algorithm for Dynamic Capacitated Minimum Spanning Tree

Dynamic Spatial Partitioning for Real-Time Visibility Determination. Joshua Shagam Computer Science

Clustering Lecture 3: Hierarchical Methods

Introduction to Bioinformatics AS Laboratory Assignment 2

Module 4: Index Structures Lecture 13: Index structure. The Lecture Contains: Index structure. Binary search tree (BST) B-tree. B+-tree.

Generic Topology Mapping Strategies for Large-scale Parallel Architectures

Comparisons and validation of statistical clustering techniques for microarray gene expression data. Outline. Microarrays.

Parsimony-Based Approaches to Inferring Phylogenetic Trees

Decision Trees Dr. G. Bharadwaja Kumar VIT Chennai

Rapid Detection of RowHammer Attacks using Dynamic Skewed Hash Tree

CHAPTER 4 SEMANTIC REGION-BASED IMAGE RETRIEVAL (SRBIR)

Data Mining. 3.5 Lazy Learners (Instance-Based Learners) Fall Instructor: Dr. Masoud Yaghini. Lazy Learners

These notes present some properties of chordal graphs, a set of undirected graphs that are important for undirected graphical models.

Transcription:

An Efficient Optimal Leaf Ordering for Hierarchical Clustering in Microarray Gene Expression Data Analysis Jianting Zhang Le Gruenwald School of Computer Science The University of Oklahoma Norman, Oklahoma, 73019, USA {jianting, ggruenwald}@ou.edu 1

Introduction Related Work Outline The Proposed Ordering Method The Objective Function The Ordering Algorithm An Example Experiment Conclusions and Future Work Directions 2

Introduction DNA microarray technologies in recent years produce tremendous amount of gene expression data Clustering is one of the most popular methods to extract patterns hidden in the data and to understand functions of genomes Hierarchical clustering provides users an initial impression of the distribution of data and and stimulate thorough inspection of the whole data set 3

Introduction Hierarchical clustering trees are usually displayed with their leaves in linear order Since genes that are adjacent in a linear ordering are often hypothesized to be related in some manner, the ordering of the tree leaves is important for human perception. The clusters in a hierarchical clustering tree are unordered and there are 2 n-1 possible orderings for a balanced binary clustering tree with n genes. 4

Introduction Thus, additional criteria are needed to generate an ordering and algorithms are needed to generate an optimal ordering based on the criteria An example: Assume Gene 1 is more similar to Gene 4: 1 2 3 4 1 2 4 3 5

Introduction Motivation: Provide Efficient and Optimal Leaf Ordering method for Hierarchical Clustering In Microarray Gene Expression Data Analysis 6

Related Work Heuristic Local Ordering Methods: (Alizadeh, et al, 2000): Proposed simple methods of weighting genes, such as average expression level, time of maximal induction, or chromosomal position. It places the element with the lower average weight earlier in the final ordering (Alon, et al, 1999): each pair of sibling branches is ordered according to the proximity of their centroids to the centroid of their parent s sibling 7

Related Work Ordering Based Global Optimizations (Bar-Joseph, et al, 2001): For an ordering π of tree T, the objective function is defined as D π ( T) = n 1 i= 1 S( π, π i i+ 1) S is the gene similarity matrix The proposed ordering algorithm to maximize has O(n 4 ) time complexity and O(n 2 ) space complexity 8

Related Work Ordering Based Global Optimizations (Ding, 2002): New objective function: D π ( T ) = n 1 ( l 2 * n l l= 1 i= 1 S( π, π They argue that the objective function used in (Bar-Joseph, et al, 2001) ignores the similarity between large distance genes Provide an approximate algorithm with O(n 2 ) complexity to minimize their objective function i i+ l )) 9

Related Work U V W X Y 1 2 * [S(U,V)+S(V,W)+S(W,X)+S(X,Y)] 2 2 * [S(U,W)+S(V,X)+S(W,Y)] 3 2 * [S(U,X)+S(V,Y)] 4 2 * [S(U,Y)] 10

The Proposed Ordering Method Our Objective Function Linear, instead of quadratic as defined in (Ding, 2002) Can be rewritten as: D ( T ) Both of them summarize the weighted distances (weighted by similarity) of all possible edges in a similarity graph under ordering π. π i denotes the node (gene) at position i while π(i) denotes the position of the node (gene) i. π D π = n 1 ( T ) = S ( i, 1<= i< j<= n ( l * n l l= 1 i= 1 S( π, π i i+ l j)* π ( i) π ( j) )) 11

The Proposed Ordering Method D π ( T ) = S( i, i, j j)* π ( i) π ( j) Graph Linear Arrangement Problem! Ordering leaf nodes of a tree, not nodes of a regular graph (Bar-Yehuda, 2001): proposed to approximately solve the regular graph MinLA problem by imposing a global constraint through a Binary Decomposition Tree (BDT) 12

The Proposed Ordering Method We propose to use a binary clustering tree as the BDT and use the algorithm to solve the hierarchical binary clustering tree leaf ordering problem Since (Bar-Yehuda, 2001) can examine the 2 n-1 orderings in O(n 2 ) time, it will produce the optimal ordering for leaf ordering in O(n 2 ) time complexity Although the algorithm is approximate for node ordering in a regular graph, it is exact for our leaf ordering in a hierarchical clustering tree for genes 13

The Proposed Ordering Method 1-Orientation + R L 0-Orientation 10 8 7 9-5 11 4 6 1 3 2 0 L R Number of Possible Orderings 2 n-1 if a BDT is full and balanced Orientation tree (or_tree): The orientations at each intermediate node of the BDT form a tree that has the same structure as the BDT 14

The Proposed Ordering Method Recursively test the two possible orientations of a BDT and choose the better one. Use position implicitly in computing the cost which is very efficient. cost 0 =cost(left(0))+cost(right(0)) + V(t 2 ).right_cut(or_tree(t1)) + V(t 1 ).left_cut(or_tree(t2)) cost 1 =cost(left(1))+cost(right(1)) + V(t 1 ).right_cut(or_tree(t2)) + V(t 2 ).left_cut(or_tree(t1)) (1) 15

The Proposed Ordering Method left_cut(or_tree(t))=left_cut(left(or_tree(t)) +left_cut(right(or_tree(t)) - in_cut(t) right_cut(or_tree(t))=right_cut(left(or_tree(t)) +right_cut(right(or_tree(t)) -in_cut(t) (2) In_cut: summation of the weights of edges the beginning and ending nodes of which are within sub-tree t Left_cut: summation of the weights of edges the beginning node of which is not within sub-tree t while the ending node of which is within sub-tree t Right_cut: summation of the weights of edges the beginning node of which is within sub-tree t while the ending node of which is not within sub-tree t 16

The Proposed Ordering Method Algorithm of (Bar-Yehuda, 2001) 1. Pre-compute in_cuts for all the non-leaf nodes in the BDT 2. Set tree t to the root of the BDT. Do the following recursively 3. If t is an intermediate node of the BDT: a) Compute the costs and the left_cuts and the right_cuts under both the 0-orientation and the 1-orientation of its two sub-trees t 1 and t 2 according to formula (1) and formula (2) respectively b) Set the orientation of t to the one that has less cost as the winner orientation 4. If t is a leaf node of the BDT, the left_cut is the summation of the weights of the edges with the node as the ending node. The right_cut is the summation of the weights of the edges with the node as the beginning node. The cost of the node is the same as its left_cut. 17

The Proposed Ordering Method An Example Similarity Graph Clustering Tree (BDT) In_cuts rooted at T 0 In_cuts rooted at T 11 In_cuts rooted at T 12 18

The Proposed Ordering Method 0: left_cut=0.5, right_cut=0.60 1: left_cut=0.30+0.85+0.60=1.75, right_cut=0 T 11 : left_cut=0.50+1.75-0.60=1.65 Right_cut=0.60+0-0.60=0 Cost=0.50+1.75+1*(1.75-0.60)+1*(0.60-0.60)=3.4 19

The Proposed Ordering Method Cost=2.75 Winner T 12: 0-orientation: 1.6 (Winner) 1-Orientation: 1.65 Left_cut=0, right_cut=1.65 T 0 : 1-orientation 2.75+1.6+2*(1.65-1.65)+2*(1.65-1.65)=4.35 ([2,3],[1,0]) 20

The Proposed Ordering Method The total cost for T 0 under 0-orientation happens to be 4.35. By convention, we choose 0-orientation if the costs of both orientations are the same. The optimized ordering is ([0,1], [3,2]) For the optimized ordering ([0,1], [3,2]), cost=4.35, Best among the 4!=24 possible orderings The initial ordering is ([1,0], [2,3]), cost=5.05 21

Experiment Data Set Cell-regulated genes of the yeast saccharomyces cerevisia (http://www-2.cs.cmu.edu/~zivbj/) The 800 genes in it have been assigned to five classes termed G1, S, S/G2, G2/M and M/G1 based on domain knowledge in (Spellman, etc, 1998) and they can be used to visually examine the quality of an ordering Also used in (Bar-Joseph, et al, 2001) to demonstrate the effectiveness of its leaf ordering technique 22

Preprocessing: Experiment Remove genes with exceptional missing values and to compute the gene similarity matrix. The final number of genes in our study is 765 Only consider the gene pairs whose similarities are above a certain threshold (avg+1.5*std_dev): results in 38714 pairs of gene similarity and the average out-degree of the similarity graph is about 50 Use a graph-partition package called Metis to generate a clustering tree by recursive graph bisection and use it as our BDT 23

Experiment Computation Environment Dell Dimension 4100 866HZ personal computer with 512M memory running Windows 2000 professional It takes about 27 seconds to perform ordering optimization Performance Metric: Normalized index I = D i, j π ( T) S( i, j) 24

Experiment Results: Original Ordering (81.11), Optimized Ordering (69.69), The average of 1000 times random orderings (255.34) The optimized ordering is about 16.2% less than the original ordering. The average of the 1000 times random orderings is 3.14 times of the original ordering and 3.66 times of the optimized ordering Both the graph bisection based clustering method and the proposed leaf ordering method are effective 25

Experiments visually examination of 22 genes The optimized ordering is more compact. For example, G1 scatters in 6 segments in the original ordering while it is in 3 segments in the optimized ordering. 26

Conclusions Leaf ordering of hierarchical clusters is significant to presenting gene expression data. Previously proposed methods are either non-efficient or approximate in nature The proposed method is efficient in the sense that the time complexity is O(n 2 ) when the clustering tree is balanced. The proposed method optimal in the sense that it examines all possible 2 n-1 orderings without relying on approximation Preliminary Experiment show good results 27

Future Work Apply this graph theoretical leaf ordering optimization method to more gene expression data using clustering trees from a variety of hierarchical clustering methods as the BDTs for comparison purposes. Develop novel approaches to visually presenting the original and optimized cluster ordering to domain experts and examine the validity of our optimization objective function 28

Thanks! Questions? 29