Study and Implementation of CHAMELEON algorithm for Gene Clustering

Similar documents
CHAMELEON: A Hierarchical Clustering Algorithm Using Dynamic Modeling

CS570: Introduction to Data Mining

BBS654 Data Mining. Pinar Duygulu. Slides are adapted from Nazli Ikizler

Module: CLUTO Toolkit. Draft: 10/21/2010

Clustering Part 4 DBSCAN

Unsupervised Video Summarization via Dynamic Modeling-based Hierarchical Clustering

University of Florida CISE department Gator Engineering. Clustering Part 4

A Review on Cluster Based Approach in Data Mining

Graph and Hypergraph Partitioning for Parallel Computing

What is Cluster Analysis? COMP 465: Data Mining Clustering Basics. Applications of Cluster Analysis. Clustering: Application Examples 3/17/2015

PATENT DATA CLUSTERING: A MEASURING UNIT FOR INNOVATORS

Centroid Based Text Clustering

Clustering in Data Mining

Notes. Reminder: HW2 Due Today by 11:59PM. Review session on Thursday. Midterm next Tuesday (10/10/2017)

Multilevel Algorithms for Multi-Constraint Hypergraph Partitioning

DATA MINING LECTURE 7. Hierarchical Clustering, DBSCAN The EM Algorithm

Clustering CS 550: Machine Learning

Document Clustering using Feature Selection Based on Multiviewpoint and Link Similarity Measure

Unsupervised Learning and Clustering

Cluster Analysis. Ying Shen, SSE, Tongji University

Unsupervised Learning and Clustering

Clustering Large Dynamic Datasets Using Exemplar Points

Clustering. Lecture 6, 1/24/03 ECS289A

10701 Machine Learning. Clustering

The k-means Algorithm and Genetic Algorithm

Clustering. CE-717: Machine Learning Sharif University of Technology Spring Soleymani

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Slides From Lecture Notes for Chapter 8. Introduction to Data Mining

Data Mining Cluster Analysis: Advanced Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining, 2 nd Edition

Hierarchy. No or very little supervision Some heuristic quality guidances on the quality of the hierarchy. Jian Pei: CMPT 459/741 Clustering (2) 1

Unsupervised Learning

A Weighted Majority Voting based on Normalized Mutual Information for Cluster Analysis

Part I. Hierarchical clustering. Hierarchical Clustering. Hierarchical clustering. Produces a set of nested clusters organized as a

Clustering Lecture 3: Hierarchical Methods

Keywords: hierarchical clustering, traditional similarity metrics, potential based similarity metrics.

INF4820. Clustering. Erik Velldal. Nov. 17, University of Oslo. Erik Velldal INF / 22

ECS 234: Data Analysis: Clustering ECS 234

Outlier Detection Using Unsupervised and Semi-Supervised Technique on High Dimensional Data

Contents. Preface to the Second Edition

International Journal of Scientific Research & Engineering Trends Volume 4, Issue 6, Nov-Dec-2018, ISSN (Online): X

Unsupervised Learning. Presenter: Anil Sharma, PhD Scholar, IIIT-Delhi

Gene Clustering & Classification

Clustering. Robert M. Haralick. Computer Science, Graduate Center City University of New York

Image Analysis - Lecture 5

Lesson 3. Prof. Enza Messina

MultiDimensional Signal Processing Master Degree in Ingegneria delle Telecomunicazioni A.A

CSE 347/447: DATA MINING

6. Dicretization methods 6.1 The purpose of discretization

9/29/13. Outline Data mining tasks. Clustering algorithms. Applications of clustering in biology

Clustering Algorithms for Data Stream

Conceptual Review of clustering techniques in data mining field

Cluster Analysis. Mu-Chun Su. Department of Computer Science and Information Engineering National Central University 2003/3/11 1

Data Mining. Part 2. Data Understanding and Preparation. 2.4 Data Transformation. Spring Instructor: Dr. Masoud Yaghini. Data Transformation

Clustering. Informal goal. General types of clustering. Applications: Clustering in information search and analysis. Example applications in search

Unsupervised Learning : Clustering

Clustering Documents in Large Text Corpora

Notes. Reminder: HW2 Due Today by 11:59PM. Review session on Thursday. Midterm next Tuesday (10/09/2018)

INF4820, Algorithms for AI and NLP: Hierarchical Clustering

Statistics 202: Data Mining. c Jonathan Taylor. Clustering Based in part on slides from textbook, slides of Susan Holmes.

Clustering k-mean clustering

Principles of Parallel Algorithm Design: Concurrency and Mapping

Cluster Analysis: Agglomerate Hierarchical Clustering

Parallelization of Hierarchical Density-Based Clustering using MapReduce. Talat Iqbal Syed

Similarity Matrix Based Session Clustering by Sequence Alignment Using Dynamic Programming

Clustering in Ratemaking: Applications in Territories Clustering

Distances, Clustering! Rafael Irizarry!

5/15/16. Computational Methods for Data Analysis. Massimo Poesio UNSUPERVISED LEARNING. Clustering. Unsupervised learning introduction

k-way Hypergraph Partitioning via n-level Recursive Bisection

Clustering Part 3. Hierarchical Clustering

CSE 5243 INTRO. TO DATA MINING

Parallel Algorithm for Multilevel Graph Partitioning and Sparse Matrix Ordering

DBSCAN. Presented by: Garrett Poppe

University of Florida CISE department Gator Engineering. Clustering Part 5

Hierarchical Clustering

Review: Identification of cell types from single-cell transcriptom. method

MSA220 - Statistical Learning for Big Data

Supervised vs unsupervised clustering

Data Mining Chapter 9: Descriptive Modeling Fall 2011 Ming Li Department of Computer Science and Technology Nanjing University

Clustering analysis of gene expression data

Clustering part II 1

Using the Kolmogorov-Smirnov Test for Image Segmentation

Introduction to Mobile Robotics

Hierarchical and Ensemble Clustering

Refined Shared Nearest Neighbors Graph for Combining Multiple Data Clusterings

Hierarchical Clustering Algorithms for Document Datasets

INF4820, Algorithms for AI and NLP: Evaluating Classifiers Clustering

Unsupervised learning on Color Images

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

Unsupervised Data Mining: Clustering. Izabela Moise, Evangelos Pournaras, Dirk Helbing

Knowledge Discovery in Databases

Topic 1 Classification Alternatives

Clustering Algorithm (DBSCAN) VISHAL BHARTI Computer Science Dept. GC, CUNY

Unsupervised Learning Hierarchical Methods

Seminar on. A Coarse-Grain Parallel Formulation of Multilevel k-way Graph Partitioning Algorithm

Motivation. Technical Background

Redefining and Enhancing K-means Algorithm

Lecture Notes for Chapter 7. Introduction to Data Mining, 2 nd Edition. by Tan, Steinbach, Karpatne, Kumar

CS7267 MACHINE LEARNING

Norbert Schuff VA Medical Center and UCSF

CS6716 Pattern Recognition

10601 Machine Learning. Hierarchical clustering. Reading: Bishop: 9-9.2

Transcription:

[1] Study and Implementation of CHAMELEON algorithm for Gene Clustering 1. Motivation Saurav Sahay The vast amount of gathered genomic data from Microarray and other experiments makes it extremely difficult for the researcher to interpret the data and form conclusions about the functions of the discovered genes. We can make use of extensive biomedical literature databases to find the functional similarity between genes using various clustering techniques. There are very few tailored information retrieval systems for this purpose. 2. Introduction Clustering is an unsupervised learning and discovery process that organizes a set of objects into meaningful groups. These groups can be of any shape and size that capture the most natural form of associated data. CHAMELEON is a clustering algorithm that explores graph partitioning and dynamic modeling in hierarchical clustering. It applies to all types of data as long as a similarity function can be constructed. Two clusters are merged in the clustering process only if the inter-connectivity and closeness between two clusters are high relative to the internal inter-connectivity and closeness of items within the clusters. This process facilitates the discovery of natural and homogeneous clusters. Gene clustering for biomedical literature is carried out as follows: genes are given as input to the system, the system searches the online biomedical literature abstracts that contain information about these genes and performs text mining in the abstracts to retrieve useful keywords that describe functions of these genes, does statistical analysis on these keywords to find their relevance and then clusters the genes based on the functional keyword associations. The input to the clustering system is the (keyword x gene) matrix or the (gene x gene) matrix. Based on the clustering results, the genes can be classified as having different functional relationships. 3. Description CHAMELEON uses a graph partitioning algorithm to cluster the sparse graph data objects into a large number of relatively small subclusters. It then uses an agglomerative hierarchical clustering algorithm to find the genuine clusters by repeatedly combining these clusters using the connectivity and closeness measures. CHAMELEON algorithm has been derived based on the observation of the weakness of two popular hierarchical clustering algorithms, CURE and ROCK. CURE and related schemes ignore information about the aggregate inter-connectivity of objects in two different clusters, they measure similarity between two clusters based on the similarity of the closest pair of the representative points belonging to different clusters. ROCK and related schemes ignore information about the closeness of two clusters while

emphasizing their inter-connectivity, they only consider the aggregate inter-connectivity across the pairs of clusters and ignores the value of the stronger edges across clusters. Overall framework of CHAMELEON Algorithm [1] CHAMELEON uses k-nearest neighbor graph approach to represent its objects. This graph captures the concept of neighborhood dynamically and results in more natural clusters. The neighborhood is defined narrowly in a dense region, whereas it is defined more widely in a sparse region. CHAMELEON determines the similarity between each pair of clusters C i and Cj according to their relative inter-connectivity RI(C i; C j), and their relative closeness RC(C i; C j). The relative inter-connectivity RI(C i; C j) between two clusters C i and C j is defined as the absolute inter-connectivity between C i and C j, normalized with respect to the internal inter-connectivity of the two clusters C and C. RI(C; i C j ) = EC {(Ci; Cj)} i EC Ci + EC Cj 2 j The Edge-Cut, EC {(Ci; Cj)} (from Graph Theory) is defined to be the sum of the weight of the edges that connect the vertices in C i to vertices in C j. The Min-Cut bisector, EC Ci is the weighted sum of edges that partition the graph into two roughly equal parts. The relative closeness RC(C; i C j) between a pair of clusters C i and C j is the absolute closeness between C i and C j, normalized with respect to the internal closeness of the two clusters C and C. i j RC(C ; C ) = S S i j EC{(Ci; Cj)} S Ci EC(Ci) + Cj EC(Cj) Ci + Cj Ci + Cj S EC(Ci) and S EC(Cj) are the average weights of the edges that belong in the min-cut bisector of clusters C i and C j, respectively, and S EC{(Ci; Cj)} is the average weight of the edges that connect vertices in C i to vertices in C j. CHAMELEON s hierarchical clustering algorithm selects to merge the pair of clusters for which both RI(Ci,Cj) and RC(Ci,Cj) are high. 4. Methodology

The algorithm can be divided into three different parts. The first part comprises building the k-nearest Neighbor (k-nn) Graph from the similarity matrix. The second part consists of partitioning the graph to find initial sub-clusters, and the third part relates to merging the partitions using RI and RC values. 4.1. Input The input to the first algorithm was a 26 Gene x Gene Matrix file that contained affinity values [2] (dot products of keyword relevance using z-score values) between the Genes. This matrix was read by the first C program to compute the k-nn Graph from the file. The k value was passed as a parameter to the program. 4.2. Converting the matrix into k-nn Graph After obtaining the matrix, the individual genes were treated as vertices and the affinity values as the edges between the vertices. The algorithm computed the k highest affinity values from the individual genes to store as another graph consisting of k nearest vertices from each 26 set of genes. 4.3. Finding initial sub-clusters CHAMELEON uses a graph partitioning algorithm to find the initial sub-clusters to partition the k nearest neighbor graph of the data set into a large number of partitions such that the edge-cut is minimized. The original algorithm used the hmetis (Hypergraph partitioning package) library to partition the graph. hmetis is a software package for partitioning large hypergraphs, especially those arising in circuit design.[3] These algorithms are fine-tuned to work on very large graphs. The 26 Gene set input was not suitable for hypergraph partitioning using hmetis. Instead, another software package library called Metis[4] that partitions irregular graphs efficiently using the same edge-cut minimization procedure was used to partition the k- NN Gene graph. The input to the Metis library was a formatted graph file containing the k-nn edges and their weights. Since the program computes the multilevel k way partitioning of the graph, k value was provided as input to the program. This k denotes the initial number of subclusters that are computed. It should be a value less than the expected number of natural clusters which can be later discovered by the merging part of the algorithm. The output of the program was as follows: ********************************************************************** METIS 4.0.1 Copyright 1998, Regents of the University of Minnesota Graph Information --------------------------------------------------- Name: metis1.txt, #Vertices: 26, #Edges: 39, #Parts: 6 K-way Partitioning... ----------------------------------------------- 6-way Edge-Cut: 535, Balance: 1.62 Timing Information --------------------------------------------------

I/O: 0.000 Partitioning: 0.000 (KMETIS time) Total: 0.000 ********************************************************************** 4.4. Finding Relative Interconnectivity and Relative Closeness In order to find the RI and RC values between different clusters, the Metis program was used to get the edge-cut of different clusters efficiently. The min-cut bisector of a cluster was calculated by adding all the weights of the edges of the individual cluster. This was done under the assumption that the vertices of the cluster were all tightly connected to each other and all of them were needed to be cut to partition the graph into two roughly equal parts. 4.5. Hierarchical Merging of the clusters Clusters are merged hierarchically if the RI and RC values are above some user specified threshold values. These threshold values control the variability of the clusters and as such have no absolute values. One can find good threshold by many experiments for a particular task. For our small Gene set, the RI threshold was chosen to be 0.1 and RC threshold was chosen to be 0.3. 5. Results CHAMELEON algorithm correctly identified all the 26 Genes in right clusters on the second pass of the merging phase based on particular threshold values. On choosing a different set of thresholds, it merged two clusters incorrectly. This difference could be highlighted since we knew which set of Genes were functionally related. The initial sub-clusters: (the numerical values indicate Gene numbers) Cluster 1: 1 2 3 4 9 Cluster 2: 10 11 12 13 14 15 16 Cluster 3: 19 20 21 Cluster 4: 5 22 23 26 Cluster 5: 6 17 18 Cluster 6: 7 8 24 25 The non-zero RI and RC values for the above sub-clusters: (all other RI and RC values between clusters were 0) RI(1,2) = 0.0093 RC(1,2) = 0.1474 RI(1,6) = 0.0141 RC(1,6) = 0.1683 RI(2,3) = 0.1234 RC(2,3) = 0.36 RI(4,6) = 0.1649 RC(4,6) = 0.825

RI(5,6) = 0.1913 RC(5,6) = 0.49 Based on the RI threshold of 0.1, clusters 2 and 3; and 5 and 6; were merged and correctly identified 2 of 4 clusters of 26 genes. 6. Conclusion and Future Work The outcome of this short experiment shows promising results for applying CHAMELEON algorithm for keyword based clustering of large number of genes. The k- NN graph captures the data to be modeled very nicely. The well-studied cut-based graph algorithms are suited for graph partitioning. Dynamic modeling is conceptually a sound method for any type of clustering. It inherently works to optimize the underlying clustering principle of minimizing the inter cluster similarity and maximizing the intracluster similarity. It was a cumbersome task to manually make so many formatted text files as input to the Metis graph partitioning system. A future work possibility could be to write one s own graph partitioning routines and integrate with all other routines so that a single matrix input file would give the desired output of hierarchical clusters. Some research needs to be carried out to find good threshold values for merging the gene sub clusters. 7. References [1] Chameleon: Hierarchical clustering using dynamic modeling George Karypis, Eui-Hong (Sam) Han, and Vipin Kumar [2] Text Mining Biomedical Literature for Discovering Gene-to-Gene Relationships: A comparative study of algorithms. Ying Liu, Alex Pivoshenko, Jorge Civera, Venu Dasigi, Ashwin Ram, Brian Ciliax, Ray Dingledine and Shamkant B. Navathe [3] hmetis: A Hypergraph Partitioning Package George Karypis and Vipin Kumar [4] Metis: A Software Package for Partitioning Unstructured Graphs, Partitioning Meshes, and Computing Fill-Reducing Orderings of Sparse Matrices George Karypis and Vipin Kumar