Biclustering Bioinformatics Data Sets. A Possibilistic Approach

Size: px
Start display at page:

Download "Biclustering Bioinformatics Data Sets. A Possibilistic Approach"

Transcription

1 Possibilistic algorithm Bioinformatics Data Sets: A Possibilistic Approach Dept Computer and Information Sciences, University of Genova ITALY EMFCSC Erice 20/4/2007 Bioinformatics Data Sets

2 Outline Introduction Possibilistic algorithm 1 Introduction 2 3 Possibilistic algorithm 4 Bioinformatics Data Sets

3 Possibilistic algorithm BIOINFORMATICS DATA SETS Data representation Nowadays, in the Post-Genomic era, we have many Bioinformatics data sets available (most of them released in public domain on the Internet) The information embedded in most of them has no yet completely exploited, due to the lack of accurate machine learning tools and/or of their diffusion in the Bioinformatics community. Bioinformatics Data Sets

4 Possibilistic algorithm Most of Bioinformatics data sets come from DNA microarray experiments and are normally given as a rectangular m n matrix X, where each column represents a feature (e.g., gene) and each row represents a data sample or condition (e.g., patient) X = (x ij ) m n, (1) where the value x ij is the expression of i-th gene in j-th condition. The analysis of microarray data sets can give a valuable information on the biological relevance of genes and correlations between them [Madei, 2004]. Bioinformatics Data Sets

5 Possibilistic algorithm BIOINFORMATICS DATA SETS Major Machine Learning tasks Clustering (Unsupervised): Given a set of samples, partition them into groups containg similar samples according to some similarity criteria (CLASS DISCOVERING). Classification (Supervised): Find classes of the test data set using known classification of training data set (CLASS PREDICTION). Feature Selection (Dimensionality reduction): Select a subset of features responsible for creating the condition corresponding to the class (GENE SELECTION, BIOMARKER SELECTION). Outlier Detection: Detect data samples that are not good representative of any of the classes, and disregard them while performing data analysis. Bioinformatics Data Sets

6 Possibilistic algorithm BIOINFORMATICS DATA SETS Major Machine Learning tasks Clustering (Unsupervised): Given a set of samples, partition them into groups containg similar samples according to some similarity criteria (CLASS DISCOVERING). Classification (Supervised): Find classes of the test data set using known classification of training data set (CLASS PREDICTION). Feature Selection (Dimensionality reduction): Select a subset of features responsible for creating the condition corresponding to the class (GENE SELECTION, BIOMARKER SELECTION). Outlier Detection: Detect data samples that are not good representative of any of the classes, and disregard them while performing data analysis. Bioinformatics Data Sets

7 Possibilistic algorithm BIOINFORMATICS DATA SETS Major Machine Learning tasks Clustering (Unsupervised): Given a set of samples, partition them into groups containg similar samples according to some similarity criteria (CLASS DISCOVERING). Classification (Supervised): Find classes of the test data set using known classification of training data set (CLASS PREDICTION). Feature Selection (Dimensionality reduction): Select a subset of features responsible for creating the condition corresponding to the class (GENE SELECTION, BIOMARKER SELECTION). Outlier Detection: Detect data samples that are not good representative of any of the classes, and disregard them while performing data analysis. Bioinformatics Data Sets

8 Possibilistic algorithm BIOINFORMATICS DATA SETS Major Machine Learning tasks Clustering (Unsupervised): Given a set of samples, partition them into groups containg similar samples according to some similarity criteria (CLASS DISCOVERING). Classification (Supervised): Find classes of the test data set using known classification of training data set (CLASS PREDICTION). Feature Selection (Dimensionality reduction): Select a subset of features responsible for creating the condition corresponding to the class (GENE SELECTION, BIOMARKER SELECTION). Outlier Detection: Detect data samples that are not good representative of any of the classes, and disregard them while performing data analysis. Bioinformatics Data Sets

9 Possibilistic algorithm BIOINFORMATICS DATA SETS Major challenges in Machine Learning Noisiness of data complicates solution of Machine Learning Tasks (robustness to noise). High-dimensionality of data makes complete search in most of data mining problems computationally infeasible (curse of dimensionality). Some data values may be inaccurate or missing. = The available data may be not sufficient to obtain statistically significant conclusions. Bioinformatics Data Sets

10 Possibilistic algorithm BIOINFORMATICS DATA SETS Major challenges in Machine Learning Noisiness of data complicates solution of Machine Learning Tasks (robustness to noise). High-dimensionality of data makes complete search in most of data mining problems computationally infeasible (curse of dimensionality). Some data values may be inaccurate or missing. = The available data may be not sufficient to obtain statistically significant conclusions. Bioinformatics Data Sets

11 Possibilistic algorithm BIOINFORMATICS DATA SETS Major challenges in Machine Learning Noisiness of data complicates solution of Machine Learning Tasks (robustness to noise). High-dimensionality of data makes complete search in most of data mining problems computationally infeasible (curse of dimensionality). Some data values may be inaccurate or missing. = The available data may be not sufficient to obtain statistically significant conclusions. Bioinformatics Data Sets

12 Possibilistic algorithm BIOINFORMATICS DATA SETS Major challenges in Machine Learning Noisiness of data complicates solution of Machine Learning Tasks (robustness to noise). High-dimensionality of data makes complete search in most of data mining problems computationally infeasible (curse of dimensionality). Some data values may be inaccurate or missing. = The available data may be not sufficient to obtain statistically significant conclusions. Bioinformatics Data Sets

13 Possibilistic algorithm Problem we shall focus today: How to identify genes with similar behavior with respect to different conditions? Instance of the problem of biclustering (also known as co-clustering, two-way clustering,...) [Cheng & Church, 2000; Hartigan, 1972; Kung et al, 2005; Turner et al, 2005] Bioinformatics Data Sets

14 Possibilistic algorithm Problem we shall focus today: How to identify genes with similar behavior with respect to different conditions? Instance of the problem of biclustering (also known as co-clustering, two-way clustering,...) [Cheng & Church, 2000; Hartigan, 1972; Kung et al, 2005; Turner et al, 2005] Bioinformatics Data Sets

15 Possibilistic algorithm BICLUSTERING is a methodology allowing for feature set and data points clustering simultaneously. It finds clusters of samples possessing similar characteristics together with features creating these similarities. It replies to the question: What characteristics make similar objects similar among them? Bioinformatics Data Sets

16 Possibilistic algorithm BICLUSTERING is a methodology allowing for feature set and data points clustering simultaneously. It finds clusters of samples possessing similar characteristics together with features creating these similarities. It replies to the question: What characteristics make similar objects similar among them? Bioinformatics Data Sets

17 Possibilistic algorithm BICLUSTERING is a methodology allowing for feature set and data points clustering simultaneously. It finds clusters of samples possessing similar characteristics together with features creating these similarities. It replies to the question: What characteristics make similar objects similar among them? Bioinformatics Data Sets

18 Possibilistic algorithm BICLUSTERING Surveys S. Madeira, A.L. Oliveira, Algorithms for Biological Data Analysis: A Survey, A. Tanay, R. Sharan, R. Shamir, Algorithms: A Survey, D. Jiang, C. Tang, A. Zhang, Cluster Analysis for Gene Expression Data: A Survey, Bioinformatics Data Sets

19 Possibilistic algorithm BICLUSTERING Applications Biological and Medical: Microarray data analysis Analysis of drug activity [Liu & Wang, 2003] Analysis of nutritional data [Lazzeroni et al., 2000] Bioinformatics Data Sets

20 Possibilistic algorithm BICLUSTERING Applications Text Mining [Dhillon, 2001, 2003] Marketing [Gaul & Schader, 1996] Others: electoral data [Hartigan, 1972] currency exchange [Lazzeroni et al., 2000] Dimensionality Reduction in Databases [Agrawal et al., 1998] Bioinformatics Data Sets

21 Possibilistic algorithm BICLUSTERING State of the art Cheng & Church algorithm [2000] The algorithm constructs one bicluster at a time using a statistical criterion - a low mean squared residue (the variance of the set of all elements in the bicluster, plus the mean row variance and the mean column variance). Once a bicluster is created, its entries are replaced by random numbers, and the procedure is repeated iteratively. Drawback: The masking procedure results in a phenomenon of random interference, affecting the subsequent discovery of large-sized biclusters [Yang et al., 2003]. Bioinformatics Data Sets

22 Possibilistic algorithm BICLUSTERING State of the art Cheng & Church algorithm [2000] The algorithm constructs one bicluster at a time using a statistical criterion - a low mean squared residue (the variance of the set of all elements in the bicluster, plus the mean row variance and the mean column variance). Once a bicluster is created, its entries are replaced by random numbers, and the procedure is repeated iteratively. Drawback: The masking procedure results in a phenomenon of random interference, affecting the subsequent discovery of large-sized biclusters [Yang et al., 2003]. Bioinformatics Data Sets

23 Possibilistic algorithm BICLUSTERING State of the art Direct Clustering [Hartigan, 1972] Flexible Overlapped Clusters (FLOC) [Yang et al., 2003] (probabilistic algorithm) Bipartite graphs [Tanay et al 2002] Genetic algorithms [Mitra et al, 2006] Simulated Annealing [Bryan et al, 2005] Bioinformatics Data Sets

24 Possibilistic algorithm POSSIBILISTIC BICLUSTERING Joint work: Maurizio Filippone,, Stefano Rovetta DISI Dept Computer and Information Science, University of Genova ITALY Sushmita Mitra, Haider Banka Indian Statistical Institute, Kolkata INDIA Bioinformatics Data Sets

25 Possibilistic algorithm POSSIBILISTIC BICLUSTERING We propose a new approach to the biclustering problem using the possibilistic clustering paradigm [Krishnapuram & Keller, 1993]. PBC algorithm finds one bicluster at a time, assigning to each data matrix element a membership to the bicluster The membership model is of the fuzzy possibilistic type. Bioinformatics Data Sets

26 Possibilistic algorithm POSSIBILISTIC BICLUSTERING Definitions Let x ij be the expression level of the i-th gene in the j-th condition. A bicluster is defined as a subset of the m n data matrix X, i.e., a bicluster is a pair (g, c), where g {1,...,m} is a subset of genes and c {1,...,n} is a subset of conditions [Cheng & Church, 2000; Hartigan, 1972; Kung et al, 2005; Turner et al, 2005]. We are interested in largest biclusters from DNA microarray data that do not exceed an assigned homogeneity constraint [Cheng & Church, 2000] as they can supply relevant biological information. Bioinformatics Data Sets

27 Possibilistic algorithm POSSIBILISTIC BICLUSTERING Definitions The size (or volume) n of a bicluster is usually defined as the number of cells in the gene expression matrix X belonging to it, that is the product of the cardinalities n g = g and n c = c : n = n g n c (2) Normalized square residual ( ) 2 dij 2 xij + x IJ x ij x Ij = n where the elements x IJ, x ij and x Ij are respectively the bicluster mean, the row mean and the column mean of X for the selected genes and conditions: (3) Bioinformatics Data Sets

28 Possibilistic algorithm POSSIBILISTIC BICLUSTERING Definitions bicluster mean: x IJ = 1 x ij (4) n i g bicluster row mean: x ij = 1 x ij (5) n c bicluster column mean: x Ij = 1 x ij (6) n g j c i g j c Bioinformatics Data Sets

29 Possibilistic algorithm POSSIBILISTIC BICLUSTERING Definitions Mean Square Residual [Cheng & Church, 2000]: G = dij 2 (7) i g j c G measures the bicluster homogeneity, i.e., the difference between the actual value of an element x ij and its expected value as predicted from the corresponding row mean, column mean, and bicluster mean. OUR AIM: maximizing the bicluster cardinality n and at the same time minimizing the residual G (NP-complete task [Peete, 2003]) using the Possibilistic Clustering Paradigm. Bioinformatics Data Sets

30 Possibilistic algorithm POSSIBILISTIC BICLUSTERING Approaches to clustering Bioinformatics data sets Data clustering is a routine step in biological data analysis, and a basic tool in Bioinformatics [Golub, et al., 1999; P. Tamayo, et al., 1999; Azuaje, 2003] Main approaches: Hierarchical Clustering [Eisen et al., 1998; Orengo et al., 2003] Partitional (or Central) Clustering: including C-Means [Duda & Hart, 1973], Self Organizing Map [Kohonen, 2001], Fuzzy C-Means [Bezdek, 1981], Deterministic Annealing [Rose et al, 1990], Alternating Cluster Estimation [Runkler, 1999], etc. Bioinformatics Data Sets

31 Possibilistic algorithm POSSIBILISTIC BICLUSTERING Probabilistic constraint in central clustering Let X = {x 1,...,x r } be a set of unlabeled data points, Y = {y 1,...,y s } a set of cluster centers (or prototypes) and U = [u pq ] the fuzzy membership matrix. Often, central clustering algorithms impose a probabilistic constraint on memberships, according to which the sum of the membership values of a point in all the clusters must be equal to one: r u pq = 1 (8) q=1 Bioinformatics Data Sets

32 Possibilistic algorithm POSSIBILISTIC BICLUSTERING From Probabilistic to Possibilistic Clustering Probabilistic constraint r u pq = 1: q=1 PROS - competitive constraint allowing the unsupervised learning algorithms to find the barycenter of clusters CONS - membership to clusters (a) not interpretable as a degree of typicality - (b) can give sensibility to outliers (a) (b) Bioinformatics Data Sets

33 Possibilistic algorithm POSSIBILISTIC BICLUSTERING Possibilistic Clustering In the Possibilistic C-Means (PCM) Algorithm [Krishnapuram & Keller, 1993] the constraints on the elements of U are relaxed to: u pq [0, 1] p, q; (9) 0 < r u pq < r p; (10) q=1 u pq > 0 q. (11) p i.e., clusters cannot be empty and each pattern must be assigned to at least one cluster mode seeking algorithm [Krishnapuram & Keller, 1993] Bioinformatics Data Sets

34 Possibilistic algorithm POSSIBILISTIC BICLUSTERING Possibilistic Clustering PCM objective function [Krishnapuram & Keller, 1996]: s r s 1 r J m (U, Y) = u pq E pq + (u pq log u pq u pq ), p=1 q=1 β p p=1 q=1 (12) where: E pq = x q y p 2 (squared Euclidean distance) β p (scale) depending on the average size of the p-th cluster. Thanks to the penality term, points with a high degree of typicality have high u pq values, and points not very representative have low u pq values in all the clusters. Note that if β p p = trivial solution u pq = 0 is assumed. p, q, as no probabilistic constraint Bioinformatics Data Sets

35 Possibilistic algorithm POSSIBILISTIC BICLUSTERING Possibilistic Clustering The pair (U, Y) minimizes J m, under the possibilistic constraints 9-11 only if: and u pq = e Epq/βp p, q, (13) y p = r q=1 x qu pq r q=1 u pq p. (14) Picard iteration Membership refinement algorithm, membership to clusters as cluster typicality degree (initialization of centroids using, e.g., Fuzzy C-Means). High outliers rejection capability as PCM makes their membership very low. Bioinformatics Data Sets

36 Possibilistic algorithm POSSIBILISTIC BICLUSTERING Possibilistic Clustering PCM approach = equivalent to a set of s independent estimation problems [Nasraoui, 1995]: (u pq, y) = arg r u pq E pq + 1 r (u pq log u pq u pq ) p, β p u pq,y q=1 q=1 (15) that can be solved independently one at a time through a Picard iteration. Bioinformatics Data Sets

37 Possibilistic algorithm POSSIBILISTIC BICLUSTERING ALGORITHM (PBC) PBC Formulation For each bicluster we assign two vectors of membership, one for the rows and one other for the columns, denoting them respectively a and b. In a crisp sets framework row i and column j can either belong to the bicluster (a i = 1 and b j = 1) or not (a i = 0 or b j = 0). An element x ij of X belongs to the bicluster if both a i = 1 and b j = 1, i.e., its membership u ij to the bicluster is: u ij = and(a i, b j ) (16) The cardinality of the bicluster is then defined as: n = u ij (17) i j Bioinformatics Data Sets

38 Possibilistic algorithm POSSIBILISTIC BICLUSTERING ALGORITHM (PBC) PBC Formulation Fuzzy set theory framework: We allow membership u ij, a i and b j to belong in the interval [0, 1]. The membership u ij of an element x ij of X to the bicluster can be obtained by the aggregation of row and column memberships, using, e.g., a fuzzy t-norm like: or u ij = a i b j (product) (18) u ij = a i + b j (average) (19) 2 The fuzzy cardinality of the bicluster is defined as the sum of the memberships u ij for all i and j as in eq. 17. Bioinformatics Data Sets

39 Possibilistic algorithm POSSIBILISTIC BICLUSTERING ALGORITHM (PBC) PBC Formulation Homogeneity measures (eqs. 4 to 7) generalization: Fuzzy normalized square residual ( ) 2 dij 2 xij + x IJ x ij x Ij = (20) n where fuzzy bicluster mean, fuzzy bicluster row mean, fuzzy bicluster column mean are defined as : i j x IJ = u ijx ij j i j u, x ij = u ijx ij ij j u, x Ij = i u ijx ij ij i u (21) ij and fuzzy mean square residual: G = u ij dij 2 (22) i j Bioinformatics Data Sets

40 Possibilistic algorithm POSSIBILISTIC BICLUSTERING ALGORITHM (PBC) PBC Formulation Possibilistic Problem: maximizing the bicluster cardinality n and minimizing the fuzzy residual G under the fuzzy possibilistic paradigm. To this aim we make the following assumptions: we treat one bicluster at a time; the fuzzy memberships a i and b j are interpreted as typicality degrees of gene i and condition j with respect to the bicluster; we compute the membership u ij using the average aggregator (eq. 19). Bioinformatics Data Sets

41 Possibilistic algorithm POSSIBILISTIC BICLUSTERING ALGORITHM (PBC) PBC Formulation All those requirements are fulfilled by minimizing the following functional J B with respect to a and b: J B = ( ) ai + b j dij 2 2 +λ (a i ln a i a i )+µ (b j ln b j b j ) ij i j (23) The first term is the fuzzy mean square residual G, while the other two are penalization terms. The parameters λ and µ control the size of the bicluster. Their values can be estimated by simple statistics over the training set, and then hand-tuned to incorporate possible a-priori knowledge and to obtain the expected results. Bioinformatics Data Sets

42 Possibilistic algorithm POSSIBILISTIC BICLUSTERING ALGORITHM (PBC) PBC Formulation Setting the derivatives of J B with respect to the memberships a i and b j to zero we obtain: a i = exp b j = exp ( ( j d ) ij 2 2λ i d ) ij 2 2µ (24) (25) Those necessary conditions for the minimization of J B together with the definition of the fuzzy normalized square residual dij 2 (eq. 20) can be used to find a numerical solution for the optimization problem (Picard iteration). Bioinformatics Data Sets

43 Possibilistic algorithm POSSIBILISTIC BICLUSTERING ALGORITHM (PBC) PBC Formulation Table: Possibilistic (PBC) algorithm. 1 Initialize memberships a and b and threshold ε 2 Compute d 2 ij i, j (eq. 20) 3 Update a i i (eq. 24) 4 Update b j j (eq. 25) 5 if a a < ε and b b < ε then stop 6 else jump to step 2 Bioinformatics Data Sets

44 Possibilistic algorithm POSSIBILISTIC BICLUSTERING ALGORITHM (PBC) PBC Formulation The memberships initialization can be made: randomly using some a priori information about relevant genes and conditions. using the results already obtained from another biclustering algorithm (in this case PBC will work as a refinement algorithm) ε controls the convergence of the algorithm. After convergence of the algorithm the memberships a and b can be defuzzified by applying an α-cut, i.e., by comparing with a threshold. Bioinformatics Data Sets

45 n Introduction Possibilistic algorithm RESULTS Yeast data set [Tavazoie et al.; 1999][Ball et al, 2000] [Aach et al 2000] 2879 genes and 17 conditions α-cut=.5 for a and b defuzzification. ε = (results averaged on 20 runs) Size of biclusters vs λ and µ mu lambda Bioinformatics Data Sets

46 RESULTS Yeast data set Introduction Possibilistic algorithm PBC is slightly sensitive to initialization of memberships while strongly sensitive to parameters λ and µ. PBC can find biclusters of a desired size just tuning the parameters λ and µ (results averaged on 20 runs). λ µ n g n c n G Bioinformatics Data Sets

47 RESULTS Yeast data set Introduction Possibilistic algorithm Method avg. G avg. n avg. n g avg. n c Largest n DBF [Zhang et al 2004] FLOC [Yang et al 2003] Cheng-Church [2000] Single-objective GA [Mitra & Banka 2006] Multi-objective GA [Mitra & Banka 2006] Possibilistic Comparative study on Yeast data Bioinformatics Data Sets

48 RESULTS Yeast data set Introduction Possibilistic algorithm λ µ n g n c n G Method avg. G avg. n avg. n g avg. n c Largest n DBF [Zhang et al 2004] FLOC [Yang et al 2003] Cheng-Church [2000] Single-objective GA [Mitra & Banka 2006] Multi-objective GA [Mitra & Banka 2006] Possibilistic Bioinformatics Data Sets

49 RESULTS Yeast data set Introduction Possibilistic algorithm λ µ n g n c n G Method avg. G avg. n avg. n g avg. n c Largest n DBF [Zhang et al 2004] FLOC [Yang et al 2003] Cheng-Church [2000] Single-objective GA [Mitra & Banka 2006] Multi-objective GA [Mitra & Banka 2006] Possibilistic Bioinformatics Data Sets

50 RESULTS Yeast data set Introduction Possibilistic algorithm Expression Values Expression Values Conditions Conditions Plot of a small and a large bicluster Bioinformatics Data Sets

51 Possibilistic algorithm CONCLUSIONS The Possibilistic (PBC) algorithm extends the possibilistic clustering paradigm for the solution of the biclustering problem. The membership u ij of an element x ij of X to the bicluster is obtained by aggregation of memberships (typicality) of his row (gene) and column (condition) with respect to bicluster. The quality (residual G) of the large biclusters obtained is better than other biclustering methods. Further studies: biological validation of the obtained results automatically selection of parameters λ and µ other aggregators for obtaining u ij Bioinformatics Data Sets

52 Possibilistic algorithm CONCLUSIONS The Possibilistic (PBC) algorithm extends the possibilistic clustering paradigm for the solution of the biclustering problem. The membership u ij of an element x ij of X to the bicluster is obtained by aggregation of memberships (typicality) of his row (gene) and column (condition) with respect to bicluster. The quality (residual G) of the large biclusters obtained is better than other biclustering methods. Further studies: biological validation of the obtained results automatically selection of parameters λ and µ other aggregators for obtaining u ij Bioinformatics Data Sets

Deposited on: 21 March 2012

Deposited on: 21 March 2012 Filippone, M., Masulli, F., Rovetta, S., Mitra, S., and Banka, H. (2006) Possibilistic approach to biclustering: an application to oligonucleotide microarray data analysis. Lecture Notes in Computer Science,

More information

Biclustering Algorithms for Gene Expression Analysis

Biclustering Algorithms for Gene Expression Analysis Biclustering Algorithms for Gene Expression Analysis T. M. Murali August 19, 2008 Problems with Hierarchical Clustering It is a global clustering algorithm. Considers all genes to be equally important

More information

Exploratory data analysis for microarrays

Exploratory data analysis for microarrays Exploratory data analysis for microarrays Jörg Rahnenführer Computational Biology and Applied Algorithmics Max Planck Institute for Informatics D-66123 Saarbrücken Germany NGFN - Courses in Practical DNA

More information

CS Introduction to Data Mining Instructor: Abdullah Mueen

CS Introduction to Data Mining Instructor: Abdullah Mueen CS 591.03 Introduction to Data Mining Instructor: Abdullah Mueen LECTURE 8: ADVANCED CLUSTERING (FUZZY AND CO -CLUSTERING) Review: Basic Cluster Analysis Methods (Chap. 10) Cluster Analysis: Basic Concepts

More information

2. Background. 2.1 Clustering

2. Background. 2.1 Clustering 2. Background 2.1 Clustering Clustering involves the unsupervised classification of data items into different groups or clusters. Unsupervised classificaiton is basically a learning task in which learning

More information

Unsupervised Learning and Clustering

Unsupervised Learning and Clustering Unsupervised Learning and Clustering Selim Aksoy Department of Computer Engineering Bilkent University saksoy@cs.bilkent.edu.tr CS 551, Spring 2009 CS 551, Spring 2009 c 2009, Selim Aksoy (Bilkent University)

More information

INF4820 Algorithms for AI and NLP. Evaluating Classifiers Clustering

INF4820 Algorithms for AI and NLP. Evaluating Classifiers Clustering INF4820 Algorithms for AI and NLP Evaluating Classifiers Clustering Murhaf Fares & Stephan Oepen Language Technology Group (LTG) September 27, 2017 Today 2 Recap Evaluation of classifiers Unsupervised

More information

Biclustering for Microarray Data: A Short and Comprehensive Tutorial

Biclustering for Microarray Data: A Short and Comprehensive Tutorial Biclustering for Microarray Data: A Short and Comprehensive Tutorial 1 Arabinda Panda, 2 Satchidananda Dehuri 1 Department of Computer Science, Modern Engineering & Management Studies, Balasore 2 Department

More information

9/29/13. Outline Data mining tasks. Clustering algorithms. Applications of clustering in biology

9/29/13. Outline Data mining tasks. Clustering algorithms. Applications of clustering in biology 9/9/ I9 Introduction to Bioinformatics, Clustering algorithms Yuzhen Ye (yye@indiana.edu) School of Informatics & Computing, IUB Outline Data mining tasks Predictive tasks vs descriptive tasks Example

More information

INF4820, Algorithms for AI and NLP: Evaluating Classifiers Clustering

INF4820, Algorithms for AI and NLP: Evaluating Classifiers Clustering INF4820, Algorithms for AI and NLP: Evaluating Classifiers Clustering Erik Velldal University of Oslo Sept. 18, 2012 Topics for today 2 Classification Recap Evaluating classifiers Accuracy, precision,

More information

Unsupervised Learning and Clustering

Unsupervised Learning and Clustering Unsupervised Learning and Clustering Selim Aksoy Department of Computer Engineering Bilkent University saksoy@cs.bilkent.edu.tr CS 551, Spring 2008 CS 551, Spring 2008 c 2008, Selim Aksoy (Bilkent University)

More information

Biclustering with δ-pcluster John Tantalo. 1. Introduction

Biclustering with δ-pcluster John Tantalo. 1. Introduction Biclustering with δ-pcluster John Tantalo 1. Introduction The subject of biclustering is chiefly concerned with locating submatrices of gene expression data that exhibit shared trends between genes. That

More information

Hard clustering. Each object is assigned to one and only one cluster. Hierarchical clustering is usually hard. Soft (fuzzy) clustering

Hard clustering. Each object is assigned to one and only one cluster. Hierarchical clustering is usually hard. Soft (fuzzy) clustering An unsupervised machine learning problem Grouping a set of objects in such a way that objects in the same group (a cluster) are more similar (in some sense or another) to each other than to those in other

More information

Similarity-Driven Cluster Merging Method for Unsupervised Fuzzy Clustering

Similarity-Driven Cluster Merging Method for Unsupervised Fuzzy Clustering Similarity-Driven Cluster Merging Method for Unsupervised Fuzzy Clustering Xuejian Xiong, Kian Lee Tan Singapore-MIT Alliance E4-04-10, 4 Engineering Drive 3 Singapore 117576 Abstract In this paper, a

More information

Mean Square Residue Biclustering with Missing Data and Row Inversions

Mean Square Residue Biclustering with Missing Data and Row Inversions Mean Square Residue Biclustering with Missing Data and Row Inversions Stefan Gremalschi a, Gulsah Altun b, Irina Astrovskaya a, and Alexander Zelikovsky a a Department of Computer Science, Georgia State

More information

DNA chips and other techniques measure the expression level of a large number of genes, perhaps all

DNA chips and other techniques measure the expression level of a large number of genes, perhaps all INESC-ID TECHNICAL REPORT 1/2004, JANUARY 2004 1 Biclustering Algorithms for Biological Data Analysis: A Survey* Sara C. Madeira and Arlindo L. Oliveira Abstract A large number of clustering approaches

More information

Unsupervised Learning

Unsupervised Learning Unsupervised Learning Unsupervised learning Until now, we have assumed our training samples are labeled by their category membership. Methods that use labeled samples are said to be supervised. However,

More information

EECS730: Introduction to Bioinformatics

EECS730: Introduction to Bioinformatics EECS730: Introduction to Bioinformatics Lecture 15: Microarray clustering http://compbio.pbworks.com/f/wood2.gif Some slides were adapted from Dr. Shaojie Zhang (University of Central Florida) Microarray

More information

Network Traffic Measurements and Analysis

Network Traffic Measurements and Analysis DEIB - Politecnico di Milano Fall, 2017 Introduction Often, we have only a set of features x = x 1, x 2,, x n, but no associated response y. Therefore we are not interested in prediction nor classification,

More information

INF4820 Algorithms for AI and NLP. Evaluating Classifiers Clustering

INF4820 Algorithms for AI and NLP. Evaluating Classifiers Clustering INF4820 Algorithms for AI and NLP Evaluating Classifiers Clustering Erik Velldal & Stephan Oepen Language Technology Group (LTG) September 23, 2015 Agenda Last week Supervised vs unsupervised learning.

More information

Incorporating Known Pathways into Gene Clustering Algorithms for Genetic Expression Data

Incorporating Known Pathways into Gene Clustering Algorithms for Genetic Expression Data Incorporating Known Pathways into Gene Clustering Algorithms for Genetic Expression Data Ryan Atallah, John Ryan, David Aeschlimann December 14, 2013 Abstract In this project, we study the problem of classifying

More information

Unsupervised Learning : Clustering

Unsupervised Learning : Clustering Unsupervised Learning : Clustering Things to be Addressed Traditional Learning Models. Cluster Analysis K-means Clustering Algorithm Drawbacks of traditional clustering algorithms. Clustering as a complex

More information

Clustering CS 550: Machine Learning

Clustering CS 550: Machine Learning Clustering CS 550: Machine Learning This slide set mainly uses the slides given in the following links: http://www-users.cs.umn.edu/~kumar/dmbook/ch8.pdf http://www-users.cs.umn.edu/~kumar/dmbook/dmslides/chap8_basic_cluster_analysis.pdf

More information

Unsupervised Data Mining: Clustering. Izabela Moise, Evangelos Pournaras, Dirk Helbing

Unsupervised Data Mining: Clustering. Izabela Moise, Evangelos Pournaras, Dirk Helbing Unsupervised Data Mining: Clustering Izabela Moise, Evangelos Pournaras, Dirk Helbing Izabela Moise, Evangelos Pournaras, Dirk Helbing 1 1. Supervised Data Mining Classification Regression Outlier detection

More information

Redefining and Enhancing K-means Algorithm

Redefining and Enhancing K-means Algorithm Redefining and Enhancing K-means Algorithm Nimrat Kaur Sidhu 1, Rajneet kaur 2 Research Scholar, Department of Computer Science Engineering, SGGSWU, Fatehgarh Sahib, Punjab, India 1 Assistant Professor,

More information

RFCM: A Hybrid Clustering Algorithm Using Rough and Fuzzy Sets

RFCM: A Hybrid Clustering Algorithm Using Rough and Fuzzy Sets Fundamenta Informaticae 8 (7) 475 495 475 RFCM: A Hybrid Clustering Algorithm Using Rough and Fuzzy Sets Pradipta Maji and Sankar K. Pal Center for Soft Computing Research Indian Statistical Institute

More information

Gene expression & Clustering (Chapter 10)

Gene expression & Clustering (Chapter 10) Gene expression & Clustering (Chapter 10) Determining gene function Sequence comparison tells us if a gene is similar to another gene, e.g., in a new species Dynamic programming Approximate pattern matching

More information

Collaborative Rough Clustering

Collaborative Rough Clustering Collaborative Rough Clustering Sushmita Mitra, Haider Banka, and Witold Pedrycz Machine Intelligence Unit, Indian Statistical Institute, Kolkata, India {sushmita, hbanka r}@isical.ac.in Dept. of Electrical

More information

Overlapping Clustering: A Review

Overlapping Clustering: A Review Overlapping Clustering: A Review SAI Computing Conference 2016 Said Baadel Canadian University Dubai University of Huddersfield Huddersfield, UK Fadi Thabtah Nelson Marlborough Institute of Technology

More information

ECLT 5810 Clustering

ECLT 5810 Clustering ECLT 5810 Clustering What is Cluster Analysis? Cluster: a collection of data objects Similar to one another within the same cluster Dissimilar to the objects in other clusters Cluster analysis Grouping

More information

Clustering. Lecture 6, 1/24/03 ECS289A

Clustering. Lecture 6, 1/24/03 ECS289A Clustering Lecture 6, 1/24/03 What is Clustering? Given n objects, assign them to groups (clusters) based on their similarity Unsupervised Machine Learning Class Discovery Difficult, and maybe ill-posed

More information

Classification. Vladimir Curic. Centre for Image Analysis Swedish University of Agricultural Sciences Uppsala University

Classification. Vladimir Curic. Centre for Image Analysis Swedish University of Agricultural Sciences Uppsala University Classification Vladimir Curic Centre for Image Analysis Swedish University of Agricultural Sciences Uppsala University Outline An overview on classification Basics of classification How to choose appropriate

More information

Use of biclustering for missing value imputation in gene expression data

Use of biclustering for missing value imputation in gene expression data ORIGINAL RESEARCH Use of biclustering for missing value imputation in gene expression data K.O. Cheng, N.F. Law, W.C. Siu Department of Electronic and Information Engineering, The Hong Kong Polytechnic

More information

CHAPTER 4 AN IMPROVED INITIALIZATION METHOD FOR FUZZY C-MEANS CLUSTERING USING DENSITY BASED APPROACH

CHAPTER 4 AN IMPROVED INITIALIZATION METHOD FOR FUZZY C-MEANS CLUSTERING USING DENSITY BASED APPROACH 37 CHAPTER 4 AN IMPROVED INITIALIZATION METHOD FOR FUZZY C-MEANS CLUSTERING USING DENSITY BASED APPROACH 4.1 INTRODUCTION Genes can belong to any genetic network and are also coordinated by many regulatory

More information

Mining Deterministic Biclusters in Gene Expression Data

Mining Deterministic Biclusters in Gene Expression Data Mining Deterministic Biclusters in Gene Expression Data Zonghong Zhang 1 Alvin Teo 1 BengChinOoi 1,2 Kian-Lee Tan 1,2 1 Department of Computer Science National University of Singapore 2 Singapore-MIT-Alliance

More information

Semi-Supervised Clustering with Partial Background Information

Semi-Supervised Clustering with Partial Background Information Semi-Supervised Clustering with Partial Background Information Jing Gao Pang-Ning Tan Haibin Cheng Abstract Incorporating background knowledge into unsupervised clustering algorithms has been the subject

More information

International Journal of Advance Research in Computer Science and Management Studies

International Journal of Advance Research in Computer Science and Management Studies Volume 2, Issue 11, November 2014 ISSN: 2321 7782 (Online) International Journal of Advance Research in Computer Science and Management Studies Research Article / Survey Paper / Case Study Available online

More information

Novel Intuitionistic Fuzzy C-Means Clustering for Linearly and Nonlinearly Separable Data

Novel Intuitionistic Fuzzy C-Means Clustering for Linearly and Nonlinearly Separable Data Novel Intuitionistic Fuzzy C-Means Clustering for Linearly and Nonlinearly Separable Data PRABHJOT KAUR DR. A. K. SONI DR. ANJANA GOSAIN Department of IT, MSIT Department of Computers University School

More information

Pattern Recognition Lecture Sequential Clustering

Pattern Recognition Lecture Sequential Clustering Pattern Recognition Lecture Prof. Dr. Marcin Grzegorzek Research Group for Pattern Recognition Institute for Vision and Graphics University of Siegen, Germany Pattern Recognition Chain patterns sensor

More information

Unsupervised Learning. Presenter: Anil Sharma, PhD Scholar, IIIT-Delhi

Unsupervised Learning. Presenter: Anil Sharma, PhD Scholar, IIIT-Delhi Unsupervised Learning Presenter: Anil Sharma, PhD Scholar, IIIT-Delhi Content Motivation Introduction Applications Types of clustering Clustering criterion functions Distance functions Normalization Which

More information

INF 4300 Classification III Anne Solberg The agenda today:

INF 4300 Classification III Anne Solberg The agenda today: INF 4300 Classification III Anne Solberg 28.10.15 The agenda today: More on estimating classifier accuracy Curse of dimensionality and simple feature selection knn-classification K-means clustering 28.10.15

More information

Applying the Possibilistic C-Means Algorithm in Kernel-Induced Spaces

Applying the Possibilistic C-Means Algorithm in Kernel-Induced Spaces 1 Applying the Possibilistic C-Means Algorithm in Kernel-Induced Spaces Maurizio Filippone, Francesco Masulli, and Stefano Rovetta M. Filippone is with the Department of Computer Science of the University

More information

INF4820. Clustering. Erik Velldal. Nov. 17, University of Oslo. Erik Velldal INF / 22

INF4820. Clustering. Erik Velldal. Nov. 17, University of Oslo. Erik Velldal INF / 22 INF4820 Clustering Erik Velldal University of Oslo Nov. 17, 2009 Erik Velldal INF4820 1 / 22 Topics for Today More on unsupervised machine learning for data-driven categorization: clustering. The task

More information

Introduction to Computer Science

Introduction to Computer Science DM534 Introduction to Computer Science Clustering and Feature Spaces Richard Roettger: About Me Computer Science (Technical University of Munich and thesis at the ICSI at the University of California at

More information

e-ccc-biclustering: Related work on biclustering algorithms for time series gene expression data

e-ccc-biclustering: Related work on biclustering algorithms for time series gene expression data : Related work on biclustering algorithms for time series gene expression data Sara C. Madeira 1,2,3, Arlindo L. Oliveira 1,2 1 Knowledge Discovery and Bioinformatics (KDBIO) group, INESC-ID, Lisbon, Portugal

More information

High throughput Data Analysis 2. Cluster Analysis

High throughput Data Analysis 2. Cluster Analysis High throughput Data Analysis 2 Cluster Analysis Overview Why clustering? Hierarchical clustering K means clustering Issues with above two Other methods Quality of clustering results Introduction WHY DO

More information

Contents. ! Data sets. ! Distance and similarity metrics. ! K-means clustering. ! Hierarchical clustering. ! Evaluation of clustering results

Contents. ! Data sets. ! Distance and similarity metrics. ! K-means clustering. ! Hierarchical clustering. ! Evaluation of clustering results Statistical Analysis of Microarray Data Contents Data sets Distance and similarity metrics K-means clustering Hierarchical clustering Evaluation of clustering results Clustering Jacques van Helden Jacques.van.Helden@ulb.ac.be

More information

Clustering Jacques van Helden

Clustering Jacques van Helden Statistical Analysis of Microarray Data Clustering Jacques van Helden Jacques.van.Helden@ulb.ac.be Contents Data sets Distance and similarity metrics K-means clustering Hierarchical clustering Evaluation

More information

ECLT 5810 Clustering

ECLT 5810 Clustering ECLT 5810 Clustering What is Cluster Analysis? Cluster: a collection of data objects Similar to one another within the same cluster Dissimilar to the objects in other clusters Cluster analysis Grouping

More information

Microarray data analysis

Microarray data analysis Microarray data analysis Computational Biology IST Technical University of Lisbon Ana Teresa Freitas 016/017 Microarrays Rows represent genes Columns represent samples Many problems may be solved using

More information

Clustering and Visualisation of Data

Clustering and Visualisation of Data Clustering and Visualisation of Data Hiroshi Shimodaira January-March 28 Cluster analysis aims to partition a data set into meaningful or useful groups, based on distances between data points. In some

More information

2. Department of Electronic Engineering and Computer Science, Case Western Reserve University

2. Department of Electronic Engineering and Computer Science, Case Western Reserve University Chapter MINING HIGH-DIMENSIONAL DATA Wei Wang 1 and Jiong Yang 2 1. Department of Computer Science, University of North Carolina at Chapel Hill 2. Department of Electronic Engineering and Computer Science,

More information

Machine Learning (BSMC-GA 4439) Wenke Liu

Machine Learning (BSMC-GA 4439) Wenke Liu Machine Learning (BSMC-GA 4439) Wenke Liu 01-25-2018 Outline Background Defining proximity Clustering methods Determining number of clusters Other approaches Cluster analysis as unsupervised Learning Unsupervised

More information

Feature Selection. CE-725: Statistical Pattern Recognition Sharif University of Technology Spring Soleymani

Feature Selection. CE-725: Statistical Pattern Recognition Sharif University of Technology Spring Soleymani Feature Selection CE-725: Statistical Pattern Recognition Sharif University of Technology Spring 2013 Soleymani Outline Dimensionality reduction Feature selection vs. feature extraction Filter univariate

More information

Supervised vs. Unsupervised Learning

Supervised vs. Unsupervised Learning Clustering Supervised vs. Unsupervised Learning So far we have assumed that the training samples used to design the classifier were labeled by their class membership (supervised learning) We assume now

More information

Optimal Web Page Category for Web Personalization Using Biclustering Approach

Optimal Web Page Category for Web Personalization Using Biclustering Approach Optimal Web Page Category for Web Personalization Using Biclustering Approach P. S. Raja Department of Computer Science, Periyar University, Salem, Tamil Nadu 636011, India. psraja5@gmail.com Abstract

More information

RFCM: A Hybrid Clustering Algorithm Using Rough and Fuzzy Sets

RFCM: A Hybrid Clustering Algorithm Using Rough and Fuzzy Sets Fundamenta Informaticae 8 (27) 475 496 475 RFCM: A Hybrid Clustering Algorithm Using Rough and Fuzzy Sets Pradipta Maji and Sankar K. Pal Center for Soft Computing Research Indian Statistical Institute

More information

MSA220 - Statistical Learning for Big Data

MSA220 - Statistical Learning for Big Data MSA220 - Statistical Learning for Big Data Lecture 13 Rebecka Jörnsten Mathematical Sciences University of Gothenburg and Chalmers University of Technology Clustering Explorative analysis - finding groups

More information

Clustering in Data Mining

Clustering in Data Mining Clustering in Data Mining Classification Vs Clustering When the distribution is based on a single parameter and that parameter is known for each object, it is called classification. E.g. Children, young,

More information

Dimension reduction : PCA and Clustering

Dimension reduction : PCA and Clustering Dimension reduction : PCA and Clustering By Hanne Jarmer Slides by Christopher Workman Center for Biological Sequence Analysis DTU The DNA Array Analysis Pipeline Array design Probe design Question Experimental

More information

Texture Image Segmentation using FCM

Texture Image Segmentation using FCM Proceedings of 2012 4th International Conference on Machine Learning and Computing IPCSIT vol. 25 (2012) (2012) IACSIT Press, Singapore Texture Image Segmentation using FCM Kanchan S. Deshmukh + M.G.M

More information

Big Data Methods. Chapter 5: Machine learning. Big Data Methods, Chapter 5, Slide 1

Big Data Methods. Chapter 5: Machine learning. Big Data Methods, Chapter 5, Slide 1 Big Data Methods Chapter 5: Machine learning Big Data Methods, Chapter 5, Slide 1 5.1 Introduction to machine learning What is machine learning? Concerned with the study and development of algorithms that

More information

Unsupervised: no target value to predict

Unsupervised: no target value to predict Clustering Unsupervised: no target value to predict Differences between models/algorithms: Exclusive vs. overlapping Deterministic vs. probabilistic Hierarchical vs. flat Incremental vs. batch learning

More information

10701 Machine Learning. Clustering

10701 Machine Learning. Clustering 171 Machine Learning Clustering What is Clustering? Organizing data into clusters such that there is high intra-cluster similarity low inter-cluster similarity Informally, finding natural groupings among

More information

A Comparative study of Clustering Algorithms using MapReduce in Hadoop

A Comparative study of Clustering Algorithms using MapReduce in Hadoop A Comparative study of Clustering Algorithms using MapReduce in Hadoop Dweepna Garg 1, Khushboo Trivedi 2, B.B.Panchal 3 1 Department of Computer Science and Engineering, Parul Institute of Engineering

More information

Double Self-Organizing Maps to Cluster Gene Expression Data

Double Self-Organizing Maps to Cluster Gene Expression Data Double Self-Organizing Maps to Cluster Gene Expression Data Dali Wang, Habtom Ressom, Mohamad Musavi, Cristian Domnisoru University of Maine, Department of Electrical & Computer Engineering, Intelligent

More information

Cluster Analysis. Mu-Chun Su. Department of Computer Science and Information Engineering National Central University 2003/3/11 1

Cluster Analysis. Mu-Chun Su. Department of Computer Science and Information Engineering National Central University 2003/3/11 1 Cluster Analysis Mu-Chun Su Department of Computer Science and Information Engineering National Central University 2003/3/11 1 Introduction Cluster analysis is the formal study of algorithms and methods

More information

The k-means Algorithm and Genetic Algorithm

The k-means Algorithm and Genetic Algorithm The k-means Algorithm and Genetic Algorithm k-means algorithm Genetic algorithm Rough set approach Fuzzy set approaches Chapter 8 2 The K-Means Algorithm The K-Means algorithm is a simple yet effective

More information

Application of fuzzy set theory in image analysis. Nataša Sladoje Centre for Image Analysis

Application of fuzzy set theory in image analysis. Nataša Sladoje Centre for Image Analysis Application of fuzzy set theory in image analysis Nataša Sladoje Centre for Image Analysis Our topics for today Crisp vs fuzzy Fuzzy sets and fuzzy membership functions Fuzzy set operators Approximate

More information

Triclustering in Gene Expression Data Analysis: A Selected Survey

Triclustering in Gene Expression Data Analysis: A Selected Survey Triclustering in Gene Expression Data Analysis: A Selected Survey P. Mahanta, H. A. Ahmed Dept of Comp Sc and Engg Tezpur University Napaam -784028, India Email: priyakshi@tezu.ernet.in, hasin@tezu.ernet.in

More information

CHAPTER 5 CLUSTER VALIDATION TECHNIQUES

CHAPTER 5 CLUSTER VALIDATION TECHNIQUES 120 CHAPTER 5 CLUSTER VALIDATION TECHNIQUES 5.1 INTRODUCTION Prediction of correct number of clusters is a fundamental problem in unsupervised classification techniques. Many clustering techniques require

More information

Clustering gene expression data

Clustering gene expression data Clustering gene expression data 1 How Gene Expression Data Looks Entries of the Raw Data matrix: Ratio values Absolute values Row = gene s expression pattern Column = experiment/condition s profile genes

More information

Spectral Methods for Network Community Detection and Graph Partitioning

Spectral Methods for Network Community Detection and Graph Partitioning Spectral Methods for Network Community Detection and Graph Partitioning M. E. J. Newman Department of Physics, University of Michigan Presenters: Yunqi Guo Xueyin Yu Yuanqi Li 1 Outline: Community Detection

More information

A Weighted Majority Voting based on Normalized Mutual Information for Cluster Analysis

A Weighted Majority Voting based on Normalized Mutual Information for Cluster Analysis A Weighted Majority Voting based on Normalized Mutual Information for Cluster Analysis Meshal Shutaywi and Nezamoddin N. Kachouie Department of Mathematical Sciences, Florida Institute of Technology Abstract

More information

TECHNIQUES FOR CLUSTERING GENE EXPRESSION DATA

TECHNIQUES FOR CLUSTERING GENE EXPRESSION DATA TECHNIQUES FOR CLUSTERING GENE EXPRESSION DATA G. Kerr, H.J. Ruskin, M. Crane, P. Doolan Biocomputation Research Lab, (Modelling and Scientific Computing Group, School of Computing) and National Institute

More information

University of Florida CISE department Gator Engineering. Clustering Part 2

University of Florida CISE department Gator Engineering. Clustering Part 2 Clustering Part 2 Dr. Sanjay Ranka Professor Computer and Information Science and Engineering University of Florida, Gainesville Partitional Clustering Original Points A Partitional Clustering Hierarchical

More information

Reflexive Regular Equivalence for Bipartite Data

Reflexive Regular Equivalence for Bipartite Data Reflexive Regular Equivalence for Bipartite Data Aaron Gerow 1, Mingyang Zhou 2, Stan Matwin 1, and Feng Shi 3 1 Faculty of Computer Science, Dalhousie University, Halifax, NS, Canada 2 Department of Computer

More information

COSC 6339 Big Data Analytics. Fuzzy Clustering. Some slides based on a lecture by Prof. Shishir Shah. Edgar Gabriel Spring 2017.

COSC 6339 Big Data Analytics. Fuzzy Clustering. Some slides based on a lecture by Prof. Shishir Shah. Edgar Gabriel Spring 2017. COSC 6339 Big Data Analytics Fuzzy Clustering Some slides based on a lecture by Prof. Shishir Shah Edgar Gabriel Spring 217 Clustering Clustering is a technique for finding similarity groups in data, called

More information

Clustering. CE-717: Machine Learning Sharif University of Technology Spring Soleymani

Clustering. CE-717: Machine Learning Sharif University of Technology Spring Soleymani Clustering CE-717: Machine Learning Sharif University of Technology Spring 2016 Soleymani Outline Clustering Definition Clustering main approaches Partitional (flat) Hierarchical Clustering validation

More information

CS 2750 Machine Learning. Lecture 19. Clustering. CS 2750 Machine Learning. Clustering. Groups together similar instances in the data sample

CS 2750 Machine Learning. Lecture 19. Clustering. CS 2750 Machine Learning. Clustering. Groups together similar instances in the data sample Lecture 9 Clustering Milos Hauskrecht milos@cs.pitt.edu 539 Sennott Square Clustering Groups together similar instances in the data sample Basic clustering problem: distribute data into k different groups

More information

Keywords Clustering, Goals of clustering, clustering techniques, clustering algorithms.

Keywords Clustering, Goals of clustering, clustering techniques, clustering algorithms. Volume 3, Issue 5, May 2013 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com A Survey of Clustering

More information

How do microarrays work

How do microarrays work Lecture 3 (continued) Alvis Brazma European Bioinformatics Institute How do microarrays work condition mrna cdna hybridise to microarray condition Sample RNA extract labelled acid acid acid nucleic acid

More information

Iteration Reduction K Means Clustering Algorithm

Iteration Reduction K Means Clustering Algorithm Iteration Reduction K Means Clustering Algorithm Kedar Sawant 1 and Snehal Bhogan 2 1 Department of Computer Engineering, Agnel Institute of Technology and Design, Assagao, Goa 403507, India 2 Department

More information

Gene Clustering & Classification

Gene Clustering & Classification BINF, Introduction to Computational Biology Gene Clustering & Classification Young-Rae Cho Associate Professor Department of Computer Science Baylor University Overview Introduction to Gene Clustering

More information

Lesson 3. Prof. Enza Messina

Lesson 3. Prof. Enza Messina Lesson 3 Prof. Enza Messina Clustering techniques are generally classified into these classes: PARTITIONING ALGORITHMS Directly divides data points into some prespecified number of clusters without a hierarchical

More information

A survey of kernel and spectral methods for clustering

A survey of kernel and spectral methods for clustering A survey of kernel and spectral methods for clustering Maurizio Filippone a Francesco Camastra b Francesco Masulli a Stefano Rovetta a a Department of Computer and Information Science, University of Genova,

More information

Statistical Analysis of Metabolomics Data. Xiuxia Du Department of Bioinformatics & Genomics University of North Carolina at Charlotte

Statistical Analysis of Metabolomics Data. Xiuxia Du Department of Bioinformatics & Genomics University of North Carolina at Charlotte Statistical Analysis of Metabolomics Data Xiuxia Du Department of Bioinformatics & Genomics University of North Carolina at Charlotte Outline Introduction Data pre-treatment 1. Normalization 2. Centering,

More information

Comparisons and validation of statistical clustering techniques for microarray gene expression data. Outline. Microarrays.

Comparisons and validation of statistical clustering techniques for microarray gene expression data. Outline. Microarrays. Comparisons and validation of statistical clustering techniques for microarray gene expression data Susmita Datta and Somnath Datta Presented by: Jenni Dietrich Assisted by: Jeffrey Kidd and Kristin Wheeler

More information

10601 Machine Learning. Hierarchical clustering. Reading: Bishop: 9-9.2

10601 Machine Learning. Hierarchical clustering. Reading: Bishop: 9-9.2 161 Machine Learning Hierarchical clustering Reading: Bishop: 9-9.2 Second half: Overview Clustering - Hierarchical, semi-supervised learning Graphical models - Bayesian networks, HMMs, Reasoning under

More information

Understanding Clustering Supervising the unsupervised

Understanding Clustering Supervising the unsupervised Understanding Clustering Supervising the unsupervised Janu Verma IBM T.J. Watson Research Center, New York http://jverma.github.io/ jverma@us.ibm.com @januverma Clustering Grouping together similar data

More information

COSC 6397 Big Data Analytics. Fuzzy Clustering. Some slides based on a lecture by Prof. Shishir Shah. Edgar Gabriel Spring 2015.

COSC 6397 Big Data Analytics. Fuzzy Clustering. Some slides based on a lecture by Prof. Shishir Shah. Edgar Gabriel Spring 2015. COSC 6397 Big Data Analytics Fuzzy Clustering Some slides based on a lecture by Prof. Shishir Shah Edgar Gabriel Spring 215 Clustering Clustering is a technique for finding similarity groups in data, called

More information

Statistical Methods and Optimization in Data Mining

Statistical Methods and Optimization in Data Mining Statistical Methods and Optimization in Data Mining Eloísa Macedo 1, Adelaide Freitas 2 1 University of Aveiro, Aveiro, Portugal; macedo@ua.pt 2 University of Aveiro, Aveiro, Portugal; adelaide@ua.pt The

More information

CSE 255 Lecture 6. Data Mining and Predictive Analytics. Community Detection

CSE 255 Lecture 6. Data Mining and Predictive Analytics. Community Detection CSE 255 Lecture 6 Data Mining and Predictive Analytics Community Detection Dimensionality reduction Goal: take high-dimensional data, and describe it compactly using a small number of dimensions Assumption:

More information

Clustering. Supervised vs. Unsupervised Learning

Clustering. Supervised vs. Unsupervised Learning Clustering Supervised vs. Unsupervised Learning So far we have assumed that the training samples used to design the classifier were labeled by their class membership (supervised learning) We assume now

More information

Variable Selection 6.783, Biomedical Decision Support

Variable Selection 6.783, Biomedical Decision Support 6.783, Biomedical Decision Support (lrosasco@mit.edu) Department of Brain and Cognitive Science- MIT November 2, 2009 About this class Why selecting variables Approaches to variable selection Sparsity-based

More information

MICROARRAY IMAGE SEGMENTATION USING CLUSTERING METHODS

MICROARRAY IMAGE SEGMENTATION USING CLUSTERING METHODS Mathematical and Computational Applications, Vol. 5, No. 2, pp. 240-247, 200. Association for Scientific Research MICROARRAY IMAGE SEGMENTATION USING CLUSTERING METHODS Volkan Uslan and Đhsan Ömür Bucak

More information

SYDE Winter 2011 Introduction to Pattern Recognition. Clustering

SYDE Winter 2011 Introduction to Pattern Recognition. Clustering SYDE 372 - Winter 2011 Introduction to Pattern Recognition Clustering Alexander Wong Department of Systems Design Engineering University of Waterloo Outline 1 2 3 4 5 All the approaches we have learned

More information

Accelerating Unique Strategy for Centroid Priming in K-Means Clustering

Accelerating Unique Strategy for Centroid Priming in K-Means Clustering IJIRST International Journal for Innovative Research in Science & Technology Volume 3 Issue 07 December 2016 ISSN (online): 2349-6010 Accelerating Unique Strategy for Centroid Priming in K-Means Clustering

More information

Performance Measure of Hard c-means,fuzzy c-means and Alternative c-means Algorithms

Performance Measure of Hard c-means,fuzzy c-means and Alternative c-means Algorithms Performance Measure of Hard c-means,fuzzy c-means and Alternative c-means Algorithms Binoda Nand Prasad*, Mohit Rathore**, Geeta Gupta***, Tarandeep Singh**** *Guru Gobind Singh Indraprastha University,

More information

Classification. Vladimir Curic. Centre for Image Analysis Swedish University of Agricultural Sciences Uppsala University

Classification. Vladimir Curic. Centre for Image Analysis Swedish University of Agricultural Sciences Uppsala University Classification Vladimir Curic Centre for Image Analysis Swedish University of Agricultural Sciences Uppsala University Outline An overview on classification Basics of classification How to choose appropriate

More information