Biclustering Bioinformatics Data Sets. A Possibilistic Approach

Similar documents
Deposited on: 21 March 2012

Biclustering Algorithms for Gene Expression Analysis

Exploratory data analysis for microarrays

CS Introduction to Data Mining Instructor: Abdullah Mueen

2. Background. 2.1 Clustering

Unsupervised Learning and Clustering

INF4820 Algorithms for AI and NLP. Evaluating Classifiers Clustering

Biclustering for Microarray Data: A Short and Comprehensive Tutorial

9/29/13. Outline Data mining tasks. Clustering algorithms. Applications of clustering in biology

INF4820, Algorithms for AI and NLP: Evaluating Classifiers Clustering

Unsupervised Learning and Clustering

Biclustering with δ-pcluster John Tantalo. 1. Introduction

Hard clustering. Each object is assigned to one and only one cluster. Hierarchical clustering is usually hard. Soft (fuzzy) clustering

Similarity-Driven Cluster Merging Method for Unsupervised Fuzzy Clustering

Mean Square Residue Biclustering with Missing Data and Row Inversions

DNA chips and other techniques measure the expression level of a large number of genes, perhaps all

Unsupervised Learning

EECS730: Introduction to Bioinformatics

Network Traffic Measurements and Analysis

INF4820 Algorithms for AI and NLP. Evaluating Classifiers Clustering

Incorporating Known Pathways into Gene Clustering Algorithms for Genetic Expression Data

Unsupervised Learning : Clustering

Clustering CS 550: Machine Learning

Unsupervised Data Mining: Clustering. Izabela Moise, Evangelos Pournaras, Dirk Helbing

Redefining and Enhancing K-means Algorithm

RFCM: A Hybrid Clustering Algorithm Using Rough and Fuzzy Sets

Gene expression & Clustering (Chapter 10)

Collaborative Rough Clustering

Overlapping Clustering: A Review

ECLT 5810 Clustering

Clustering. Lecture 6, 1/24/03 ECS289A

Classification. Vladimir Curic. Centre for Image Analysis Swedish University of Agricultural Sciences Uppsala University

Use of biclustering for missing value imputation in gene expression data

CHAPTER 4 AN IMPROVED INITIALIZATION METHOD FOR FUZZY C-MEANS CLUSTERING USING DENSITY BASED APPROACH

Mining Deterministic Biclusters in Gene Expression Data

Semi-Supervised Clustering with Partial Background Information

International Journal of Advance Research in Computer Science and Management Studies

Novel Intuitionistic Fuzzy C-Means Clustering for Linearly and Nonlinearly Separable Data

Pattern Recognition Lecture Sequential Clustering

Unsupervised Learning. Presenter: Anil Sharma, PhD Scholar, IIIT-Delhi

INF 4300 Classification III Anne Solberg The agenda today:

Applying the Possibilistic C-Means Algorithm in Kernel-Induced Spaces

INF4820. Clustering. Erik Velldal. Nov. 17, University of Oslo. Erik Velldal INF / 22

Introduction to Computer Science

e-ccc-biclustering: Related work on biclustering algorithms for time series gene expression data

High throughput Data Analysis 2. Cluster Analysis

Contents. ! Data sets. ! Distance and similarity metrics. ! K-means clustering. ! Hierarchical clustering. ! Evaluation of clustering results

Clustering Jacques van Helden

ECLT 5810 Clustering

Microarray data analysis

Clustering and Visualisation of Data

2. Department of Electronic Engineering and Computer Science, Case Western Reserve University

Machine Learning (BSMC-GA 4439) Wenke Liu

Feature Selection. CE-725: Statistical Pattern Recognition Sharif University of Technology Spring Soleymani

Supervised vs. Unsupervised Learning

Optimal Web Page Category for Web Personalization Using Biclustering Approach

RFCM: A Hybrid Clustering Algorithm Using Rough and Fuzzy Sets

MSA220 - Statistical Learning for Big Data

Clustering in Data Mining

Dimension reduction : PCA and Clustering

Texture Image Segmentation using FCM

Big Data Methods. Chapter 5: Machine learning. Big Data Methods, Chapter 5, Slide 1

Unsupervised: no target value to predict

10701 Machine Learning. Clustering

A Comparative study of Clustering Algorithms using MapReduce in Hadoop

Double Self-Organizing Maps to Cluster Gene Expression Data

Cluster Analysis. Mu-Chun Su. Department of Computer Science and Information Engineering National Central University 2003/3/11 1

The k-means Algorithm and Genetic Algorithm

Application of fuzzy set theory in image analysis. Nataša Sladoje Centre for Image Analysis

Triclustering in Gene Expression Data Analysis: A Selected Survey

CHAPTER 5 CLUSTER VALIDATION TECHNIQUES

Clustering gene expression data

Spectral Methods for Network Community Detection and Graph Partitioning

A Weighted Majority Voting based on Normalized Mutual Information for Cluster Analysis

TECHNIQUES FOR CLUSTERING GENE EXPRESSION DATA

University of Florida CISE department Gator Engineering. Clustering Part 2

Reflexive Regular Equivalence for Bipartite Data

COSC 6339 Big Data Analytics. Fuzzy Clustering. Some slides based on a lecture by Prof. Shishir Shah. Edgar Gabriel Spring 2017.

Clustering. CE-717: Machine Learning Sharif University of Technology Spring Soleymani

CS 2750 Machine Learning. Lecture 19. Clustering. CS 2750 Machine Learning. Clustering. Groups together similar instances in the data sample

Keywords Clustering, Goals of clustering, clustering techniques, clustering algorithms.

How do microarrays work

Iteration Reduction K Means Clustering Algorithm

Gene Clustering & Classification

Lesson 3. Prof. Enza Messina

A survey of kernel and spectral methods for clustering

Statistical Analysis of Metabolomics Data. Xiuxia Du Department of Bioinformatics & Genomics University of North Carolina at Charlotte

Comparisons and validation of statistical clustering techniques for microarray gene expression data. Outline. Microarrays.

10601 Machine Learning. Hierarchical clustering. Reading: Bishop: 9-9.2

Understanding Clustering Supervising the unsupervised

COSC 6397 Big Data Analytics. Fuzzy Clustering. Some slides based on a lecture by Prof. Shishir Shah. Edgar Gabriel Spring 2015.

Statistical Methods and Optimization in Data Mining

CSE 255 Lecture 6. Data Mining and Predictive Analytics. Community Detection

Clustering. Supervised vs. Unsupervised Learning

Variable Selection 6.783, Biomedical Decision Support

MICROARRAY IMAGE SEGMENTATION USING CLUSTERING METHODS

SYDE Winter 2011 Introduction to Pattern Recognition. Clustering

Accelerating Unique Strategy for Centroid Priming in K-Means Clustering

Performance Measure of Hard c-means,fuzzy c-means and Alternative c-means Algorithms

Classification. Vladimir Curic. Centre for Image Analysis Swedish University of Agricultural Sciences Uppsala University

Transcription:

Possibilistic algorithm Bioinformatics Data Sets: A Possibilistic Approach Dept Computer and Information Sciences, University of Genova ITALY EMFCSC Erice 20/4/2007 Bioinformatics Data Sets

Outline Introduction Possibilistic algorithm 1 Introduction 2 3 Possibilistic algorithm 4 Bioinformatics Data Sets

Possibilistic algorithm BIOINFORMATICS DATA SETS Data representation Nowadays, in the Post-Genomic era, we have many Bioinformatics data sets available (most of them released in public domain on the Internet) The information embedded in most of them has no yet completely exploited, due to the lack of accurate machine learning tools and/or of their diffusion in the Bioinformatics community. Bioinformatics Data Sets

Possibilistic algorithm Most of Bioinformatics data sets come from DNA microarray experiments and are normally given as a rectangular m n matrix X, where each column represents a feature (e.g., gene) and each row represents a data sample or condition (e.g., patient) X = (x ij ) m n, (1) where the value x ij is the expression of i-th gene in j-th condition. The analysis of microarray data sets can give a valuable information on the biological relevance of genes and correlations between them [Madei, 2004]. Bioinformatics Data Sets

Possibilistic algorithm BIOINFORMATICS DATA SETS Major Machine Learning tasks Clustering (Unsupervised): Given a set of samples, partition them into groups containg similar samples according to some similarity criteria (CLASS DISCOVERING). Classification (Supervised): Find classes of the test data set using known classification of training data set (CLASS PREDICTION). Feature Selection (Dimensionality reduction): Select a subset of features responsible for creating the condition corresponding to the class (GENE SELECTION, BIOMARKER SELECTION). Outlier Detection: Detect data samples that are not good representative of any of the classes, and disregard them while performing data analysis. Bioinformatics Data Sets

Possibilistic algorithm BIOINFORMATICS DATA SETS Major Machine Learning tasks Clustering (Unsupervised): Given a set of samples, partition them into groups containg similar samples according to some similarity criteria (CLASS DISCOVERING). Classification (Supervised): Find classes of the test data set using known classification of training data set (CLASS PREDICTION). Feature Selection (Dimensionality reduction): Select a subset of features responsible for creating the condition corresponding to the class (GENE SELECTION, BIOMARKER SELECTION). Outlier Detection: Detect data samples that are not good representative of any of the classes, and disregard them while performing data analysis. Bioinformatics Data Sets

Possibilistic algorithm BIOINFORMATICS DATA SETS Major Machine Learning tasks Clustering (Unsupervised): Given a set of samples, partition them into groups containg similar samples according to some similarity criteria (CLASS DISCOVERING). Classification (Supervised): Find classes of the test data set using known classification of training data set (CLASS PREDICTION). Feature Selection (Dimensionality reduction): Select a subset of features responsible for creating the condition corresponding to the class (GENE SELECTION, BIOMARKER SELECTION). Outlier Detection: Detect data samples that are not good representative of any of the classes, and disregard them while performing data analysis. Bioinformatics Data Sets

Possibilistic algorithm BIOINFORMATICS DATA SETS Major Machine Learning tasks Clustering (Unsupervised): Given a set of samples, partition them into groups containg similar samples according to some similarity criteria (CLASS DISCOVERING). Classification (Supervised): Find classes of the test data set using known classification of training data set (CLASS PREDICTION). Feature Selection (Dimensionality reduction): Select a subset of features responsible for creating the condition corresponding to the class (GENE SELECTION, BIOMARKER SELECTION). Outlier Detection: Detect data samples that are not good representative of any of the classes, and disregard them while performing data analysis. Bioinformatics Data Sets

Possibilistic algorithm BIOINFORMATICS DATA SETS Major challenges in Machine Learning Noisiness of data complicates solution of Machine Learning Tasks (robustness to noise). High-dimensionality of data makes complete search in most of data mining problems computationally infeasible (curse of dimensionality). Some data values may be inaccurate or missing. = The available data may be not sufficient to obtain statistically significant conclusions. Bioinformatics Data Sets

Possibilistic algorithm BIOINFORMATICS DATA SETS Major challenges in Machine Learning Noisiness of data complicates solution of Machine Learning Tasks (robustness to noise). High-dimensionality of data makes complete search in most of data mining problems computationally infeasible (curse of dimensionality). Some data values may be inaccurate or missing. = The available data may be not sufficient to obtain statistically significant conclusions. Bioinformatics Data Sets

Possibilistic algorithm BIOINFORMATICS DATA SETS Major challenges in Machine Learning Noisiness of data complicates solution of Machine Learning Tasks (robustness to noise). High-dimensionality of data makes complete search in most of data mining problems computationally infeasible (curse of dimensionality). Some data values may be inaccurate or missing. = The available data may be not sufficient to obtain statistically significant conclusions. Bioinformatics Data Sets

Possibilistic algorithm BIOINFORMATICS DATA SETS Major challenges in Machine Learning Noisiness of data complicates solution of Machine Learning Tasks (robustness to noise). High-dimensionality of data makes complete search in most of data mining problems computationally infeasible (curse of dimensionality). Some data values may be inaccurate or missing. = The available data may be not sufficient to obtain statistically significant conclusions. Bioinformatics Data Sets

Possibilistic algorithm Problem we shall focus today: How to identify genes with similar behavior with respect to different conditions? Instance of the problem of biclustering (also known as co-clustering, two-way clustering,...) [Cheng & Church, 2000; Hartigan, 1972; Kung et al, 2005; Turner et al, 2005] Bioinformatics Data Sets

Possibilistic algorithm Problem we shall focus today: How to identify genes with similar behavior with respect to different conditions? Instance of the problem of biclustering (also known as co-clustering, two-way clustering,...) [Cheng & Church, 2000; Hartigan, 1972; Kung et al, 2005; Turner et al, 2005] Bioinformatics Data Sets

Possibilistic algorithm BICLUSTERING is a methodology allowing for feature set and data points clustering simultaneously. It finds clusters of samples possessing similar characteristics together with features creating these similarities. It replies to the question: What characteristics make similar objects similar among them? Bioinformatics Data Sets

Possibilistic algorithm BICLUSTERING is a methodology allowing for feature set and data points clustering simultaneously. It finds clusters of samples possessing similar characteristics together with features creating these similarities. It replies to the question: What characteristics make similar objects similar among them? Bioinformatics Data Sets

Possibilistic algorithm BICLUSTERING is a methodology allowing for feature set and data points clustering simultaneously. It finds clusters of samples possessing similar characteristics together with features creating these similarities. It replies to the question: What characteristics make similar objects similar among them? Bioinformatics Data Sets

Possibilistic algorithm BICLUSTERING Surveys S. Madeira, A.L. Oliveira, Algorithms for Biological Data Analysis: A Survey, 2004. A. Tanay, R. Sharan, R. Shamir, Algorithms: A Survey, 2004. D. Jiang, C. Tang, A. Zhang, Cluster Analysis for Gene Expression Data: A Survey, 2004. Bioinformatics Data Sets

Possibilistic algorithm BICLUSTERING Applications Biological and Medical: Microarray data analysis Analysis of drug activity [Liu & Wang, 2003] Analysis of nutritional data [Lazzeroni et al., 2000] Bioinformatics Data Sets

Possibilistic algorithm BICLUSTERING Applications Text Mining [Dhillon, 2001, 2003] Marketing [Gaul & Schader, 1996] Others: electoral data [Hartigan, 1972] currency exchange [Lazzeroni et al., 2000] Dimensionality Reduction in Databases [Agrawal et al., 1998] Bioinformatics Data Sets

Possibilistic algorithm BICLUSTERING State of the art Cheng & Church algorithm [2000] The algorithm constructs one bicluster at a time using a statistical criterion - a low mean squared residue (the variance of the set of all elements in the bicluster, plus the mean row variance and the mean column variance). Once a bicluster is created, its entries are replaced by random numbers, and the procedure is repeated iteratively. Drawback: The masking procedure results in a phenomenon of random interference, affecting the subsequent discovery of large-sized biclusters [Yang et al., 2003]. Bioinformatics Data Sets

Possibilistic algorithm BICLUSTERING State of the art Cheng & Church algorithm [2000] The algorithm constructs one bicluster at a time using a statistical criterion - a low mean squared residue (the variance of the set of all elements in the bicluster, plus the mean row variance and the mean column variance). Once a bicluster is created, its entries are replaced by random numbers, and the procedure is repeated iteratively. Drawback: The masking procedure results in a phenomenon of random interference, affecting the subsequent discovery of large-sized biclusters [Yang et al., 2003]. Bioinformatics Data Sets

Possibilistic algorithm BICLUSTERING State of the art Direct Clustering [Hartigan, 1972] Flexible Overlapped Clusters (FLOC) [Yang et al., 2003] (probabilistic algorithm) Bipartite graphs [Tanay et al 2002] Genetic algorithms [Mitra et al, 2006] Simulated Annealing [Bryan et al, 2005] Bioinformatics Data Sets

Possibilistic algorithm POSSIBILISTIC BICLUSTERING Joint work: Maurizio Filippone,, Stefano Rovetta DISI Dept Computer and Information Science, University of Genova ITALY Sushmita Mitra, Haider Banka Indian Statistical Institute, Kolkata INDIA Bioinformatics Data Sets

Possibilistic algorithm POSSIBILISTIC BICLUSTERING We propose a new approach to the biclustering problem using the possibilistic clustering paradigm [Krishnapuram & Keller, 1993]. PBC algorithm finds one bicluster at a time, assigning to each data matrix element a membership to the bicluster The membership model is of the fuzzy possibilistic type. Bioinformatics Data Sets

Possibilistic algorithm POSSIBILISTIC BICLUSTERING Definitions Let x ij be the expression level of the i-th gene in the j-th condition. A bicluster is defined as a subset of the m n data matrix X, i.e., a bicluster is a pair (g, c), where g {1,...,m} is a subset of genes and c {1,...,n} is a subset of conditions [Cheng & Church, 2000; Hartigan, 1972; Kung et al, 2005; Turner et al, 2005]. We are interested in largest biclusters from DNA microarray data that do not exceed an assigned homogeneity constraint [Cheng & Church, 2000] as they can supply relevant biological information. Bioinformatics Data Sets

Possibilistic algorithm POSSIBILISTIC BICLUSTERING Definitions The size (or volume) n of a bicluster is usually defined as the number of cells in the gene expression matrix X belonging to it, that is the product of the cardinalities n g = g and n c = c : n = n g n c (2) Normalized square residual ( ) 2 dij 2 xij + x IJ x ij x Ij = n where the elements x IJ, x ij and x Ij are respectively the bicluster mean, the row mean and the column mean of X for the selected genes and conditions: (3) Bioinformatics Data Sets

Possibilistic algorithm POSSIBILISTIC BICLUSTERING Definitions bicluster mean: x IJ = 1 x ij (4) n i g bicluster row mean: x ij = 1 x ij (5) n c bicluster column mean: x Ij = 1 x ij (6) n g j c i g j c Bioinformatics Data Sets

Possibilistic algorithm POSSIBILISTIC BICLUSTERING Definitions Mean Square Residual [Cheng & Church, 2000]: G = dij 2 (7) i g j c G measures the bicluster homogeneity, i.e., the difference between the actual value of an element x ij and its expected value as predicted from the corresponding row mean, column mean, and bicluster mean. OUR AIM: maximizing the bicluster cardinality n and at the same time minimizing the residual G (NP-complete task [Peete, 2003]) using the Possibilistic Clustering Paradigm. Bioinformatics Data Sets

Possibilistic algorithm POSSIBILISTIC BICLUSTERING Approaches to clustering Bioinformatics data sets Data clustering is a routine step in biological data analysis, and a basic tool in Bioinformatics [Golub, et al., 1999; P. Tamayo, et al., 1999; Azuaje, 2003] Main approaches: Hierarchical Clustering [Eisen et al., 1998; Orengo et al., 2003] Partitional (or Central) Clustering: including C-Means [Duda & Hart, 1973], Self Organizing Map [Kohonen, 2001], Fuzzy C-Means [Bezdek, 1981], Deterministic Annealing [Rose et al, 1990], Alternating Cluster Estimation [Runkler, 1999], etc. Bioinformatics Data Sets

Possibilistic algorithm POSSIBILISTIC BICLUSTERING Probabilistic constraint in central clustering Let X = {x 1,...,x r } be a set of unlabeled data points, Y = {y 1,...,y s } a set of cluster centers (or prototypes) and U = [u pq ] the fuzzy membership matrix. Often, central clustering algorithms impose a probabilistic constraint on memberships, according to which the sum of the membership values of a point in all the clusters must be equal to one: r u pq = 1 (8) q=1 Bioinformatics Data Sets

Possibilistic algorithm POSSIBILISTIC BICLUSTERING From Probabilistic to Possibilistic Clustering Probabilistic constraint r u pq = 1: q=1 PROS - competitive constraint allowing the unsupervised learning algorithms to find the barycenter of clusters CONS - membership to clusters (a) not interpretable as a degree of typicality - (b) can give sensibility to outliers (a) (b) Bioinformatics Data Sets

Possibilistic algorithm POSSIBILISTIC BICLUSTERING Possibilistic Clustering In the Possibilistic C-Means (PCM) Algorithm [Krishnapuram & Keller, 1993] the constraints on the elements of U are relaxed to: u pq [0, 1] p, q; (9) 0 < r u pq < r p; (10) q=1 u pq > 0 q. (11) p i.e., clusters cannot be empty and each pattern must be assigned to at least one cluster mode seeking algorithm [Krishnapuram & Keller, 1993] Bioinformatics Data Sets

Possibilistic algorithm POSSIBILISTIC BICLUSTERING Possibilistic Clustering PCM objective function [Krishnapuram & Keller, 1996]: s r s 1 r J m (U, Y) = u pq E pq + (u pq log u pq u pq ), p=1 q=1 β p p=1 q=1 (12) where: E pq = x q y p 2 (squared Euclidean distance) β p (scale) depending on the average size of the p-th cluster. Thanks to the penality term, points with a high degree of typicality have high u pq values, and points not very representative have low u pq values in all the clusters. Note that if β p p = trivial solution u pq = 0 is assumed. p, q, as no probabilistic constraint Bioinformatics Data Sets

Possibilistic algorithm POSSIBILISTIC BICLUSTERING Possibilistic Clustering The pair (U, Y) minimizes J m, under the possibilistic constraints 9-11 only if: and u pq = e Epq/βp p, q, (13) y p = r q=1 x qu pq r q=1 u pq p. (14) Picard iteration Membership refinement algorithm, membership to clusters as cluster typicality degree (initialization of centroids using, e.g., Fuzzy C-Means). High outliers rejection capability as PCM makes their membership very low. Bioinformatics Data Sets

Possibilistic algorithm POSSIBILISTIC BICLUSTERING Possibilistic Clustering PCM approach = equivalent to a set of s independent estimation problems [Nasraoui, 1995]: (u pq, y) = arg r u pq E pq + 1 r (u pq log u pq u pq ) p, β p u pq,y q=1 q=1 (15) that can be solved independently one at a time through a Picard iteration. Bioinformatics Data Sets

Possibilistic algorithm POSSIBILISTIC BICLUSTERING ALGORITHM (PBC) PBC Formulation For each bicluster we assign two vectors of membership, one for the rows and one other for the columns, denoting them respectively a and b. In a crisp sets framework row i and column j can either belong to the bicluster (a i = 1 and b j = 1) or not (a i = 0 or b j = 0). An element x ij of X belongs to the bicluster if both a i = 1 and b j = 1, i.e., its membership u ij to the bicluster is: u ij = and(a i, b j ) (16) The cardinality of the bicluster is then defined as: n = u ij (17) i j Bioinformatics Data Sets

Possibilistic algorithm POSSIBILISTIC BICLUSTERING ALGORITHM (PBC) PBC Formulation Fuzzy set theory framework: We allow membership u ij, a i and b j to belong in the interval [0, 1]. The membership u ij of an element x ij of X to the bicluster can be obtained by the aggregation of row and column memberships, using, e.g., a fuzzy t-norm like: or u ij = a i b j (product) (18) u ij = a i + b j (average) (19) 2 The fuzzy cardinality of the bicluster is defined as the sum of the memberships u ij for all i and j as in eq. 17. Bioinformatics Data Sets

Possibilistic algorithm POSSIBILISTIC BICLUSTERING ALGORITHM (PBC) PBC Formulation Homogeneity measures (eqs. 4 to 7) generalization: Fuzzy normalized square residual ( ) 2 dij 2 xij + x IJ x ij x Ij = (20) n where fuzzy bicluster mean, fuzzy bicluster row mean, fuzzy bicluster column mean are defined as : i j x IJ = u ijx ij j i j u, x ij = u ijx ij ij j u, x Ij = i u ijx ij ij i u (21) ij and fuzzy mean square residual: G = u ij dij 2 (22) i j Bioinformatics Data Sets

Possibilistic algorithm POSSIBILISTIC BICLUSTERING ALGORITHM (PBC) PBC Formulation Possibilistic Problem: maximizing the bicluster cardinality n and minimizing the fuzzy residual G under the fuzzy possibilistic paradigm. To this aim we make the following assumptions: we treat one bicluster at a time; the fuzzy memberships a i and b j are interpreted as typicality degrees of gene i and condition j with respect to the bicluster; we compute the membership u ij using the average aggregator (eq. 19). Bioinformatics Data Sets

Possibilistic algorithm POSSIBILISTIC BICLUSTERING ALGORITHM (PBC) PBC Formulation All those requirements are fulfilled by minimizing the following functional J B with respect to a and b: J B = ( ) ai + b j dij 2 2 +λ (a i ln a i a i )+µ (b j ln b j b j ) ij i j (23) The first term is the fuzzy mean square residual G, while the other two are penalization terms. The parameters λ and µ control the size of the bicluster. Their values can be estimated by simple statistics over the training set, and then hand-tuned to incorporate possible a-priori knowledge and to obtain the expected results. Bioinformatics Data Sets

Possibilistic algorithm POSSIBILISTIC BICLUSTERING ALGORITHM (PBC) PBC Formulation Setting the derivatives of J B with respect to the memberships a i and b j to zero we obtain: a i = exp b j = exp ( ( j d ) ij 2 2λ i d ) ij 2 2µ (24) (25) Those necessary conditions for the minimization of J B together with the definition of the fuzzy normalized square residual dij 2 (eq. 20) can be used to find a numerical solution for the optimization problem (Picard iteration). Bioinformatics Data Sets

Possibilistic algorithm POSSIBILISTIC BICLUSTERING ALGORITHM (PBC) PBC Formulation Table: Possibilistic (PBC) algorithm. 1 Initialize memberships a and b and threshold ε 2 Compute d 2 ij i, j (eq. 20) 3 Update a i i (eq. 24) 4 Update b j j (eq. 25) 5 if a a < ε and b b < ε then stop 6 else jump to step 2 Bioinformatics Data Sets

Possibilistic algorithm POSSIBILISTIC BICLUSTERING ALGORITHM (PBC) PBC Formulation The memberships initialization can be made: randomly using some a priori information about relevant genes and conditions. using the results already obtained from another biclustering algorithm (in this case PBC will work as a refinement algorithm) ε controls the convergence of the algorithm. After convergence of the algorithm the memberships a and b can be defuzzified by applying an α-cut, i.e., by comparing with a threshold. Bioinformatics Data Sets

n Introduction Possibilistic algorithm RESULTS Yeast data set [Tavazoie et al.; 1999][Ball et al, 2000] [Aach et al 2000] 2879 genes and 17 conditions α-cut=.5 for a and b defuzzification. ε = 10 2. (results averaged on 20 runs) 15000 Size of biclusters vs λ and µ 10000 5000 0 105 100 95 mu 90 0.26 0.28 0.30 0.32 lambda 0.34 0.36 Bioinformatics Data Sets

RESULTS Yeast data set Introduction Possibilistic algorithm PBC is slightly sensitive to initialization of memberships while strongly sensitive to parameters λ and µ. PBC can find biclusters of a desired size just tuning the parameters λ and µ (results averaged on 20 runs). λ µ n g n c n G 0.25 115 448 10 4480 56.07 0.19 200 457 16 7312 67.80 0.30 100 654 8 5232 82.20 0.32 100 840 9 7560 111.63 0.31 120 989 13 12857 146.89 0.34 120 1177 13 15301 181.57 0.37 110 1309 13 17017 207.20 0.42 100 1500 13 19500 245.50 0.45 95 1622 12 19464 260.25 0.46 95 1681 13 21853 285.00 0.47 95 1737 13 22581 297.40 0.48 95 1797 13 23361 310.72 Bioinformatics Data Sets

RESULTS Yeast data set Introduction Possibilistic algorithm Method avg. G avg. n avg. n g avg. n c Largest n DBF [Zhang et al 2004] 115 1627 188 11 4000 FLOC [Yang et al 2003] 188 1826 195 12.8 2000 Cheng-Church [2000] 204 1577 167 12 4485 Single-objective GA [Mitra & Banka 2006] 52.9 571 191 5.13 1408 Multi-objective GA [Mitra & Banka 2006] 235 10302 1095 9.29 14828 Possibilistic 297 22571 1736 13 22607 Comparative study on Yeast data Bioinformatics Data Sets

RESULTS Yeast data set Introduction Possibilistic algorithm λ µ n g n c n G 0.25 115 448 10 4480 56.07 0.19 200 457 16 7312 67.80 0.30 100 654 8 5232 82.20 0.32 100 840 9 7560 111.63 0.31 120 989 13 12857 146.89 0.34 120 1177 13 15301 181.57 0.37 110 1309 13 17017 207.20 0.42 100 1500 13 19500 245.50 0.45 95 1622 12 19464 260.25 0.46 95 1681 13 21853 285.00 0.47 95 1737 13 22581 297.40 0.48 95 1797 13 23361 310.72 Method avg. G avg. n avg. n g avg. n c Largest n DBF [Zhang et al 2004] 115 1627 188 11 4000 FLOC [Yang et al 2003] 188 1826 195 12.8 2000 Cheng-Church [2000] 204 1577 167 12 4485 Single-objective GA [Mitra & Banka 2006] 52.9 571 191 5.13 1408 Multi-objective GA [Mitra & Banka 2006] 235 10302 1095 9.29 14828 Possibilistic 297 22571 1736 13 22607 Bioinformatics Data Sets

RESULTS Yeast data set Introduction Possibilistic algorithm λ µ n g n c n G 0.25 115 448 10 4480 56.07 0.19 200 457 16 7312 67.80 0.30 100 654 8 5232 82.20 0.32 100 840 9 7560 111.63 0.31 120 989 13 12857 146.89 0.34 120 1177 13 15301 181.57 0.37 110 1309 13 17017 207.20 0.42 100 1500 13 19500 245.50 0.45 95 1622 12 19464 260.25 0.46 95 1681 13 21853 285.00 0.47 95 1737 13 22581 297.40 0.48 95 1797 13 23361 310.72 Method avg. G avg. n avg. n g avg. n c Largest n DBF [Zhang et al 2004] 115 1627 188 11 4000 FLOC [Yang et al 2003] 188 1826 195 12.8 2000 Cheng-Church [2000] 204 1577 167 12 4485 Single-objective GA [Mitra & Banka 2006] 52.9 571 191 5.13 1408 Multi-objective GA [Mitra & Banka 2006] 235 10302 1095 9.29 14828 Possibilistic 297 22571 1736 13 22607 Bioinformatics Data Sets

RESULTS Yeast data set Introduction Possibilistic algorithm Expression Values 100 150 200 250 300 350 Expression Values 0 100 200 300 400 500 1 2 3 4 5 6 7 8 Conditions 2 4 6 8 10 12 Conditions Plot of a small and a large bicluster Bioinformatics Data Sets

Possibilistic algorithm CONCLUSIONS The Possibilistic (PBC) algorithm extends the possibilistic clustering paradigm for the solution of the biclustering problem. The membership u ij of an element x ij of X to the bicluster is obtained by aggregation of memberships (typicality) of his row (gene) and column (condition) with respect to bicluster. The quality (residual G) of the large biclusters obtained is better than other biclustering methods. Further studies: biological validation of the obtained results automatically selection of parameters λ and µ other aggregators for obtaining u ij Bioinformatics Data Sets

Possibilistic algorithm CONCLUSIONS The Possibilistic (PBC) algorithm extends the possibilistic clustering paradigm for the solution of the biclustering problem. The membership u ij of an element x ij of X to the bicluster is obtained by aggregation of memberships (typicality) of his row (gene) and column (condition) with respect to bicluster. The quality (residual G) of the large biclusters obtained is better than other biclustering methods. Further studies: biological validation of the obtained results automatically selection of parameters λ and µ other aggregators for obtaining u ij Bioinformatics Data Sets