Mean Square Residue Biclustering with Missing Data and Row Inversions

Similar documents
e-ccc-biclustering: Related work on biclustering algorithms for time series gene expression data

Biclustering with δ-pcluster John Tantalo. 1. Introduction

Biclustering Bioinformatics Data Sets. A Possibilistic Approach

Biclustering Algorithms for Gene Expression Analysis

Biclustering for Microarray Data: A Short and Comprehensive Tutorial

DNA chips and other techniques measure the expression level of a large number of genes, perhaps all

BIMAX. Lecture 11: December 31, Introduction Model An Incremental Algorithm.

EECS730: Introduction to Bioinformatics

Mining Deterministic Biclusters in Gene Expression Data

CS Introduction to Data Mining Instructor: Abdullah Mueen

Deposited on: 21 March 2012

Optimal Web Page Category for Web Personalization Using Biclustering Approach

Order Preserving Clustering by Finding Frequent Orders in Gene Expression Data

A Memetic Heuristic for the Co-clustering Problem

A Web Page Recommendation system using GA based biclustering of web usage data

Exploratory data analysis for microarrays

Triclustering in Gene Expression Data Analysis: A Selected Survey

9/29/13. Outline Data mining tasks. Clustering algorithms. Applications of clustering in biology

2. Background. 2.1 Clustering

Mathematical and Algorithmic Foundations Linear Programming and Matchings

DNA chips and other techniques measure the expression

3 No-Wait Job Shops with Variable Processing Times

The Dynamic Hungarian Algorithm for the Assignment Problem with Changing Costs

Statistical Methods and Optimization in Data Mining

ONE TIME ENUMERATION OF MAXIMAL BICLIQUE PATTERNS FROM 3D SYMMETRIC MATRIX

On the Approximability of Modularity Clustering

Randomized rounding of semidefinite programs and primal-dual method for integer linear programming. Reza Moosavi Dr. Saeedeh Parsaeefard Dec.

Computational Genomics and Molecular Biology, Fall

Gene Clustering & Classification

OPEN MP-BASED PARALLEL AND SCALABLE GENETIC SEQUENCE ALIGNMENT

Iterative Signature Algorithm for the Analysis of Large-Scale Gene Expression Data. By S. Bergmann, J. Ihmels, N. Barkai

6 Randomized rounding of semidefinite programs

1. Lecture notes on bipartite matching

Plaid models, biclustering, clustering on subsets of attributes, feature selection in clustering, et al.

Use of biclustering for missing value imputation in gene expression data

Fast and Simple Algorithms for Weighted Perfect Matching

Set Cover with Almost Consecutive Ones Property

Gene expression & Clustering (Chapter 10)

HARNESSING CERTAINTY TO SPEED TASK-ALLOCATION ALGORITHMS FOR MULTI-ROBOT SYSTEMS

I How does the formulation (5) serve the purpose of the composite parameterization

Small Survey on Perfect Graphs

Microarray data analysis

α Coverage to Extend Network Lifetime on Wireless Sensor Networks

The Generalized Topological Overlap Matrix in Biological Network Analysis

On Mining Micro-array data by Order-Preserving Submatrix

AC-Close: Efficiently Mining Approximate Closed Itemsets by Core Pattern Recovery

CS 5540 Spring 2013 Assignment 3, v1.0 Due: Apr. 24th 11:59PM

CSC Linear Programming and Combinatorial Optimization Lecture 12: Semidefinite Programming(SDP) Relaxation

Estimating Error-Dimensionality Relationship for Gene Expression Based Cancer Classification

Correspondence Clustering: An Approach to Cluster Multiple Related Spatial Datasets

LECTURES 3 and 4: Flows and Matchings

Parallel Evaluation of Hopfield Neural Networks

Reflexive Regular Equivalence for Bipartite Data

Clustering Techniques

Topic: Local Search: Max-Cut, Facility Location Date: 2/13/2007

Notes for Lecture 24

Approximation Algorithms for Wavelength Assignment

Enhancing Clustering Results In Hierarchical Approach By Mvs Measures

A new predictive image compression scheme using histogram analysis and pattern matching

Neural Network Weight Selection Using Genetic Algorithms

2. Department of Electronic Engineering and Computer Science, Case Western Reserve University

The Ordered Covering Problem

Algorithms for Bioinformatics

Foundations of Computing

SPARSE COMPONENT ANALYSIS FOR BLIND SOURCE SEPARATION WITH LESS SENSORS THAN SOURCES. Yuanqing Li, Andrzej Cichocki and Shun-ichi Amari

Models of distributed computing: port numbering and local algorithms

Clustering: Classic Methods and Modern Views

Optimizing multiple spaced seeds for homology search

Lecture 11: Maximum flow and minimum cut

On the Min-Max 2-Cluster Editing Problem

Lecture 9: Pipage Rounding Method

Feature Selection Using Modified-MCA Based Scoring Metric for Classification

Storage Coding for Wear Leveling in Flash Memories

[Ch 6] Set Theory. 1. Basic Concepts and Definitions. 400 lecture note #4. 1) Basics

Surrogate Gradient Algorithm for Lagrangian Relaxation 1,2

A Parallel Evolutionary Algorithm for Discovery of Decision Rules

2 ATTILA FAZEKAS The tracking model of the robot car The schematic picture of the robot car can be seen on Fig.1. Figure 1. The main controlling task

Graph Adjacency Matrix Automata Joshua Abbott, Phyllis Z. Chinn, Tyler Evans, Allen J. Stewart Humboldt State University, Arcata, California

10701 Machine Learning. Clustering

EECS730: Introduction to Bioinformatics

Visual Representations for Machine Learning

Toward the joint design of electronic and optical layer protection

GEMINI GEneric Multimedia INdexIng

Subset sum problem and dynamic programming

Data Mining Technologies for Bioinformatics Sequences

Effective probabilistic stopping rules for randomized metaheuristics: GRASP implementations

The Probabilistic Method

Approximation Algorithms

Distance-based Methods: Drawbacks

On Demand Phenotype Ranking through Subspace Clustering

MICROARRAY IMAGE SEGMENTATION USING CLUSTERING METHODS

Module 7. Independent sets, coverings. and matchings. Contents

The Encoding Complexity of Network Coding

Approximation Algorithms: The Primal-Dual Method. My T. Thai

Approximability Results for the p-center Problem

Clustering Jacques van Helden

Optimal Detector Locations for OD Matrix Estimation

Clustering. Informal goal. General types of clustering. Applications: Clustering in information search and analysis. Example applications in search

CHAPTER 6 MODIFIED FUZZY TECHNIQUES BASED IMAGE SEGMENTATION

An Efficient Algorithm for Computing Non-overlapping Inversion and Transposition Distance

Transcription:

Mean Square Residue Biclustering with Missing Data and Row Inversions Stefan Gremalschi a, Gulsah Altun b, Irina Astrovskaya a, and Alexander Zelikovsky a a Department of Computer Science, Georgia State University, Atlanta, GA 30303 {stefan,iraa,alexz}@cs.gsu.edu b Department of Reproductive Medicine, University of California, San Diego, CA 92093, galtun@ucsd.edu Abstract. Cheng and Church proposed a greedy deletion-addition algorithm to find a given number of k biclusters, whose mean squared residues (MSRs) are below certain thresholds and the missing values in the matrix are replaced with random numbers. In our previous paper we introduced the dual biclustering method with quadratic optimization to missing data and row inversions. In this paper, we modified the dual biclustering method with quadratic optimization and added three new features. First, we introduce row status for each row in a bicluster where we add and also delete rows from biclusters based on their status in order to find min MSR. We compare our results with Cheng and Church s approach where they inverse rows while adding them to the biclusters. We select the row or the negated row not only at addition, but also at deletion and show improvement. Second, we give a prove for the theorem introduced by Cheng and Church in [4]. Since, missing data often occur in the given data matrices for biclustering, usually, missing data are filled by random numbers. However, we show that ignoring the missing data is a better approach and avoids additional noise caused by randomness. Since, an ideal bicluster is a bicluster with an H value of zero, our results show a significant decrease of H value of the biclusters with lesser noise compared to original dual biclustering and Cheng and Church method. Keywords: biclustering, Mean Square Residue 1 Introduction The gene expression data are given in matrices. In these matrices rows represent genes and columns represent experimental conditions. Each cell in the matrix represents the expression level of a gene under a specific experimental condition. It is well known that, genes can be relevant for a subset of conditions. On the other hand, groups of conditions can be clustered by using different groups of genes. In this case, it is important to do clustering in these two dimensions simultaneously. This led to the discovery of Partially supported by GSU Molecular Basis of Disease Fellowship.

2 Stefan Gremalschi et al. biclusters corresponding to a subset of genes and a subset of conditions with a high similarity score by Cheng and Church [4]. Biclustering algorithms perform simultaneous row-column clustering. The goal in these algorithms is to find homogeneous submatrices. Biclustering has been widely used to find appropriate subsets of experimental conditions in microarray data [1, 5, 15, 7, 9, 11 13, 18, 19]. Cheng and Church s algorithm is based on a natural uniformity model which is the mean squared residue. They proposed a greedy deletion-addition algorithm to find a given number of k biclusters, whose mean squared residues (MSRs) are below certain thresholds. However, in their method, missing values in the matrix is replaced with random numbers. It is possible that these random numbers can interfere the discovery of future biclusters, especially those ones that have overlap with the discovered ones. Yang et al. [16, 15] referred to this as random interference. They generalize the model of bicluster to incorporate missing values and propose a probabilistic algorithm. They defined a probabilistic move-based algorithm FLOC (FLexible Overlapped biclustering) that generalizes the concept of mean squared residue and based on the concept of action and gain. However, FLOC model is still not suitable for non-disjoint clusters and there are more user parameters, including the number of biclusters. These additional features can have negative impacts to the clustering process. In this paper, we propose a similar method to handle the missing data. We have first mathematically characterized general ideal biclusters, i.e., biclusters with zero mean square residue. We have shown that new way of handling missing data is significantly more tolerant to noise. We have also introduced status for each row status -1 means that the corresponding row is inverted (negated), status +1 means that the original row is not inverted. We consider the problem of finding min MSR overall possible row inversions. A limited use of row inversion (without introducing row status) has been applied in [4] when rows are added to biclusters. Based on our findings in [14], we developed a new dual biclustering algorithm and quadratic program that treats missing data accordingly and use the best status assignment. The matrix entries with missing data are not taken in account when computing averages. When comparing our method with Cheng and Church [4], we show that it is better to ignore missing data when adjusting the mean squared residue (MSR) value for finding optimal biclusters. We use a set of methods which includes a dual biclustering algorithm, quadratic program (QP) and combination of dual biclustering with QP which finds (k l)-bicluster with MSR using a greedy approach proposed in paper [14]. We use a set of methods which includes a dual biclustering algorithm, quadratic program and combination of dual biclustering with QP which finds (k l)-bicluster with MSR using a greedy approach proposed in paper [14]. Finally, we apply the best row status assignments and get even better average and median MSR overall set of all biclusters. The reminder of this paper is organized as follows. Section 2 gives the formal definition of mean squared residue. In section 3, we give a new definition for adjusting MSR and prove a necessary and sufficient criteria for a matrix to have a perfect correlation. Section 4 defines the inversion based MSR and shows how to compute it. In section 5, we introduce the dual problem formulation described in [14] and we illustrate the comparison between the new adjusted MSR with Cheng and Church s method. The search

MSR Biclustering with Missing Data and Row Inversions 3 of biclusters using the new MSR is given in section 6. The analysis and validation of experimental study is given in Section 7. Finally, we draw conclusions in Section 8. 2 Mean Squared Residue Mean squared residue problem has been defined before by Cheng and Church [4] and Zhou and Khokhar [13]. In this paper, we use the same terminology as in [13]. In this section, we give a brief introduction to the terminology as given in [14]. Our input is an (N M)-data matrix A, with R rows and C columns, where a cell a ij is a real value that represents the expression level of gene i(row i), under condition j(column j). Matrix A is defined by its set of rows, R = {r 1, r 2,..., r N } and its set of columns C = {c 1, c 2,..., c M }. Given a matrix, biclustering finds sub-matrices, that are subgroups of rows (genes) and subgroups of columns, where the genes exhibit highly correlated behavior for every condition. Given a data matrix A, the goal is to find a set of biclusters such that each bicluster exhibits some similar characteristic. Let A IJ = (I, J) represent a submatrix of A (I R and J C). A IJ contains only the elements aij belonging to the submatrix with set of rows I and set of columns J. A bicluster A IJ = (I, J) can be defined as a k by l sub-matrix of the data matrix where k and l are the number of rows and the number of columns in the submatrix A IJ. The concept of bicluster was introduced by [4] to find correlated subsets of genes and a subset of conditions. Let a ij denote the mean of the i-th row of the bicluster (I, J), a Ij the mean of the j-th column of (I, J), and a IJ the mean of all the elements in the bicluster. As given in [4], more formally, a ij = 1 J a Ij = 1 I a ij,i I, (1) j J a ij,j J, (2) i I a IJ = 1 I J i I,j J a ij. (3) According to [4], the residue of an element a ij in a submatrix A IJ equals r ij = a ij a ij a Ij + a IJ (4) The difference between the actual value of a ij and its expected value predicted from its row, column, and bicluster mean is given by the residue of an element. It also reveals its degree of coherence with the other entries of the bicluster it belongs to. The quality of a bicluster can be evaluated by computing the mean squared residue H, i.e. the sum of all the squared residues of its elements[4]: H(I,J) = 1 I J i I,j J (a ij a ij a Ij + a IJ ) 2 (5)

4 Stefan Gremalschi et al. A submatrix A IJ is called a δ bicluster if H(I,J) δ for some given threshold δ 0. In general, we can formulate biclustering problem bilaterally maximize the size (area) of the biclusters and minimize MSR. But, these two objectives above contradict each other because smaller biclusters have smaller MSR and vice versa. Therefore, there are two optimization problem formulations. Cheng and church considered the following formulation: Maximize the bicluster size (area) subject to an upper bound on MSR. 3 Adjusting MSR for missing data Missing data often occur in biological data. Common practice to deal with them is to fill gaps by random numbers. However, it adds noise and may result in biclusters of lower quality. Alternative approach is to ignore missing data, keeping only originally available information. Let A be a bicluster (I,J). We denote via J i J bicluster s columns without missing data in i-th row and via I j I rows without missing data in j-th column. Then the mean of the i-th row of the bicluster, the mean of the j-th column, and the mean of all the elements in the bicluster are reformulated as follows in equations 6, 7 and 8. a ij = 1 a ij,i I, (6) J i j J i a Ij = 1 a ij,j J, (7) I j i I j 1 a IJ = j J I j i I j,j J a ij. (8) In order to compare the approach with the Cheng-Church s approach for handling missing data, a bicluster with zero H-value were used. A bicluster with H=0 is called ideal bicluster. Theorem Let n m matrix A be a bicluster (I,J). Then, A has a zero H-value if and only if A can be represented as a sum of n-vector X and m-vector Y in the following way a ij = x i + y j,i I,j J. Proof First, we assume that A is a n m bicluster (I,J) with zero H value and try to prove that A can be represented as above-mentioned sum. Zero H value means zero residues r ij,i I,j J. Then each element of A can be calculated as follows a ij = a ij +a Ij a IJ. Denoting X = {x i = a ij ai 2 J } i I and Y = {y j = a Ij ai 2 J } j J results in A = X + Y where vector addition is defined as a ij = x i + y j. Q.E.D. In the other direction, we assume that bicluster A can be represented as a sum of n-vector X and m-vector Y and try to show that A has zero H-value. Since a ij = x i + y j,i I,j J, the mean of the i-th row is a ij = mxi+ j J yj, the mean of the j-th column is a Ij = i I xi+nyj n, and the mean of all the elements in the bicluster is a IJ = m i I xi+n j J yj nm. Obviously, the residues are equalled to zero. Indeed, m

r ij = x i + y j x i m bicluster A has zero H-value. MSR Biclustering with Missing Data and Row Inversions 5 j J yj i I xi i I n y j + xi j J n + yj m = 0. Thus, the Note. Theorem also covers biclusters that are product of two vectors. Indeed, applying logarithm to them produces biclusters that are represented as a sum. 4 MSR with Row Inversions In the original definition of biclusters, it is possible to invert (negate) certain rows. The row inversion corresponds to negative correlation rather than usual positive correlation of the inverted rows with other rows in the bicluster. The row inversion may result in the significant reduction of the bicluster MSR. In contrast to algorithmically handling inversions when adding rows (see [4]), we suggest to embed row inversion in the MSR definition as follows. We associate with each row its status which is equal -1 if the row is inverted and +1, otherwise. Definition. The Mean Square Residue with row inversions is minimum MSR over all possible row statuses. Finding the optimal row status assignment is not a trivial problem. Since MSR of a matrix does not change when positive linear transformations is applied, we can show that there is a single global minimum of MSR among all possible status assignments. A greedy iterative method changing status of row if the resulted MSR of the entire matrix decreases will find such minimum. Unfortunately, this greedy method is too slow to apply even once while it is better to apply it after each node deletion. Therefore, we suggest the following simple heuristic iteratively over each row find which total row square residue is lower: the original or the one with all values inverted (negated). The better choice is used as the row status. In our experiments, this heuristic always finds the optimal inversion status assignment. 5 Dual Biclustering In this section, we give a brief overview of the dual biclustering problem and our algorithm that we described in [14]. We formulate the dual biclustering problem as follows: given expression matrix A, find k l bicluster with the smallest mean squared residue H. For a set of biclusters, we have: Given: matrix A n m, set of bicluster sizes S, total overlapping V. Find: S biclusters with total overlapping at most V and total minimum sum of scores H. This algorithm implements the new computation of MSR which ignores missing data. The algorithm uses only the present data that is available. The greedy algorithm for finding a bicluster may start with the entire matrix and at each step try all single rows (columns) addition (deletion), applying the best operation if it improves the score and terminating when it reaches the bicluster size k l. The output bicluster will have the smaller MSR for the given size. Like in [4], the algorithm uses the structure of the mean

6 Stefan Gremalschi et al. residue score to enable faster greedy steps: for a given threshold α, at each deletion iteration all rows (columns) for which d(i) > αh(i,j) are removed. Also, the algorithm implements the addition of inverse rows to the matrix, allowing the identification of the biclusters which contains co-regulation and inverse co-regulation. Single node deletion and addition algorithms are shown in Figure 1 and Figure 2, respectively. Input: Expression matrix A on genes n, conditions m and bicluster size (k, l). Output: Bicluster A I,J with the smallest adjusted MSR. Initialize: I = n, J = m, w ( i, j) = 0, i n, j m. Iteration: 1. Calculate a ij, a Ij and H(I, J) based on adjusted MSR. If I = k, J = l output I, J. 2. For each row calculate d(i) = 1 J i j J i RS IJ(i, j) 3. For each column calculate e(j) = 1 I j i I j RS IJ(i, j) 4. Take the best row or column and remove it from I or J. Fig. 1. Single node deletion algorithm. Input: Expression matrix A and bicluster size (k, l). Output: Bicluster A I,J with I I and J J. Iteration: 1. Calculate a ij, a Ij and H(I, J) based on the adjusted MSR. 2. Add the columns with 1 I j i I j RS IJ(i, j) H(I, J) 3. Calculate a ij, a Ij and H(I, J) based on the adjusted MSR. j J i RS IJ(i, j) H(I, J) 4. Add the rows with 1 J i 5. If nothing was added or I = k, J = l, halt. Fig. 2. Single node addition algorithm. This algorithm is used as a subroutine and repeatedly applied to the matrix. We are using bicluster overlapping control (BOC) to avoid finding the same bicluster over and over again. The penalty is applied for using the cells present in biclusters found before. By using BOC, we can preserve the original data from losing information it carries because we do not mask biclusters with random numbers. The general biclustering scheme is outlined in Figure 3, where w ij is an element of weights matrix W, A is the resulting data matrix after node deletion on original matrix A; and A is the resulting matrix after node addition on A. We used the measure of bicluster overlapping, V, introduced in [14], which is the complement to ratio of number of distinct cells used in all found biclusters and the area of all biclusters.

MSR Biclustering with Missing Data and Row Inversions 7 Input: Expression matrix A, parameter α and a set S of bicluster sizes. Output: S biclusters in matrix A. Iteration: 1. w ( i, j) = 0, i n, j m. 2. while S not empty do 3. (k, l) = get first element from S 4. S = S {(k, l)} 5. Apply multiple node deletion on A giving (k, l). 6. Apply node addition on A giving (k, l). 7. Store A and update W. 8. end. Fig. 3. Dual biclustering algorithm. 6 MSR Minimization via Quadratic Program We have defined the Dual Biclustering as an optimization problem [6], [3] in [14]. We have also defined a quadratic program for biclustering in [14]. In this paper, we have modified our QP in [14] where we reformulated the objective and constraints in order to handle missing data. We define the dual biclustering formulation as an optimization problem [14]: for a given matrix A n m, find the bicluster with bounded size (area) k l with minimal mean squared residue. It can be easily seen that if MSR has to be defined as QP objective, it will be of a cubic form. Since QP s objective can be contain only squared variables, the following constraint needs to be satisfied: define QP objective in such a way that only quadratic variables are present. To meet this requirement, we simulated variable multiplication by addition as described in [14]. 6.1 Integer Quadratic Program For a given normalized matrix A n m and bicluster size k l, the Integer Quadratic Program is defined as follows: Objective 1 Minimize : I J i n,j m (residue ij) 2 Subject to I = k J = l residue ij = a ij x ij a ij x ij a Ij x ij + a IJ x ij a ij = 1 J j m a ij, a Ij = 1 I x ij row i + column j 1 x ij row i i n a ij and a IJ = 1 I J i n, j m a ij

8 Stefan Gremalschi et al. x ij column j i n row i = k j m column j = l x ij, row i, column j {0, 1} End The QP is used as a subroutine and repeatedly applied to the matrix. For each bicluster size, we generate a separate QP. In order to avoid finding the same bicluster over and over again, the discovered bicluster is masked by replacing the values of its submatrix with random values. Row inversion is simulated by adding to the input matrix A its inversed rows. The resulting matrix will have twice more rows. Missing data is handled in the following way: if an element of the matrix contains a missing value, then it does not participate in computation of mean squared residue H. In this case, the row mean A ij will be equal to the sum of all cells in row i that are not marked as missing values and divided by their number. Similar for column mean A Ij and bicluster average A IJ. Since the integer QP is too slow and its not scalable enough, we used the greedy rounding and random interval rounding methods proposed in [14]. 6.2 Combining Dual Biclustering with Rounded QP Input: Expression matrix A, parameters α, ratio k, ratio l and a set of bicluster sizes S. Output: S biclusters in matrix A. 1. while S not empty do 2. (k, l) = get first element from S 3. S = S {(k, l)} 4. k = k ratio k 5. l = l ratio l 6. Apply multiple node deletion on A giving (k, l ). 7. Apply node addition on A giving (k, l ). 8. Update W. 9. Run QP on A giving (k, l ). 10. Round Fractional Relaxation and store A. 11. end. Fig. 4. Combined Adjusted Dual Biclustering with Rounded QP algorithm. In this section, we combined the adjusted dual biclustering with modified rounded QP algorithm. Here, our goal is to reduce the instance size to speed up the QP. First, we apply adjusted dual biclustering algorithm to input matrix A to reduce the instance size where the new size is specified by two parameters: ratio k and ratio l. Then, we run rounded QP on the output obtained from Dual Biclustering algorithm. This combination improves the running time of the QP and increases the quality of the final bicluster since

MSR Biclustering with Missing Data and Row Inversions 9 an optimization method is applied. The general algorithm scheme is outlined in Figure 4, where W is the weights matrix, A is the resulting matrix after node deletion and A is the resulting matrix after node addition. 7 Experimental Results In this section, we analyze results obtained from Dual Biclustering with adjusted MSR for missing data. We describe comparison criteria, define the swap rule model and analyze the p value of the biclusters. We tested our biclustering algorithms on data from [10] and compared our results with Cheng and Church [4]. For a fair comparison, we used bicluster sizes published by [4]. A systematic comparison and evaluation of biclustering methods for gene expression data is given in [17]. However, their model uses biologically relevant information, whereas our model is more generic and based on statistical approach. Therefore, we haven t used their comparison results in this paper. 7.1 Evaluation of the adjusted MSR To measure robustness of the proposed MSR to noise and evaluate quality of the obtained biclusters, the experiments were run on the imputed data. Let A be a (I,J) bicluster with zero H-value and variation of real data σ 2. Corresponding imputed bicluster A p is defined as follows in the following equation. a p ij = a ij + ε ij (9) where p is a percentage of added noise, {ε ij } i I,j J N(0, 7.2 The goal of our experiments p 100 σ2 ). The goal of our experiments is to find percentage of noise data such that algorithm is still able to distinguish bicluster of size k from non-biclusters in the imputed data. Although, one can determine such percentage in respect to submatrices of the bicluster, the probability of having distinguishable submatrix when bicluster can not be already distinguished from non-bicluster tends which becomes zero due to uniformly distributed imputation of error. 7.3 Experimental results Figure 5 compares Cheng and Church, dual biclustering, dual biclustering coupled with QP, adjusted dual biclustering, adjusted dual biclustering coupled with QP and adjusted dual biclustering with row inversion. Average MSR for adjusted dual and QP represents 68 percent (average) and 48 percent (median) of the data published in [4]. These results show that ignoring missing data for the dual algorithm gives much smaller MSR. The effect of noise on the MSR computation using synthesized data can be seen in Figure 6.

10 Stefan Gremalschi et al. Algorithms Cheng and Church* Cheng and Church** Dual Dual and QP Adjusted Dual Adjusted Dual and QP Adjusted Dual with inverted rows OC parameter n/a n/a 1.8 1.7 1.8 1.7 1.6 Covering 39945 39945 40548 41037 40548 41087 43028 Average MSR 204.29 228.56 205.77 171.5 161.23 154.66 195.9 (%) 100 112 100.72 75.02 70.54 68 95 Median MSR 196.3095 204.96 123.27 104.47 104.66 95.46 77.96 (%) 100 105 62.79 47.91 51.1 47 39.71 Fig. 5. Comparison of biclustering methods Noise vs. MSR 1000000 900000 800000 MSR 700000 600000 500000 400000 300000 200000 100000 0 0% 3% 5% 10% 20% 30% Noise (%) 0% Missing Data 5% Missing Data 10% Missing Data 15% Missing Data Fig. 6. MSR computation for synthesized data Figure 7 shows the noise effect on adjusted MSR computation vs. random filled missing data. It is easy to see that adjusted MSR is less affected by noise than random filled missing data. Figure 8 shows how noise affects adjusted MSR random filled missing data for different levels of noise. We measure the statistical significance of biclusters obtained by our algorithms using p value. P value is computed by running Dual Problem algorithm on 100 random generated input data sets. The random data is obtained from matrix A by randomly selecting two cells in the matrix (a ij,d kl ) and taking their diagonal elements (b kj,c il ). If a ij > b kj and c il < d kl, algorithm swaps a ij with c il and b kj with d kl, it is called a hit. If not, two elements a ij and d kl are randomly chosen again. The matrix is considered randomized if there are nm 2 hits. In our case, p value is smaller than 0.001, which indicates that the results are not random and are statistically significant. 8 Conclusions Random numbers can interfere with the discovery of future biclusters, especially those ones that have overlap with the discovered ones. In this paper, we introduce a new

MSR Biclustering with Missing Data and Row Inversions 11 Missing Data vs. Random Missing Data 1200000 10% Missing Data 10% Random Missing Data 1000000 800000 600000 400000 200000 0 0% 3% 5% 10% 20% 30% Noise (%) 50% 60% 70% 10% Random Missing Data 10% Missing Data MSR Fig. 7. Adjusted MSR vs. random filled missing data 1200000 1000000 Missing Data vs. Random Missing Data 0% Missing Data 10% Missing Data 10% Random Data 800000 600000 400000 200000 0 0% 3% 5% 10% 20% Noise (%) 10% Random Data 10% Missing Data 30% 0% Missing Data MSR 50% 60% 70% Fig. 8. MSR random filled missing data for different levels of noise. approach to handle the missing data which does not take in account entries with missing data. We have characterized ideal biclusters, i.e., biclusters with zero mean square residue and shown that this approach is significantly more stable with respect to increasing noise. Several biclustering methods have been modified accordingly. Our experimental results show a significant decrease of H value of the biclusters when comparing with counterparts with noise reduction (e.g., the original Cheng and Church [4] method). Average MSR for adjusted dual and QP represents 68 percent (average) and 48 percent (median) of the data published in [4]. These results showed that ignoring missing data for the dual algorithm gives much smaller MSR. We also define MSR based on the best row inversion status. We give an efficient heuristic for finding such assignment. This new definition allow to further reduced MSR for a found set of biclusters.

12 Stefan Gremalschi et al. References 1. Angiulli F., Pizzuti C., Gene Expression Biclustering using Random Walk Strategies. In Proceedings of the 7th International Conference on Data Warehousing and Knowledge Discovery (DaWaK 2005), Copenhagen, Denmark, 2005. 2. Baldi P. and Hatfield G.W., DNA Microarrays and Gene Expression. From Experiments to Data Analysis and Modelling, Cambridge Univ. Press, 2002. 3. Bertsimas D., Tsitsiklis J., Introduction to Linear Optimization, Athena Scientific. 4. Cheng Y., Church GM.: Biclustering of Expression Data. In Proceedings of the Eighth International Conference on Intelligent Systems for Molecular Biology (ISMB), (AAAI Press) 93-103, 2000. 5. Madeira S.C., Oliveira A.L., Biclustering Algorithms for Biological Data Analysis: A Survey, IEEE Transactions on Computational Biology and Bioinformatics, 1(1):24 45, 2004. 6. Papadimitriou C.H., Steiglitz K., Combinatorial optimization: algorithms and complexity,prentice-hall, Inc., Upper Saddle River, NJ, 2982 7. Prelic A., Bleuler S., Zimmermann P., Wille A., Bhlmann P., Gruissem W., Hennig L., Thiele L., Zitzle E., A systematic comparison and evaluation of biclustering methods for gene expression data. Bioinformatics, 22(9):1122-1129, 2006. 8. Shamir R., Lecture notes, http://www.cs.tau.ac.il/ rshamir/ge/05/scribes/lec04.pdf. 9. Tanay A., Sharan R. and Shamir R., Discovering Statistically Significant Biclusters in Gene Expression Data, Bioinformatics, 18:136-144, 2002. 10. Tavazoie S., Hughes J.D., Campbell M.J., Cho R.J., and Church G.M., Systematic determination of genetic network architecture. Nature Genetics, 22:281 285, 1999. 11. Yang J., Wang H., Wang W., and Yu P., Enhanced biclustering on gene expression data, Proceedings of the 3rd IEEE Conference on Bioinformatics and Bioengineering (BIBE), pp. 321-327, 2003. 12. Zhang Y., Zha H., Chu C.H., A time-series biclustering algorithm for revealing co-regulated genes. In Proc. Int. Symp. Information and Technology: Coding and Computing, (ITCC 2005), pp. 32-37, Las Vegas, USA, 2005. 13. Zhou J., Khokhar A.A., ParRescue: Scalable Parallel Algorithm and Implementation for Biclustering over Large Distributed Datasets, 26th IEEE International Conference on Distributed Computing Systems, (ICDCS 2006), 2006. 14. Gremalschi S. and Altun G., Mean Squared Residue Based Biclustering Algorithms, Proceedings of International Symposium on Bioinformatics Research and Applications (IS- BRA 08), Springer LNBI (Lecture Notes in Computer Science) 4983:232 243, 2008. 15. F. Divina, J. Aguilar, Ruiz Biclustering of Expression Data with Evolutionary Computation. IEEE Transactions on Knowledge and Data Engineering, pp 590-602, Vol. 18, No. 5 May 2006. 16. Yang, J., Wang, W., Wang,H. and Yu, P.S., Enhanced biclustering on expression data. In Proceedings of the 3rd IEEE Conference on Bioinformatics and Bioengineering (BIBE 2003), 321-327, 2003. 17. Prelic A., Bleuler S., Zimmermann P., Wille A., Bhlmann P., Gruissem W., Hennig L., Thiele L, and Zitzler E., A Systematic Comparison and Evaluation of Biclustering Methods for Gene Expression Data. Bioinformatics, 22(9):1122-1129, 2006. 18. Jing Xiao, Lusheng Wang, Xiaowen Liu, Tao Jiang: An Efficient Voting Algorithm for Finding Additive Biclusters with Random Background. Journal of Computational Biology 15(10): 1275-1293 (2008) 19. Xiaowen Liu, Lusheng Wang: Computing the maximum similarity bi-clusters of gene expression data. Bioinformatics 23(1): 50-56 (2007)