A formal concept analysis approach to consensus clustering of multi-experiment expression data

Size: px
Start display at page:

Download "A formal concept analysis approach to consensus clustering of multi-experiment expression data"

Transcription

1 Hristoskova et al. BMC Bioinformatics 214, 15:151 METHODOLOGY ARTICLE Open Access A formal concept analysis approach to consensus clustering of multi-experiment expression data Anna Hristoskova 1*, Veselka Boeva 2 and Elena Tsiporkova 3 Abstract Background: Presently, with the increasing number and complexity of available gene expression datasets, the combination of data from multiple microarray studies addressing a similar biological question is gaining importance. The analysis and integration of multiple datasets are expected to yield more reliable and robust results since they are based on a larger number of samples and the effects of the individual study-specific biases are diminished. This is supported by recent studies suggesting that important biological signals are often preserved or enhanced by multiple experiments. An approach to combining data from different experiments is the aggregation of their clusterings into a consensus or representative clustering solution which increases the confidence in the common features of all the datasets and reveals the important differences among them. Results: We propose a novel generic consensus clustering technique that applies Formal Concept Analysis (FCA) approach for the consolidation and analysis of clustering solutions derived from several microarray datasets. These datasets are initially divided into groups of related experiments with respect to a predefined criterion. Subsequently, a consensus clustering algorithm is applied to each group resulting in a clustering solution per group. These solutions are pooled together and further analysed by employing FCA which allows extracting valuable insights from the data and generating a gene partition over all the experiments. In order to validate the FCA-enhanced approach two consensus clustering algorithms are adapted to incorporate the FCA analysis. Their performance is evaluated on gene expression data from multi-experiment study examining the global cell-cycle control of fission yeast. The FCA results derived from both methods demonstrate that, although both algorithms optimize different clustering characteristics, FCA is able to overcome and diminish these differences and preserve some relevant biological signals. Conclusions: The proposed FCA-enhanced consensus clustering technique is a general approach to the combination of clustering algorithms with FCA for deriving clustering solutions from multiple gene expression matrices. The experimental results presented herein demonstrate that it is a robust data integration technique able to produce good quality clustering solution that is representative for the whole set of expression matrices. Keywords: Consensus clustering, Particle swarm optimization, Formal concept analysis, Integration analysis, Multi-experiment expression data *Correspondence: anna.hristoskova@intec.ugent.be 1 Department of Information Technology, Ghent University - iminds, Gaston Crommenlaan 8 (21), 95 Ghent, Belgium Full list of author information is available at the end of the article 214 Hristoskova et al.; licensee BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License ( which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited. The Creative Commons Public Domain Dedication waiver ( applies to the data made available in this article, unless otherwise stated.

2 Hristoskova et al. BMC Bioinformatics 214, 15:151 Page 2 of 16 Background DNA microarray technology offers the ability to screen the expression levels of thousands of genes in parallel under different experimental conditions or their evolution in discrete time points. All these measurements contain information on several aspects of gene regulation and function, ranging from understanding the global cell-cycle control of microorganisms [1], to cancer in humans [2,3]. Gene clustering is one of the most frequently used analysis methods for gene expression data. Clustering algorithms are used to divide genes into groups according to the degree of their expression similarity. These groups suggest the correlation and/or co-regulation of the respective genes that possibly share common biological roles. The combination of data from multiple microarray studies addressing a similar biological question is gaining high importance in the recent years [4-7] due to the ever increasing number and complexity of the available gene expression datasets. The integration and evaluation of multiple datasets yield more reliable and robust results since they are based on a larger number of samples and the effects of the individual study-specific biases are diminished. A method for integration analysis of the data from multiple experiments is the aggregation of their clustering results into a consensus clustering which emphasizes the common organization in all the datasets and reveals the significant differences among them. In this work, we present and validate a novel generic approach to consensus clustering based on Formal Concept Analysis (FCA) where microarray data realized under different experimental conditions is integrated into a representative consensus clustering solution. It initially divides the available microarray experiments into groups of related datasets with respect to a predefined criterion and then a consensus clustering algorithm is applied to each group of experiments separately. The rationale behind this is that if the experiments are closely related to one another, then they produce more accurate and robust clustering solution. Next, the clustering solutions produced by the different groups are pooled together and further analyzed by employing FCA which allows extracting valuable insights from the data and generating a gene partition over the whole experimental compendium. FCA produces a concept lattice where each concept represents a subset of genes that belongs to a number of clusters. The concepts compose the final disjoint clustering partition. The proposed general FCA-enhanced consensus clustering approach is experimentally validated. For this purpose, two consensus clustering methods are adapted to incorporate an FCA analysis step. These methods are quite different; the first (Integrative) integrates the partitioning results derived from multiple microarray datasets through a weighted aggregation process, while the second employs a Particle Swarm Optimization (PSO) approach to cluster gene expression data across multiple experiments. The FCA results derived from both methods are analysed with respect to the cluster consistency and biological relevance. It is shown that although both algorithms optimize different clustering characteristics, FCA is able to construct similar consensus clustering solutions that are representative for the whole set of experiments. In summary, the main contribution of the introduced FCA-enhanced consensus clustering technique is that it proposes a general approach to the combination of clustering algorithms with Formal Concept Analysis (FCA) for deriving clustering solutions from multiple gene expression matrices. In addition, the approach is demonstrated to be independent of the selected clustering algorithm. In this way one can use a customized algorithm optimized for the specific characteristics of each group of experiments. The further employment of FCA allows performing a subsequent data analysis, which provides useful insights on the biological role of genes contained in the same FCA concepts. Related work Presently, with the increasing number and complexity of available gene expression datasets, the combination of data from multiple microarray studies addressing a similar biological question is gaining importance. However, as emphasized in [7], the investigations of gene expression levels have also generated controversy because of the probabilistic nature of the conclusions and the discrepancies between the results of the studies addressing the same biological question. Subsequently, the authors proposed data analysis and visualization tools for estimating the degree to which the findings of one study are reproduced by others and for integrating multiple studies in a single analysis. These tools were described in the context of studies of breast cancer and it was illustrated that it is possible to identify a substantial biologically relevant subset of the human genome within which the expression levels are reliable. The latter suggests that important biological signals are often preserved or enhanced by multiple experiments. Another approach to combining data from different experiments is the aggregation of their clusterings into a consensus or representative clustering which increases theconfidenceinthecommonfeaturesinallthedatasets and reveals the important differences among them [8]. Methods for the combination of clustering results derived for each dataset separately have been considered in [9-11]. The algorithm proposed in [9] first generates local cluster models and then combines them into a global cluster model of the data. The study in [1] focuses on clustering ensembles, i.e. seeking a combination of multiple partitions that provides improved overall clustering of the given data. The combined partition is found as a

3 Hristoskova et al. BMC Bioinformatics 214, 15:151 Page 3 of 16 solution to the corresponding maximum likelihood problem using the Expectation-Maximization (EM) algorithm in [12]. The authors in [11] consider the problem of combining multiple partitions of a set of objects into a single consolidated clustering without accessing the features or algorithms that determined these partitions. The cluster ensemble problem is formalized as a combinatorial optimization problem in terms of shared mutual information. In contrast to the foregoing approaches, the generic solution proposed in this paper applies FCA in order to construct a consensus clustering that is representative for all the datasets and in addition, it is independent of the applied clustering algorithm. In order to validate the proposed FCA-enhanced approach two consensus clustering algorithms have been developed, Integrative [13] and PSO-based [14], and used in the validation process. Note that, a preliminary FCA-enhanced version of the PSO-based algorithm was initially considered in [15]. In [13] we study two microarray data integration techniques that can be applied to the problem of deriving clustering results from a set of microarray experiments. A cluster integration approach is considered, which combines the information contained in multiple microarray experiments at the level of expression or distance matrices and then applies a clustering algorithm on the combined matrix. Furthermore, a technique for the combination of partitioning results derived from multiple microarray datasets, referred to as Integrative consensus clustering, is introduced. It uses a traditional aggregation schema in order to integrate the different partitioning results into a final partition matrix. The PSO-based approach is used to cluster gene expression data across multiple experiments. In this algorithm, referred to as PSO-based consensus clustering, eachexperiment (dataset) defines a particle which is initialized with a set of k cluster centroids obtained after performing k-means clustering on the experiment. The final (optimal) clustering solution is found by updating the particles using the information on the best clustering solution obtained by each experiment and the entire set of experiments. In this article, the Integrative and PSO-based consensus clustering approaches are extended incorporating a final FCA-step. The experimental results from their validation suggest that although both algorithms optimize different clustering characteristics, FCA is able to construct similar consensus clustering solutions. Methods Partitioning algorithms Three partitioning algorithms are commonly used for the purpose of dividing data objects into k disjoint clusters [16]: k-means, k-medians and k-medoids clustering. All three methods start by initializing a set of k cluster centres, where k is preliminarily determined. Subsequently, each object of the dataset is assigned to the cluster whose centre is the nearest, and the cluster centres are recomputed. This process is repeated until the objects inside every cluster become as close to the centre as possible and no further object item reassignments take place. The EM algorithm in [12] is commonly used for that purpose, i.e. to find the optimal partitioning into k groups. The three partitioning methods in question differ in how the cluster centre is defined. In k-means, the cluster centre is defined as the mean data vector averaged over all objects in the cluster. Instead of the mean, k-medians calculates the median for each dimension in the data vector. Finally, in k-medoids [17], which is a robust version of the k-means, the cluster centre is defined as the object which has the smallest sum of distances to the other objects in the cluster, i.e.,thisisthe mostcentrally locatedpointina given cluster. Particle swarm optimization Particle swarm optimization (PSO) is an evolutionary computation method introduced in [18]. In order to find an optimal or near-optimal solution to the problem, PSO updates the current generation of particles (each particle is a candidate solution to the problem) using the information on the best solution obtained by each particle and the entire population. The hybrid algorithm proposed in [14] combines k- means and PSO for deriving a clustering result from a group of n related microarray datasets M 1, M 2,..., M n. Each dataset contains the gene expression levels of m genes in n i different experimental conditions or time points. In this context, each matrix i is used to generate k cluster centers, which are considered to represent a particle, i.e. the particle is treated as a set of points in an n i -dimensional space. The final (optimal) clustering solution is found by updating the particles using the information on the best clustering solution obtained by each data matrix and the entire set of matrices. Assume that the i-th particle is initialized with a set of k cluster centers a C i = { C1 i, Ci 2,..., k} Ci and a set of velocity vectors b V i = { V1 i, V 2 i,..., V k} i using gene expression matrix M i. Thus each cluster center is a vector Cj i = ) (c i j1, ci j2,..., ci and each velocity vector is a vector V i jni j = ) (v i j1, vi j2,..., vi, i.e. each particle i isamatrix(orasetof jni points) in the k n i dimensional space. Next, assume that P g = { } P g1, P g2,..., P gk is a set of cluster centers in an n g -dimensional space representing the best clustering solution found so far within the set of matrices and P i = { P1 i, Pi 2,..., k} Pi is the set of centroids of the best solution discovered so far by the corresponding

4 Hristoskova et al. BMC Bioinformatics 214, 15:151 Page 4 of 16 matrix. The update equation for the d-th dimension of the j-the velocity vector of the i-th particle is defined as follows ) v i jd (t + 1) = w vi jd (t) + c 1 ϕ 1 (p i jd ci jd (t) + c 2 ϕ 2 g(t), (1) where i = 1,..., n; j = 1,..., k; d = 1,..., n i and { pgd c i g(t) = jd (t),ifn g n i (2), otherwise The variables ϕ 1 and ϕ 2 are uniformly generated random numbers in the range [, 1], c 1 and c 2 are called acceleration constants whereas w is called inertia weight as defined in [19]. The first part of Equation (1) represents the inertia of the previous velocity, the second part is the cognition part that identifies the personal experience of the particle and the third part represents the cooperation among particles and is therefore named the social component. Acceleration numbers c 1, c 2 and inertia weight w are predefined by the user. Note that the cognition part in the above equation has a modified interpretation. Namely, it represents the private thinking (opinion) of the particle based on its own source of information (dataset). Due to this we adapted the social part (see equation (2)) since each particle matrix has a different number of columns (n i ) due to different number of experiment points in each dataset. It was demonstrated in [19] that when w is in the range [.9, 1.2] PSO will have the best chance to find the global optimum within a reasonable number of iterations. Furthermore, w =.72 and c 1 = c 2 = 1.49 were found in [2] to ensure good convergence. The clustering algorithm combining PSO and k-means can be summarized as follows: 1. Initialize each particle with k cluster centers obtained as a result of applying the k-means algorithm to the corresponding data matrix. 2. Initialize the personal best clustering solution of each matrix with the corresponding clustering solution found in Step for iteration = 1 to max-iteration do (a) for i = 1 to n do (i.e. for all datasets) i. for j = 1 to m do (i.e. for all genes in the current dataset) A. Calculate the distance of gene g j with all cluster centers. B. Assign g j to the cluster that has the nearest center to g j. ii. end for iii. Calculate the fitness function for the clustering solution C i. iv. Update the personal best clustering solution P i. 4. end for (b) end for (c) Find the global best solution P g. (d) Update the cluster centers according to the velocity updating formula proposed in equation (1). The PSO-based clustering algorithm was first introduced in [21] showing that it outperforms k-means and a few other state-of-the-art clustering algorithms. In this method, each particle represents a possible set of k cluster centroids. The authors in [22] hybridized the approach in [21] with the k-means algorithm for clustering general datasets. A single particle of the swarm is initialized with the result of the k-means algorithm while the rest of the swarm is initialized randomly. In [23] a new approach is proposed based on the combination of PSO and Self Organizing Maps and applied it for clustering gene expression data obtaining promising results. Further the study in [24] considers a dynamic clustering approach based on PSO and genetic algorithm. The main advantage of this algorithm is that it can automatically determine the optimal number of clusters and simultaneously cluster the data set with minimal user interference. The downside of all the foregoing approaches is that they are not suitable for consolidating multiple partitions as the conducted clustering analysis is based on a single expression matrix. Formal concept analysis Formal Concept Analysis (FCA) [25] is a mathematical formalism allowing to derive a concept lattice from a formal context constituted of a set of objects O,asetofattributes A, and a binary relation defined as the Cartesian product O A. The context is described as a table which rows correspond to objects and the columns to attributes or properties and a cross in a table cell means that an object possesses a property. FCA is used for a number of purposes among which knowledge formalization and acquisition, ontology design, and data mining. The concept lattice is composed of formal concepts, or simply concepts, organized into a hierarchy by a partial ordering (a subsumption relation allowing to compare concepts). Intuitively, a concept is a pair (X, Y ) where X O, Y A,andX is the maximum set of objects sharing the whole set of attributes in Y and vice-versa. The set X is called the extent and the set Y the intent of the concept (X, Y ). The subsumption (or sub concept - super concept) relation between concepts is defined as follows: (X 1, Y 1 ) (X 2, Y 2 ) X 1 X 2 (or Y 2 Y 1 ). (3) Relying on this subsumption relation, thesetofall concepts extracted from a context is organized within a complete lattice. That means that for any set of concepts

5 Hristoskova et al. BMC Bioinformatics 214, 15:151 Page 5 of 16 there is a smallest super concept and a largest sub concept, called the concept lattice. The FCA or concept lattice approach has been applied for extracting local patterns from microarray data [26,27] or for performing microarray data comparison [28,29]. For example, the FCA method proposed in [28] builds a concept lattice from the experimental data together with additional biological information. Each vertex of the lattice corresponds to a subset of genes that are grouped together according to their expression values and some biological information related to the gene function. It is assumed that the lattice structure of the gene sets might reflect biological relationships in the dataset. In [3], a FCA-based method is proposed for extracting groups or classes of co-expressed genes. A concept lattice is constructed where each concept represents a set of co-expressed genes in a number of situations. A serious drawback of the method is the fact that the expression matrix is transformed into a binary table (the input for the FCA step) which leads to possible introduction of biases or information loss. Thus the authors further propose and compare two FCA-based methods for mining gene expression data and show that they are equivalent [31]. The first one relies on interordinal scaling, encoding all possible intervals of attribute values in a formal context that is processed with classical FCA algorithms. The second one relies on pattern structures without a prior transformation, and is shown to be more computationally efficient and to provide more readable results. Notice that all the mentioned FCA-based methods focus solely on optimizing the clustering of a single expression matrix and consequently, they are not suited for the consolidation of multiple partitions. FCA-enhanced consensus clustering algorithm The problem of deriving clustering results from a set of gene expression matrices can be approached in two different ways: 1) information contained in different datasets may be combined at the level of expression (or similarity) matrices and afterwards clustered; 2) given multiple clustering solutions, one per each dataset, find a consensus (combined) clustering. In this section, a general FCA-enhanced consensus clustering algorithm for deriving a clustering result from multiple microarray datasets is proposed which adopts the second approach. Assume that a particular biological phenomenon is monitored in several high-throughput experiments under n different conditions. Each experiment i (i = 1, 2,..., n) is supposed to measure the gene expression levels of m i genes in n i different experimental conditions or time points. Thus, a set of n different data matrices M 1, M 2,..., M n will be produced, one per experiment. The FCA-enhanced consensus clustering consists of three distinctive steps in Figure 1: 1) the expression datasets are divided into several smaller groups using some predefined criterion; 2) a consensus clustering algorithm (e.g. Integrative, PSO-based or other) is applied to each group of datasets separately, which produces a list of different clustering solutions, one per group; 3) these clustering solutions are further transformed into a single clustering result by employing FCA. In contrast to the consensus clustering algorithms discussed in the foregoing sections, where some partitioning (e.g. Integrative and PSO-clustering) algorithm is applied to the entire set of experiments in order to produce the final clustering solution, the algorithm proposed herein initially divides the available microarray datasets into groups of related (similar) experiments with respect to a predefined criterion. The rationale behind this is that if the experiments are closely related to one another, then these experiments produce more accurate and robust clustering solution. Thus, the selected consensus clustering algorithm is applied to each group of experiments separately. This produces a list of different clustering solutions, one per each group. Subsequently, these solutions are pooled together and further analyzed by employing FCA which allows extracting valuable insights from the data and generating a gene partition over the whole experimental compendium. FCA produces a concept lattice where each concept represents a subsetofgenesthatbelongtoanumberofclusters.the different concepts compose the final disjoint clustering partition. The proposed FCA-enhanced consensus clustering approach has the following characteristics: 1. the clustering uses all data by allowing potentially each group of related experiments to have a different set of genes, i.e. the total set of studied genes is not restricted to those contained into all datasets; 2. it is better tuned to each experimental condition by identifying the initial number of clusters for each group of related experiments separately depending on the number, composition and quality of the gene profiles; 3. the problem with ties is avoided (i.e. acasewhena gene is randomly assigned to a cluster because it belongs to more than one cluster) by employing FCA in order to analyze together all the partitioning results and find the final clustering solution representative of the whole experimental compendium. The distinctive steps of the FCA-enhanced clustering algorithm, visualized in Figure 1, are explained in detail below. Initialization step Let us consider the aforementioned expression matrices M 1, M 2,..., M n monitoring N different genes in total.

6 Hristoskova et al. BMC Bioinformatics 214, 15:151 Page 6 of 16 Microarray data realized under different experimental conditions 1. Initialization Step Divided into r groups of related datasets 2. Clustering Step Consensus clustering generates r different clustering solutions 3. FCA-based Analysis Step Constructed concept lattice partitioning the genes into a set of disjoint clusters. M1 Group 1 k1 M2 M3 M4 Group k2 Concept Lattice Mn Group r Figure 1 Schematic representation of the FCA-enhanced consensus clustering approach. kr The initialization step is the most variable part of the algorithm since it closely depends on the concrete clustering algorithm employed. The available gene expression matrices are divided into r groups of related (similar) datasets with respect to some predefined criterion, e.g. the used synchronized method or the expression similarity between the matrices. For each group the set of studied genes needs to be restricted to those contained in all datasets of the group, i.e. the number of overlapping genes found across all datasets of the group. Then the number of cluster centres is identified for each group of experiments i (i = 1, 2,..., r) separately. As discussed in [32,33], this can be performed by running the k-means or other clustering algorithm on each data matrix for a range of different numbers of clusters. Subsequently, the quality of the obtained clustering solutions needs to be assessed in some way in order to identify the clustering scheme which best fits the datasets in question. Some commonly used validation measures are the Silhouette Index and Connectivity, presented in the Cluster validation measures section, which are able to identify the best clustering scheme. Finally, the prevailing number of clusters within the concrete group of experimentsisselectedasrepresentativeforthewholegroup. Clustering step The selected consensus clustering (e.g. Integrative, PSObased or other) algorithm is applied to each group of related experiments i (i = 1, 2,..., r) separately. The latter willgeneratealistofr different clustering solutions, one per each group. The result is that K (K = k k r ) different clusters are produced by the different groups. This clustering solution is disjoint in terms of the gene expression profiles produced in the different experiments. However, it is not disjoint in terms of the different participating genes, i.e. there will be genes which will belong to more than one cluster. FCA-based analysis step As discussed above, the N studied genes are grouped during the Clustering step into K clusters that are not guaranteed to be disjoint. This overlapping partition is further analysed and refined into a disjoint one by applying FCA. As mentioned above FCA is a principled way of automatically deriving a hierarchical conceptual structure from a collection of objects and their properties. The approach takes as input a matrix (referred to as the formal context) specifying a set of objects and the properties thereof, called attributes. In our case, a formal context

7 Hristoskova et al. BMC Bioinformatics 214, 15:151 Page 7 of 16 consists of the set G of the N studied genes (objects), the set of clusters C = C 1, C 2,..., C K produced by the clustering step (attributes), and an indication of which genes belong to which clusters. Thus, the context is described as a matrix, with the genes corresponding to the rows and the clusters corresponding to the columns of the matrix, and a value of 1 in cell (i, j) whenever gene i belongs to cluster C j. Subsequently, a formal concept for this context is defined to be a pair (X, Y) such that X G & Y C & every gene in X belongs to every cluster in Y for every gene in G that is not in X, there is a cluster in Y that does not contain that gene for every cluster in C that is not in Y, there is a gene in X that does not belong to that cluster. The family of these concepts obeys the mathematical axioms defining a concept lattice. The constructed lattice consists of concepts where each one represents a subset of genes belonging to a number of clusters. The set of all concepts partitions the genes into a set of disjoint clusters. Validation setup Microarray datasets The aforementioned consensus clustering algorithms are validated and compared on benchmark datasets where the true clustering is known. These datasets are composed by gene expression time series data obtained from a study examining the global cell-cycle control of gene expression in fission yeast Schizosaccharomyces pombe [1]. The study includes eight independent time-course experiments synchronized respectively by: 1. elutriation: three independent biological repeats (elu1, elu2, elu3); 2. cdc25 block-release: two independent biological repeats, of which one in two dye-swapped technical replicates (cdc25-1, cdc25-2.1, cdc25-2.2)andin addition, one experiment in a sep1 mutant background (cdc25-sep1); 3. a combination of both methods: elutriation and cdc25 block-release (elu-cdc1) as well as elutriation and cdc1 block-release (elu-cdc25). Thus, nine different expression test sets are available. In the pre-processing phase the rows with more than 25% missing entries are filtered out from each expression matrix and any other missing expression entries are imputed by the DTWimpute algorithm presented in [34]. Inthiswayninecompletematricesareobtained. The authors in [1] identified 47 genes as cell-cycle regulated subjected to clustering which resulted in the formation of 4 separate clusters. Subsequently, the time expression profiles of these genes are extracted from the complete data matrices and thus nine new matrices are constructed. Note that some of these 47 genes are removed from the original matrices during the preprocessing phase, i.e. each dataset may have a different set of genes. Thus a set of 376 different genes are present in theninepre-processeddatasetsintotal. Two different benchmark datasets are constructed from the original nine pre-processed matrices and used in the validation process: 1. The genes that are not present in the intersection of the nine pre-processed datasets are removed. The latter produces a subset of 267 genes. Subsequently, the time expression profiles of these genes are extracted from the complete data matrices and thus nine new matrices which form our test corpus 1 are constructed. 2. The initial complete datasets are divided into three groups with respect to the used synchronization method. The overlapping genes within each group are as follows: a subset of 286 common genes in the elutriation datasets, a subset of 35 common genes in the cdc25 block-release datasets and a subset of 374 common genes in the datasets synchronized by the combination of both methods. For each of the three groups only these common genes are retained. As a result of this nine new matrices which form our test corpus 2 are built. Notice that the nine different dataset contain 374 different genes in total. The benchmark datasets are normalized by applying a data transformation method as proposed in [35]. Consensus clustering algorithms In order to validate the proposed general FCA-enhanced approach, we applied two different consensus clustering methods: 1) an algorithm integrating multiple partitioning results and 2) a PSO-based clustering method. These two consensus clustering methods approach in a different way the initialization of the cluster centres and the production of the final clustering partition. Both methods initially restrict the studied genes for each group to those contained in all datasets of the group. The first algorithm, referred to as Integrative, initializes the cluster centres for each group of experiments using the information contained in the datasets of the group in an integrated manner. This step is performed using a matrix constructed by concatenating the expression matrices in each group. The k-means algorithm is then applied to each expression matrix in the group to generate a set of partition matrices for each group. Utilizing information on the quality of the microarrays, weights are assigned to the experiments and are further used in the integration process in order to obtain more realistic overall partition for each group. The data transformation

8 Hristoskova et al. BMC Bioinformatics 214, 15:151 Page 8 of 16 method proposed in [35] is used to evaluate the quality of the considered microarrays. It is applied to each expression matrix and the number of standardized genes is considered as a quality measure for this matrix. This step results in the assignment of a weight to each expression matrix in the group, i.e. the different experiments in the group will contribute to the final partitioning result to a different extent. A detailed explanation of the Integrative consensus clustering algorithm can be found in [13]. The second consensus clustering method, referred to as PSO-based, employs a PSO approach to cluster gene expression data across multiple experiments. Each experiment (dataset) defines a particle which is initialized with a set of k cluster centroids obtained after performing the k- means clustering algorithm applied over the experiment. The final (optimal) clustering solution for each group of experiments is found by updating the particles using the information on the best clustering solution obtained by each experiment and the entire set of experiments in the group. A detailed explanation of the PSO-based consensus clustering algorithm can be found in [14]. Cluster validation measures One of the most important issues in cluster analysis is the validation of the clustering results. Essentially, cluster validation techniques are designed to find the partitioning that best fits the underlying data, and should therefore be regarded as a key tool in the interpretation of the clustering results. Since none of the clustering algorithms performs uniformly well under all scenarios, it is not reliable to use a single cluster validation measure, but instead to use at least two that reflect different aspects of a partitioning. In this sense, we have implemented two different validation measures for estimating the quality of the clusters: 1. Connectivity: for assessing connectedness; 2. Silhouette Index (SI): for assessing compactness and separation properties of a partitioning. Connectivity The Connectivity captures the degree to which genes are connected within a cluster by keeping track of whether the neighbouring genes are put into the same cluster [36]. Let us define m i( j) as the j-th nearest neighbour of gene i, and let χ imi(j) be zero if i and j areinthesameclusterand 1/j otherwise. Then for a particular clustering solution C 1, C 2,..., C k of matrix M, which contains the expression values of m genes (rows) in n different experimental conditions or time points (columns), the connectivity is defined as Conn(c) = m i=1 j=1 n χ imi( j). The Connectivity has a value between zero and infinity and should be minimized. Silhouette index The Silhouette Index (SI) reflects the compactness and separation of clusters [37]. Suppose C 1, C 2,..., C k is a clustering solution (partition) of matrix M, which contains the expression profiles of m genes. Then, the SI is defined as s(k) = 1 m (b i a i ) / max {a i, b i }, m i=1 where a i represents the average distance of gene i to the other genes of the cluster to which the gene is assigned, and b i represents the minimum of the average distances of gene i to genes of the other clusters. The values of the Silhouette Index vary from -1 to 1 and a higher value indicates better clustering results. Results and discussion The validation process presented below has two main goals: 1) to illustrate that different clustering methods usually generate gene partitions on expression data, which may exhibit considerable inconsistencies and discrepancies in terms of clustering quality and gene composition 2) to demonstrate the added value of an FCA-enhanced clustering approach for overcoming and diminishing such differences and preserving relevant biological signals. For this purpose, the following partitions have been generated and compared: 1. two consensus partitions over test corpus 1 using respectively the original versions of the Integrative and thepso-based consensus clustering algorithms; 2. two times three consensus partitions, one for each group of experiments from test corpus 2,using respectively the grouped versions of the Integrative and thepso-based consensus clustering algorithms as specified in the foregoing section; 3. two concept lattices derived by applying FCA on the two sets of group partitions (see above) produced respectively from the grouped versions of the Integrative and the PSO-based methods. Clustering performance In this section, we evaluate and compare the clustering performance of the two consensus clustering algorithms discussed in the foregoing section on the benchmark datasets described above by using two cluster validation measures: Silhouette Index and Connectivity. The partitioning algorithms such as k-means contain the number of clusters (k) as a parameter and their major drawback is the lack of prior knowledge for that number. Therefore, we initially identified the number of cluster

9 Hristoskova et al. BMC Bioinformatics 214, 15:151 Page 9 of 16 centres by running the k-means clustering algorithm on each dataset for values of k between 2 and 1 [32], [33]. Subsequently, the quality of the obtained clustering solutions is assessed using the Connectivity validation index. We focused on the values of k at which a significant local change in value of the index occurs [32]. The optimal number of clusters for the different experiments range between 3 and 5 as can be seen in Figure 2. However, k = 4 prevails (encountered in five matrices) and therefore it is used for our experiments. Figure 3 compares for test corpus 1, the SI and Connectivity values corresponding to the partitions generated by the standard k-means on each individual matrix versus the values calculated on the individual matrices using the overall consensus partitions produced respectively by the Integrative and the PSO-based consensus clustering algorithm. One can observe that the SI scores obtained by the PSO-based clustering technique are better than those generated by the Integrative clustering and k-means for all the individual matrices. However, under the Connectivity measure, the Integrative solution exhibits better performance than the other algorithms (in 7 of the 9 experiments for the PSO-based clustering algorithm and in 8 of the 9 for k-means). The obtained results are explained by the fact that the PSO-based clustering algorithm favours the production of compact and well separated clusters instead of well connected, since it finds the final clustering solution by updating the cluster centres using the information on the best clustering solution generated by each experiment and the entire set of experiments. The Integrative clustering algorithm on the other hand has a bias towards well connected clusters as it produces the lowest SI scores in comparison to k-means and the PSObased algorithms. Clearly, each clustering algorithm may introduce biases due to its specific characteristics. Therefore a generic solution, such as supported by the FCA approach, that diminishes the differences and preserves relevant biological signals is required to be applied for further data analysis sep1 cdc1 cdc25 Figure 2 Optimal number of clusters as determined using the Connectivity validation index for the 9 different experiments. For 5 of the 9 datasets k = 4 is identified as the optimal cluster number. Next, the performance of the grouped versions of the Integrative and PSO-based consensus clustering algorithms on test corpus 2 is considered. Initially, the nine datasets are divided into three groups with respect to the used synchronized method: 1. elutriation datasets: elu1, elu2, elu3; 2. cdc25 block-release datasets: cdc25-1, cdc25-2.1, cdc25-2.2, cdc25-sep1; 3. datasets synchronized by the combination of both methods:elu-cdc1, elu-cdc25. Afterwards, the number of cluster centres is identified for each group using the Connectivity measure. The selected optimal number of clusters for the three groups of experiments is as follows: elutriation datasets: k = 4; cdc25 block-release datasets: k = 6, and the combined ones: k = 5. As a result 15 different clusters (elutriation: clusters -3, cdc25 block-release: clusters 4-9 and combination of both: clusters 1-14) in total are produced by each of the two consensus clustering methods. Figure 4 compares the SI and Connectivity values calculated on the individual matrices using the partitions produced by the known clustering solution published in [1], andthosegeneratedbythegroupedintegrative and PSObased clustering solutions. According to the SI indices in Figure 4(a), the PSO-based solution clearly outperforms the Integrative one. It also exhibits better performance than the Integrative clustering solution in 7 of the 9 experiments under the Connectivity validation measure. In addition, the PSO-based clustering solution is better than the known one in 6 of the 9 experiments under the SI validation index and respectively, in 3 of the 9 experiments under the Connectivity index. These results are further analyzed in the next section. Cluster consistency In this section, we analyse the impact of the different clustering methods on the final gene partition. The degree of pairwise overlap between the different gene clusters generated by the Integrative and the PSO-based consensus clustering algorithms is calculated and compared. Assume that we have a gene cluster c i produced by the Integrative algorithm and a gene cluster c j coming from the PSObased clustering. Then the degree of overlap is calculated as follows: d ij = # ( c i c j ) max ( #c i,#c j )1. In this way, d ij will be equal to 1 in case of full overlap between the clusters i and j, in case of no overlap and between and 1 otherwise.

10 Hristoskovaet al. BMC Bioinformatics 214, 15:151 Page 1 of 16 k-means Integrative PSO-based Silhouette Index (a) Silhouette Index elu1 elu2 elu3 cdc25-1 cdc cdc Gene expression matrices cdc25- sep1 elucdc1 elucdc25 Connectivity k-means Integrative PSO-based (b) Connectivity Index elu1 elu2 elu3 cdc25-1 cdc cdc Gene expression matrices cdc25- sep1 elucdc1 elucdc25 Figure 3 Comparison of the SI (a) and Connectivity (b) values generated by the standard k-means algorithm, and the Integrative and the PSO-based consensus clustering algorithms on test corpus 1. The consensus gene partitions generated respectively by the original Integrative and the PSO-based consensus clustering algorithms on test corpus 1 are compared in Figure 5(a). The best overlap is recorded for the pairs: (PSO 1, Integrative 3), (PSO 2, Integrative 3), (PSO 2, Integrative 2) and (PSO 3, Integrative 2). Evidently, the genes of PSO cluster are almost uniformly distributed between Integrative 1, Integrative 2 and Integrative 3, those of PSO 1 are allocated to Integrative 3, the genes of PSO 2 are mainly distributed between Integrative 2 and Integrative 3 and PSO 3 has a high overlap with Integrative 2. Figures 5(b), (c), (d) depict the overlap between the consensus clustering assignment of PSO-based versus the Integrative clusteringalgorithmsontest corpus 2 for each group of experiments separately (respectively elutriation, cdc25 block-release and the combined). The best overlap is recorded for the following pairs: elutriation: (PSO, Integrative 1), (PSO 2, Integrative ), (PSO 3, Integrative 3), cdc25 block-release: (PSO 4, Integrative 5), (PSO 4, Integrative 4), (PSO 5, Integrative 6), (PSO 6, Integrative 9), (PSO 8, Integrative 9), (PSO 9, Integrative 7), combined: (PSO 1, Integrative 14), (PSO 12, Integrative 14), (PSO 13, Integrative 1) and (PSO 14, Integrative 13). The high degree of pairwise overlap between the different gene clusters generated by the Integrative and the PSO-based consensus clustering algorithms suggests that there is a certain consistency in the resulting clustering solutions. The next section elaborates on this effect by applying FCA on both consensus clustering solutions consolidating the different groups of experiments. Results from the FCA-enhanced Step The gene partitions produced by the grouped versions of the Integrative and PSO-based consensus clustering algorithms on test corpus 2 are further analysed by applying FCA. First, we create a context that consists of the set of 374 studied genes and the set of 15 clusters produced by the grouped version of the Integrative consensus clustering

11 Hristoskovaet al. BMC Bioinformatics 214, 15:151 Page 11 of 16 Silhouette Index known clustering Integrative PSO-based elu1 elu2 elu3 cdc25-1 cdc cdc Gene expression matrices (a) Silhouette Index cdc25- sep1 elucdc1 elucdc25 Connectivity known clustering Integrative PSO-based elu1 elu2 elu3 cdc25-1 cdc cdc Gene expression matrices (b) Connectivity Index cdc25- sep1 elucdc1 elucdc25 Figure 4 Comparison of the SI (a) and Connectivity (b) values generated by the known clustering solution published in [1], and those obtained by applying the grouped-versions of the Integrative and PSO-based consensus clustering algorithms on test corpus 2. algorithm. It produces a lattice of 133 concepts for this context (see Figure 6(a)). However, 74 concepts have support, i.e. the benchmark genes are grouped into 59 disjoint clusters (concepts). Further a context that consists of the set of 374 studied genes and the set of 15 clusters produced by the grouped version of the PSO-based clustering algorithm is built. Subsequently, a lattice of 19 concepts for this context is generated (see Figure 6(b)). The FCA step partitions the benchmark gene set in 85 disjoint clusters (concepts) in total since the rest of the concepts have support. It is interesting to notice that all the concepts consisting of three clusters in both disjoint partitions (21 such concepts exist in the Integrative partition and 29 in the PSO-based) containclustersproduced by each of the three groups of experiments (elutriation, cdc25 block-release, combined). Let us consider the concepts, each consisting of three clusters, with support above.3 for both FCA-enhanced clustering partitions in Table 1. The SI and Connectivity values for the expression matrices formed by the genes contained in these concepts are compared to the known clustering solution published in [1]. This was executed by extracting the known clustering solutions for the PSO and Integrative concepts in Table 1. The results from Figure 7 show that in contrast to the results obtained for the clustering results before the FCA step (see Figures 3 and 4), the SI and Connectivity performances of the two clustering algorithms follow a similar trend. Thus, it appears that FCA step is able to overcome cluster assignment differences and performance discrepancies mostly attributed to the specificities of the different clustering methods. Additionally, for the elutriation experiments FCA has better SI and Connectivity than the known clustering solution for both PSO and Integrative. However, for cdc25 blockrelease the FCA performs for both measures a bit worse than the known solutions. Let us investigate whether some correspondence exists between the PSO-based and the Integrative concepts in Table 1. For this purpose we consider the overlapping pairs of clusters of the grouped FCA partitions identified in the foregoing section (Figures 5(b), (c), (d)). Table 2 presents the correspondence between the PSO-based and the Integrative concepts in Table 1 in terms of the percentage minimum cluster assignment overlap as extracted

12 Hristoskovaet al. BMC Bioinformatics 214, 15:151 Page 12 of 16 Percentage of assigned genes Integrative Integrative 1 Integrative 2 Integrative PSO-based clusters (a) all datasets Percentage of assigned genes Integrative Integrative 1 Integrative 2 Integrative PSO-based clusters (b) elutriation group Percentage of assigned genes Integrative 4 Integrative 5 Integrative 6 5 Integrative 7 Integrative 8 Integrative PSO-based clusters (c) cdc25 block-release group Percentage of assigned genes Integrative 1 Integrative 11 Integrative 12 Integrative 13 Integrative PSO-based clusters (d) combined group Figure 5 Consensus clustering assignment overlap: Integrative vs. PSO-based. Figure (a) compares the consensus gene partitions generated by the consensus clustering algorithms on test corpus 1 while Figures (b), (c), (d) depict the overlap between the consensus clustering assignment of the clustering algorithms on test corpus 2 for each group of experiments separately. from the results in Figures 5(b), (c), (d). Note that Table 2 does not include concepts, which exhibit significantly low overlap (lower than 2%) with any other concept in Table 1. Next, each of the 9 different concepts in Table 2 are subjected to analysis with the BiNGO tool [38], in order to determine which Gene Ontology categories are statistically overrepresented in each concept. The results are generated for a cut-off p-value of.5 and Benjamini and Hochberg (False Discovery Rate) multiple testing correction. For each gene concept a table is generated consisting of five columns: (1) the GO category identification (GOid); (2) the multiple testing corrected p-value (p-value); (3) the total number of genes annotated to that GO term divided by total number of genes in the test set (cluster frequency); (4) the number of selected genes versus the total GO number (total frequency); and (5) a detailed description of the selected GO categories (description). Only6ofthe9FCAconceptsareassignedGOcategories by the BiNGO tool: Integrative-{, 9, 14} contains 41 genes annotated to 26 GO categories (25 have total frequency >.%), most of which refer to the regulation of protein kinaseactivity or regulation of cell-cycle process or regulation of metabolic process; Integrative-{, 8, 14} contains 22 genes annotated to 25 GO categories (22 have total frequency >.% and cluster frequency > 1.%), dominatedby sister chromatid segregation and beta-glucan process regulation categories; Integrative-{1, 6, 14} contains 21 genes annotated to 21 GO categories (15 have total frequency >.%), most of which refer to regulation of mrna stability; PSO-{1, 6, 1} contains 29 genes connected with 19 GO categories (1 have total frequency >.%), most ofwhichreferto cell-cycle control or regulationof DNAreplication or sister chromatid segregation; PSO-{1, 6, 12} contains 18 genes connected with 5 GO categories (only 3 have total frequency >.%),

13 Hristoskovaet al. BMC Bioinformatics 214, 15:151 Page 13 of 16 (a) FCA-enhanced Integrative. (b) FCA-enhanced PSO-based. Figure 6 The concept lattices generated by the FCA-enhanced consensus clustering produced by the Integrative (a) and the PSO-based (b) clustering algorithms respectively. all referring to the regulation of sister chromatid cohesion and segregation; PSO-{1, 8, 12} contains 12 genes annotated to 22 GO categories (16 have total frequency >.%) dominated by RNA metabolic processing related categories. It can be observed that the correspondences between the PSO-based and Integrative concepts presented in Table 2 are also supported by the above GO categories Table 1 All concepts consisting of three clusters with support above.3 PSO-based Integrative {1, 6, 1} {, 9, 14} {, 5, 13} {1, 6, 14} {1, 6, 12} {, 8, 14} {2, 4, 1} {2, 9, 14} {, 5, 1} {1, 9, 14} {1,8,12}

14 Hristoskovaet al. BMC Bioinformatics 214, 15:151 Page 14 of 16 (a) Silhouette Index. (b) Connectivity. Figure 7 Comparison of the SI (a) and Connectivity (b) values for the expression matrices formed by the genes contained in the concepts from Table 1. e.g. sister chromatid segregation, DNA and cell-cycle regulation, metabolic processing, etc. These correspondences suggest that although both algorithms optimize different clustering characteristics (SI and Connectivity indices in Figure 4), FCA is able to construct similar consensus clustering lattices that are representative for all the datasets. Evidently, the proposed FCA-enhanced approach is a generic consensus clustering technique that is not dependent on the applied clustering algorithm. This means that one can use a customized algorithm suited for the specific characteristic of each dataset group and consolidate the Table 2 Percentage minimum overlap between the concepts from Table 1 PSO-based integrative {1, 6, 1} {1, 6, 12} {, 5, 1} {1, 8, 12} {, 9, 14} 35% 35% - 25% {1, 6, 14} 2% 2% 4% - {, 8, 14} 2% 2% - - {2, 9, 14} 2% 2% - 2% {1, 9, 14} 25% 25% - 25% resulting clustering solutions of the involved groups by using FCA. Conclusions In this paper we introduced a novel consensus clustering technique which proposes a general approach to the combination of clustering algorithms with Formal Concept Analysis (FCA) for deriving representative clustering solutions from multiple gene expression matrices. This approach involves three distinctive steps: (i) the studied microarray experiments are partitioned into groups of related datasets with respect to a predefined criterion, (ii) a consensus clustering algorithm is applied to each group of experiments separately, (iii) the clustering solutions produced by the different groups are pooled together and further analyzed by employing FCA. The performance of the proposed consensus clustering algorithm is evaluated on a test set of nine time series expression datasets obtained from a study examining the global cell-cycle control of gene expression in fission yeast Schizosaccharomyces pombe. In addition, in step (ii) of the proposed approach two different consensus clustering algorithms

CHAPTER 5 OPTIMAL CLUSTER-BASED RETRIEVAL

CHAPTER 5 OPTIMAL CLUSTER-BASED RETRIEVAL 85 CHAPTER 5 OPTIMAL CLUSTER-BASED RETRIEVAL 5.1 INTRODUCTION Document clustering can be applied to improve the retrieval process. Fast and high quality document clustering algorithms play an important

More information

CHAPTER 6 ORTHOGONAL PARTICLE SWARM OPTIMIZATION

CHAPTER 6 ORTHOGONAL PARTICLE SWARM OPTIMIZATION 131 CHAPTER 6 ORTHOGONAL PARTICLE SWARM OPTIMIZATION 6.1 INTRODUCTION The Orthogonal arrays are helpful in guiding the heuristic algorithms to obtain a good solution when applied to NP-hard problems. This

More information

CHAPTER 6 HYBRID AI BASED IMAGE CLASSIFICATION TECHNIQUES

CHAPTER 6 HYBRID AI BASED IMAGE CLASSIFICATION TECHNIQUES CHAPTER 6 HYBRID AI BASED IMAGE CLASSIFICATION TECHNIQUES 6.1 INTRODUCTION The exploration of applications of ANN for image classification has yielded satisfactory results. But, the scope for improving

More information

CHAPTER 6 MODIFIED FUZZY TECHNIQUES BASED IMAGE SEGMENTATION

CHAPTER 6 MODIFIED FUZZY TECHNIQUES BASED IMAGE SEGMENTATION CHAPTER 6 MODIFIED FUZZY TECHNIQUES BASED IMAGE SEGMENTATION 6.1 INTRODUCTION Fuzzy logic based computational techniques are becoming increasingly important in the medical image analysis arena. The significant

More information

Clustering Jacques van Helden

Clustering Jacques van Helden Statistical Analysis of Microarray Data Clustering Jacques van Helden Jacques.van.Helden@ulb.ac.be Contents Data sets Distance and similarity metrics K-means clustering Hierarchical clustering Evaluation

More information

Contents. ! Data sets. ! Distance and similarity metrics. ! K-means clustering. ! Hierarchical clustering. ! Evaluation of clustering results

Contents. ! Data sets. ! Distance and similarity metrics. ! K-means clustering. ! Hierarchical clustering. ! Evaluation of clustering results Statistical Analysis of Microarray Data Contents Data sets Distance and similarity metrics K-means clustering Hierarchical clustering Evaluation of clustering results Clustering Jacques van Helden Jacques.van.Helden@ulb.ac.be

More information

SEEK User Manual. Introduction

SEEK User Manual. Introduction SEEK User Manual Introduction SEEK is a computational gene co-expression search engine. It utilizes a vast human gene expression compendium to deliver fast, integrative, cross-platform co-expression analyses.

More information

Keywords: clustering algorithms, unsupervised learning, cluster validity

Keywords: clustering algorithms, unsupervised learning, cluster validity Volume 6, Issue 1, January 2016 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Clustering Based

More information

Particle Swarm Optimization applied to Pattern Recognition

Particle Swarm Optimization applied to Pattern Recognition Particle Swarm Optimization applied to Pattern Recognition by Abel Mengistu Advisor: Dr. Raheel Ahmad CS Senior Research 2011 Manchester College May, 2011-1 - Table of Contents Introduction... - 3 - Objectives...

More information

Chapter 14 Global Search Algorithms

Chapter 14 Global Search Algorithms Chapter 14 Global Search Algorithms An Introduction to Optimization Spring, 2015 Wei-Ta Chu 1 Introduction We discuss various search methods that attempts to search throughout the entire feasible set.

More information

A Naïve Soft Computing based Approach for Gene Expression Data Analysis

A Naïve Soft Computing based Approach for Gene Expression Data Analysis Available online at www.sciencedirect.com Procedia Engineering 38 (2012 ) 2124 2128 International Conference on Modeling Optimization and Computing (ICMOC-2012) A Naïve Soft Computing based Approach for

More information

9/29/13. Outline Data mining tasks. Clustering algorithms. Applications of clustering in biology

9/29/13. Outline Data mining tasks. Clustering algorithms. Applications of clustering in biology 9/9/ I9 Introduction to Bioinformatics, Clustering algorithms Yuzhen Ye (yye@indiana.edu) School of Informatics & Computing, IUB Outline Data mining tasks Predictive tasks vs descriptive tasks Example

More information

CLASSIFICATION WITH RADIAL BASIS AND PROBABILISTIC NEURAL NETWORKS

CLASSIFICATION WITH RADIAL BASIS AND PROBABILISTIC NEURAL NETWORKS CLASSIFICATION WITH RADIAL BASIS AND PROBABILISTIC NEURAL NETWORKS CHAPTER 4 CLASSIFICATION WITH RADIAL BASIS AND PROBABILISTIC NEURAL NETWORKS 4.1 Introduction Optical character recognition is one of

More information

Cluster Analysis. Mu-Chun Su. Department of Computer Science and Information Engineering National Central University 2003/3/11 1

Cluster Analysis. Mu-Chun Su. Department of Computer Science and Information Engineering National Central University 2003/3/11 1 Cluster Analysis Mu-Chun Su Department of Computer Science and Information Engineering National Central University 2003/3/11 1 Introduction Cluster analysis is the formal study of algorithms and methods

More information

Clustering. Lecture 6, 1/24/03 ECS289A

Clustering. Lecture 6, 1/24/03 ECS289A Clustering Lecture 6, 1/24/03 What is Clustering? Given n objects, assign them to groups (clusters) based on their similarity Unsupervised Machine Learning Class Discovery Difficult, and maybe ill-posed

More information

GENETIC ALGORITHM VERSUS PARTICLE SWARM OPTIMIZATION IN N-QUEEN PROBLEM

GENETIC ALGORITHM VERSUS PARTICLE SWARM OPTIMIZATION IN N-QUEEN PROBLEM Journal of Al-Nahrain University Vol.10(2), December, 2007, pp.172-177 Science GENETIC ALGORITHM VERSUS PARTICLE SWARM OPTIMIZATION IN N-QUEEN PROBLEM * Azhar W. Hammad, ** Dr. Ban N. Thannoon Al-Nahrain

More information

Cluster quality assessment by the modified Renyi-ClipX algorithm

Cluster quality assessment by the modified Renyi-ClipX algorithm Issue 3, Volume 4, 2010 51 Cluster quality assessment by the modified Renyi-ClipX algorithm Dalia Baziuk, Aleksas Narščius Abstract This paper presents the modified Renyi-CLIPx clustering algorithm and

More information

CHAPTER 2 CONVENTIONAL AND NON-CONVENTIONAL TECHNIQUES TO SOLVE ORPD PROBLEM

CHAPTER 2 CONVENTIONAL AND NON-CONVENTIONAL TECHNIQUES TO SOLVE ORPD PROBLEM 20 CHAPTER 2 CONVENTIONAL AND NON-CONVENTIONAL TECHNIQUES TO SOLVE ORPD PROBLEM 2.1 CLASSIFICATION OF CONVENTIONAL TECHNIQUES Classical optimization methods can be classified into two distinct groups:

More information

A Weighted Majority Voting based on Normalized Mutual Information for Cluster Analysis

A Weighted Majority Voting based on Normalized Mutual Information for Cluster Analysis A Weighted Majority Voting based on Normalized Mutual Information for Cluster Analysis Meshal Shutaywi and Nezamoddin N. Kachouie Department of Mathematical Sciences, Florida Institute of Technology Abstract

More information

Genetic Programming. Charles Chilaka. Department of Computational Science Memorial University of Newfoundland

Genetic Programming. Charles Chilaka. Department of Computational Science Memorial University of Newfoundland Genetic Programming Charles Chilaka Department of Computational Science Memorial University of Newfoundland Class Project for Bio 4241 March 27, 2014 Charles Chilaka (MUN) Genetic algorithms and programming

More information

Unsupervised Learning : Clustering

Unsupervised Learning : Clustering Unsupervised Learning : Clustering Things to be Addressed Traditional Learning Models. Cluster Analysis K-means Clustering Algorithm Drawbacks of traditional clustering algorithms. Clustering as a complex

More information

Centroid based clustering of high throughput sequencing reads based on n-mer counts

Centroid based clustering of high throughput sequencing reads based on n-mer counts Solovyov and Lipkin BMC Bioinformatics 2013, 14:268 RESEARCH ARTICLE OpenAccess Centroid based clustering of high throughput sequencing reads based on n-mer counts Alexander Solovyov * and W Ian Lipkin

More information

CHAPTER 4 AN IMPROVED INITIALIZATION METHOD FOR FUZZY C-MEANS CLUSTERING USING DENSITY BASED APPROACH

CHAPTER 4 AN IMPROVED INITIALIZATION METHOD FOR FUZZY C-MEANS CLUSTERING USING DENSITY BASED APPROACH 37 CHAPTER 4 AN IMPROVED INITIALIZATION METHOD FOR FUZZY C-MEANS CLUSTERING USING DENSITY BASED APPROACH 4.1 INTRODUCTION Genes can belong to any genetic network and are also coordinated by many regulatory

More information

Pattern Mining in Frequent Dynamic Subgraphs

Pattern Mining in Frequent Dynamic Subgraphs Pattern Mining in Frequent Dynamic Subgraphs Karsten M. Borgwardt, Hans-Peter Kriegel, Peter Wackersreuther Institute of Computer Science Ludwig-Maximilians-Universität Munich, Germany kb kriegel wackersr@dbs.ifi.lmu.de

More information

Clustering. Chapter 10 in Introduction to statistical learning

Clustering. Chapter 10 in Introduction to statistical learning Clustering Chapter 10 in Introduction to statistical learning 16 14 12 10 8 6 4 2 0 2 4 6 8 10 12 14 1 Clustering ² Clustering is the art of finding groups in data (Kaufman and Rousseeuw, 1990). ² What

More information

Biclustering Bioinformatics Data Sets. A Possibilistic Approach

Biclustering Bioinformatics Data Sets. A Possibilistic Approach Possibilistic algorithm Bioinformatics Data Sets: A Possibilistic Approach Dept Computer and Information Sciences, University of Genova ITALY EMFCSC Erice 20/4/2007 Bioinformatics Data Sets Outline Introduction

More information

ON HEURISTIC METHODS IN NEXT-GENERATION SEQUENCING DATA ANALYSIS

ON HEURISTIC METHODS IN NEXT-GENERATION SEQUENCING DATA ANALYSIS ON HEURISTIC METHODS IN NEXT-GENERATION SEQUENCING DATA ANALYSIS Ivan Vogel Doctoral Degree Programme (1), FIT BUT E-mail: xvogel01@stud.fit.vutbr.cz Supervised by: Jaroslav Zendulka E-mail: zendulka@fit.vutbr.cz

More information

Research Article Path Planning Using a Hybrid Evolutionary Algorithm Based on Tree Structure Encoding

Research Article Path Planning Using a Hybrid Evolutionary Algorithm Based on Tree Structure Encoding e Scientific World Journal, Article ID 746260, 8 pages http://dx.doi.org/10.1155/2014/746260 Research Article Path Planning Using a Hybrid Evolutionary Algorithm Based on Tree Structure Encoding Ming-Yi

More information

CS Introduction to Data Mining Instructor: Abdullah Mueen

CS Introduction to Data Mining Instructor: Abdullah Mueen CS 591.03 Introduction to Data Mining Instructor: Abdullah Mueen LECTURE 8: ADVANCED CLUSTERING (FUZZY AND CO -CLUSTERING) Review: Basic Cluster Analysis Methods (Chap. 10) Cluster Analysis: Basic Concepts

More information

APPLICATION OF MULTIPLE RANDOM CENTROID (MRC) BASED K-MEANS CLUSTERING ALGORITHM IN INSURANCE A REVIEW ARTICLE

APPLICATION OF MULTIPLE RANDOM CENTROID (MRC) BASED K-MEANS CLUSTERING ALGORITHM IN INSURANCE A REVIEW ARTICLE APPLICATION OF MULTIPLE RANDOM CENTROID (MRC) BASED K-MEANS CLUSTERING ALGORITHM IN INSURANCE A REVIEW ARTICLE Sundari NallamReddy, Samarandra Behera, Sanjeev Karadagi, Dr. Anantha Desik ABSTRACT: Tata

More information

Random projection for non-gaussian mixture models

Random projection for non-gaussian mixture models Random projection for non-gaussian mixture models Győző Gidófalvi Department of Computer Science and Engineering University of California, San Diego La Jolla, CA 92037 gyozo@cs.ucsd.edu Abstract Recently,

More information

Redefining and Enhancing K-means Algorithm

Redefining and Enhancing K-means Algorithm Redefining and Enhancing K-means Algorithm Nimrat Kaur Sidhu 1, Rajneet kaur 2 Research Scholar, Department of Computer Science Engineering, SGGSWU, Fatehgarh Sahib, Punjab, India 1 Assistant Professor,

More information

Cluster Analysis for Microarray Data

Cluster Analysis for Microarray Data Cluster Analysis for Microarray Data Seventh International Long Oligonucleotide Microarray Workshop Tucson, Arizona January 7-12, 2007 Dan Nettleton IOWA STATE UNIVERSITY 1 Clustering Group objects that

More information

ECS 234: Data Analysis: Clustering ECS 234

ECS 234: Data Analysis: Clustering ECS 234 : Data Analysis: Clustering What is Clustering? Given n objects, assign them to groups (clusters) based on their similarity Unsupervised Machine Learning Class Discovery Difficult, and maybe ill-posed

More information

CHAPTER 5 CLUSTER VALIDATION TECHNIQUES

CHAPTER 5 CLUSTER VALIDATION TECHNIQUES 120 CHAPTER 5 CLUSTER VALIDATION TECHNIQUES 5.1 INTRODUCTION Prediction of correct number of clusters is a fundamental problem in unsupervised classification techniques. Many clustering techniques require

More information

MSA220 - Statistical Learning for Big Data

MSA220 - Statistical Learning for Big Data MSA220 - Statistical Learning for Big Data Lecture 13 Rebecka Jörnsten Mathematical Sciences University of Gothenburg and Chalmers University of Technology Clustering Explorative analysis - finding groups

More information

Clustering CS 550: Machine Learning

Clustering CS 550: Machine Learning Clustering CS 550: Machine Learning This slide set mainly uses the slides given in the following links: http://www-users.cs.umn.edu/~kumar/dmbook/ch8.pdf http://www-users.cs.umn.edu/~kumar/dmbook/dmslides/chap8_basic_cluster_analysis.pdf

More information

Hard clustering. Each object is assigned to one and only one cluster. Hierarchical clustering is usually hard. Soft (fuzzy) clustering

Hard clustering. Each object is assigned to one and only one cluster. Hierarchical clustering is usually hard. Soft (fuzzy) clustering An unsupervised machine learning problem Grouping a set of objects in such a way that objects in the same group (a cluster) are more similar (in some sense or another) to each other than to those in other

More information

EECS730: Introduction to Bioinformatics

EECS730: Introduction to Bioinformatics EECS730: Introduction to Bioinformatics Lecture 15: Microarray clustering http://compbio.pbworks.com/f/wood2.gif Some slides were adapted from Dr. Shaojie Zhang (University of Central Florida) Microarray

More information

Cluster Analysis. Angela Montanari and Laura Anderlucci

Cluster Analysis. Angela Montanari and Laura Anderlucci Cluster Analysis Angela Montanari and Laura Anderlucci 1 Introduction Clustering a set of n objects into k groups is usually moved by the aim of identifying internally homogenous groups according to a

More information

3. Cluster analysis Overview

3. Cluster analysis Overview Université Laval Multivariate analysis - February 2006 1 3.1. Overview 3. Cluster analysis Clustering requires the recognition of discontinuous subsets in an environment that is sometimes discrete (as

More information

Semi-Supervised Clustering with Partial Background Information

Semi-Supervised Clustering with Partial Background Information Semi-Supervised Clustering with Partial Background Information Jing Gao Pang-Ning Tan Haibin Cheng Abstract Incorporating background knowledge into unsupervised clustering algorithms has been the subject

More information

E-Companion: On Styles in Product Design: An Analysis of US. Design Patents

E-Companion: On Styles in Product Design: An Analysis of US. Design Patents E-Companion: On Styles in Product Design: An Analysis of US Design Patents 1 PART A: FORMALIZING THE DEFINITION OF STYLES A.1 Styles as categories of designs of similar form Our task involves categorizing

More information

Particle Swarm Optimization Based Approach for Location Area Planning in Cellular Networks

Particle Swarm Optimization Based Approach for Location Area Planning in Cellular Networks International Journal of Intelligent Systems and Applications in Engineering Advanced Technology and Science ISSN:2147-67992147-6799 www.atscience.org/ijisae Original Research Paper Particle Swarm Optimization

More information

Gene expression & Clustering (Chapter 10)

Gene expression & Clustering (Chapter 10) Gene expression & Clustering (Chapter 10) Determining gene function Sequence comparison tells us if a gene is similar to another gene, e.g., in a new species Dynamic programming Approximate pattern matching

More information

Clustering. CE-717: Machine Learning Sharif University of Technology Spring Soleymani

Clustering. CE-717: Machine Learning Sharif University of Technology Spring Soleymani Clustering CE-717: Machine Learning Sharif University of Technology Spring 2016 Soleymani Outline Clustering Definition Clustering main approaches Partitional (flat) Hierarchical Clustering validation

More information

A Session-based Ontology Alignment Approach for Aligning Large Ontologies

A Session-based Ontology Alignment Approach for Aligning Large Ontologies Undefined 1 (2009) 1 5 1 IOS Press A Session-based Ontology Alignment Approach for Aligning Large Ontologies Editor(s): Name Surname, University, Country Solicited review(s): Name Surname, University,

More information

Evaluating Classifiers

Evaluating Classifiers Evaluating Classifiers Charles Elkan elkan@cs.ucsd.edu January 18, 2011 In a real-world application of supervised learning, we have a training set of examples with labels, and a test set of examples with

More information

Chapter 6 Continued: Partitioning Methods

Chapter 6 Continued: Partitioning Methods Chapter 6 Continued: Partitioning Methods Partitioning methods fix the number of clusters k and seek the best possible partition for that k. The goal is to choose the partition which gives the optimal

More information

PROTEIN MULTIPLE ALIGNMENT MOTIVATION: BACKGROUND: Marina Sirota

PROTEIN MULTIPLE ALIGNMENT MOTIVATION: BACKGROUND: Marina Sirota Marina Sirota MOTIVATION: PROTEIN MULTIPLE ALIGNMENT To study evolution on the genetic level across a wide range of organisms, biologists need accurate tools for multiple sequence alignment of protein

More information

Robust PDF Table Locator

Robust PDF Table Locator Robust PDF Table Locator December 17, 2016 1 Introduction Data scientists rely on an abundance of tabular data stored in easy-to-machine-read formats like.csv files. Unfortunately, most government records

More information

International Journal of Digital Application & Contemporary research Website: (Volume 1, Issue 7, February 2013)

International Journal of Digital Application & Contemporary research Website:   (Volume 1, Issue 7, February 2013) Performance Analysis of GA and PSO over Economic Load Dispatch Problem Sakshi Rajpoot sakshirajpoot1988@gmail.com Dr. Sandeep Bhongade sandeepbhongade@rediffmail.com Abstract Economic Load dispatch problem

More information

Data Mining. Dr. Raed Ibraheem Hamed. University of Human Development, College of Science and Technology Department of Computer Science

Data Mining. Dr. Raed Ibraheem Hamed. University of Human Development, College of Science and Technology Department of Computer Science Data Mining Dr. Raed Ibraheem Hamed University of Human Development, College of Science and Technology Department of Computer Science 2016 201 Road map What is Cluster Analysis? Characteristics of Clustering

More information

3. Cluster analysis Overview

3. Cluster analysis Overview Université Laval Analyse multivariable - mars-avril 2008 1 3.1. Overview 3. Cluster analysis Clustering requires the recognition of discontinuous subsets in an environment that is sometimes discrete (as

More information

Fast Efficient Clustering Algorithm for Balanced Data

Fast Efficient Clustering Algorithm for Balanced Data Vol. 5, No. 6, 214 Fast Efficient Clustering Algorithm for Balanced Data Adel A. Sewisy Faculty of Computer and Information, Assiut University M. H. Marghny Faculty of Computer and Information, Assiut

More information

Exploring Econometric Model Selection Using Sensitivity Analysis

Exploring Econometric Model Selection Using Sensitivity Analysis Exploring Econometric Model Selection Using Sensitivity Analysis William Becker Paolo Paruolo Andrea Saltelli Nice, 2 nd July 2013 Outline What is the problem we are addressing? Past approaches Hoover

More information

Center-Based Sampling for Population-Based Algorithms

Center-Based Sampling for Population-Based Algorithms Center-Based Sampling for Population-Based Algorithms Shahryar Rahnamayan, Member, IEEE, G.GaryWang Abstract Population-based algorithms, such as Differential Evolution (DE), Particle Swarm Optimization

More information

CSE 5243 INTRO. TO DATA MINING

CSE 5243 INTRO. TO DATA MINING CSE 5243 INTRO. TO DATA MINING Cluster Analysis: Basic Concepts and Methods Huan Sun, CSE@The Ohio State University 09/25/2017 Slides adapted from UIUC CS412, Fall 2017, by Prof. Jiawei Han 2 Chapter 10.

More information

CHAPTER VII INDEXED K TWIN NEIGHBOUR CLUSTERING ALGORITHM 7.1 INTRODUCTION

CHAPTER VII INDEXED K TWIN NEIGHBOUR CLUSTERING ALGORITHM 7.1 INTRODUCTION CHAPTER VII INDEXED K TWIN NEIGHBOUR CLUSTERING ALGORITHM 7.1 INTRODUCTION Cluster analysis or clustering is the task of grouping a set of objects in such a way that objects in the same group (called cluster)

More information

Unsupervised Learning and Clustering

Unsupervised Learning and Clustering Unsupervised Learning and Clustering Selim Aksoy Department of Computer Engineering Bilkent University saksoy@cs.bilkent.edu.tr CS 551, Spring 2009 CS 551, Spring 2009 c 2009, Selim Aksoy (Bilkent University)

More information

Value Added Association Rules

Value Added Association Rules Value Added Association Rules T.Y. Lin San Jose State University drlin@sjsu.edu Glossary Association Rule Mining A Association Rule Mining is an exploratory learning task to discover some hidden, dependency

More information

Biclustering for Microarray Data: A Short and Comprehensive Tutorial

Biclustering for Microarray Data: A Short and Comprehensive Tutorial Biclustering for Microarray Data: A Short and Comprehensive Tutorial 1 Arabinda Panda, 2 Satchidananda Dehuri 1 Department of Computer Science, Modern Engineering & Management Studies, Balasore 2 Department

More information

ECLT 5810 Clustering

ECLT 5810 Clustering ECLT 5810 Clustering What is Cluster Analysis? Cluster: a collection of data objects Similar to one another within the same cluster Dissimilar to the objects in other clusters Cluster analysis Grouping

More information

International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.7, No.3, May Dr.Zakea Il-Agure and Mr.Hicham Noureddine Itani

International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.7, No.3, May Dr.Zakea Il-Agure and Mr.Hicham Noureddine Itani LINK MINING PROCESS Dr.Zakea Il-Agure and Mr.Hicham Noureddine Itani Higher Colleges of Technology, United Arab Emirates ABSTRACT Many data mining and knowledge discovery methodologies and process models

More information

Simplicial Global Optimization

Simplicial Global Optimization Simplicial Global Optimization Julius Žilinskas Vilnius University, Lithuania September, 7 http://web.vu.lt/mii/j.zilinskas Global optimization Find f = min x A f (x) and x A, f (x ) = f, where A R n.

More information

Missing Data Estimation in Microarrays Using Multi-Organism Approach

Missing Data Estimation in Microarrays Using Multi-Organism Approach Missing Data Estimation in Microarrays Using Multi-Organism Approach Marcel Nassar and Hady Zeineddine Progress Report: Data Mining Course Project, Spring 2008 Prof. Inderjit S. Dhillon April 02, 2008

More information

Enhancing Forecasting Performance of Naïve-Bayes Classifiers with Discretization Techniques

Enhancing Forecasting Performance of Naïve-Bayes Classifiers with Discretization Techniques 24 Enhancing Forecasting Performance of Naïve-Bayes Classifiers with Discretization Techniques Enhancing Forecasting Performance of Naïve-Bayes Classifiers with Discretization Techniques Ruxandra PETRE

More information

Meta- Heuristic based Optimization Algorithms: A Comparative Study of Genetic Algorithm and Particle Swarm Optimization

Meta- Heuristic based Optimization Algorithms: A Comparative Study of Genetic Algorithm and Particle Swarm Optimization 2017 2 nd International Electrical Engineering Conference (IEEC 2017) May. 19 th -20 th, 2017 at IEP Centre, Karachi, Pakistan Meta- Heuristic based Optimization Algorithms: A Comparative Study of Genetic

More information

Understanding Clustering Supervising the unsupervised

Understanding Clustering Supervising the unsupervised Understanding Clustering Supervising the unsupervised Janu Verma IBM T.J. Watson Research Center, New York http://jverma.github.io/ jverma@us.ibm.com @januverma Clustering Grouping together similar data

More information

Visual Representations for Machine Learning

Visual Representations for Machine Learning Visual Representations for Machine Learning Spectral Clustering and Channel Representations Lecture 1 Spectral Clustering: introduction and confusion Michael Felsberg Klas Nordberg The Spectral Clustering

More information

5. Computational Geometry, Benchmarks and Algorithms for Rectangular and Irregular Packing. 6. Meta-heuristic Algorithms and Rectangular Packing

5. Computational Geometry, Benchmarks and Algorithms for Rectangular and Irregular Packing. 6. Meta-heuristic Algorithms and Rectangular Packing 1. Introduction 2. Cutting and Packing Problems 3. Optimisation Techniques 4. Automated Packing Techniques 5. Computational Geometry, Benchmarks and Algorithms for Rectangular and Irregular Packing 6.

More information

Unsupervised Learning and Clustering

Unsupervised Learning and Clustering Unsupervised Learning and Clustering Selim Aksoy Department of Computer Engineering Bilkent University saksoy@cs.bilkent.edu.tr CS 551, Spring 2008 CS 551, Spring 2008 c 2008, Selim Aksoy (Bilkent University)

More information

Iterative Signature Algorithm for the Analysis of Large-Scale Gene Expression Data. By S. Bergmann, J. Ihmels, N. Barkai

Iterative Signature Algorithm for the Analysis of Large-Scale Gene Expression Data. By S. Bergmann, J. Ihmels, N. Barkai Iterative Signature Algorithm for the Analysis of Large-Scale Gene Expression Data By S. Bergmann, J. Ihmels, N. Barkai Reasoning Both clustering and Singular Value Decomposition(SVD) are useful tools

More information

Introduction to GE Microarray data analysis Practical Course MolBio 2012

Introduction to GE Microarray data analysis Practical Course MolBio 2012 Introduction to GE Microarray data analysis Practical Course MolBio 2012 Claudia Pommerenke Nov-2012 Transkriptomanalyselabor TAL Microarray and Deep Sequencing Core Facility Göttingen University Medical

More information

Exploratory data analysis for microarrays

Exploratory data analysis for microarrays Exploratory data analysis for microarrays Jörg Rahnenführer Computational Biology and Applied Algorithmics Max Planck Institute for Informatics D-66123 Saarbrücken Germany NGFN - Courses in Practical DNA

More information

ROTS: Reproducibility Optimized Test Statistic

ROTS: Reproducibility Optimized Test Statistic ROTS: Reproducibility Optimized Test Statistic Fatemeh Seyednasrollah, Tomi Suomi, Laura L. Elo fatsey (at) utu.fi March 3, 2016 Contents 1 Introduction 2 2 Algorithm overview 3 3 Input data 3 4 Preprocessing

More information

Telecommunication and Informatics University of North Carolina, Technical University of Gdansk Charlotte, NC 28223, USA

Telecommunication and Informatics University of North Carolina, Technical University of Gdansk Charlotte, NC 28223, USA A Decoder-based Evolutionary Algorithm for Constrained Parameter Optimization Problems S lawomir Kozie l 1 and Zbigniew Michalewicz 2 1 Department of Electronics, 2 Department of Computer Science, Telecommunication

More information

CHAPTER 6 IDENTIFICATION OF CLUSTERS USING VISUAL VALIDATION VAT ALGORITHM

CHAPTER 6 IDENTIFICATION OF CLUSTERS USING VISUAL VALIDATION VAT ALGORITHM 96 CHAPTER 6 IDENTIFICATION OF CLUSTERS USING VISUAL VALIDATION VAT ALGORITHM Clustering is the process of combining a set of relevant information in the same group. In this process KM algorithm plays

More information

An Architecture for Semantic Enterprise Application Integration Standards

An Architecture for Semantic Enterprise Application Integration Standards An Architecture for Semantic Enterprise Application Integration Standards Nenad Anicic 1, 2, Nenad Ivezic 1, Albert Jones 1 1 National Institute of Standards and Technology, 100 Bureau Drive Gaithersburg,

More information

CHAPTER 6 PROPOSED HYBRID MEDICAL IMAGE RETRIEVAL SYSTEM USING SEMANTIC AND VISUAL FEATURES

CHAPTER 6 PROPOSED HYBRID MEDICAL IMAGE RETRIEVAL SYSTEM USING SEMANTIC AND VISUAL FEATURES 188 CHAPTER 6 PROPOSED HYBRID MEDICAL IMAGE RETRIEVAL SYSTEM USING SEMANTIC AND VISUAL FEATURES 6.1 INTRODUCTION Image representation schemes designed for image retrieval systems are categorized into two

More information

10701 Machine Learning. Clustering

10701 Machine Learning. Clustering 171 Machine Learning Clustering What is Clustering? Organizing data into clusters such that there is high intra-cluster similarity low inter-cluster similarity Informally, finding natural groupings among

More information

Data Mining with Oracle 10g using Clustering and Classification Algorithms Nhamo Mdzingwa September 25, 2005

Data Mining with Oracle 10g using Clustering and Classification Algorithms Nhamo Mdzingwa September 25, 2005 Data Mining with Oracle 10g using Clustering and Classification Algorithms Nhamo Mdzingwa September 25, 2005 Abstract Deciding on which algorithm to use, in terms of which is the most effective and accurate

More information

CSE 5243 INTRO. TO DATA MINING

CSE 5243 INTRO. TO DATA MINING CSE 5243 INTRO. TO DATA MINING Cluster Analysis: Basic Concepts and Methods Huan Sun, CSE@The Ohio State University Slides adapted from UIUC CS412, Fall 2017, by Prof. Jiawei Han 2 Chapter 10. Cluster

More information

Open Access Research on the Prediction Model of Material Cost Based on Data Mining

Open Access Research on the Prediction Model of Material Cost Based on Data Mining Send Orders for Reprints to reprints@benthamscience.ae 1062 The Open Mechanical Engineering Journal, 2015, 9, 1062-1066 Open Access Research on the Prediction Model of Material Cost Based on Data Mining

More information

Constraints in Particle Swarm Optimization of Hidden Markov Models

Constraints in Particle Swarm Optimization of Hidden Markov Models Constraints in Particle Swarm Optimization of Hidden Markov Models Martin Macaš, Daniel Novák, and Lenka Lhotská Czech Technical University, Faculty of Electrical Engineering, Dep. of Cybernetics, Prague,

More information

CHAPTER 6 REAL-VALUED GENETIC ALGORITHMS

CHAPTER 6 REAL-VALUED GENETIC ALGORITHMS CHAPTER 6 REAL-VALUED GENETIC ALGORITHMS 6.1 Introduction Gradient-based algorithms have some weaknesses relative to engineering optimization. Specifically, it is difficult to use gradient-based algorithms

More information

Leveraging Set Relations in Exact Set Similarity Join

Leveraging Set Relations in Exact Set Similarity Join Leveraging Set Relations in Exact Set Similarity Join Xubo Wang, Lu Qin, Xuemin Lin, Ying Zhang, and Lijun Chang University of New South Wales, Australia University of Technology Sydney, Australia {xwang,lxue,ljchang}@cse.unsw.edu.au,

More information

Analytical model A structure and process for analyzing a dataset. For example, a decision tree is a model for the classification of a dataset.

Analytical model A structure and process for analyzing a dataset. For example, a decision tree is a model for the classification of a dataset. Glossary of data mining terms: Accuracy Accuracy is an important factor in assessing the success of data mining. When applied to data, accuracy refers to the rate of correct values in the data. When applied

More information

Clustering. SC4/SM4 Data Mining and Machine Learning, Hilary Term 2017 Dino Sejdinovic

Clustering. SC4/SM4 Data Mining and Machine Learning, Hilary Term 2017 Dino Sejdinovic Clustering SC4/SM4 Data Mining and Machine Learning, Hilary Term 2017 Dino Sejdinovic Clustering is one of the fundamental and ubiquitous tasks in exploratory data analysis a first intuition about the

More information

Association Rule Mining and Clustering

Association Rule Mining and Clustering Association Rule Mining and Clustering Lecture Outline: Classification vs. Association Rule Mining vs. Clustering Association Rule Mining Clustering Types of Clusters Clustering Algorithms Hierarchical:

More information

Evolving Variable-Ordering Heuristics for Constrained Optimisation

Evolving Variable-Ordering Heuristics for Constrained Optimisation Griffith Research Online https://research-repository.griffith.edu.au Evolving Variable-Ordering Heuristics for Constrained Optimisation Author Bain, Stuart, Thornton, John, Sattar, Abdul Published 2005

More information

Evolving SQL Queries for Data Mining

Evolving SQL Queries for Data Mining Evolving SQL Queries for Data Mining Majid Salim and Xin Yao School of Computer Science, The University of Birmingham Edgbaston, Birmingham B15 2TT, UK {msc30mms,x.yao}@cs.bham.ac.uk Abstract. This paper

More information

PROBLEM FORMULATION AND RESEARCH METHODOLOGY

PROBLEM FORMULATION AND RESEARCH METHODOLOGY PROBLEM FORMULATION AND RESEARCH METHODOLOGY ON THE SOFT COMPUTING BASED APPROACHES FOR OBJECT DETECTION AND TRACKING IN VIDEOS CHAPTER 3 PROBLEM FORMULATION AND RESEARCH METHODOLOGY The foregoing chapter

More information

Inital Starting Point Analysis for K-Means Clustering: A Case Study

Inital Starting Point Analysis for K-Means Clustering: A Case Study lemson University TigerPrints Publications School of omputing 3-26 Inital Starting Point Analysis for K-Means lustering: A ase Study Amy Apon lemson University, aapon@clemson.edu Frank Robinson Vanderbilt

More information

Clustering. Supervised vs. Unsupervised Learning

Clustering. Supervised vs. Unsupervised Learning Clustering Supervised vs. Unsupervised Learning So far we have assumed that the training samples used to design the classifier were labeled by their class membership (supervised learning) We assume now

More information

Incorporating Known Pathways into Gene Clustering Algorithms for Genetic Expression Data

Incorporating Known Pathways into Gene Clustering Algorithms for Genetic Expression Data Incorporating Known Pathways into Gene Clustering Algorithms for Genetic Expression Data Ryan Atallah, John Ryan, David Aeschlimann December 14, 2013 Abstract In this project, we study the problem of classifying

More information

On the Relationships between Zero Forcing Numbers and Certain Graph Coverings

On the Relationships between Zero Forcing Numbers and Certain Graph Coverings On the Relationships between Zero Forcing Numbers and Certain Graph Coverings Fatemeh Alinaghipour Taklimi, Shaun Fallat 1,, Karen Meagher 2 Department of Mathematics and Statistics, University of Regina,

More information

1. Introduction. 2. Motivation and Problem Definition. Volume 8 Issue 2, February Susmita Mohapatra

1. Introduction. 2. Motivation and Problem Definition. Volume 8 Issue 2, February Susmita Mohapatra Pattern Recall Analysis of the Hopfield Neural Network with a Genetic Algorithm Susmita Mohapatra Department of Computer Science, Utkal University, India Abstract: This paper is focused on the implementation

More information

Using Machine Learning to Optimize Storage Systems

Using Machine Learning to Optimize Storage Systems Using Machine Learning to Optimize Storage Systems Dr. Kiran Gunnam 1 Outline 1. Overview 2. Building Flash Models using Logistic Regression. 3. Storage Object classification 4. Storage Allocation recommendation

More information

An Empirical Study of Lazy Multilabel Classification Algorithms

An Empirical Study of Lazy Multilabel Classification Algorithms An Empirical Study of Lazy Multilabel Classification Algorithms E. Spyromitros and G. Tsoumakas and I. Vlahavas Department of Informatics, Aristotle University of Thessaloniki, 54124 Thessaloniki, Greece

More information