Mining Gene Expression Data Using PCA Based Clustering

Size: px
Start display at page:

Download "Mining Gene Expression Data Using PCA Based Clustering"

Transcription

1 Vol. 5, No. 1, January-June 2012, pp , Published by Serials Publications, ISSN: Mining Gene Expression Data Using PCA Based Clustering N.P. Gopalan 1 and B. Sathiyabhama 2 * 1 Department of Computer Applications, National Institute of Technology, Tiruchirappalli, , Tamilnadu, India, gopalan@nitt.edu 2 Department of Computer Science and Engineering, National Institute of Technology, Tiruchirappalli, , Tamilnadu, India, sathya674@yahoo.co.in ABSTRACT: As the amount of laboratory data in molecular biology and bioinformatics grows exponentially in each year due to advanced technologies such as DNA Microarray, new efficient and effective clustering methods must be developed to process this fast growing amount of biological data. Numerous clustering techniques have been applied in the analysis of gene expression data to extract biologically significant patterns. But there are issues like clustering quality, high dimensionality of input data and computational efficiency need to be addressed. A novel hybrid clustering algorithm is proposed, which is a blend of Principal Component Analysis (PCA) and the enhanced correlation based clustering. PCA is a classical statistic technique for finding patterns in data of high dimension. The empirical results show that this approach provides more stable clustering performance in terms of quality and efficiency. The resulting clusters offer potential insight into gene function, molecular biological processes and regulatory mechanisms. Keywords: Clustering analysis; Bioinformatics; Gene expression data; Principal Component Analysis; 1. INTRODUCTION DNA Microarray technology has now made it possible to simultaneously monitor the expression levels of thousands of genes during important biological processes across collections of related samples. It has enormous promises in areas such as revealing function of genes in various cell populations, tumor classification, drug target identification, understanding cellular pathways, and prediction of outcome to therapy [1], [2]. A major application of microarray technology is gene expression profiling to predict outcome in multiple tumor types [3]. Data mining methods can be applied to various gene expression data sets including cancer data sets in order to identify distinct genes to classify tumors. Cluster analysis is one of the data mining technique, seeks to partition a given data set into groups based on specified features so that the data points within a group are more similar to each other than the points in different groups [2]. Clustering techniques are useful in identifying (yet unknown) subclasses of tumors, or identifying clusters of genes that are co-regulated or share the same function [4]. These methods have been successful in separating certain types of genes associated with different types of leukemia and lymphoma [3]. The groupings of biologically relevant clusters containing genes are having similar expression patterns called co-expression genes. Clustering technique has become an efficient and mandatory tool for in-silico analysis of gene expression data * Corresponding Author: sathya674@yahoo.co.in [5], [6], [7], [8], [9]. A variant of hierarchical clustering algorithm is used by Eisen et al. [7] to identify groups of co-expressed yeast genes. Two-way clustering technique [5] is used to detect clusters of correlated genes and tissues. To identify clusters in the yeast cell cycle data set and human hematopoietic differentiation data set Self-Organizing Maps (SOM) [9] is used. Biologically meaningful clusters of yeast chodata have been determined by using genetic enhanced K-Means clustering method [10]. Variety of clustering validation measures are used in the literature to evaluate the validity of clustering results [11], [12]. Numerous validation indexes are used in practice like Jaccard coefficient, Simple matching coefficient and Hubert s (gamma) statistic (HGS) [13] to evaluate the stability of parameters and reliability of clustering algorithms and are ingrained only on the phase of post-validation. Clustering techniques have the drawbacks of poor clustering quality and destabilization of clusters [4],[14],[15],[16]. Vincent Tseng et al. have used correlation based clustering algorithm for partitioning co-regulated genes. To improve the quality of clustering, validation technique is integrated in to the clustering process [13]. In the initial stage of clustering this algorithm adds highly negative correlated elements in addition ot positively correlated elelments. In the later phase, exterminates the cluster members that were inaccurately added. Hence it consumes more computational resources. Recently the authors have developed a variant of sparse matrices to represent the gene expression similarity matrix [17], [18]. Sparse matrix is the suitable data structure for effective

2 14 N.P. Gopalan and B. Sathiyabhama memory utilization. The authors also made improvement on the validation statistic by substituting the fast heuristic namely Enhanced HGS (EHGS) from the basic Hubert s statistic]. Computational intelligence [18], [19] is generally accepted to include evolutionary computation and is used to increase the precision of resolved structure. Genetic algorithm (GA) has been proven to be a robust and effective search method requiring very little information about the problem to explore a large search space. Blend of computational intelligence and clustering approaches endow with rapid, automated, feature selection and pattern recognition for a wide assortment of gene expression profile [19]. Most of the clustering algorithms suffer from high dimensionality and huge size of the data. To analyze these fast growing gene expression data sets efficiently and effectively good clustering algorithm is required, but the dimensionality and size of data impersonate challenging problems in both computational and biomedical research, and the difficult task ahead is transferring gene expression data in to subject specific knowledge. Various methods have been developed to reduce the size of the gene expression data [20], [21], [22]. In the proposed work, clustering algorithm is appropriately integrated with a dimensionality reduction technique namely Principal Component Analysis (PCA) whose goal is to reduce the dimensionality of the data to facilitate visualization and additional analysis. PCA is often used as a pre-processing step to the clustering analysis of large data sets and are widely used in the gene expression data. 2. RESEARCH METHODOLOGY The high dimensionality of the gene expression data sets and the high percentage of irrelevant or redundant genes make it very difficult either to classify samples or pick out substantial genes in a context where little domain knowledge is available. To address this problem, PCA has been applied to analyze gene expression data. PCA is a classical statistic technique to reduce the dimensionality of the data by transforming to a set of variables that summarize the features of the data without much loss of information [22]. Principal Components (PC s) are uncorrelated and ordered. PCA is closely related to a mathematical technique called Singular Value Decomposition (SVD) and it is applied to the algorithm before the clustering process. Hence, only the relevant data is given to the clustering. SVD takes a gene expression data matrix namely A of order n X p where n rows represent the genes and the p columns (p is approximately equal to n) represent the experimental conditions. The SVD theorem is as follows: A nxp = U T U = T UnXn SnXp V pxp (1) T InXn V V I pxp (2) U and V are orthogonal. U is the left singular gene coefficient vectors and S has the same dimension as A. Now SVD represents an expansion of the original data in a coordinate system where the covariance matrix is diagonal. SVD consists of finding eigen values and eigen vectors of the following: AA T and A T A (3) Depending on the eigen vectors, the components are selected. These are forming a feature vector and it is the notion of data compression. Eigen vector with the highest eigen value is the PC s of the data set. If eigen vectors with the largest eigen value is one that pointed as middle of the data. This is the most significant relationship between the data dimensions and the least significant components are ignored. The clustering algorithm is then forming the similarity matrix from the PC s only with the relevant biologically significant data. Unlike the traditional clustering algorithms the proposed approach uses the constraint based addition procedure to add the elements to the clusters. It never removes any element from the clusters once added and outliers are filtered out during the initial phase of the clustering process. Consequently, the stability and quality of the clustering process is improved. To assess the predictive power of the clustering algorithm and quality of clustering results, combination of EHGS [17] and figure of merit (FOM) is used [11]. A typical gene expression data set contains the measurements of expression levels of n genes measured under n experimental conditions. Apparently, the expression levels of co-regulated genes will vary similarly across the n conditions. Consequently, clustering the genes based on similarities among these expression level measurements should isolate clusters of biologically related genes. The EHGS is as follows: 2 ( A( i, j)( B( i, j)) A( i, j) A* B M n 1 n n 1 n i 1 j i 1 i 1 j i 1 ( B) A clustering algorithm is said to have good predictive power if genes in the same cluster tend to have similar expression levels. In the set of experimental conditions, the condition that is not used to produce the clusters is used as leave one out condition and assumed as least significant constraint [11]. With reference to the left out condition, the clustering process is evaluated. The above illustration does not provide any guarantee that the left out condition is the appropriate one to determine the predictive power of the clustering algorithm. There is no proof that the left out condition is not biologically significant because there exists an equal probability that every condition becomes a left out. Hence the proposed algorithm uses a variant of FOM, a set of scalar quantities that determine the predictive power of clusters. This implies that a set of threshold parameters is attached for every cluster produced for each pre-defined biological condition, i.e. T i, is the threshold parameter for (4)

3 Mining Gene Expression Data Using PCA Based Clustering 15 the ith cluster (where 1 i k) and k is the number of clusters. This heuristic helps in reducing the redundant computations. The idea behind the FOM is that the data from conditions 0, 1, 2,., (m-1), are used to estimate the predictive power of the algorithm. Suppose k clusters, C 1, C 2,, C k are obtained, with cluster sizes s 1, s 2,, s k, such that si n. Let R(i, j) be the expression level of gene i under condition j in the similarity matrix. FOM (i, k) be the FOM for k clusters and using condition i as validation along with a threshold value T i. Thus FOM is defined as FOM 1 ( i, k ) n( R( X, e)) T i Numerous strategies available in the literature to set up the threshold value [23]. There is no theoretical proof that the chosen value is appropriate for clustering. In the proposed approach, a metric is appropriately assigned for gene expression data clustering. The threshold value T i, is an expected average distance (according to the distance metric) of objects is assigned to the cluster C i. Mixture of correlation based enhanced variant and FOM operators are used to select the appropriate cluster for the gene expression data and validate them simultaneously. Hence the proposed algorithm Gene clustering using Correlation Search Technique (G-CST) is scalable, efficient and resilient in determining the biologically significant patterns. 3. THE PROPOSED (G-CST) ALGORITHM Input: Gene expression data (n X n matrix). Output: Biologically significant clusters. begin 1. Initialization: (5) A: input gene expression similarity matrix A of n X n Perform PCA: AA T and A T A n M = ( n 1) 2 S A = S B = 0 C = 0 n 1 n i 1 j i 1 A( i, j) G = {1, 2, 3,..., n} CGST = 0 2. while (G is not empty) do begin C open = 0; for i: = 1 to n do a(i) = 0; Max_EHGS = max ; pick an element c from C with maximum neighbours; remove c from C; for j: = 1 to n do a(i) = a(c, i); C open FOM (i, k) = = {c}; 1 n( R( X, e)) // Adding elements to the clusters repeat while Max_EHGS and FOM(i,k) do begin select c from C with highest a(i) remove c from C; S B = S B + C open S AB for i = 1 to n do = S AB + a(u); a(i) = a(i) + A(C, i); Until all elements in A(i, j) are assigned; 3. Return the collection of clusters; end T i end Figure 1: Pseudo Code for G-CST Clustering The algorithm consists of an initialisation and an iterative step. In the initialization step, the algorithm first computes the PCA by determining the eigen vectors. From the feature vectors the similarity matrix is constructed. After that the appropriate population is generated. The iterative step successively selects elements and allocates to the appropriate cluster. 3.1 Clustering and Validation The input for the algorithm is a raw gene expression data matrix. This is converted as sparse symmetric similarity matrix of the gene expression data set. This algorithm is constructing clusters one at a time. The current cluster is denoted by C open. Each cluster is started by a seed value and constructed incrementally by adding items to C open. The addition of data items is computed using EHGS [17] and is defined as. The current maximum is represented as add(k). An element k is added if it has high positive correlation max i.e high similarity. Also it clusters low similarity gene data items in different clusters according to the value. The value of is between ( 1, 1) and a higher value of represents the best clustering quality. A data item is added to the cluster if it satisfies the maximum neighbors criteria and a threshold value. In general, the threshold value depends on the number of patterns and the number of features in the data set. The

4 16 N.P. Gopalan and B. Sathiyabhama set of clusters is stabilized by consecutive addition operations. To inaugurate a new cluster, a data item with maximum number of neighbours or closest data items is used. Also, a threshold value is used while adding an element. This automatically filters out the outlier data items and appropriately inserted in to the respective clusters. The mixture of validation measures provide increased predictive performance relative to other methods of pattern recognition. These are the principal heuristics that have been attached to this algorithm and are responsible for assigning clusters to all the valid items. The added items need not be removed from the cluster unlike correlation based clustering algorithm devised by Vincent Tseng et al. [13]. 4. EMPIRICAL RESULTS To describe the performance of the proposed approach, K-Means, E-CAST [23] and E-CST algorithms on the cancer gene expression data sets are used. There are several other algorithms are also available in the literature for comparing the performance of the proposed algorithm. Due to relativity these algorithms are used. Datasets [24] from breast cell lines are used here to evaluate the proposed methodology. To estimate the predictive power of the clustering algorithms mixture of FOM and EGHS used. To obtain reliable clustering results the proposed approach, K-Means, E-CAST and Enhanced Correlation Search Technique (ECST) algorithms are executed 25, 20, 25 and 20 times respectively. Transfection with a single oncogene is expected to generate similar expression profiles presumably, because only a few genes are biologically influenced. Therefore, it is desirable to see whether profiles of the different phenotypes can be partitioned. Due to the presence of noise in the data and similarity between the different samples, common clustering techniques such as K-Means, and E-CAST failed to produce good quality clusters. Expression levels of the four cell lines were measured in two separate sets of four measurements. These data sets cluster structures are determined in advance. From the given data set, the users can set up some parameters for generating various kinds of gene expression data sets with variation in terms of the number of clusters and number of genes in each cluster. First seed genes are generated and it must have the same number of constraints for all the clusters. If the seed gen es and the th reshold values are appropriately incorporated and tested in the algorithm, then the clusters having high intra cluster similarity and low inter cluster similarity. During the initial phase of the clustering process the outliers or noise are purged successfully. The proposed approach (G-CST) is compared with the other clustering algorithms. Table 1 provides the complete detail about the data sets, cluster structure, clustering patterns for the proposed approach, E-CAST, K-Means, E-CST and their computational time (running time in Table 1). The newly designed algorithm outperforms quantitatively and qualitatively in computational time and memory utilization. Close to this, E-CST is performing better in accuracy. In addition, the results illustrate that the quality of clustering will be better in the proposed algorithm. This can provide more accurate results and insight into molecular process, morphological characteristics and gene control functions. Figures 2 and 3 depict a large contiguous group of genes sharing the similar expression patterns over set of conditions. This type of clustering structure elaborates the biological significance of the underlying genes. The curved lines in the figures 2 and 3 represent the sum of average FOM (i, k) and EHGS measures on the experimental conditions. The newly designed algorithm and the E-CST are very sensitive to outliers. The number of clusters is very crucial parameter in the traditional clustering algorithms, whereas in the proposed approach automatically produces the clusters without any user input. The result of this clustering analysis may be a group of co-regulated genes (i.e. genes that exhibit similar experimental behavior) that are placed in the same cluster. They express the relationships between the clusters and the functional categories in biological activities. The behaviors of the clustering algorithms on data sets presented here demonstrate a feature of gene expression that makes this method particularly useful. It is known that genes expressed together share common functions. Gene expression patterns suffice to separate genes into functional categories across a relatively small and redundant collection of conditions. It is been observed that the addition of more and diverse conditions can only enhance these observations. The behaviour of the clustering algorithms on gene expression data set 2 is very similar to that of the data set 1 which is shown in Table 1. When the number of clusters is small, the E-CAST, K-Means algorithms have comparable FOM and EHGS, which are lower than those of the new approach and E-CST. When the number of clusters is large, the proposed algorithm has comparable FOM and EHGS. In Figure 2, there is a knee shaped structure in the curves between one and two clusters portrays that cluster separation is minimum for the data set 2. Data sets considered for evaluation exhibit declining validity measures under all algorithms as the number of clusters increases. Two factors contribute to this. First, the algorithms may be finding higher quality clusters, as they subdivide large, coarse clusters into smaller, more homogeneous ones. Second, simply increasing the number of clusters will tend to decrease the validity measures. There is an obvious negative slope trend in both figures, showing that clustering results with low values tend to have high correspondence with the given functional categorization. The mixture of FOM and EHGS provide a meaningful estimate of cluster quality.

5 Mining Gene Expression Data Using PCA Based Clustering 17 Table 1 Experimental Results for 2 Gene Expression Data Sets Algorithm No. of Enhanced No. of Running Patterns hm Clusters Statistic outliers Time (approx) E-CST 63, ,10 O(n log n) 3000, 200 E-CAST 58, ,20 O(n2) 2500, 180 K-means 27, ,12 O(n2) 2000, 75 Proposed 67, < 100, 10 >= O(n > 3000, G-CST log n) 200 The number of outliers is shown in the Table 1, which may exist approximately in all the clusters together. In the G-CST and clearly the outliers tremendously reduced, since only relevant genes are considered for building clusters. All the possible patterns which are biologically significant are extracted from the clusters which are formed on the basis of the constraints specified by the proposed algorithm. Figure 2: Clustering behaviour of Data Set 1 Figure 3: Clustering Behavior of Data Set 2 The proposed approach is superior to the existing approaches in quality and efficiency, stability and memory utilization. It is understood from the Figures 2 and 3 this emphasizes its supremacy of capturing sharp coherent tendency among gene expression data. In addition, the results of functionally enriched clusters highlight the fact that these clusters carry significant biological meaning. 7. CONCLUSION AND FUTURE WORK As the number of microarray experiments continues to increase drastically and as these techniques are becoming more and more a part of personalized healthcare, computational methods to support this expansion must also occur. Most of the clustering algorithms used in practice are having certain inherent difficulties. This novel approach clusters the gene expression data sets and produces good results. This clustering process signifies great promise to glean information from gene expression profile. To evaluate the performance of this method, cancer gene expression data sets have been used and it is compared with the E-CST, E- CAST, K-Means clustering algorithms. It is clear that the healthcare industry requires methods to rapidly transit microarray data into practical use. Future work includes the application of more real data sets and the theoretical analysis of the determination of the threshold parameter. A key roadblock remains the discovery of exact patterns and predictive accuracy that still retains high accuracy in clustering. Combinations of computational intelligence approaches hold promise for rapid, automated and pattern recognition for a wide assortment of data. Blend of parallel approaches like genetic algorithm and characterization guided clustering may improve the performance and it will play an increasingly important role in the areas of gene expression analysis. References [1] P.T. Spellman, G. Sherlock, M.Q. Zhang, V.R. Iyer, K. Anders, M.B. Eisen, P.O. Brown, D. Botstein, and B. Fucher, Comprehensive Identification of Cell Cycle- Regulated Genes of the Yeast Saccharomyces Cerevisiae by Microarray Hybridization, Molecular Biology of the Cell, 9(12), pp , [2] T. Zhang, R. Ramakrishnan, and M. Livny, Birch: An Efficient Data Clustering Methods for Very Large Databases, Proc. ACM SIGMOD Int l Conf. Management of Data, pp , [3] Golub. T.R. Slonim, D.K. Slonim, D.K. Tamayo, P. Huard, C. Gaasenbeek, M. Mesiroy, J.P. Coller, H. Loh, M. Downing, J.R., Caligiuri, M. et al., Molecular Classification of Cancer: Class Discovery and Class Prediction by Gene Expression Monitoring, Science, 286: , [4] M.S. Chen, J. Han, and P.S. Yu, Data Mining: An Overview from a Database Perspective, IEEE Trans. Knowledge and Data Eng., 8(6), pp , Dec [5] U. Alon, N. Barkai, D.A. Nottleman, k. Gish, S. Ybarra, D. Mack, and A.J. Levine, Broad Patterns of Gene Expression Revealed by Clustering Analysis of Tumor and Normal Colon Tissues Probed by Clustering Oligonucleotide Arrays, Proc. Nat l Academy of Sciences, 96, pp , 1999.

6 18 N.P. Gopalan and B. Sathiyabhama [6] A.Ben-Dor and Z. Yakhini, Clustering Gene Expression Patterns, J.Computational Biology, 6, pp , [7] M.B. Eisen, P.T. Spellman, P.O. Brown, and D. Botstein, Clustering Analysis and Display of Genome Wide Expression Patterns, Proc. Nat l Academy of Sciences, 95, pp , [8] M.K. Kerr and G.A. Churchill, Bootstrapping Cluster Analysis: Assessing the Reliability of Conclusions from Microarray Experiments, Proc. Nat l Academy of Science, 98(16), pp , [9] P. Tamayo, D. Slonim, J. Mesirov, Q. Zhu, S. Kitareewan, E. Dmitrovsky, E.S. Lander, and T.R. Golub, Interpreting Patterns of Gene Expression with Self-Organizing Maps: Methods and Application to Hematopoietic Differentiation, Proc. Nat l Academy of Sciences, 96(6), pp , [10] N.P. Gopalan, B. Sathiyabhama, Scalable Biclustering Gene Expression Data using Genetic Enhanced K-Means Algorithm, Proc. National Conference on High Performance Computing - VISION 06, pp [11] K.Y. Yeung, D.R. Haynor, and W.L. Ruzzo, Validating Clustering for Gene Expression Data, Bioinformatics, 17(4), pp , [12] A.K. Jain and R.C. Dubes, Algorithms for Clustering Data, Englewood Cliffs N.J.: Prentice Hall, [13] Vincent S. Tseng and Ching-Pin Kao, Efficiently Mining Gene Expression Data via a Novel Parameterless Clustering Method, IEEE/ACM Transactions on Computational Biology and Bioinformatics, 2(4), pp , Dec [14] S. Guha, R. Rastogi, and K. Shim, CURE: An Efficient Clustering Algorithm for Large Databases, ACM Int'l Conf. Management of Data, pp , [15] S. Guha, R. Rastogi, and K. Shim, ROCK: A Robust Clustering Algorithm for Categorical Attributes, 15th Int l Conf. Data Eng., pp , [16] T. Kohonen, The Self-Organizing Map, Proc.IEEE, 78(9), pp , [17] B. Sathiyabhama, N.P. Gopalan, Correlation Search Technique for Clustering Cancer Gene Expression Data, WSEAS International Conferences Lisbon, Sep [18] B. Sathiyabhama, N.P. Gopalan, Enhanced Correlation Search Technique for Clustering Cancer Gene Expression Data, WSEAS Transactions on Information Science and Applications 12, 3, 2006, pp [19] Fogel, G.B., Corne, D.W., Evolutionary Computation in Bioinformatics, Morgan Kaufmann, San Francisco [20] Xing E. P. and Karp R.M., Cliff: Clustering of High - Dimensional Microarray Data via Iterative Feature Filtering Using Normalized Cuts, Bioinformatics, 17(4), , [21] Yeung, Ka Yee and Ruzzo, Walter L., An Empirical Study on Principal Component Analysis for Clustering Gene Expression Data, Technical Report UW-CSE , Department of Computer Science and Engineering, University of Washington, [22] Holter N.S., Mitra, M. Maritan, A., Cieplak, M., Banaver, J.R. and Fedoroff, N.V., Fundamental Patterns Underlying Gene Expression Profiles: Simplicity from Complexity, Proceedings of the National Academy of Science USA, 97, , [23] Abdelghani Bellaachia et. al., E-CAST: A Data Mining Algorithm for Gene Expression Data, Proc. BIOKDD02: Workshop on Data Mining in Bioinformatics (With SIGKDD02 conference), pp [24] Kluger, H. Kacinski, B., Kluger, Y., Mironenko, Gilmore Hebert, M., Chang, J., Perkins, A.S., and Sapi, E., Microarray Analysis of Invasive and Metastatic in a Breast Cancer Model, In Poster presented at the Gordon Conference on Cancer, Newport, RI, 2001.

Double Self-Organizing Maps to Cluster Gene Expression Data

Double Self-Organizing Maps to Cluster Gene Expression Data Double Self-Organizing Maps to Cluster Gene Expression Data Dali Wang, Habtom Ressom, Mohamad Musavi, Cristian Domnisoru University of Maine, Department of Electrical & Computer Engineering, Intelligent

More information

Estimating Error-Dimensionality Relationship for Gene Expression Based Cancer Classification

Estimating Error-Dimensionality Relationship for Gene Expression Based Cancer Classification 1 Estimating Error-Dimensionality Relationship for Gene Expression Based Cancer Classification Feng Chu and Lipo Wang School of Electrical and Electronic Engineering Nanyang Technological niversity Singapore

More information

Exploratory data analysis for microarrays

Exploratory data analysis for microarrays Exploratory data analysis for microarrays Adrian Alexa Computational Biology and Applied Algorithmics Max Planck Institute for Informatics D-66123 Saarbrücken slides by Jörg Rahnenführer NGFN - Courses

More information

Exploratory data analysis for microarrays

Exploratory data analysis for microarrays Exploratory data analysis for microarrays Jörg Rahnenführer Computational Biology and Applied Algorithmics Max Planck Institute for Informatics D-66123 Saarbrücken Germany NGFN - Courses in Practical DNA

More information

Missing Data Estimation in Microarrays Using Multi-Organism Approach

Missing Data Estimation in Microarrays Using Multi-Organism Approach Missing Data Estimation in Microarrays Using Multi-Organism Approach Marcel Nassar and Hady Zeineddine Progress Report: Data Mining Course Project, Spring 2008 Prof. Inderjit S. Dhillon April 02, 2008

More information

Exploratory data analysis for microarrays

Exploratory data analysis for microarrays Exploratory data analysis for microarrays Jörg Rahnenführer Computational Biology and Applied Algorithmics Max Planck Institute for Informatics D-66123 Saarbrücken Germany NGFN - Courses in Practical DNA

More information

AN IMPROVED HYBRIDIZED K- MEANS CLUSTERING ALGORITHM (IHKMCA) FOR HIGHDIMENSIONAL DATASET & IT S PERFORMANCE ANALYSIS

AN IMPROVED HYBRIDIZED K- MEANS CLUSTERING ALGORITHM (IHKMCA) FOR HIGHDIMENSIONAL DATASET & IT S PERFORMANCE ANALYSIS AN IMPROVED HYBRIDIZED K- MEANS CLUSTERING ALGORITHM (IHKMCA) FOR HIGHDIMENSIONAL DATASET & IT S PERFORMANCE ANALYSIS H.S Behera Department of Computer Science and Engineering, Veer Surendra Sai University

More information

An empirical study on Principal Component Analysis for clustering gene expression data

An empirical study on Principal Component Analysis for clustering gene expression data An empirical study on Principal Component Analysis for clustering gene expression data Ka Yee Yeung Walter L Ruzzo Technical Report UW-CSE-2000-11-03 November, 2000 Department of Computer Science & Engineering

More information

Biclustering for Microarray Data: A Short and Comprehensive Tutorial

Biclustering for Microarray Data: A Short and Comprehensive Tutorial Biclustering for Microarray Data: A Short and Comprehensive Tutorial 1 Arabinda Panda, 2 Satchidananda Dehuri 1 Department of Computer Science, Modern Engineering & Management Studies, Balasore 2 Department

More information

Validating Clustering for Gene Expression Data

Validating Clustering for Gene Expression Data Validating Clustering for Gene Expression Data Ka Yee Yeung David R. Haynor Walter L. Ruzzo Technical Report UW-CSE-00-01-01 January, 2000 Department of Computer Science & Engineering University of Washington

More information

Clustering Jacques van Helden

Clustering Jacques van Helden Statistical Analysis of Microarray Data Clustering Jacques van Helden Jacques.van.Helden@ulb.ac.be Contents Data sets Distance and similarity metrics K-means clustering Hierarchical clustering Evaluation

More information

Contents. ! Data sets. ! Distance and similarity metrics. ! K-means clustering. ! Hierarchical clustering. ! Evaluation of clustering results

Contents. ! Data sets. ! Distance and similarity metrics. ! K-means clustering. ! Hierarchical clustering. ! Evaluation of clustering results Statistical Analysis of Microarray Data Contents Data sets Distance and similarity metrics K-means clustering Hierarchical clustering Evaluation of clustering results Clustering Jacques van Helden Jacques.van.Helden@ulb.ac.be

More information

MICROARRAY IMAGE SEGMENTATION USING CLUSTERING METHODS

MICROARRAY IMAGE SEGMENTATION USING CLUSTERING METHODS Mathematical and Computational Applications, Vol. 5, No. 2, pp. 240-247, 200. Association for Scientific Research MICROARRAY IMAGE SEGMENTATION USING CLUSTERING METHODS Volkan Uslan and Đhsan Ömür Bucak

More information

Supervised Clustering of Yeast Gene Expression Data

Supervised Clustering of Yeast Gene Expression Data Supervised Clustering of Yeast Gene Expression Data In the DeRisi paper five expression profile clusters were cited, each containing a small number (7-8) of genes. In the following examples we apply supervised

More information

Comparison of Optimization Methods for L1-regularized Logistic Regression

Comparison of Optimization Methods for L1-regularized Logistic Regression Comparison of Optimization Methods for L1-regularized Logistic Regression Aleksandar Jovanovich Department of Computer Science and Information Systems Youngstown State University Youngstown, OH 44555 aleksjovanovich@gmail.com

More information

9/29/13. Outline Data mining tasks. Clustering algorithms. Applications of clustering in biology

9/29/13. Outline Data mining tasks. Clustering algorithms. Applications of clustering in biology 9/9/ I9 Introduction to Bioinformatics, Clustering algorithms Yuzhen Ye (yye@indiana.edu) School of Informatics & Computing, IUB Outline Data mining tasks Predictive tasks vs descriptive tasks Example

More information

10. Clustering. Introduction to Bioinformatics Jarkko Salojärvi. Based on lecture slides by Samuel Kaski

10. Clustering. Introduction to Bioinformatics Jarkko Salojärvi. Based on lecture slides by Samuel Kaski 10. Clustering Introduction to Bioinformatics 30.9.2008 Jarkko Salojärvi Based on lecture slides by Samuel Kaski Definition of a cluster Typically either 1. A group of mutually similar samples, or 2. A

More information

Triclustering in Gene Expression Data Analysis: A Selected Survey

Triclustering in Gene Expression Data Analysis: A Selected Survey Triclustering in Gene Expression Data Analysis: A Selected Survey P. Mahanta, H. A. Ahmed Dept of Comp Sc and Engg Tezpur University Napaam -784028, India Email: priyakshi@tezu.ernet.in, hasin@tezu.ernet.in

More information

CHAPTER 5 CLUSTER VALIDATION TECHNIQUES

CHAPTER 5 CLUSTER VALIDATION TECHNIQUES 120 CHAPTER 5 CLUSTER VALIDATION TECHNIQUES 5.1 INTRODUCTION Prediction of correct number of clusters is a fundamental problem in unsupervised classification techniques. Many clustering techniques require

More information

e-ccc-biclustering: Related work on biclustering algorithms for time series gene expression data

e-ccc-biclustering: Related work on biclustering algorithms for time series gene expression data : Related work on biclustering algorithms for time series gene expression data Sara C. Madeira 1,2,3, Arlindo L. Oliveira 1,2 1 Knowledge Discovery and Bioinformatics (KDBIO) group, INESC-ID, Lisbon, Portugal

More information

A Novel Approach for Minimum Spanning Tree Based Clustering Algorithm

A Novel Approach for Minimum Spanning Tree Based Clustering Algorithm IJCSES International Journal of Computer Sciences and Engineering Systems, Vol. 5, No. 2, April 2011 CSES International 2011 ISSN 0973-4406 A Novel Approach for Minimum Spanning Tree Based Clustering Algorithm

More information

A Memetic Heuristic for the Co-clustering Problem

A Memetic Heuristic for the Co-clustering Problem A Memetic Heuristic for the Co-clustering Problem Mohammad Khoshneshin 1, Mahtab Ghazizadeh 2, W. Nick Street 1, and Jeffrey W. Ohlmann 1 1 The University of Iowa, Iowa City IA 52242, USA {mohammad-khoshneshin,nick-street,jeffrey-ohlmann}@uiowa.edu

More information

Clustering Using Elements of Information Theory

Clustering Using Elements of Information Theory Clustering Using Elements of Information Theory Daniel de Araújo 1,2, Adrião Dória Neto 2, Jorge Melo 2, and Allan Martins 2 1 Federal Rural University of Semi-Árido, Campus Angicos, Angicos/RN, Brasil

More information

Feature Selection for SVMs

Feature Selection for SVMs Feature Selection for SVMs J. Weston, S. Mukherjee, O. Chapelle, M. Pontil T. Poggio, V. Vapnik Barnhill BioInformatics.com, Savannah, Georgia, USA. CBCL MIT, Cambridge, Massachusetts, USA. AT&T Research

More information

Improving the Efficiency of Fast Using Semantic Similarity Algorithm

Improving the Efficiency of Fast Using Semantic Similarity Algorithm International Journal of Scientific and Research Publications, Volume 4, Issue 1, January 2014 1 Improving the Efficiency of Fast Using Semantic Similarity Algorithm D.KARTHIKA 1, S. DIVAKAR 2 Final year

More information

CS570: Introduction to Data Mining

CS570: Introduction to Data Mining CS570: Introduction to Data Mining Scalable Clustering Methods: BIRCH and Others Reading: Chapter 10.3 Han, Chapter 9.5 Tan Cengiz Gunay, Ph.D. Slides courtesy of Li Xiong, Ph.D., 2011 Han, Kamber & Pei.

More information

Exploratory Data Analysis for Microarrays

Exploratory Data Analysis for Microarrays Exploratory Data Analysis for Microarrays Computational Biology and Applied Algorithmics Max Planck Institute for Informatics D-66123 Saarbrücken Germany NGFN Courses in Practical DNA Microarray Analysis

More information

Classification Tasks for Microarrays

Classification Tasks for Microarrays Exploratory Data Analysis for Microarrays Jörg Rahnenführer Universität Dortmund, Fachbereich Statistik Email: rahnenfuehrer@statistik.uni-dortmund.de NGFN Courses in Practical DNA Microarray Analysis

More information

An integrated tool for microarray data clustering and cluster validity assessment

An integrated tool for microarray data clustering and cluster validity assessment An integrated tool for microarray data clustering and cluster validity assessment Nadia Bolshakova Department of Computer Science Trinity College Dublin Ireland +353 1 608 3688 Nadia.Bolshakova@cs.tcd.ie

More information

Clustering and Visualisation of Data

Clustering and Visualisation of Data Clustering and Visualisation of Data Hiroshi Shimodaira January-March 28 Cluster analysis aims to partition a data set into meaningful or useful groups, based on distances between data points. In some

More information

A Review on Cluster Based Approach in Data Mining

A Review on Cluster Based Approach in Data Mining A Review on Cluster Based Approach in Data Mining M. Vijaya Maheswari PhD Research Scholar, Department of Computer Science Karpagam University Coimbatore, Tamilnadu,India Dr T. Christopher Assistant professor,

More information

University of Florida CISE department Gator Engineering. Clustering Part 5

University of Florida CISE department Gator Engineering. Clustering Part 5 Clustering Part 5 Dr. Sanjay Ranka Professor Computer and Information Science and Engineering University of Florida, Gainesville SNN Approach to Clustering Ordinary distance measures have problems Euclidean

More information

The Iterative Bayesian Model Averaging Algorithm: an improved method for gene selection and classification using microarray data

The Iterative Bayesian Model Averaging Algorithm: an improved method for gene selection and classification using microarray data The Iterative Bayesian Model Averaging Algorithm: an improved method for gene selection and classification using microarray data Ka Yee Yeung, Roger E. Bumgarner, and Adrian E. Raftery April 30, 2018 1

More information

Dimension reduction : PCA and Clustering

Dimension reduction : PCA and Clustering Dimension reduction : PCA and Clustering By Hanne Jarmer Slides by Christopher Workman Center for Biological Sequence Analysis DTU The DNA Array Analysis Pipeline Array design Probe design Question Experimental

More information

A Quantitative Approach for Textural Image Segmentation with Median Filter

A Quantitative Approach for Textural Image Segmentation with Median Filter International Journal of Advancements in Research & Technology, Volume 2, Issue 4, April-2013 1 179 A Quantitative Approach for Textural Image Segmentation with Median Filter Dr. D. Pugazhenthi 1, Priya

More information

SUPPLEMENTARY INFORMATION

SUPPLEMENTARY INFORMATION Supplementary Discussion 1: Rationale for Clustering Algorithm Selection Introduction: The process of machine learning is to design and employ computer programs that are capable to deduce patterns, regularities

More information

Study and Implementation of CHAMELEON algorithm for Gene Clustering

Study and Implementation of CHAMELEON algorithm for Gene Clustering [1] Study and Implementation of CHAMELEON algorithm for Gene Clustering 1. Motivation Saurav Sahay The vast amount of gathered genomic data from Microarray and other experiments makes it extremely difficult

More information

Hybrid Fuzzy C-Means Clustering Technique for Gene Expression Data

Hybrid Fuzzy C-Means Clustering Technique for Gene Expression Data Hybrid Fuzzy C-Means Clustering Technique for Gene Expression Data 1 P. Valarmathie, 2 Dr MV Srinath, 3 Dr T. Ravichandran, 4 K. Dinakaran 1 Dept. of Computer Science and Engineering, Dr. MGR University,

More information

CHAPTER 6 MODIFIED FUZZY TECHNIQUES BASED IMAGE SEGMENTATION

CHAPTER 6 MODIFIED FUZZY TECHNIQUES BASED IMAGE SEGMENTATION CHAPTER 6 MODIFIED FUZZY TECHNIQUES BASED IMAGE SEGMENTATION 6.1 INTRODUCTION Fuzzy logic based computational techniques are becoming increasingly important in the medical image analysis arena. The significant

More information

Clustering Techniques

Clustering Techniques Clustering Techniques Bioinformatics: Issues and Algorithms CSE 308-408 Fall 2007 Lecture 16 Lopresti Fall 2007 Lecture 16-1 - Administrative notes Your final project / paper proposal is due on Friday,

More information

Review of feature selection techniques in bioinformatics by Yvan Saeys, Iñaki Inza and Pedro Larrañaga.

Review of feature selection techniques in bioinformatics by Yvan Saeys, Iñaki Inza and Pedro Larrañaga. Americo Pereira, Jan Otto Review of feature selection techniques in bioinformatics by Yvan Saeys, Iñaki Inza and Pedro Larrañaga. ABSTRACT In this paper we want to explain what feature selection is and

More information

International Journal of Advance Research in Computer Science and Management Studies

International Journal of Advance Research in Computer Science and Management Studies Volume 2, Issue 11, November 2014 ISSN: 2321 7782 (Online) International Journal of Advance Research in Computer Science and Management Studies Research Article / Survey Paper / Case Study Available online

More information

The Effect of Word Sampling on Document Clustering

The Effect of Word Sampling on Document Clustering The Effect of Word Sampling on Document Clustering OMAR H. KARAM AHMED M. HAMAD SHERIN M. MOUSSA Department of Information Systems Faculty of Computer and Information Sciences University of Ain Shams,

More information

Redundancy Based Feature Selection for Microarray Data

Redundancy Based Feature Selection for Microarray Data Redundancy Based Feature Selection for Microarray Data Lei Yu Department of Computer Science & Engineering Arizona State University Tempe, AZ 85287-8809 leiyu@asu.edu Huan Liu Department of Computer Science

More information

Clustering. Lecture 6, 1/24/03 ECS289A

Clustering. Lecture 6, 1/24/03 ECS289A Clustering Lecture 6, 1/24/03 What is Clustering? Given n objects, assign them to groups (clusters) based on their similarity Unsupervised Machine Learning Class Discovery Difficult, and maybe ill-posed

More information

EECS730: Introduction to Bioinformatics

EECS730: Introduction to Bioinformatics EECS730: Introduction to Bioinformatics Lecture 15: Microarray clustering http://compbio.pbworks.com/f/wood2.gif Some slides were adapted from Dr. Shaojie Zhang (University of Central Florida) Microarray

More information

THE EFFECT OF NOISY BOOTSTRAPPING ON THE ROBUSTNESS OF SUPERVISED CLASSIFICATION OF GENE EXPRESSION DATA

THE EFFECT OF NOISY BOOTSTRAPPING ON THE ROBUSTNESS OF SUPERVISED CLASSIFICATION OF GENE EXPRESSION DATA THE EFFECT OF NOISY BOOTSTRAPPING ON THE ROBUSTNESS OF SUPERVISED CLASSIFICATION OF GENE EXPRESSION DATA Niv Efron and Nathan Intrator School of Computer Science, Tel-Aviv University, Ramat-Aviv 69978,

More information

AN IMPROVISED FREQUENT PATTERN TREE BASED ASSOCIATION RULE MINING TECHNIQUE WITH MINING FREQUENT ITEM SETS ALGORITHM AND A MODIFIED HEADER TABLE

AN IMPROVISED FREQUENT PATTERN TREE BASED ASSOCIATION RULE MINING TECHNIQUE WITH MINING FREQUENT ITEM SETS ALGORITHM AND A MODIFIED HEADER TABLE AN IMPROVISED FREQUENT PATTERN TREE BASED ASSOCIATION RULE MINING TECHNIQUE WITH MINING FREQUENT ITEM SETS ALGORITHM AND A MODIFIED HEADER TABLE Vandit Agarwal 1, Mandhani Kushal 2 and Preetham Kumar 3

More information

Biclustering Bioinformatics Data Sets. A Possibilistic Approach

Biclustering Bioinformatics Data Sets. A Possibilistic Approach Possibilistic algorithm Bioinformatics Data Sets: A Possibilistic Approach Dept Computer and Information Sciences, University of Genova ITALY EMFCSC Erice 20/4/2007 Bioinformatics Data Sets Outline Introduction

More information

A Framework for Clustering Massive Text and Categorical Data Streams

A Framework for Clustering Massive Text and Categorical Data Streams A Framework for Clustering Massive Text and Categorical Data Streams Charu C. Aggarwal IBM T. J. Watson Research Center charu@us.ibm.com Philip S. Yu IBM T. J.Watson Research Center psyu@us.ibm.com Abstract

More information

Domain Independent Prediction with Evolutionary Nearest Neighbors.

Domain Independent Prediction with Evolutionary Nearest Neighbors. Research Summary Domain Independent Prediction with Evolutionary Nearest Neighbors. Introduction In January of 1848, on the American River at Coloma near Sacramento a few tiny gold nuggets were discovered.

More information

Gene Expression Based Classification using Iterative Transductive Support Vector Machine

Gene Expression Based Classification using Iterative Transductive Support Vector Machine Gene Expression Based Classification using Iterative Transductive Support Vector Machine Hossein Tajari and Hamid Beigy Abstract Support Vector Machine (SVM) is a powerful and flexible learning machine.

More information

Text Data Pre-processing and Dimensionality Reduction Techniques for Document Clustering

Text Data Pre-processing and Dimensionality Reduction Techniques for Document Clustering Text Data Pre-processing and Dimensionality Reduction Techniques for Document Clustering A. Anil Kumar Dept of CSE Sri Sivani College of Engineering Srikakulam, India S.Chandrasekhar Dept of CSE Sri Sivani

More information

Automated Microarray Classification Based on P-SVM Gene Selection

Automated Microarray Classification Based on P-SVM Gene Selection Automated Microarray Classification Based on P-SVM Gene Selection Johannes Mohr 1,2,, Sambu Seo 1, and Klaus Obermayer 1 1 Berlin Institute of Technology Department of Electrical Engineering and Computer

More information

Facial Expression Recognition using Principal Component Analysis with Singular Value Decomposition

Facial Expression Recognition using Principal Component Analysis with Singular Value Decomposition ISSN: 2321-7782 (Online) Volume 1, Issue 6, November 2013 International Journal of Advance Research in Computer Science and Management Studies Research Paper Available online at: www.ijarcsms.com Facial

More information

Fast Efficient Clustering Algorithm for Balanced Data

Fast Efficient Clustering Algorithm for Balanced Data Vol. 5, No. 6, 214 Fast Efficient Clustering Algorithm for Balanced Data Adel A. Sewisy Faculty of Computer and Information, Assiut University M. H. Marghny Faculty of Computer and Information, Assiut

More information

Introduction The problem of cancer classication has clear implications on cancer treatment. Additionally, the advent of DNA microarrays introduces a w

Introduction The problem of cancer classication has clear implications on cancer treatment. Additionally, the advent of DNA microarrays introduces a w MASSACHUSETTS INSTITUTE OF TECHNOLOGY ARTIFICIAL INTELLIGENCE LABORATORY and CENTER FOR BIOLOGICAL AND COMPUTATIONAL LEARNING DEPARTMENT OF BRAIN AND COGNITIVE SCIENCES A.I. Memo No.677 C.B.C.L Paper No.8

More information

Gene expression & Clustering (Chapter 10)

Gene expression & Clustering (Chapter 10) Gene expression & Clustering (Chapter 10) Determining gene function Sequence comparison tells us if a gene is similar to another gene, e.g., in a new species Dynamic programming Approximate pattern matching

More information

TDT- An Efficient Clustering Algorithm for Large Database Ms. Kritika Maheshwari, Mr. M.Rajsekaran

TDT- An Efficient Clustering Algorithm for Large Database Ms. Kritika Maheshwari, Mr. M.Rajsekaran TDT- An Efficient Clustering Algorithm for Large Database Ms. Kritika Maheshwari, Mr. M.Rajsekaran M-Tech Scholar, Department of Computer Science and Engineering, SRM University, India Assistant Professor,

More information

Motivation. Technical Background

Motivation. Technical Background Handling Outliers through Agglomerative Clustering with Full Model Maximum Likelihood Estimation, with Application to Flow Cytometry Mark Gordon, Justin Li, Kevin Matzen, Bryce Wiedenbeck Motivation Clustering

More information

Enhancing Clustering Results In Hierarchical Approach By Mvs Measures

Enhancing Clustering Results In Hierarchical Approach By Mvs Measures International Journal of Engineering Research and Development e-issn: 2278-067X, p-issn: 2278-800X, www.ijerd.com Volume 10, Issue 6 (June 2014), PP.25-30 Enhancing Clustering Results In Hierarchical Approach

More information

A Weighted Majority Voting based on Normalized Mutual Information for Cluster Analysis

A Weighted Majority Voting based on Normalized Mutual Information for Cluster Analysis A Weighted Majority Voting based on Normalized Mutual Information for Cluster Analysis Meshal Shutaywi and Nezamoddin N. Kachouie Department of Mathematical Sciences, Florida Institute of Technology Abstract

More information

C-NBC: Neighborhood-Based Clustering with Constraints

C-NBC: Neighborhood-Based Clustering with Constraints C-NBC: Neighborhood-Based Clustering with Constraints Piotr Lasek Chair of Computer Science, University of Rzeszów ul. Prof. St. Pigonia 1, 35-310 Rzeszów, Poland lasek@ur.edu.pl Abstract. Clustering is

More information

Gene Clustering & Classification

Gene Clustering & Classification BINF, Introduction to Computational Biology Gene Clustering & Classification Young-Rae Cho Associate Professor Department of Computer Science Baylor University Overview Introduction to Gene Clustering

More information

10701 Machine Learning. Clustering

10701 Machine Learning. Clustering 171 Machine Learning Clustering What is Clustering? Organizing data into clusters such that there is high intra-cluster similarity low inter-cluster similarity Informally, finding natural groupings among

More information

Redefining and Enhancing K-means Algorithm

Redefining and Enhancing K-means Algorithm Redefining and Enhancing K-means Algorithm Nimrat Kaur Sidhu 1, Rajneet kaur 2 Research Scholar, Department of Computer Science Engineering, SGGSWU, Fatehgarh Sahib, Punjab, India 1 Assistant Professor,

More information

FEATURE SELECTION TECHNIQUES

FEATURE SELECTION TECHNIQUES CHAPTER-2 FEATURE SELECTION TECHNIQUES 2.1. INTRODUCTION Dimensionality reduction through the choice of an appropriate feature subset selection, results in multiple uses including performance upgrading,

More information

ECS 234: Data Analysis: Clustering ECS 234

ECS 234: Data Analysis: Clustering ECS 234 : Data Analysis: Clustering What is Clustering? Given n objects, assign them to groups (clusters) based on their similarity Unsupervised Machine Learning Class Discovery Difficult, and maybe ill-posed

More information

IJBR CLUSTER ANALYSIS OF MICROARRAY DATA BASED ON SIMILARITY MEASUREMENT

IJBR CLUSTER ANALYSIS OF MICROARRAY DATA BASED ON SIMILARITY MEASUREMENT IJBR ISSN: 0975 3087, E-ISSN: 0975 9115, Vol. 3, Issue 2, 2011, pp-207-213 Available online at http://www.bioinfo.in/contents.php?id=21 CLUSTER ANALYSIS OF MICROARRAY DATA BASED ON SIMILARITY MEASUREMENT

More information

Distributed and clustering techniques for Multiprocessor Systems

Distributed and clustering techniques for Multiprocessor Systems www.ijcsi.org 199 Distributed and clustering techniques for Multiprocessor Systems Elsayed A. Sallam Associate Professor and Head of Computer and Control Engineering Department, Faculty of Engineering,

More information

TOWARDS NEW ESTIMATING INCREMENTAL DIMENSIONAL ALGORITHM (EIDA)

TOWARDS NEW ESTIMATING INCREMENTAL DIMENSIONAL ALGORITHM (EIDA) TOWARDS NEW ESTIMATING INCREMENTAL DIMENSIONAL ALGORITHM (EIDA) 1 S. ADAEKALAVAN, 2 DR. C. CHANDRASEKAR 1 Assistant Professor, Department of Information Technology, J.J. College of Arts and Science, Pudukkottai,

More information

What is clustering. Organizing data into clusters such that there is high intra- cluster similarity low inter- cluster similarity

What is clustering. Organizing data into clusters such that there is high intra- cluster similarity low inter- cluster similarity Clustering What is clustering Organizing data into clusters such that there is high intra- cluster similarity low inter- cluster similarity Informally, finding natural groupings among objects. High dimensional

More information

Efficiency of k-means and K-Medoids Algorithms for Clustering Arbitrary Data Points

Efficiency of k-means and K-Medoids Algorithms for Clustering Arbitrary Data Points Efficiency of k-means and K-Medoids Algorithms for Clustering Arbitrary Data Points Dr. T. VELMURUGAN Associate professor, PG and Research Department of Computer Science, D.G.Vaishnav College, Chennai-600106,

More information

An Intelligent Clustering Algorithm for High Dimensional and Highly Overlapped Photo-Thermal Infrared Imaging Data

An Intelligent Clustering Algorithm for High Dimensional and Highly Overlapped Photo-Thermal Infrared Imaging Data An Intelligent Clustering Algorithm for High Dimensional and Highly Overlapped Photo-Thermal Infrared Imaging Data Nian Zhang and Lara Thompson Department of Electrical and Computer Engineering, University

More information

Statistical Methods and Optimization in Data Mining

Statistical Methods and Optimization in Data Mining Statistical Methods and Optimization in Data Mining Eloísa Macedo 1, Adelaide Freitas 2 1 University of Aveiro, Aveiro, Portugal; macedo@ua.pt 2 University of Aveiro, Aveiro, Portugal; adelaide@ua.pt The

More information

Dynamic Clustering of Data with Modified K-Means Algorithm

Dynamic Clustering of Data with Modified K-Means Algorithm 2012 International Conference on Information and Computer Networks (ICICN 2012) IPCSIT vol. 27 (2012) (2012) IACSIT Press, Singapore Dynamic Clustering of Data with Modified K-Means Algorithm Ahamed Shafeeq

More information

Mining Microarray Gene Expression Data

Mining Microarray Gene Expression Data Mining Microarray Gene Expression Data Michinari Momma (1) Minghu Song (2) Jinbo Bi (3) (1) mommam@rpi.edu, Dept. of Decision Sciences and Engineering Systems (2) songm@rpi.edu, Dept. of Chemistry (3)

More information

Similarity Matrix Based Session Clustering by Sequence Alignment Using Dynamic Programming

Similarity Matrix Based Session Clustering by Sequence Alignment Using Dynamic Programming Similarity Matrix Based Session Clustering by Sequence Alignment Using Dynamic Programming Dr.K.Duraiswamy Dean, Academic K.S.Rangasamy College of Technology Tiruchengode, India V. Valli Mayil (Corresponding

More information

Clustering Part 4 DBSCAN

Clustering Part 4 DBSCAN Clustering Part 4 Dr. Sanjay Ranka Professor Computer and Information Science and Engineering University of Florida, Gainesville DBSCAN DBSCAN is a density based clustering algorithm Density = number of

More information

Properties of Biological Networks

Properties of Biological Networks Properties of Biological Networks presented by: Ola Hamud June 12, 2013 Supervisor: Prof. Ron Pinter Based on: NETWORK BIOLOGY: UNDERSTANDING THE CELL S FUNCTIONAL ORGANIZATION By Albert-László Barabási

More information

DENSITY BASED AND PARTITION BASED CLUSTERING OF UNCERTAIN DATA BASED ON KL-DIVERGENCE SIMILARITY MEASURE

DENSITY BASED AND PARTITION BASED CLUSTERING OF UNCERTAIN DATA BASED ON KL-DIVERGENCE SIMILARITY MEASURE DENSITY BASED AND PARTITION BASED CLUSTERING OF UNCERTAIN DATA BASED ON KL-DIVERGENCE SIMILARITY MEASURE Sinu T S 1, Mr.Joseph George 1,2 Computer Science and Engineering, Adi Shankara Institute of Engineering

More information

Flexible-Hybrid Sequential Floating Search in Statistical Feature Selection

Flexible-Hybrid Sequential Floating Search in Statistical Feature Selection Flexible-Hybrid Sequential Floating Search in Statistical Feature Selection Petr Somol 1,2, Jana Novovičová 1,2, and Pavel Pudil 2,1 1 Dept. of Pattern Recognition, Institute of Information Theory and

More information

Dynamic Optimization of Generalized SQL Queries with Horizontal Aggregations Using K-Means Clustering

Dynamic Optimization of Generalized SQL Queries with Horizontal Aggregations Using K-Means Clustering Dynamic Optimization of Generalized SQL Queries with Horizontal Aggregations Using K-Means Clustering Abstract Mrs. C. Poongodi 1, Ms. R. Kalaivani 2 1 PG Student, 2 Assistant Professor, Department of

More information

APPLICATION OF MULTIPLE RANDOM CENTROID (MRC) BASED K-MEANS CLUSTERING ALGORITHM IN INSURANCE A REVIEW ARTICLE

APPLICATION OF MULTIPLE RANDOM CENTROID (MRC) BASED K-MEANS CLUSTERING ALGORITHM IN INSURANCE A REVIEW ARTICLE APPLICATION OF MULTIPLE RANDOM CENTROID (MRC) BASED K-MEANS CLUSTERING ALGORITHM IN INSURANCE A REVIEW ARTICLE Sundari NallamReddy, Samarandra Behera, Sanjeev Karadagi, Dr. Anantha Desik ABSTRACT: Tata

More information

Order Preserving Clustering by Finding Frequent Orders in Gene Expression Data

Order Preserving Clustering by Finding Frequent Orders in Gene Expression Data Order Preserving Clustering by Finding Frequent Orders in Gene Expression Data Li Teng and Laiwan Chan Department of Computer Science and Engineering, The Chinese University of Hong Kong, Hong Kong Abstract.

More information

Statistics 202: Data Mining. c Jonathan Taylor. Week 8 Based in part on slides from textbook, slides of Susan Holmes. December 2, / 1

Statistics 202: Data Mining. c Jonathan Taylor. Week 8 Based in part on slides from textbook, slides of Susan Holmes. December 2, / 1 Week 8 Based in part on slides from textbook, slides of Susan Holmes December 2, 2012 1 / 1 Part I Clustering 2 / 1 Clustering Clustering Goal: Finding groups of objects such that the objects in a group

More information

CANCER PREDICTION USING PATTERN CLASSIFICATION OF MICROARRAY DATA. By: Sudhir Madhav Rao &Vinod Jayakumar Instructor: Dr.

CANCER PREDICTION USING PATTERN CLASSIFICATION OF MICROARRAY DATA. By: Sudhir Madhav Rao &Vinod Jayakumar Instructor: Dr. CANCER PREDICTION USING PATTERN CLASSIFICATION OF MICROARRAY DATA By: Sudhir Madhav Rao &Vinod Jayakumar Instructor: Dr. Michael Nechyba 1. Abstract The objective of this project is to apply well known

More information

Accelerating Unique Strategy for Centroid Priming in K-Means Clustering

Accelerating Unique Strategy for Centroid Priming in K-Means Clustering IJIRST International Journal for Innovative Research in Science & Technology Volume 3 Issue 07 December 2016 ISSN (online): 2349-6010 Accelerating Unique Strategy for Centroid Priming in K-Means Clustering

More information

Taxonomically Clustering Organisms Based on the Profiles of Gene Sequences Using PCA

Taxonomically Clustering Organisms Based on the Profiles of Gene Sequences Using PCA Journal of Computer Science 2 (3): 292-296, 2006 ISSN 1549-3636 2006 Science Publications Taxonomically Clustering Organisms Based on the Profiles of Gene Sequences Using PCA 1 E.Ramaraj and 2 M.Punithavalli

More information

USING SOFT COMPUTING TECHNIQUES TO INTEGRATE MULTIPLE KINDS OF ATTRIBUTES IN DATA MINING

USING SOFT COMPUTING TECHNIQUES TO INTEGRATE MULTIPLE KINDS OF ATTRIBUTES IN DATA MINING USING SOFT COMPUTING TECHNIQUES TO INTEGRATE MULTIPLE KINDS OF ATTRIBUTES IN DATA MINING SARAH COPPOCK AND LAWRENCE MAZLACK Computer Science, University of Cincinnati, Cincinnati, Ohio 45220 USA E-mail:

More information

MSA220 - Statistical Learning for Big Data

MSA220 - Statistical Learning for Big Data MSA220 - Statistical Learning for Big Data Lecture 13 Rebecka Jörnsten Mathematical Sciences University of Gothenburg and Chalmers University of Technology Clustering Explorative analysis - finding groups

More information

ECLT 5810 Clustering

ECLT 5810 Clustering ECLT 5810 Clustering What is Cluster Analysis? Cluster: a collection of data objects Similar to one another within the same cluster Dissimilar to the objects in other clusters Cluster analysis Grouping

More information

Customer Clustering using RFM analysis

Customer Clustering using RFM analysis Customer Clustering using RFM analysis VASILIS AGGELIS WINBANK PIRAEUS BANK Athens GREECE AggelisV@winbank.gr DIMITRIS CHRISTODOULAKIS Computer Engineering and Informatics Department University of Patras

More information

Text Document Clustering Using DPM with Concept and Feature Analysis

Text Document Clustering Using DPM with Concept and Feature Analysis Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology IJCSMC, Vol. 2, Issue. 10, October 2013,

More information

AN IMPROVED K-MEANS CLUSTERING ALGORITHM FOR IMAGE SEGMENTATION

AN IMPROVED K-MEANS CLUSTERING ALGORITHM FOR IMAGE SEGMENTATION AN IMPROVED K-MEANS CLUSTERING ALGORITHM FOR IMAGE SEGMENTATION WILLIAM ROBSON SCHWARTZ University of Maryland, Department of Computer Science College Park, MD, USA, 20742-327, schwartz@cs.umd.edu RICARDO

More information

University of Florida CISE department Gator Engineering. Clustering Part 4

University of Florida CISE department Gator Engineering. Clustering Part 4 Clustering Part 4 Dr. Sanjay Ranka Professor Computer and Information Science and Engineering University of Florida, Gainesville DBSCAN DBSCAN is a density based clustering algorithm Density = number of

More information

A Distance-Based Classifier Using Dissimilarity Based on Class Conditional Probability and Within-Class Variation. Kwanyong Lee 1 and Hyeyoung Park 2

A Distance-Based Classifier Using Dissimilarity Based on Class Conditional Probability and Within-Class Variation. Kwanyong Lee 1 and Hyeyoung Park 2 A Distance-Based Classifier Using Dissimilarity Based on Class Conditional Probability and Within-Class Variation Kwanyong Lee 1 and Hyeyoung Park 2 1. Department of Computer Science, Korea National Open

More information

Improving the Performance of K-Means Clustering For High Dimensional Data Set

Improving the Performance of K-Means Clustering For High Dimensional Data Set Improving the Performance of K-Means Clustering For High Dimensional Data Set P.Prabhu Assistant Professor in Information Technology DDE, Alagappa University Karaikudi, Tamilnadu, India N.Anbazhagan Associate

More information

Metaheuristic Development Methodology. Fall 2009 Instructor: Dr. Masoud Yaghini

Metaheuristic Development Methodology. Fall 2009 Instructor: Dr. Masoud Yaghini Metaheuristic Development Methodology Fall 2009 Instructor: Dr. Masoud Yaghini Phases and Steps Phases and Steps Phase 1: Understanding Problem Step 1: State the Problem Step 2: Review of Existing Solution

More information

Multi-Modal Data Fusion: A Description

Multi-Modal Data Fusion: A Description Multi-Modal Data Fusion: A Description Sarah Coppock and Lawrence J. Mazlack ECECS Department University of Cincinnati Cincinnati, Ohio 45221-0030 USA {coppocs,mazlack}@uc.edu Abstract. Clustering groups

More information