Clustering of SNP Data with Application to Genomics

Size: px
Start display at page:

Download "Clustering of SNP Data with Application to Genomics"

Transcription

1 Clustering of SNP Data with Application to Genomics Michael K Ng, Mark J Li, Sio I Ao Department of Mathematics Hong Kong Baptist University Kowloon Tong, Hong Kong mng@mathhkbueduhk Yiu-ming Cheung Department of Computer Science Hong Kong Baptist University Kowloon Tong, Hong Kong Pak C Sham Genome Research Center The University of Hong Kong Pokfulam Road, Hong Kong Joshua Z Huang E-Business Technology Institute The University of Hong Kong Pokfulam Road, Hong Kong Abstract Single nucleotide polymorphisms (SNPs) are very common throughout the genome and hence are potentially valuable for mapping disease susceptibility loci by detecting association between SNP markers and disease Many methods may only be applicable when marker haplotypes, rather than genotypes (categorical data), are available for analysis In this paper, we explore the properties of k-modes (categorical data) clustering algorithms to SNP data for detecting association between SNP markers and disease Subspace k-modes clustering properties are also considered and tested 1 Introduction Because of their ubiquity there has been considerable interest in using single nucleotide polymorphisms (SNPs) to fine-map susceptibility loci [5] It is estimated that 90% of naturally occurring sequence variations are SNPs [9] These variations are sufficiently finely spaced that one may reasonably expect to find SNPs within a defined chromosomal region which can be sufficient to manifest detectable linkage disequilibrium in some human populations Detecting association between SNPs and disease may provide useful evidence for the existence of a susceptibility locus within such a region, allowing one to proceed to more intensive investigations which can lead to identification of the gene and pathogenic polymorphisms Several strategies have been proposed that utilize two- Research supported in part by HKRGC 7035/04P, 7035/05P and HKBU FRGs point methods to localize the position of a disease locus [11] However, SNPs studied individually might be expected to provide relatively little information for detecting association between a disease and a chromosomal region [25, 27], especially if more than one mutation is present Potentially the amount of information available from SNPs could be increased dramatically by utilizing information from several marker loci simultaneously, with the aim of detecting association with a marker haplotype rather than just one biallelic marker Composite likelihood methods combining disease associations with a series of linked markers from haplotypes have been proposed by Collins et al [10], Lam et al [23] and McPeek & Strahs [24] A problem with any method which directly utilizes case and control haplotypes is that such haplotypes are rarely available for autosomal markers In practice if one wishes to use haplotypes then one must generally rely on a combination of deduction and estimation Methods which are based on modelling explicitly a pattern of development of linkage disequilibrium relationships between marker and disease loci may be expected to perform badly if the assumptions of the model are violated Methods which assume only one mutational event may perform poorly if more than one has occurred Finally, genotyping and map errors may cause particular problems for methods which rely on a regular relationship between physical position and linkage disequilibrium parameters An alternative approach which might tackle some these difficulties would be to utilize data mining (clustering) methods to investigate association between a disease phenotype and a multilocus genotype, without assuming the availability of haplotypes required by most of the above mentioned methods, and without any attempt model the process whereby disease-related haplotypes might have been gener /06 $

2 ated Since multilocus genotype data type is categorical, the main contribution of this paper is to use k-modes clustering algorithms to detect association between a disease and multiple marker geonotypes The outline of this paper is as follows In Section 2, we review a k-modes clustering algorithm In Section 3, a real data set is employed to test the performance k-modes clustering algorithm, and to compare with other methods like logistic regression, neural network and decision tree Finally, we consider subspace clustering algorithms and present some preliminary clustering results 2 The K-modes Clustering Method Since first published in 1997, the k-modes algorithm [19] has become a popular technique in solving categorical data clustering problems in different application domains The k-modes algorithm extends the k-means algorithm by using a simple matching dissimilarity measure for categorical objects, modes instead of means for clusters, and a frequency-based method to update modes in the clustering process to minimize the clustering cost function These extensions have removed the numeric-only limitation of the k- means algorithm and enable the k-means clustering process to be used to efficiently cluster large categorical data sets from real world databases An equivalent nonparametric approach to deriving clusters from categorical data is presented [21] We note that modes for clusters can be viewed as the representative haplotypes of the corresponding cluster On the assmuption some association has been maintained through linkage disequilibrium, this implies that particular haplotypes should be more commonly found on chromosomes bearing pathogenic mutations Such haplotypes should act in a similar way to multallelic markers, and should be better able to produce detectable association, especially when there are multiple mutation events We assume the set of individuals to be clustered is stored in a table T defined by a set of SNP locus attributes (or simply say attributes), A 1, A 2,, A m Each attribute A j describes a domain of values, denoted by DOM(A j ), associated with a defined genotypes An example is given in the following table: Individual (Case or Control) Locus 1 Locus 2 Locus m n For instance, A 1 has two alleles 1 and 3, the three possible genotypes are 11, 13 and 33 Genotypes for each attribute are categorical (nominally) such that no genetic model assumptions are incorporated A domain DOM(A j ) is defined as categorical if it is finite and unordered, eg, for any a, b DOM(A j ), either a = b or a b An individual X in T can be logically represented as a conjunction of attribute-value pairs [A 1 = x 1 ] [A 2 = x 2 ] [A m = x m ] where x j DOM(A j ) for 1 j m Without ambiguity, we represent X as a vector [x 1, x 2,, x m ] X is called a categorical individual if it has only categorical genotype values We consider every individual has exactly attribute genotype values If the value of an attribute A j is missing, then we denote the attribute value of A j by a category ɛ which means empty Let X = {X 1, X 2,, X n } be a set of n individuals Individual X i is represented as [x i,1, x i,2,, x i,m ] We write X i = X k if x i,j = x k,j for 1 j m The relation X i = X k does not mean that X i and X k are the same indvidual in the table, but rather that the two individuals have equal genotype values in attributes A 1, A 2,, A m The k-modes algorithm [19] has made the following modifications to the k-means algorithm: (i) using a simple matching dissimilarity measure for categorical individuals, (ii) replacing the means of clusters with the modes, and (iii) using a frequency based method to find the modes These modifications have removed the numeric-only limitation of the k-means algorithm but maintain its efficiency in clustering large categorical data sets [19] Let X and Y be two categorical individuals represented by [x 1, x 2,, x m ] and [y 1, y 2,, y m ] respectively The simple matching dissimilarity measure between X and Y is defined as follows: m d(x, Y ) δ(x j, y j ) where δ(x j, y j ) = j=1 { 0, xj = y j 1, x j y j (1) It is easy to verify that the function d defines a metric space on the set of categorical individuals Traditionally, the simple matching approach is often used in binary variables which are converted from categorical variables We note that d is also a kind of generalized Hamming distance The k-modes algorithm uses the k-means paradigm to cluster categorical data The objective of clustering a set of n categorical individuals into k clusters is to find W and Z that minimize k n F (W, Z) = w li d(z l, X i ) (2) l=1 i= /06 $

3 subject to and w li {0, 1}, 1 l k, 1 i n, (3) 0 < k w li = 1, 1 i n, (4) l=1 n w li < n, 1 l k, (5) i=1 where k( n) is a known number of clusters, W = [w li ] is a k-by-n {0, 1} matrix, Z = [Z 1, Z 2,, Z k ], and Z i is the ith cluster center with the categorical attributes A 1, A 2,, A m We remind that Z i can be viewed as the representative haplotypes of the corresponding cluster Minimization of F in (2) with the constraints in (3), (4) and (5) forms a class of constrained nonlinear optimization problems whose solutions are unknown The usual method towards optimization of F in (2) is to use partial optimization for Z and W In this method we first fix Z and find necessary conditions on W to minimize F Then we fix W and minimize F with respect to Z This process is formalized in the k-modes algorithm as follows Algorithm The k-modes algorithm 1 Choose an initial mode Z (1) of each cluster Determine W (1) such that F (W, Z (1) ) is minimized Set t = 1 2 Determine Z (t+1) such that F (W (t), Z (t+1) ) is minimized If F (W (t), Z (t+1) ) = F (W (t), Z (t) ), then stop; otherwise goto step 3 3 Determine W (t+1) such that F (W (t+1), Z (t+1) ) is minimized If F (W (t+1), Z (t+1) ) = F (W (t), Z (t+1) ), then stop; otherwise set t = t + 1 and goto Step 2 The matrices W and Z are calculated as follows Let Ẑ be fixed and consider the problem: min W F (W, Ẑ) subject to (3), (4) and (5) The minimizer Ŵ is given by { 1, if d( Ẑ ŵ li = l, X i ) d(ẑh, X i ), 1 h k, 0, otherwise Let X be a set of categorical individuals described by categorical attributes A 1, A 2,, A m and DOM(A j ) = {a (1) j, a (2) j,, a (n j) j }, where n j is the number of categories of attribute A j for 1 j m Let the cluster centers Z l be represented by [z l,1, z l,2,, z l,m ] for 1 l k Then the quantity k n l=1 i=1 w lid(z l, X i ) is minimized iff z l,j = a (r) j DOM(A j ) where (for 1 t n j ) {w li x i,j = a (r) j, w li = 1} {w li x i,j = a (t) j, w li = 1}, for 1 j m Here X denotes the number of elements in the set X 3 Experimental Results In this section, a real data set is employed to test the performance of the k-modes clustering algorithm, and to compare with other methods like logistic regression, neural network and decision tree We analyze the case/control populations of patients served in a data set from Genome Research Center, The University of Hong Kong The data is consisted of 488 cases (patients) recruited from hospitals in Hong Kong and 520 controls (normal) recruited from the community 144 SNPs on chromosome 3p are picked by CLUSTAG developed by us [4] making an average marker density of 1 tagging SNP per 25 kb The following table shows the summary classification results for the k-modes clustering algorithm and the other methods Each classification result is computed by the average of ten runs of the algorithm In the tests of logistic regression, decision tree and neural network, 60% data is used for training and 40% data is used for validation Method Validation Accuracy Logistic Regression Decision Tree Neural Network k-modes Clustering (k = 2) k-modes Clustering (k = 4) k-modes Clustering (k = 6) Subspace Clustering When more SNPs (a genome-wide genotyping) are used to detect the association between a disease and multiple marker geonotypes, we may need to consider subspace clustering techniques More precisely, we expect in a typical dataset that contains the genotype data of several thousands of SNPs in different individuals, it is common to find only several tens of SNPs having genotype patterns that are highly specific to each cluster of individuals The SNPs are called the relevant SNPs, as opposed to the irrelevant SNPs that do not help much in identifying the cluster members (ie, individuals of the same type) Due to the large number of SNPs being irrelevant to each cluster, two individuals in the same cluster could have low similarity when measured by a similarity function (matching distances in Section 2) /06 $

4 that consider the genotypes of all SNPs The clusters may thus be undetectable by the k-modes clustering algorithms The subspace clustering problem is defined for such a scenario Each subspace cluster is a set of individuals with an associated set of relevant SNPs such that in the subspace formed by the relevant SNPs, the individuals are similar to each other but dissimilar to individuals outside the cluster In general, subspace clustering seeks to group objects into clusters on subsets of dimensions or attributes of a data set It pursues two tasks, identification of the subsets of dimensions where clusters can be found and discovery of the clusters from different subsets of dimensions According to the ways with which the subsets of dimensions are identified, we can divide subspace clustering methods into two categories The methods in the first category determine the exact subsets of dimensions where clusters are discovered We call these methods as hard subspace clustering, see for instance [1, 2, 3, 8, 17, 26, 28, 29] The methods in the second category determine the subsets of dimensions according to the contributions of the dimensions in discovering the corresponding clusters The contribution of a dimension is measured by a weight that is assigned to the dimension in the clustering process We call these methods as soft subspace clustering because every dimension contributes to the discovery of clusters but the dimensions with larger weights form the subsets of dimensions of the clusters, for instance, [6, 12, 13, 14, 15, 16, 22] We modify the weighting clustering algorithm EW KM [6] to formulate the k-modes subspace clustering algorithm The CYP2D6 data of Hosking et al [18] is used to test the algorithm Four functional CYP2D6 polymorphisms predicting 99% of slow metabolizers were types on 1018 Caucasians and 41 predicted slow metabolizers were identified Therefore 977 are called normal A full description of the data, the database IDs and primers for the marker SNPs are given in Hosking et al [18] We produced 100 clustering results for k-modes algorithms and EW KM with different γ Here γ is the parameter to used to control the number of relevant SNPs included in a cluster, for details, see [?] If we consider the clustering accuracy as a good clustering result The chance to obtain the good result by employing the EW KM is close to 60% in a large range of γ values [02,70], which is better than the k-modes algorithm The distribution of the clustering accuracies is shown in Table 1 Our preliminary results show that a useful information can be obtained in case-control association studies by using subspace clustering method to analyze high-dimensional categorical data from multiple SNPs The method has the advantage of being almost entirely theoretical, with no attempt to model the population history which has produced disease-related haplotypes It might therefore be less prone than other methods to be sensitive to map errors or model Clustering K-modes EWKM (γ) accuracy Clustering EWKM (γ) accuracy Table 1 Distribution of accuracies in 100 runs for K-modes and EW KM algorithms violations The preliminary investigations we have carried our show that subspace clustering methods can provide a simple and practical method for dealing with the multilocus genotypes which are obtained from standard case-control studies The proposed method allows multiple markers to be analyzed simultaneously, even when haplotypes are unavailable, and do not rely on any model of population history or any genetic map to account for present patterns of linkage disequilibrium References [1] C Aggarwal, C Procopiuc, J Wolf, P Yu, and J Park, Fast algorithms for projected clustering, Proc ACM SIGMOD, pp 61 72, 1999 [2] C Aggarwal and P Yu, Finding generalized projected clusters in high dimensional spaces, Proc ACM SIGMOD, pp 70 81, 2000 [3] R Agrawal, J Gehrke, D Gunopulos, and P Raghavan, Automatic subspace clustering of high dimensional data mining applications, Proc ACM SIG- MOD, pp , 1998 [4] S Ao, K Yip, M Ng, D Cheung, P Yee, I Melhado and P Sham CLUSTAG: hierarchical clustering and graph methods for selecting tag SNPs, Bioinformatics, v21, pp , 2005 [5] A Brookes, The essence of SNPs, Gene, vol 234, pp , /06 $

5 [6] Y Chan, W Ching, M Ng, Z Huang, An optimization algorithm for clustering using weighted dissimilarity measures, Pattern Recognition, vol 37, no 5, pp , 2004 [7] A Chaturvedi, P Green and J Carroll, K-modes clustering, Journal of Classification, vol 18, pp 35 55, 2001 [8] C H Cheng, A W Fu, and Y Zhang, Entropy-based subspace subspace clustering for mining numerical data, Proc of the 5th ACM SIGKDD International Conference on Knowledge and Data Mining, pp 84 93, 1999 [9] F Collins, L Brooks and A Chakravarti, A DNA polymorphism discovery resource for research on human genetic variation, Genome Research, vol 8, pp , 1998 [10] A Collins and N Morton, Mapping a disease locus by allelic association, Proc Natl Acad Sci USA, vol 95, pp , 1998 [11] B Devlin and N Risch, A comparison of linkage disequilibrium measures for fine-scale mapping, Genomics, vol 29, pp , 1995 [12] C Domeniconi, Locally adaptive techniques for pattern classification, Dissertation for Doctor of Philosophy, 2002 [13] C Domeniconi, D Papadopoulos, D Gunopulos, and S Ma, Subspace clustering of high dimensional data, Proc of SIAM International Conference on Data Mining, 2004 [14] J Friedman and J Meulman, Clustering objects on subsets of attributes, JRStatist Soc B, vol 66, no 4, pp , 2004 [15] H Frigui and O Nasraoui, Unsupervised learning of prototypes and attribute weights, Pattern Recognition, vol 37, no 3, pp , 2004 [16] H Frigui and O Nasraoui, Simultaneous clustering and dynamic keyword weighting for text documents, Survey of Text Mining, Michael Berry, Ed, Springer, pp 45 70, 2004 [17] S Goil, H Nagesh, and A Choudhary, Mafia: Efficient and scalable subspace clustering for very large data sets, Technical Report CPDC-TR , Northwest University, 1999 Hagen-Mann, M Ehm, J Riley, Linkage disequilibrium mapping identifies a 390 kb region associated with CYP2D6 poor drug metabolising activity, Pharmacogenomics J, 2, pp , 2002 [19] Z Huang, Extensions to the k-means algorithm for clustering large data sets with categorical values, Data Mining and Knowledge Discovery, vol 2, no 3, pp , 1998 [20] Z Huang and Michael Ng, A fuzzy k-mode algorithm for clustering categorical data, IEEE Transactions on Fuzzy System, vol 7, no 4, 1999 [21] Z Huang and M Ng, A note on k-modes clustering, Journal of Classification, vol 20, pp , 2003 [22] L Jing, M Ng, J Xu, and Z Huang, Subspace clustering of text documents with feature weighting k- means algorithm, PAKDD, pp , 2005 [23] J Lam, K Roeder and B Devlin, Haplotype fine mapping by evolutionary trees, Am J Hum Genet, vol 66, pp , 2000 [24] M McPeek and A Strahs, Assessment of linkage disequilibrium by the decay of haplotype sharing with application to fine-scale genetic mapping, Am J Hum Genet, vol 65, pp , 1999 [25] P Sham, J Zhao and D Curtis, The effect of marker characteristics on the power to detect linkage disequilibrium due to single or multiple ancestral mutations, Ann Hum, Genet, vol 64, pp , 2000 [26] K Woo and J Lee, FINDIT: a fast and intelligent subspace clustering algorithm using dimension voting, PhD thesis, Korea Advanced Institute of Science and Technology, Taejon, Korea,2002 [27] M Xiong and L Jin, Comparison of the power and accuracy of biallelic and microsatellite markers in population-based gene-mapping methods, Am J Hum Genet, vol 64, pp , 1999 [28] J Yang, W Wang, H Wang, and P Yu, δ-clusters: capturing subspace correlation in a large data set, In Data Engineering, 2002 Proceedings 18th International Conference on, pp , 2003 [29] K Yip, D Cheung, and M Ng, A practical projected clustering algorithm, IEEE Transactions on Knowledge and Data Engineering, vol 16, no 11, pp , 2004 [18] L Hosking, R Boyd, C Xu, M Nissum, K Cantone, I Purvis, R Khakhar, M Barnes, U Liberwirth, K /06 $

Linkage Disequilibrium Map by Unidimensional Nonnegative Scaling

Linkage Disequilibrium Map by Unidimensional Nonnegative Scaling The First International Symposium on Optimization and Systems Biology (OSB 07) Beijing, China, August 8 10, 2007 Copyright 2007 ORSC & APORC pp. 302 308 Linkage Disequilibrium Map by Unidimensional Nonnegative

More information

A Fuzzy Subspace Algorithm for Clustering High Dimensional Data

A Fuzzy Subspace Algorithm for Clustering High Dimensional Data A Fuzzy Subspace Algorithm for Clustering High Dimensional Data Guojun Gan 1, Jianhong Wu 1, and Zijiang Yang 2 1 Department of Mathematics and Statistics, York University, Toronto, Ontario, Canada M3J

More information

Network Based Models For Analysis of SNPs Yalta Opt

Network Based Models For Analysis of SNPs Yalta Opt Outline Network Based Models For Analysis of Yalta Optimization Conference 2010 Network Science Zeynep Ertem*, Sergiy Butenko*, Clare Gill** *Department of Industrial and Systems Engineering, **Department

More information

On the Consequence of Variation Measure in K- modes Clustering Algorithm

On the Consequence of Variation Measure in K- modes Clustering Algorithm ORIENTAL JOURNAL OF COMPUTER SCIENCE & TECHNOLOGY An International Open Free Access, Peer Reviewed Research Journal Published By: Oriental Scientific Publishing Co., India. www.computerscijournal.org ISSN:

More information

A fuzzy k-modes algorithm for clustering categorical data. Citation IEEE Transactions on Fuzzy Systems, 1999, v. 7 n. 4, p.

A fuzzy k-modes algorithm for clustering categorical data. Citation IEEE Transactions on Fuzzy Systems, 1999, v. 7 n. 4, p. Title A fuzzy k-modes algorithm for clustering categorical data Author(s) Huang, Z; Ng, MKP Citation IEEE Transactions on Fuzzy Systems, 1999, v. 7 n. 4, p. 446-452 Issued Date 1999 URL http://hdl.handle.net/10722/42992

More information

A Genetic k-modes Algorithm for Clustering Categorical Data

A Genetic k-modes Algorithm for Clustering Categorical Data A Genetic k-modes Algorithm for Clustering Categorical Data Guojun Gan, Zijiang Yang, and Jianhong Wu Department of Mathematics and Statistics, York University, Toronto, Ontario, Canada M3J 1P3 {gjgan,

More information

PARTCAT: A Subspace Clustering Algorithm for High Dimensional Categorical Data

PARTCAT: A Subspace Clustering Algorithm for High Dimensional Categorical Data 2006 International Joint Conference on Neural Networks Sheraton Vancouver Wall Centre Hotel, Vancouver, BC, Canada July 16-21, 2006 PARTCAT: A Subspace Clustering Algorithm for High Dimensional Categorical

More information

Genetic Analysis. Page 1

Genetic Analysis. Page 1 Genetic Analysis Page 1 Genetic Analysis Objectives: 1) Set up Case-Control Association analysis and the Basic Genetics Workflow 2) Use JMP tools to interact with and explore results 3) Learn advanced

More information

Clustering Algorithms In Data Mining

Clustering Algorithms In Data Mining 2017 5th International Conference on Computer, Automation and Power Electronics (CAPE 2017) Clustering Algorithms In Data Mining Xiaosong Chen 1, a 1 Deparment of Computer Science, University of Vermont,

More information

Step-by-Step Guide to Basic Genetic Analysis

Step-by-Step Guide to Basic Genetic Analysis Step-by-Step Guide to Basic Genetic Analysis Page 1 Introduction This document shows you how to clean up your genetic data, assess its statistical properties and perform simple analyses such as case-control

More information

Semi-Supervised Clustering with Partial Background Information

Semi-Supervised Clustering with Partial Background Information Semi-Supervised Clustering with Partial Background Information Jing Gao Pang-Ning Tan Haibin Cheng Abstract Incorporating background knowledge into unsupervised clustering algorithms has been the subject

More information

Comparative Study of Subspace Clustering Algorithms

Comparative Study of Subspace Clustering Algorithms Comparative Study of Subspace Clustering Algorithms S.Chitra Nayagam, Asst Prof., Dept of Computer Applications, Don Bosco College, Panjim, Goa. Abstract-A cluster is a collection of data objects that

More information

UAPRIORI: AN ALGORITHM FOR FINDING SEQUENTIAL PATTERNS IN PROBABILISTIC DATA

UAPRIORI: AN ALGORITHM FOR FINDING SEQUENTIAL PATTERNS IN PROBABILISTIC DATA UAPRIORI: AN ALGORITHM FOR FINDING SEQUENTIAL PATTERNS IN PROBABILISTIC DATA METANAT HOOSHSADAT, SAMANEH BAYAT, PARISA NAEIMI, MAHDIEH S. MIRIAN, OSMAR R. ZAÏANE Computing Science Department, University

More information

On Demand Phenotype Ranking through Subspace Clustering

On Demand Phenotype Ranking through Subspace Clustering On Demand Phenotype Ranking through Subspace Clustering Xiang Zhang, Wei Wang Department of Computer Science University of North Carolina at Chapel Hill Chapel Hill, NC 27599, USA {xiang, weiwang}@cs.unc.edu

More information

A survey on hard subspace clustering algorithms

A survey on hard subspace clustering algorithms A survey on hard subspace clustering algorithms 1 A. Surekha, 2 S. Anuradha, 3 B. Jaya Lakshmi, 4 K. B. Madhuri 1,2 Research Scholars, 3 Assistant Professor, 4 HOD Gayatri Vidya Parishad College of Engineering

More information

Monika Maharishi Dayanand University Rohtak

Monika Maharishi Dayanand University Rohtak Performance enhancement for Text Data Mining using k means clustering based genetic optimization (KMGO) Monika Maharishi Dayanand University Rohtak ABSTRACT For discovering hidden patterns and structures

More information

Application of Genetic Algorithm Based Intuitionistic Fuzzy k-mode for Clustering Categorical Data

Application of Genetic Algorithm Based Intuitionistic Fuzzy k-mode for Clustering Categorical Data BULGARIAN ACADEMY OF SCIENCES CYBERNETICS AND INFORMATION TECHNOLOGIES Volume 17, No 4 Sofia 2017 Print ISSN: 1311-9702; Online ISSN: 1314-4081 DOI: 10.1515/cait-2017-0044 Application of Genetic Algorithm

More information

International Journal of Advanced Research in Computer Science and Software Engineering

International Journal of Advanced Research in Computer Science and Software Engineering Volume 3, Issue 3, March 2013 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Special Issue:

More information

C-NBC: Neighborhood-Based Clustering with Constraints

C-NBC: Neighborhood-Based Clustering with Constraints C-NBC: Neighborhood-Based Clustering with Constraints Piotr Lasek Chair of Computer Science, University of Rzeszów ul. Prof. St. Pigonia 1, 35-310 Rzeszów, Poland lasek@ur.edu.pl Abstract. Clustering is

More information

SNP HiTLink Manual. Yoko Fukuda 1, Hiroki Adachi 2, Eiji Nakamura 2, and Shoji Tsuji 1

SNP HiTLink Manual. Yoko Fukuda 1, Hiroki Adachi 2, Eiji Nakamura 2, and Shoji Tsuji 1 SNP HiTLink Manual Yoko Fukuda 1, Hiroki Adachi 2, Eiji Nakamura 2, and Shoji Tsuji 1 1 Department of Neurology, Graduate School of Medicine, the University of Tokyo, Tokyo, Japan 2 Dynacom Co., Ltd, Kanagawa,

More information

Genetic type 1 Error Calculator (GEC)

Genetic type 1 Error Calculator (GEC) Genetic type 1 Error Calculator (GEC) (Version 0.2) User Manual Miao-Xin Li Department of Psychiatry and State Key Laboratory for Cognitive and Brain Sciences; the Centre for Reproduction, Development

More information

OUTLIER DETECTION FOR DYNAMIC DATA STREAMS USING WEIGHTED K-MEANS

OUTLIER DETECTION FOR DYNAMIC DATA STREAMS USING WEIGHTED K-MEANS OUTLIER DETECTION FOR DYNAMIC DATA STREAMS USING WEIGHTED K-MEANS DEEVI RADHA RANI Department of CSE, K L University, Vaddeswaram, Guntur, Andhra Pradesh, India. deevi_radharani@rediffmail.com NAVYA DHULIPALA

More information

Minimum Recombinant Haplotype Configuration on Tree Pedigrees (Extended Abstract)

Minimum Recombinant Haplotype Configuration on Tree Pedigrees (Extended Abstract) Minimum Recombinant Haplotype Configuration on Tree Pedigrees (Extended Abstract) Koichiro Doi 1, Jing Li 2, and Tao Jiang 2 1 Department of Computer Science Graduate School of Information Science and

More information

GMDR User Manual. GMDR software Beta 0.9. Updated March 2011

GMDR User Manual. GMDR software Beta 0.9. Updated March 2011 GMDR User Manual GMDR software Beta 0.9 Updated March 2011 1 As an open source project, the source code of GMDR is published and made available to the public, enabling anyone to copy, modify and redistribute

More information

Analytical model A structure and process for analyzing a dataset. For example, a decision tree is a model for the classification of a dataset.

Analytical model A structure and process for analyzing a dataset. For example, a decision tree is a model for the classification of a dataset. Glossary of data mining terms: Accuracy Accuracy is an important factor in assessing the success of data mining. When applied to data, accuracy refers to the rate of correct values in the data. When applied

More information

Web page recommendation using a stochastic process model

Web page recommendation using a stochastic process model Data Mining VII: Data, Text and Web Mining and their Business Applications 233 Web page recommendation using a stochastic process model B. J. Park 1, W. Choi 1 & S. H. Noh 2 1 Computer Science Department,

More information

[Gidhane* et al., 5(7): July, 2016] ISSN: IC Value: 3.00 Impact Factor: 4.116

[Gidhane* et al., 5(7): July, 2016] ISSN: IC Value: 3.00 Impact Factor: 4.116 IJESRT INTERNATIONAL JOURNAL OF ENGINEERING SCIENCES & RESEARCH TECHNOLOGY AN EFFICIENT APPROACH FOR TEXT MINING USING SIDE INFORMATION Kiran V. Gaidhane*, Prof. L. H. Patil, Prof. C. U. Chouhan DOI: 10.5281/zenodo.58632

More information

Cluster Analysis. Mu-Chun Su. Department of Computer Science and Information Engineering National Central University 2003/3/11 1

Cluster Analysis. Mu-Chun Su. Department of Computer Science and Information Engineering National Central University 2003/3/11 1 Cluster Analysis Mu-Chun Su Department of Computer Science and Information Engineering National Central University 2003/3/11 1 Introduction Cluster analysis is the formal study of algorithms and methods

More information

Data mining with Support Vector Machine

Data mining with Support Vector Machine Data mining with Support Vector Machine Ms. Arti Patle IES, IPS Academy Indore (M.P.) artipatle@gmail.com Mr. Deepak Singh Chouhan IES, IPS Academy Indore (M.P.) deepak.schouhan@yahoo.com Abstract: Machine

More information

Keyword Extraction by KNN considering Similarity among Features

Keyword Extraction by KNN considering Similarity among Features 64 Int'l Conf. on Advances in Big Data Analytics ABDA'15 Keyword Extraction by KNN considering Similarity among Features Taeho Jo Department of Computer and Information Engineering, Inha University, Incheon,

More information

Datasets Size: Effect on Clustering Results

Datasets Size: Effect on Clustering Results 1 Datasets Size: Effect on Clustering Results Adeleke Ajiboye 1, Ruzaini Abdullah Arshah 2, Hongwu Qin 3 Faculty of Computer Systems and Software Engineering Universiti Malaysia Pahang 1 {ajibraheem@live.com}

More information

DENSITY BASED AND PARTITION BASED CLUSTERING OF UNCERTAIN DATA BASED ON KL-DIVERGENCE SIMILARITY MEASURE

DENSITY BASED AND PARTITION BASED CLUSTERING OF UNCERTAIN DATA BASED ON KL-DIVERGENCE SIMILARITY MEASURE DENSITY BASED AND PARTITION BASED CLUSTERING OF UNCERTAIN DATA BASED ON KL-DIVERGENCE SIMILARITY MEASURE Sinu T S 1, Mr.Joseph George 1,2 Computer Science and Engineering, Adi Shankara Institute of Engineering

More information

HPC methods for hidden Markov models (HMMs) in population genetics

HPC methods for hidden Markov models (HMMs) in population genetics HPC methods for hidden Markov models (HMMs) in population genetics Peter Kecskemethy supervised by: Chris Holmes Department of Statistics and, University of Oxford February 20, 2013 Outline Background

More information

Chapter 1, Introduction

Chapter 1, Introduction CSI 4352, Introduction to Data Mining Chapter 1, Introduction Young-Rae Cho Associate Professor Department of Computer Science Baylor University What is Data Mining? Definition Knowledge Discovery from

More information

Outlier Detection Using Unsupervised and Semi-Supervised Technique on High Dimensional Data

Outlier Detection Using Unsupervised and Semi-Supervised Technique on High Dimensional Data Outlier Detection Using Unsupervised and Semi-Supervised Technique on High Dimensional Data Ms. Gayatri Attarde 1, Prof. Aarti Deshpande 2 M. E Student, Department of Computer Engineering, GHRCCEM, University

More information

The Lander-Green Algorithm in Practice. Biostatistics 666

The Lander-Green Algorithm in Practice. Biostatistics 666 The Lander-Green Algorithm in Practice Biostatistics 666 Last Lecture: Lander-Green Algorithm More general definition for I, the "IBD vector" Probability of genotypes given IBD vector Transition probabilities

More information

Neural Network Weight Selection Using Genetic Algorithms

Neural Network Weight Selection Using Genetic Algorithms Neural Network Weight Selection Using Genetic Algorithms David Montana presented by: Carl Fink, Hongyi Chen, Jack Cheng, Xinglong Li, Bruce Lin, Chongjie Zhang April 12, 2005 1 Neural Networks Neural networks

More information

Mining Quantitative Maximal Hyperclique Patterns: A Summary of Results

Mining Quantitative Maximal Hyperclique Patterns: A Summary of Results Mining Quantitative Maximal Hyperclique Patterns: A Summary of Results Yaochun Huang, Hui Xiong, Weili Wu, and Sam Y. Sung 3 Computer Science Department, University of Texas - Dallas, USA, {yxh03800,wxw0000}@utdallas.edu

More information

Dynamic Clustering of Data with Modified K-Means Algorithm

Dynamic Clustering of Data with Modified K-Means Algorithm 2012 International Conference on Information and Computer Networks (ICICN 2012) IPCSIT vol. 27 (2012) (2012) IACSIT Press, Singapore Dynamic Clustering of Data with Modified K-Means Algorithm Ahamed Shafeeq

More information

String Vector based KNN for Text Categorization

String Vector based KNN for Text Categorization 458 String Vector based KNN for Text Categorization Taeho Jo Department of Computer and Information Communication Engineering Hongik University Sejong, South Korea tjo018@hongik.ac.kr Abstract This research

More information

TOWARDS NEW ESTIMATING INCREMENTAL DIMENSIONAL ALGORITHM (EIDA)

TOWARDS NEW ESTIMATING INCREMENTAL DIMENSIONAL ALGORITHM (EIDA) TOWARDS NEW ESTIMATING INCREMENTAL DIMENSIONAL ALGORITHM (EIDA) 1 S. ADAEKALAVAN, 2 DR. C. CHANDRASEKAR 1 Assistant Professor, Department of Information Technology, J.J. College of Arts and Science, Pudukkottai,

More information

Mining Frequent Itemsets for data streams over Weighted Sliding Windows

Mining Frequent Itemsets for data streams over Weighted Sliding Windows Mining Frequent Itemsets for data streams over Weighted Sliding Windows Pauray S.M. Tsai Yao-Ming Chen Department of Computer Science and Information Engineering Minghsin University of Science and Technology

More information

called Hadoop Distribution file System (HDFS). HDFS is designed to run on clusters of commodity hardware and is capable of handling large files. A fil

called Hadoop Distribution file System (HDFS). HDFS is designed to run on clusters of commodity hardware and is capable of handling large files. A fil Parallel Genome-Wide Analysis With Central And Graphic Processing Units Muhamad Fitra Kacamarga mkacamarga@binus.edu James W. Baurley baurley@binus.edu Bens Pardamean bpardamean@binus.edu Abstract The

More information

SEMBIOSPHERE: A SEMANTIC WEB APPROACH TO RECOMMENDING MICROARRAY CLUSTERING SERVICES

SEMBIOSPHERE: A SEMANTIC WEB APPROACH TO RECOMMENDING MICROARRAY CLUSTERING SERVICES SEMBIOSPHERE: A SEMANTIC WEB APPROACH TO RECOMMENDING MICROARRAY CLUSTERING SERVICES KEVIN Y. YIP 1, PEISHEN QI 1, MARTIN SCHULTZ 1, DAVID W. CHEUNG 5 AND KEI-HOI CHEUNG 1,2,3,4 1 Computer Science, 2 Center

More information

NDoT: Nearest Neighbor Distance Based Outlier Detection Technique

NDoT: Nearest Neighbor Distance Based Outlier Detection Technique NDoT: Nearest Neighbor Distance Based Outlier Detection Technique Neminath Hubballi 1, Bidyut Kr. Patra 2, and Sukumar Nandi 1 1 Department of Computer Science & Engineering, Indian Institute of Technology

More information

Overview. Background. Locating quantitative trait loci (QTL)

Overview. Background. Locating quantitative trait loci (QTL) Overview Implementation of robust methods for locating quantitative trait loci in R Introduction to QTL mapping Andreas Baierl and Andreas Futschik Institute of Statistics and Decision Support Systems

More information

2. Department of Electronic Engineering and Computer Science, Case Western Reserve University

2. Department of Electronic Engineering and Computer Science, Case Western Reserve University Chapter MINING HIGH-DIMENSIONAL DATA Wei Wang 1 and Jiong Yang 2 1. Department of Computer Science, University of North Carolina at Chapel Hill 2. Department of Electronic Engineering and Computer Science,

More information

A Classifier with the Function-based Decision Tree

A Classifier with the Function-based Decision Tree A Classifier with the Function-based Decision Tree Been-Chian Chien and Jung-Yi Lin Institute of Information Engineering I-Shou University, Kaohsiung 84008, Taiwan, R.O.C E-mail: cbc@isu.edu.tw, m893310m@isu.edu.tw

More information

Clustering of Data with Mixed Attributes based on Unified Similarity Metric

Clustering of Data with Mixed Attributes based on Unified Similarity Metric Clustering of Data with Mixed Attributes based on Unified Similarity Metric M.Soundaryadevi 1, Dr.L.S.Jayashree 2 Dept of CSE, RVS College of Engineering and Technology, Coimbatore, Tamilnadu, India 1

More information

Improving the Efficiency of Fast Using Semantic Similarity Algorithm

Improving the Efficiency of Fast Using Semantic Similarity Algorithm International Journal of Scientific and Research Publications, Volume 4, Issue 1, January 2014 1 Improving the Efficiency of Fast Using Semantic Similarity Algorithm D.KARTHIKA 1, S. DIVAKAR 2 Final year

More information

Keywords Clustering, Goals of clustering, clustering techniques, clustering algorithms.

Keywords Clustering, Goals of clustering, clustering techniques, clustering algorithms. Volume 3, Issue 5, May 2013 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com A Survey of Clustering

More information

Gene Clustering & Classification

Gene Clustering & Classification BINF, Introduction to Computational Biology Gene Clustering & Classification Young-Rae Cho Associate Professor Department of Computer Science Baylor University Overview Introduction to Gene Clustering

More information

Encoding Words into String Vectors for Word Categorization

Encoding Words into String Vectors for Word Categorization Int'l Conf. Artificial Intelligence ICAI'16 271 Encoding Words into String Vectors for Word Categorization Taeho Jo Department of Computer and Information Communication Engineering, Hongik University,

More information

HapBlock A Suite of Dynamic Programming Algorithms for Haplotype Block Partitioning and Tag SNP Selection Based on Haplotype and Genotype Data

HapBlock A Suite of Dynamic Programming Algorithms for Haplotype Block Partitioning and Tag SNP Selection Based on Haplotype and Genotype Data HapBlock A Suite of Dynamic Programming Algorithms for Haplotype Block Partitioning and Tag SNP Selection Based on Haplotype and Genotype Data Introduction The suite of programs, HapBlock, is developed

More information

Iterative Learning of Single Individual Haplotypes from High-Throughput DNA Sequencing Data

Iterative Learning of Single Individual Haplotypes from High-Throughput DNA Sequencing Data Iterative Learning of Single Individual Haplotypes from High-Throughput DNA Sequencing Data Zrinka Puljiz and Haris Vikalo Electrical and Computer Engineering Department The University of Texas at Austin

More information

IEEE TRANSACTIONS ON EVOLUTIONARY COMPUTATION, VOL. 5, NO. 1, FEBRUARY

IEEE TRANSACTIONS ON EVOLUTIONARY COMPUTATION, VOL. 5, NO. 1, FEBRUARY IEEE TRANSACTIONS ON EVOLUTIONARY COMPUTATION, VOL. 5, NO. 1, FEBRUARY 2001 41 Brief Papers An Orthogonal Genetic Algorithm with Quantization for Global Numerical Optimization Yiu-Wing Leung, Senior Member,

More information

CHAPTER 6 HYBRID AI BASED IMAGE CLASSIFICATION TECHNIQUES

CHAPTER 6 HYBRID AI BASED IMAGE CLASSIFICATION TECHNIQUES CHAPTER 6 HYBRID AI BASED IMAGE CLASSIFICATION TECHNIQUES 6.1 INTRODUCTION The exploration of applications of ANN for image classification has yielded satisfactory results. But, the scope for improving

More information

Hybrid Models Using Unsupervised Clustering for Prediction of Customer Churn

Hybrid Models Using Unsupervised Clustering for Prediction of Customer Churn Hybrid Models Using Unsupervised Clustering for Prediction of Customer Churn Indranil Bose and Xi Chen Abstract In this paper, we use two-stage hybrid models consisting of unsupervised clustering techniques

More information

Mining High Average-Utility Itemsets

Mining High Average-Utility Itemsets Proceedings of the 2009 IEEE International Conference on Systems, Man, and Cybernetics San Antonio, TX, USA - October 2009 Mining High Itemsets Tzung-Pei Hong Dept of Computer Science and Information Engineering

More information

Feature Selection. CE-725: Statistical Pattern Recognition Sharif University of Technology Spring Soleymani

Feature Selection. CE-725: Statistical Pattern Recognition Sharif University of Technology Spring Soleymani Feature Selection CE-725: Statistical Pattern Recognition Sharif University of Technology Spring 2013 Soleymani Outline Dimensionality reduction Feature selection vs. feature extraction Filter univariate

More information

Keywords: clustering algorithms, unsupervised learning, cluster validity

Keywords: clustering algorithms, unsupervised learning, cluster validity Volume 6, Issue 1, January 2016 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Clustering Based

More information

Estimating Error-Dimensionality Relationship for Gene Expression Based Cancer Classification

Estimating Error-Dimensionality Relationship for Gene Expression Based Cancer Classification 1 Estimating Error-Dimensionality Relationship for Gene Expression Based Cancer Classification Feng Chu and Lipo Wang School of Electrical and Electronic Engineering Nanyang Technological niversity Singapore

More information

Shuheng Zhou. Annotated Bibliography

Shuheng Zhou. Annotated Bibliography Shuheng Zhou Annotated Bibliography High-dimensional Statistical Inference S. Zhou, J. Lafferty and L. Wasserman, Compressed Regression, in Advances in Neural Information Processing Systems 20 (NIPS 2007).

More information

Haplotype Analysis. 02 November 2003 Mendel Short IGES Slide 1

Haplotype Analysis. 02 November 2003 Mendel Short IGES Slide 1 Haplotype Analysis Specifies the genetic information descending through a pedigree Useful visualization of the gene flow through a pedigree A haplotype for a given individual and set of loci is defined

More information

HIMIC : A Hierarchical Mixed Type Data Clustering Algorithm

HIMIC : A Hierarchical Mixed Type Data Clustering Algorithm HIMIC : A Hierarchical Mixed Type Data Clustering Algorithm R. A. Ahmed B. Borah D. K. Bhattacharyya Department of Computer Science and Information Technology, Tezpur University, Napam, Tezpur-784028,

More information

K-Means Clustering With Initial Centroids Based On Difference Operator

K-Means Clustering With Initial Centroids Based On Difference Operator K-Means Clustering With Initial Centroids Based On Difference Operator Satish Chaurasiya 1, Dr.Ratish Agrawal 2 M.Tech Student, School of Information and Technology, R.G.P.V, Bhopal, India Assistant Professor,

More information

A Quantified Approach for large Dataset Compression in Association Mining

A Quantified Approach for large Dataset Compression in Association Mining IOSR Journal of Computer Engineering (IOSR-JCE) e-issn: 2278-0661, p- ISSN: 2278-8727Volume 15, Issue 3 (Nov. - Dec. 2013), PP 79-84 A Quantified Approach for large Dataset Compression in Association Mining

More information

Expectation Maximization (EM) and Gaussian Mixture Models

Expectation Maximization (EM) and Gaussian Mixture Models Expectation Maximization (EM) and Gaussian Mixture Models Reference: The Elements of Statistical Learning, by T. Hastie, R. Tibshirani, J. Friedman, Springer 1 2 3 4 5 6 7 8 Unsupervised Learning Motivation

More information

Collaborative Filtering using a Spreading Activation Approach

Collaborative Filtering using a Spreading Activation Approach Collaborative Filtering using a Spreading Activation Approach Josephine Griffith *, Colm O Riordan *, Humphrey Sorensen ** * Department of Information Technology, NUI, Galway ** Computer Science Department,

More information

GPU Data Mining in Neuroimaging Genomics

GPU Data Mining in Neuroimaging Genomics GPU Data Mining in Neuroimaging Genomics Bob Zigon Beckman Coulter Indianapolis, Indiana May 10, 2017 1 / 20 Outline Background ANOVA for Voxels and SNPs VEGAS for Voxels and Genes High Speed GPU Monte-Carlo

More information

Progress Report: Collaborative Filtering Using Bregman Co-clustering

Progress Report: Collaborative Filtering Using Bregman Co-clustering Progress Report: Collaborative Filtering Using Bregman Co-clustering Wei Tang, Srivatsan Ramanujam, and Andrew Dreher April 4, 2008 1 Introduction Analytics are becoming increasingly important for business

More information

A Comparison of Resampling Methods for Clustering Ensembles

A Comparison of Resampling Methods for Clustering Ensembles A Comparison of Resampling Methods for Clustering Ensembles Behrouz Minaei-Bidgoli Computer Science Department Michigan State University East Lansing, MI, 48824, USA Alexander Topchy Computer Science Department

More information

Particle swarm optimizer for variable weighting in clustering high-dimensional data

Particle swarm optimizer for variable weighting in clustering high-dimensional data Mach Learn (2011) 82: 43 70 DOI 10.1007/s10994-009-5154-2 Particle swarm optimizer for variable weighting in clustering high-dimensional data Yanping Lu Shengrui Wang Shaozi Li Changle Zhou Received: 30

More information

Fast Efficient Clustering Algorithm for Balanced Data

Fast Efficient Clustering Algorithm for Balanced Data Vol. 5, No. 6, 214 Fast Efficient Clustering Algorithm for Balanced Data Adel A. Sewisy Faculty of Computer and Information, Assiut University M. H. Marghny Faculty of Computer and Information, Assiut

More information

A Naïve Soft Computing based Approach for Gene Expression Data Analysis

A Naïve Soft Computing based Approach for Gene Expression Data Analysis Available online at www.sciencedirect.com Procedia Engineering 38 (2012 ) 2124 2128 International Conference on Modeling Optimization and Computing (ICMOC-2012) A Naïve Soft Computing based Approach for

More information

Evaluating Subspace Clustering Algorithms

Evaluating Subspace Clustering Algorithms Evaluating Subspace Clustering Algorithms Lance Parsons lparsons@asu.edu Ehtesham Haque Ehtesham.Haque@asu.edu Department of Computer Science Engineering Arizona State University, Tempe, AZ 85281 Huan

More information

Contents. Preface to the Second Edition

Contents. Preface to the Second Edition Preface to the Second Edition v 1 Introduction 1 1.1 What Is Data Mining?....................... 4 1.2 Motivating Challenges....................... 5 1.3 The Origins of Data Mining....................

More information

CS Introduction to Data Mining Instructor: Abdullah Mueen

CS Introduction to Data Mining Instructor: Abdullah Mueen CS 591.03 Introduction to Data Mining Instructor: Abdullah Mueen LECTURE 8: ADVANCED CLUSTERING (FUZZY AND CO -CLUSTERING) Review: Basic Cluster Analysis Methods (Chap. 10) Cluster Analysis: Basic Concepts

More information

HARD, SOFT AND FUZZY C-MEANS CLUSTERING TECHNIQUES FOR TEXT CLASSIFICATION

HARD, SOFT AND FUZZY C-MEANS CLUSTERING TECHNIQUES FOR TEXT CLASSIFICATION HARD, SOFT AND FUZZY C-MEANS CLUSTERING TECHNIQUES FOR TEXT CLASSIFICATION 1 M.S.Rekha, 2 S.G.Nawaz 1 PG SCALOR, CSE, SRI KRISHNADEVARAYA ENGINEERING COLLEGE, GOOTY 2 ASSOCIATE PROFESSOR, SRI KRISHNADEVARAYA

More information

A General Greedy Approximation Algorithm with Applications

A General Greedy Approximation Algorithm with Applications A General Greedy Approximation Algorithm with Applications Tong Zhang IBM T.J. Watson Research Center Yorktown Heights, NY 10598 tzhang@watson.ibm.com Abstract Greedy approximation algorithms have been

More information

Unsupervised Feature Selection Using Multi-Objective Genetic Algorithms for Handwritten Word Recognition

Unsupervised Feature Selection Using Multi-Objective Genetic Algorithms for Handwritten Word Recognition Unsupervised Feature Selection Using Multi-Objective Genetic Algorithms for Handwritten Word Recognition M. Morita,2, R. Sabourin 3, F. Bortolozzi 3 and C. Y. Suen 2 École de Technologie Supérieure, Montreal,

More information

Machine Learning (BSMC-GA 4439) Wenke Liu

Machine Learning (BSMC-GA 4439) Wenke Liu Machine Learning (BSMC-GA 4439) Wenke Liu 01-31-017 Outline Background Defining proximity Clustering methods Determining number of clusters Comparing two solutions Cluster analysis as unsupervised Learning

More information

AN IMPROVED K-MEANS CLUSTERING ALGORITHM FOR IMAGE SEGMENTATION

AN IMPROVED K-MEANS CLUSTERING ALGORITHM FOR IMAGE SEGMENTATION AN IMPROVED K-MEANS CLUSTERING ALGORITHM FOR IMAGE SEGMENTATION WILLIAM ROBSON SCHWARTZ University of Maryland, Department of Computer Science College Park, MD, USA, 20742-327, schwartz@cs.umd.edu RICARDO

More information

NOVEL HYBRID GENETIC ALGORITHM WITH HMM BASED IRIS RECOGNITION

NOVEL HYBRID GENETIC ALGORITHM WITH HMM BASED IRIS RECOGNITION NOVEL HYBRID GENETIC ALGORITHM WITH HMM BASED IRIS RECOGNITION * Prof. Dr. Ban Ahmed Mitras ** Ammar Saad Abdul-Jabbar * Dept. of Operation Research & Intelligent Techniques ** Dept. of Mathematics. College

More information

NSGA-II for Biological Graph Compression

NSGA-II for Biological Graph Compression Advanced Studies in Biology, Vol. 9, 2017, no. 1, 1-7 HIKARI Ltd, www.m-hikari.com https://doi.org/10.12988/asb.2017.61143 NSGA-II for Biological Graph Compression A. N. Zakirov and J. A. Brown Innopolis

More information

Genetic Programming. Charles Chilaka. Department of Computational Science Memorial University of Newfoundland

Genetic Programming. Charles Chilaka. Department of Computational Science Memorial University of Newfoundland Genetic Programming Charles Chilaka Department of Computational Science Memorial University of Newfoundland Class Project for Bio 4241 March 27, 2014 Charles Chilaka (MUN) Genetic algorithms and programming

More information

GWAsimulator: A rapid whole-genome simulation program

GWAsimulator: A rapid whole-genome simulation program GWAsimulator: A rapid whole-genome simulation program Version 1.1 Chun Li and Mingyao Li September 21, 2007 (revised October 9, 2007) 1. Introduction...1 2. Download and compile the program...2 3. Input

More information

MetaData for Database Mining

MetaData for Database Mining MetaData for Database Mining John Cleary, Geoffrey Holmes, Sally Jo Cunningham, and Ian H. Witten Department of Computer Science University of Waikato Hamilton, New Zealand. Abstract: At present, a machine

More information

CLUSTERING IN BIOINFORMATICS

CLUSTERING IN BIOINFORMATICS CLUSTERING IN BIOINFORMATICS CSE/BIMM/BENG 8 MAY 4, 0 OVERVIEW Define the clustering problem Motivation: gene expression and microarrays Types of clustering Clustering algorithms Other applications of

More information

Multi-Modal Data Fusion: A Description

Multi-Modal Data Fusion: A Description Multi-Modal Data Fusion: A Description Sarah Coppock and Lawrence J. Mazlack ECECS Department University of Cincinnati Cincinnati, Ohio 45221-0030 USA {coppocs,mazlack}@uc.edu Abstract. Clustering groups

More information

K-modes Clustering Algorithm for Categorical Data

K-modes Clustering Algorithm for Categorical Data K-modes Clustering Algorithm for Categorical Data Neha Sharma Samrat Ashok Technological Institute Department of Information Technology, Vidisha, India Nirmal Gaud Samrat Ashok Technological Institute

More information

Clustering Lecture 9: Other Topics. Jing Gao SUNY Buffalo

Clustering Lecture 9: Other Topics. Jing Gao SUNY Buffalo Clustering Lecture 9: Other Topics Jing Gao SUNY Buffalo 1 Basics Outline Motivation, definition, evaluation Methods Partitional Hierarchical Density-based Miture model Spectral methods Advanced topics

More information

Hybrid Fuzzy C-Means Clustering Technique for Gene Expression Data

Hybrid Fuzzy C-Means Clustering Technique for Gene Expression Data Hybrid Fuzzy C-Means Clustering Technique for Gene Expression Data 1 P. Valarmathie, 2 Dr MV Srinath, 3 Dr T. Ravichandran, 4 K. Dinakaran 1 Dept. of Computer Science and Engineering, Dr. MGR University,

More information

Text Document Clustering Using DPM with Concept and Feature Analysis

Text Document Clustering Using DPM with Concept and Feature Analysis Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology IJCSMC, Vol. 2, Issue. 10, October 2013,

More information

Package EBglmnet. January 30, 2016

Package EBglmnet. January 30, 2016 Type Package Package EBglmnet January 30, 2016 Title Empirical Bayesian Lasso and Elastic Net Methods for Generalized Linear Models Version 4.1 Date 2016-01-15 Author Anhui Huang, Dianting Liu Maintainer

More information

Biclustering Bioinformatics Data Sets. A Possibilistic Approach

Biclustering Bioinformatics Data Sets. A Possibilistic Approach Possibilistic algorithm Bioinformatics Data Sets: A Possibilistic Approach Dept Computer and Information Sciences, University of Genova ITALY EMFCSC Erice 20/4/2007 Bioinformatics Data Sets Outline Introduction

More information

Classification and Optimization using RF and Genetic Algorithm

Classification and Optimization using RF and Genetic Algorithm International Journal of Management, IT & Engineering Vol. 8 Issue 4, April 2018, ISSN: 2249-0558 Impact Factor: 7.119 Journal Homepage: Double-Blind Peer Reviewed Refereed Open Access International Journal

More information

Towards New Heterogeneous Data Stream Clustering based on Density

Towards New Heterogeneous Data Stream Clustering based on Density , pp.30-35 http://dx.doi.org/10.14257/astl.2015.83.07 Towards New Heterogeneous Data Stream Clustering based on Density Chen Jin-yin, He Hui-hao Zhejiang University of Technology, Hangzhou,310000 chenjinyin@zjut.edu.cn

More information

Machine Learning (BSMC-GA 4439) Wenke Liu

Machine Learning (BSMC-GA 4439) Wenke Liu Machine Learning (BSMC-GA 4439) Wenke Liu 01-25-2018 Outline Background Defining proximity Clustering methods Determining number of clusters Other approaches Cluster analysis as unsupervised Learning Unsupervised

More information

Evaluating Classifiers

Evaluating Classifiers Evaluating Classifiers Charles Elkan elkan@cs.ucsd.edu January 18, 2011 In a real-world application of supervised learning, we have a training set of examples with labels, and a test set of examples with

More information