Clustering of SNP Data with Application to Genomics
|
|
- Stephany Chase
- 6 years ago
- Views:
Transcription
1 Clustering of SNP Data with Application to Genomics Michael K Ng, Mark J Li, Sio I Ao Department of Mathematics Hong Kong Baptist University Kowloon Tong, Hong Kong mng@mathhkbueduhk Yiu-ming Cheung Department of Computer Science Hong Kong Baptist University Kowloon Tong, Hong Kong Pak C Sham Genome Research Center The University of Hong Kong Pokfulam Road, Hong Kong Joshua Z Huang E-Business Technology Institute The University of Hong Kong Pokfulam Road, Hong Kong Abstract Single nucleotide polymorphisms (SNPs) are very common throughout the genome and hence are potentially valuable for mapping disease susceptibility loci by detecting association between SNP markers and disease Many methods may only be applicable when marker haplotypes, rather than genotypes (categorical data), are available for analysis In this paper, we explore the properties of k-modes (categorical data) clustering algorithms to SNP data for detecting association between SNP markers and disease Subspace k-modes clustering properties are also considered and tested 1 Introduction Because of their ubiquity there has been considerable interest in using single nucleotide polymorphisms (SNPs) to fine-map susceptibility loci [5] It is estimated that 90% of naturally occurring sequence variations are SNPs [9] These variations are sufficiently finely spaced that one may reasonably expect to find SNPs within a defined chromosomal region which can be sufficient to manifest detectable linkage disequilibrium in some human populations Detecting association between SNPs and disease may provide useful evidence for the existence of a susceptibility locus within such a region, allowing one to proceed to more intensive investigations which can lead to identification of the gene and pathogenic polymorphisms Several strategies have been proposed that utilize two- Research supported in part by HKRGC 7035/04P, 7035/05P and HKBU FRGs point methods to localize the position of a disease locus [11] However, SNPs studied individually might be expected to provide relatively little information for detecting association between a disease and a chromosomal region [25, 27], especially if more than one mutation is present Potentially the amount of information available from SNPs could be increased dramatically by utilizing information from several marker loci simultaneously, with the aim of detecting association with a marker haplotype rather than just one biallelic marker Composite likelihood methods combining disease associations with a series of linked markers from haplotypes have been proposed by Collins et al [10], Lam et al [23] and McPeek & Strahs [24] A problem with any method which directly utilizes case and control haplotypes is that such haplotypes are rarely available for autosomal markers In practice if one wishes to use haplotypes then one must generally rely on a combination of deduction and estimation Methods which are based on modelling explicitly a pattern of development of linkage disequilibrium relationships between marker and disease loci may be expected to perform badly if the assumptions of the model are violated Methods which assume only one mutational event may perform poorly if more than one has occurred Finally, genotyping and map errors may cause particular problems for methods which rely on a regular relationship between physical position and linkage disequilibrium parameters An alternative approach which might tackle some these difficulties would be to utilize data mining (clustering) methods to investigate association between a disease phenotype and a multilocus genotype, without assuming the availability of haplotypes required by most of the above mentioned methods, and without any attempt model the process whereby disease-related haplotypes might have been gener /06 $
2 ated Since multilocus genotype data type is categorical, the main contribution of this paper is to use k-modes clustering algorithms to detect association between a disease and multiple marker geonotypes The outline of this paper is as follows In Section 2, we review a k-modes clustering algorithm In Section 3, a real data set is employed to test the performance k-modes clustering algorithm, and to compare with other methods like logistic regression, neural network and decision tree Finally, we consider subspace clustering algorithms and present some preliminary clustering results 2 The K-modes Clustering Method Since first published in 1997, the k-modes algorithm [19] has become a popular technique in solving categorical data clustering problems in different application domains The k-modes algorithm extends the k-means algorithm by using a simple matching dissimilarity measure for categorical objects, modes instead of means for clusters, and a frequency-based method to update modes in the clustering process to minimize the clustering cost function These extensions have removed the numeric-only limitation of the k- means algorithm and enable the k-means clustering process to be used to efficiently cluster large categorical data sets from real world databases An equivalent nonparametric approach to deriving clusters from categorical data is presented [21] We note that modes for clusters can be viewed as the representative haplotypes of the corresponding cluster On the assmuption some association has been maintained through linkage disequilibrium, this implies that particular haplotypes should be more commonly found on chromosomes bearing pathogenic mutations Such haplotypes should act in a similar way to multallelic markers, and should be better able to produce detectable association, especially when there are multiple mutation events We assume the set of individuals to be clustered is stored in a table T defined by a set of SNP locus attributes (or simply say attributes), A 1, A 2,, A m Each attribute A j describes a domain of values, denoted by DOM(A j ), associated with a defined genotypes An example is given in the following table: Individual (Case or Control) Locus 1 Locus 2 Locus m n For instance, A 1 has two alleles 1 and 3, the three possible genotypes are 11, 13 and 33 Genotypes for each attribute are categorical (nominally) such that no genetic model assumptions are incorporated A domain DOM(A j ) is defined as categorical if it is finite and unordered, eg, for any a, b DOM(A j ), either a = b or a b An individual X in T can be logically represented as a conjunction of attribute-value pairs [A 1 = x 1 ] [A 2 = x 2 ] [A m = x m ] where x j DOM(A j ) for 1 j m Without ambiguity, we represent X as a vector [x 1, x 2,, x m ] X is called a categorical individual if it has only categorical genotype values We consider every individual has exactly attribute genotype values If the value of an attribute A j is missing, then we denote the attribute value of A j by a category ɛ which means empty Let X = {X 1, X 2,, X n } be a set of n individuals Individual X i is represented as [x i,1, x i,2,, x i,m ] We write X i = X k if x i,j = x k,j for 1 j m The relation X i = X k does not mean that X i and X k are the same indvidual in the table, but rather that the two individuals have equal genotype values in attributes A 1, A 2,, A m The k-modes algorithm [19] has made the following modifications to the k-means algorithm: (i) using a simple matching dissimilarity measure for categorical individuals, (ii) replacing the means of clusters with the modes, and (iii) using a frequency based method to find the modes These modifications have removed the numeric-only limitation of the k-means algorithm but maintain its efficiency in clustering large categorical data sets [19] Let X and Y be two categorical individuals represented by [x 1, x 2,, x m ] and [y 1, y 2,, y m ] respectively The simple matching dissimilarity measure between X and Y is defined as follows: m d(x, Y ) δ(x j, y j ) where δ(x j, y j ) = j=1 { 0, xj = y j 1, x j y j (1) It is easy to verify that the function d defines a metric space on the set of categorical individuals Traditionally, the simple matching approach is often used in binary variables which are converted from categorical variables We note that d is also a kind of generalized Hamming distance The k-modes algorithm uses the k-means paradigm to cluster categorical data The objective of clustering a set of n categorical individuals into k clusters is to find W and Z that minimize k n F (W, Z) = w li d(z l, X i ) (2) l=1 i= /06 $
3 subject to and w li {0, 1}, 1 l k, 1 i n, (3) 0 < k w li = 1, 1 i n, (4) l=1 n w li < n, 1 l k, (5) i=1 where k( n) is a known number of clusters, W = [w li ] is a k-by-n {0, 1} matrix, Z = [Z 1, Z 2,, Z k ], and Z i is the ith cluster center with the categorical attributes A 1, A 2,, A m We remind that Z i can be viewed as the representative haplotypes of the corresponding cluster Minimization of F in (2) with the constraints in (3), (4) and (5) forms a class of constrained nonlinear optimization problems whose solutions are unknown The usual method towards optimization of F in (2) is to use partial optimization for Z and W In this method we first fix Z and find necessary conditions on W to minimize F Then we fix W and minimize F with respect to Z This process is formalized in the k-modes algorithm as follows Algorithm The k-modes algorithm 1 Choose an initial mode Z (1) of each cluster Determine W (1) such that F (W, Z (1) ) is minimized Set t = 1 2 Determine Z (t+1) such that F (W (t), Z (t+1) ) is minimized If F (W (t), Z (t+1) ) = F (W (t), Z (t) ), then stop; otherwise goto step 3 3 Determine W (t+1) such that F (W (t+1), Z (t+1) ) is minimized If F (W (t+1), Z (t+1) ) = F (W (t), Z (t+1) ), then stop; otherwise set t = t + 1 and goto Step 2 The matrices W and Z are calculated as follows Let Ẑ be fixed and consider the problem: min W F (W, Ẑ) subject to (3), (4) and (5) The minimizer Ŵ is given by { 1, if d( Ẑ ŵ li = l, X i ) d(ẑh, X i ), 1 h k, 0, otherwise Let X be a set of categorical individuals described by categorical attributes A 1, A 2,, A m and DOM(A j ) = {a (1) j, a (2) j,, a (n j) j }, where n j is the number of categories of attribute A j for 1 j m Let the cluster centers Z l be represented by [z l,1, z l,2,, z l,m ] for 1 l k Then the quantity k n l=1 i=1 w lid(z l, X i ) is minimized iff z l,j = a (r) j DOM(A j ) where (for 1 t n j ) {w li x i,j = a (r) j, w li = 1} {w li x i,j = a (t) j, w li = 1}, for 1 j m Here X denotes the number of elements in the set X 3 Experimental Results In this section, a real data set is employed to test the performance of the k-modes clustering algorithm, and to compare with other methods like logistic regression, neural network and decision tree We analyze the case/control populations of patients served in a data set from Genome Research Center, The University of Hong Kong The data is consisted of 488 cases (patients) recruited from hospitals in Hong Kong and 520 controls (normal) recruited from the community 144 SNPs on chromosome 3p are picked by CLUSTAG developed by us [4] making an average marker density of 1 tagging SNP per 25 kb The following table shows the summary classification results for the k-modes clustering algorithm and the other methods Each classification result is computed by the average of ten runs of the algorithm In the tests of logistic regression, decision tree and neural network, 60% data is used for training and 40% data is used for validation Method Validation Accuracy Logistic Regression Decision Tree Neural Network k-modes Clustering (k = 2) k-modes Clustering (k = 4) k-modes Clustering (k = 6) Subspace Clustering When more SNPs (a genome-wide genotyping) are used to detect the association between a disease and multiple marker geonotypes, we may need to consider subspace clustering techniques More precisely, we expect in a typical dataset that contains the genotype data of several thousands of SNPs in different individuals, it is common to find only several tens of SNPs having genotype patterns that are highly specific to each cluster of individuals The SNPs are called the relevant SNPs, as opposed to the irrelevant SNPs that do not help much in identifying the cluster members (ie, individuals of the same type) Due to the large number of SNPs being irrelevant to each cluster, two individuals in the same cluster could have low similarity when measured by a similarity function (matching distances in Section 2) /06 $
4 that consider the genotypes of all SNPs The clusters may thus be undetectable by the k-modes clustering algorithms The subspace clustering problem is defined for such a scenario Each subspace cluster is a set of individuals with an associated set of relevant SNPs such that in the subspace formed by the relevant SNPs, the individuals are similar to each other but dissimilar to individuals outside the cluster In general, subspace clustering seeks to group objects into clusters on subsets of dimensions or attributes of a data set It pursues two tasks, identification of the subsets of dimensions where clusters can be found and discovery of the clusters from different subsets of dimensions According to the ways with which the subsets of dimensions are identified, we can divide subspace clustering methods into two categories The methods in the first category determine the exact subsets of dimensions where clusters are discovered We call these methods as hard subspace clustering, see for instance [1, 2, 3, 8, 17, 26, 28, 29] The methods in the second category determine the subsets of dimensions according to the contributions of the dimensions in discovering the corresponding clusters The contribution of a dimension is measured by a weight that is assigned to the dimension in the clustering process We call these methods as soft subspace clustering because every dimension contributes to the discovery of clusters but the dimensions with larger weights form the subsets of dimensions of the clusters, for instance, [6, 12, 13, 14, 15, 16, 22] We modify the weighting clustering algorithm EW KM [6] to formulate the k-modes subspace clustering algorithm The CYP2D6 data of Hosking et al [18] is used to test the algorithm Four functional CYP2D6 polymorphisms predicting 99% of slow metabolizers were types on 1018 Caucasians and 41 predicted slow metabolizers were identified Therefore 977 are called normal A full description of the data, the database IDs and primers for the marker SNPs are given in Hosking et al [18] We produced 100 clustering results for k-modes algorithms and EW KM with different γ Here γ is the parameter to used to control the number of relevant SNPs included in a cluster, for details, see [?] If we consider the clustering accuracy as a good clustering result The chance to obtain the good result by employing the EW KM is close to 60% in a large range of γ values [02,70], which is better than the k-modes algorithm The distribution of the clustering accuracies is shown in Table 1 Our preliminary results show that a useful information can be obtained in case-control association studies by using subspace clustering method to analyze high-dimensional categorical data from multiple SNPs The method has the advantage of being almost entirely theoretical, with no attempt to model the population history which has produced disease-related haplotypes It might therefore be less prone than other methods to be sensitive to map errors or model Clustering K-modes EWKM (γ) accuracy Clustering EWKM (γ) accuracy Table 1 Distribution of accuracies in 100 runs for K-modes and EW KM algorithms violations The preliminary investigations we have carried our show that subspace clustering methods can provide a simple and practical method for dealing with the multilocus genotypes which are obtained from standard case-control studies The proposed method allows multiple markers to be analyzed simultaneously, even when haplotypes are unavailable, and do not rely on any model of population history or any genetic map to account for present patterns of linkage disequilibrium References [1] C Aggarwal, C Procopiuc, J Wolf, P Yu, and J Park, Fast algorithms for projected clustering, Proc ACM SIGMOD, pp 61 72, 1999 [2] C Aggarwal and P Yu, Finding generalized projected clusters in high dimensional spaces, Proc ACM SIGMOD, pp 70 81, 2000 [3] R Agrawal, J Gehrke, D Gunopulos, and P Raghavan, Automatic subspace clustering of high dimensional data mining applications, Proc ACM SIG- MOD, pp , 1998 [4] S Ao, K Yip, M Ng, D Cheung, P Yee, I Melhado and P Sham CLUSTAG: hierarchical clustering and graph methods for selecting tag SNPs, Bioinformatics, v21, pp , 2005 [5] A Brookes, The essence of SNPs, Gene, vol 234, pp , /06 $
5 [6] Y Chan, W Ching, M Ng, Z Huang, An optimization algorithm for clustering using weighted dissimilarity measures, Pattern Recognition, vol 37, no 5, pp , 2004 [7] A Chaturvedi, P Green and J Carroll, K-modes clustering, Journal of Classification, vol 18, pp 35 55, 2001 [8] C H Cheng, A W Fu, and Y Zhang, Entropy-based subspace subspace clustering for mining numerical data, Proc of the 5th ACM SIGKDD International Conference on Knowledge and Data Mining, pp 84 93, 1999 [9] F Collins, L Brooks and A Chakravarti, A DNA polymorphism discovery resource for research on human genetic variation, Genome Research, vol 8, pp , 1998 [10] A Collins and N Morton, Mapping a disease locus by allelic association, Proc Natl Acad Sci USA, vol 95, pp , 1998 [11] B Devlin and N Risch, A comparison of linkage disequilibrium measures for fine-scale mapping, Genomics, vol 29, pp , 1995 [12] C Domeniconi, Locally adaptive techniques for pattern classification, Dissertation for Doctor of Philosophy, 2002 [13] C Domeniconi, D Papadopoulos, D Gunopulos, and S Ma, Subspace clustering of high dimensional data, Proc of SIAM International Conference on Data Mining, 2004 [14] J Friedman and J Meulman, Clustering objects on subsets of attributes, JRStatist Soc B, vol 66, no 4, pp , 2004 [15] H Frigui and O Nasraoui, Unsupervised learning of prototypes and attribute weights, Pattern Recognition, vol 37, no 3, pp , 2004 [16] H Frigui and O Nasraoui, Simultaneous clustering and dynamic keyword weighting for text documents, Survey of Text Mining, Michael Berry, Ed, Springer, pp 45 70, 2004 [17] S Goil, H Nagesh, and A Choudhary, Mafia: Efficient and scalable subspace clustering for very large data sets, Technical Report CPDC-TR , Northwest University, 1999 Hagen-Mann, M Ehm, J Riley, Linkage disequilibrium mapping identifies a 390 kb region associated with CYP2D6 poor drug metabolising activity, Pharmacogenomics J, 2, pp , 2002 [19] Z Huang, Extensions to the k-means algorithm for clustering large data sets with categorical values, Data Mining and Knowledge Discovery, vol 2, no 3, pp , 1998 [20] Z Huang and Michael Ng, A fuzzy k-mode algorithm for clustering categorical data, IEEE Transactions on Fuzzy System, vol 7, no 4, 1999 [21] Z Huang and M Ng, A note on k-modes clustering, Journal of Classification, vol 20, pp , 2003 [22] L Jing, M Ng, J Xu, and Z Huang, Subspace clustering of text documents with feature weighting k- means algorithm, PAKDD, pp , 2005 [23] J Lam, K Roeder and B Devlin, Haplotype fine mapping by evolutionary trees, Am J Hum Genet, vol 66, pp , 2000 [24] M McPeek and A Strahs, Assessment of linkage disequilibrium by the decay of haplotype sharing with application to fine-scale genetic mapping, Am J Hum Genet, vol 65, pp , 1999 [25] P Sham, J Zhao and D Curtis, The effect of marker characteristics on the power to detect linkage disequilibrium due to single or multiple ancestral mutations, Ann Hum, Genet, vol 64, pp , 2000 [26] K Woo and J Lee, FINDIT: a fast and intelligent subspace clustering algorithm using dimension voting, PhD thesis, Korea Advanced Institute of Science and Technology, Taejon, Korea,2002 [27] M Xiong and L Jin, Comparison of the power and accuracy of biallelic and microsatellite markers in population-based gene-mapping methods, Am J Hum Genet, vol 64, pp , 1999 [28] J Yang, W Wang, H Wang, and P Yu, δ-clusters: capturing subspace correlation in a large data set, In Data Engineering, 2002 Proceedings 18th International Conference on, pp , 2003 [29] K Yip, D Cheung, and M Ng, A practical projected clustering algorithm, IEEE Transactions on Knowledge and Data Engineering, vol 16, no 11, pp , 2004 [18] L Hosking, R Boyd, C Xu, M Nissum, K Cantone, I Purvis, R Khakhar, M Barnes, U Liberwirth, K /06 $
Linkage Disequilibrium Map by Unidimensional Nonnegative Scaling
The First International Symposium on Optimization and Systems Biology (OSB 07) Beijing, China, August 8 10, 2007 Copyright 2007 ORSC & APORC pp. 302 308 Linkage Disequilibrium Map by Unidimensional Nonnegative
More informationA Fuzzy Subspace Algorithm for Clustering High Dimensional Data
A Fuzzy Subspace Algorithm for Clustering High Dimensional Data Guojun Gan 1, Jianhong Wu 1, and Zijiang Yang 2 1 Department of Mathematics and Statistics, York University, Toronto, Ontario, Canada M3J
More informationNetwork Based Models For Analysis of SNPs Yalta Opt
Outline Network Based Models For Analysis of Yalta Optimization Conference 2010 Network Science Zeynep Ertem*, Sergiy Butenko*, Clare Gill** *Department of Industrial and Systems Engineering, **Department
More informationOn the Consequence of Variation Measure in K- modes Clustering Algorithm
ORIENTAL JOURNAL OF COMPUTER SCIENCE & TECHNOLOGY An International Open Free Access, Peer Reviewed Research Journal Published By: Oriental Scientific Publishing Co., India. www.computerscijournal.org ISSN:
More informationA fuzzy k-modes algorithm for clustering categorical data. Citation IEEE Transactions on Fuzzy Systems, 1999, v. 7 n. 4, p.
Title A fuzzy k-modes algorithm for clustering categorical data Author(s) Huang, Z; Ng, MKP Citation IEEE Transactions on Fuzzy Systems, 1999, v. 7 n. 4, p. 446-452 Issued Date 1999 URL http://hdl.handle.net/10722/42992
More informationA Genetic k-modes Algorithm for Clustering Categorical Data
A Genetic k-modes Algorithm for Clustering Categorical Data Guojun Gan, Zijiang Yang, and Jianhong Wu Department of Mathematics and Statistics, York University, Toronto, Ontario, Canada M3J 1P3 {gjgan,
More informationPARTCAT: A Subspace Clustering Algorithm for High Dimensional Categorical Data
2006 International Joint Conference on Neural Networks Sheraton Vancouver Wall Centre Hotel, Vancouver, BC, Canada July 16-21, 2006 PARTCAT: A Subspace Clustering Algorithm for High Dimensional Categorical
More informationGenetic Analysis. Page 1
Genetic Analysis Page 1 Genetic Analysis Objectives: 1) Set up Case-Control Association analysis and the Basic Genetics Workflow 2) Use JMP tools to interact with and explore results 3) Learn advanced
More informationClustering Algorithms In Data Mining
2017 5th International Conference on Computer, Automation and Power Electronics (CAPE 2017) Clustering Algorithms In Data Mining Xiaosong Chen 1, a 1 Deparment of Computer Science, University of Vermont,
More informationStep-by-Step Guide to Basic Genetic Analysis
Step-by-Step Guide to Basic Genetic Analysis Page 1 Introduction This document shows you how to clean up your genetic data, assess its statistical properties and perform simple analyses such as case-control
More informationSemi-Supervised Clustering with Partial Background Information
Semi-Supervised Clustering with Partial Background Information Jing Gao Pang-Ning Tan Haibin Cheng Abstract Incorporating background knowledge into unsupervised clustering algorithms has been the subject
More informationComparative Study of Subspace Clustering Algorithms
Comparative Study of Subspace Clustering Algorithms S.Chitra Nayagam, Asst Prof., Dept of Computer Applications, Don Bosco College, Panjim, Goa. Abstract-A cluster is a collection of data objects that
More informationUAPRIORI: AN ALGORITHM FOR FINDING SEQUENTIAL PATTERNS IN PROBABILISTIC DATA
UAPRIORI: AN ALGORITHM FOR FINDING SEQUENTIAL PATTERNS IN PROBABILISTIC DATA METANAT HOOSHSADAT, SAMANEH BAYAT, PARISA NAEIMI, MAHDIEH S. MIRIAN, OSMAR R. ZAÏANE Computing Science Department, University
More informationOn Demand Phenotype Ranking through Subspace Clustering
On Demand Phenotype Ranking through Subspace Clustering Xiang Zhang, Wei Wang Department of Computer Science University of North Carolina at Chapel Hill Chapel Hill, NC 27599, USA {xiang, weiwang}@cs.unc.edu
More informationA survey on hard subspace clustering algorithms
A survey on hard subspace clustering algorithms 1 A. Surekha, 2 S. Anuradha, 3 B. Jaya Lakshmi, 4 K. B. Madhuri 1,2 Research Scholars, 3 Assistant Professor, 4 HOD Gayatri Vidya Parishad College of Engineering
More informationMonika Maharishi Dayanand University Rohtak
Performance enhancement for Text Data Mining using k means clustering based genetic optimization (KMGO) Monika Maharishi Dayanand University Rohtak ABSTRACT For discovering hidden patterns and structures
More informationApplication of Genetic Algorithm Based Intuitionistic Fuzzy k-mode for Clustering Categorical Data
BULGARIAN ACADEMY OF SCIENCES CYBERNETICS AND INFORMATION TECHNOLOGIES Volume 17, No 4 Sofia 2017 Print ISSN: 1311-9702; Online ISSN: 1314-4081 DOI: 10.1515/cait-2017-0044 Application of Genetic Algorithm
More informationInternational Journal of Advanced Research in Computer Science and Software Engineering
Volume 3, Issue 3, March 2013 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Special Issue:
More informationC-NBC: Neighborhood-Based Clustering with Constraints
C-NBC: Neighborhood-Based Clustering with Constraints Piotr Lasek Chair of Computer Science, University of Rzeszów ul. Prof. St. Pigonia 1, 35-310 Rzeszów, Poland lasek@ur.edu.pl Abstract. Clustering is
More informationSNP HiTLink Manual. Yoko Fukuda 1, Hiroki Adachi 2, Eiji Nakamura 2, and Shoji Tsuji 1
SNP HiTLink Manual Yoko Fukuda 1, Hiroki Adachi 2, Eiji Nakamura 2, and Shoji Tsuji 1 1 Department of Neurology, Graduate School of Medicine, the University of Tokyo, Tokyo, Japan 2 Dynacom Co., Ltd, Kanagawa,
More informationGenetic type 1 Error Calculator (GEC)
Genetic type 1 Error Calculator (GEC) (Version 0.2) User Manual Miao-Xin Li Department of Psychiatry and State Key Laboratory for Cognitive and Brain Sciences; the Centre for Reproduction, Development
More informationOUTLIER DETECTION FOR DYNAMIC DATA STREAMS USING WEIGHTED K-MEANS
OUTLIER DETECTION FOR DYNAMIC DATA STREAMS USING WEIGHTED K-MEANS DEEVI RADHA RANI Department of CSE, K L University, Vaddeswaram, Guntur, Andhra Pradesh, India. deevi_radharani@rediffmail.com NAVYA DHULIPALA
More informationMinimum Recombinant Haplotype Configuration on Tree Pedigrees (Extended Abstract)
Minimum Recombinant Haplotype Configuration on Tree Pedigrees (Extended Abstract) Koichiro Doi 1, Jing Li 2, and Tao Jiang 2 1 Department of Computer Science Graduate School of Information Science and
More informationGMDR User Manual. GMDR software Beta 0.9. Updated March 2011
GMDR User Manual GMDR software Beta 0.9 Updated March 2011 1 As an open source project, the source code of GMDR is published and made available to the public, enabling anyone to copy, modify and redistribute
More informationAnalytical model A structure and process for analyzing a dataset. For example, a decision tree is a model for the classification of a dataset.
Glossary of data mining terms: Accuracy Accuracy is an important factor in assessing the success of data mining. When applied to data, accuracy refers to the rate of correct values in the data. When applied
More informationWeb page recommendation using a stochastic process model
Data Mining VII: Data, Text and Web Mining and their Business Applications 233 Web page recommendation using a stochastic process model B. J. Park 1, W. Choi 1 & S. H. Noh 2 1 Computer Science Department,
More information[Gidhane* et al., 5(7): July, 2016] ISSN: IC Value: 3.00 Impact Factor: 4.116
IJESRT INTERNATIONAL JOURNAL OF ENGINEERING SCIENCES & RESEARCH TECHNOLOGY AN EFFICIENT APPROACH FOR TEXT MINING USING SIDE INFORMATION Kiran V. Gaidhane*, Prof. L. H. Patil, Prof. C. U. Chouhan DOI: 10.5281/zenodo.58632
More informationCluster Analysis. Mu-Chun Su. Department of Computer Science and Information Engineering National Central University 2003/3/11 1
Cluster Analysis Mu-Chun Su Department of Computer Science and Information Engineering National Central University 2003/3/11 1 Introduction Cluster analysis is the formal study of algorithms and methods
More informationData mining with Support Vector Machine
Data mining with Support Vector Machine Ms. Arti Patle IES, IPS Academy Indore (M.P.) artipatle@gmail.com Mr. Deepak Singh Chouhan IES, IPS Academy Indore (M.P.) deepak.schouhan@yahoo.com Abstract: Machine
More informationKeyword Extraction by KNN considering Similarity among Features
64 Int'l Conf. on Advances in Big Data Analytics ABDA'15 Keyword Extraction by KNN considering Similarity among Features Taeho Jo Department of Computer and Information Engineering, Inha University, Incheon,
More informationDatasets Size: Effect on Clustering Results
1 Datasets Size: Effect on Clustering Results Adeleke Ajiboye 1, Ruzaini Abdullah Arshah 2, Hongwu Qin 3 Faculty of Computer Systems and Software Engineering Universiti Malaysia Pahang 1 {ajibraheem@live.com}
More informationDENSITY BASED AND PARTITION BASED CLUSTERING OF UNCERTAIN DATA BASED ON KL-DIVERGENCE SIMILARITY MEASURE
DENSITY BASED AND PARTITION BASED CLUSTERING OF UNCERTAIN DATA BASED ON KL-DIVERGENCE SIMILARITY MEASURE Sinu T S 1, Mr.Joseph George 1,2 Computer Science and Engineering, Adi Shankara Institute of Engineering
More informationHPC methods for hidden Markov models (HMMs) in population genetics
HPC methods for hidden Markov models (HMMs) in population genetics Peter Kecskemethy supervised by: Chris Holmes Department of Statistics and, University of Oxford February 20, 2013 Outline Background
More informationChapter 1, Introduction
CSI 4352, Introduction to Data Mining Chapter 1, Introduction Young-Rae Cho Associate Professor Department of Computer Science Baylor University What is Data Mining? Definition Knowledge Discovery from
More informationOutlier Detection Using Unsupervised and Semi-Supervised Technique on High Dimensional Data
Outlier Detection Using Unsupervised and Semi-Supervised Technique on High Dimensional Data Ms. Gayatri Attarde 1, Prof. Aarti Deshpande 2 M. E Student, Department of Computer Engineering, GHRCCEM, University
More informationThe Lander-Green Algorithm in Practice. Biostatistics 666
The Lander-Green Algorithm in Practice Biostatistics 666 Last Lecture: Lander-Green Algorithm More general definition for I, the "IBD vector" Probability of genotypes given IBD vector Transition probabilities
More informationNeural Network Weight Selection Using Genetic Algorithms
Neural Network Weight Selection Using Genetic Algorithms David Montana presented by: Carl Fink, Hongyi Chen, Jack Cheng, Xinglong Li, Bruce Lin, Chongjie Zhang April 12, 2005 1 Neural Networks Neural networks
More informationMining Quantitative Maximal Hyperclique Patterns: A Summary of Results
Mining Quantitative Maximal Hyperclique Patterns: A Summary of Results Yaochun Huang, Hui Xiong, Weili Wu, and Sam Y. Sung 3 Computer Science Department, University of Texas - Dallas, USA, {yxh03800,wxw0000}@utdallas.edu
More informationDynamic Clustering of Data with Modified K-Means Algorithm
2012 International Conference on Information and Computer Networks (ICICN 2012) IPCSIT vol. 27 (2012) (2012) IACSIT Press, Singapore Dynamic Clustering of Data with Modified K-Means Algorithm Ahamed Shafeeq
More informationString Vector based KNN for Text Categorization
458 String Vector based KNN for Text Categorization Taeho Jo Department of Computer and Information Communication Engineering Hongik University Sejong, South Korea tjo018@hongik.ac.kr Abstract This research
More informationTOWARDS NEW ESTIMATING INCREMENTAL DIMENSIONAL ALGORITHM (EIDA)
TOWARDS NEW ESTIMATING INCREMENTAL DIMENSIONAL ALGORITHM (EIDA) 1 S. ADAEKALAVAN, 2 DR. C. CHANDRASEKAR 1 Assistant Professor, Department of Information Technology, J.J. College of Arts and Science, Pudukkottai,
More informationMining Frequent Itemsets for data streams over Weighted Sliding Windows
Mining Frequent Itemsets for data streams over Weighted Sliding Windows Pauray S.M. Tsai Yao-Ming Chen Department of Computer Science and Information Engineering Minghsin University of Science and Technology
More informationcalled Hadoop Distribution file System (HDFS). HDFS is designed to run on clusters of commodity hardware and is capable of handling large files. A fil
Parallel Genome-Wide Analysis With Central And Graphic Processing Units Muhamad Fitra Kacamarga mkacamarga@binus.edu James W. Baurley baurley@binus.edu Bens Pardamean bpardamean@binus.edu Abstract The
More informationSEMBIOSPHERE: A SEMANTIC WEB APPROACH TO RECOMMENDING MICROARRAY CLUSTERING SERVICES
SEMBIOSPHERE: A SEMANTIC WEB APPROACH TO RECOMMENDING MICROARRAY CLUSTERING SERVICES KEVIN Y. YIP 1, PEISHEN QI 1, MARTIN SCHULTZ 1, DAVID W. CHEUNG 5 AND KEI-HOI CHEUNG 1,2,3,4 1 Computer Science, 2 Center
More informationNDoT: Nearest Neighbor Distance Based Outlier Detection Technique
NDoT: Nearest Neighbor Distance Based Outlier Detection Technique Neminath Hubballi 1, Bidyut Kr. Patra 2, and Sukumar Nandi 1 1 Department of Computer Science & Engineering, Indian Institute of Technology
More informationOverview. Background. Locating quantitative trait loci (QTL)
Overview Implementation of robust methods for locating quantitative trait loci in R Introduction to QTL mapping Andreas Baierl and Andreas Futschik Institute of Statistics and Decision Support Systems
More information2. Department of Electronic Engineering and Computer Science, Case Western Reserve University
Chapter MINING HIGH-DIMENSIONAL DATA Wei Wang 1 and Jiong Yang 2 1. Department of Computer Science, University of North Carolina at Chapel Hill 2. Department of Electronic Engineering and Computer Science,
More informationA Classifier with the Function-based Decision Tree
A Classifier with the Function-based Decision Tree Been-Chian Chien and Jung-Yi Lin Institute of Information Engineering I-Shou University, Kaohsiung 84008, Taiwan, R.O.C E-mail: cbc@isu.edu.tw, m893310m@isu.edu.tw
More informationClustering of Data with Mixed Attributes based on Unified Similarity Metric
Clustering of Data with Mixed Attributes based on Unified Similarity Metric M.Soundaryadevi 1, Dr.L.S.Jayashree 2 Dept of CSE, RVS College of Engineering and Technology, Coimbatore, Tamilnadu, India 1
More informationImproving the Efficiency of Fast Using Semantic Similarity Algorithm
International Journal of Scientific and Research Publications, Volume 4, Issue 1, January 2014 1 Improving the Efficiency of Fast Using Semantic Similarity Algorithm D.KARTHIKA 1, S. DIVAKAR 2 Final year
More informationKeywords Clustering, Goals of clustering, clustering techniques, clustering algorithms.
Volume 3, Issue 5, May 2013 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com A Survey of Clustering
More informationGene Clustering & Classification
BINF, Introduction to Computational Biology Gene Clustering & Classification Young-Rae Cho Associate Professor Department of Computer Science Baylor University Overview Introduction to Gene Clustering
More informationEncoding Words into String Vectors for Word Categorization
Int'l Conf. Artificial Intelligence ICAI'16 271 Encoding Words into String Vectors for Word Categorization Taeho Jo Department of Computer and Information Communication Engineering, Hongik University,
More informationHapBlock A Suite of Dynamic Programming Algorithms for Haplotype Block Partitioning and Tag SNP Selection Based on Haplotype and Genotype Data
HapBlock A Suite of Dynamic Programming Algorithms for Haplotype Block Partitioning and Tag SNP Selection Based on Haplotype and Genotype Data Introduction The suite of programs, HapBlock, is developed
More informationIterative Learning of Single Individual Haplotypes from High-Throughput DNA Sequencing Data
Iterative Learning of Single Individual Haplotypes from High-Throughput DNA Sequencing Data Zrinka Puljiz and Haris Vikalo Electrical and Computer Engineering Department The University of Texas at Austin
More informationIEEE TRANSACTIONS ON EVOLUTIONARY COMPUTATION, VOL. 5, NO. 1, FEBRUARY
IEEE TRANSACTIONS ON EVOLUTIONARY COMPUTATION, VOL. 5, NO. 1, FEBRUARY 2001 41 Brief Papers An Orthogonal Genetic Algorithm with Quantization for Global Numerical Optimization Yiu-Wing Leung, Senior Member,
More informationCHAPTER 6 HYBRID AI BASED IMAGE CLASSIFICATION TECHNIQUES
CHAPTER 6 HYBRID AI BASED IMAGE CLASSIFICATION TECHNIQUES 6.1 INTRODUCTION The exploration of applications of ANN for image classification has yielded satisfactory results. But, the scope for improving
More informationHybrid Models Using Unsupervised Clustering for Prediction of Customer Churn
Hybrid Models Using Unsupervised Clustering for Prediction of Customer Churn Indranil Bose and Xi Chen Abstract In this paper, we use two-stage hybrid models consisting of unsupervised clustering techniques
More informationMining High Average-Utility Itemsets
Proceedings of the 2009 IEEE International Conference on Systems, Man, and Cybernetics San Antonio, TX, USA - October 2009 Mining High Itemsets Tzung-Pei Hong Dept of Computer Science and Information Engineering
More informationFeature Selection. CE-725: Statistical Pattern Recognition Sharif University of Technology Spring Soleymani
Feature Selection CE-725: Statistical Pattern Recognition Sharif University of Technology Spring 2013 Soleymani Outline Dimensionality reduction Feature selection vs. feature extraction Filter univariate
More informationKeywords: clustering algorithms, unsupervised learning, cluster validity
Volume 6, Issue 1, January 2016 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Clustering Based
More informationEstimating Error-Dimensionality Relationship for Gene Expression Based Cancer Classification
1 Estimating Error-Dimensionality Relationship for Gene Expression Based Cancer Classification Feng Chu and Lipo Wang School of Electrical and Electronic Engineering Nanyang Technological niversity Singapore
More informationShuheng Zhou. Annotated Bibliography
Shuheng Zhou Annotated Bibliography High-dimensional Statistical Inference S. Zhou, J. Lafferty and L. Wasserman, Compressed Regression, in Advances in Neural Information Processing Systems 20 (NIPS 2007).
More informationHaplotype Analysis. 02 November 2003 Mendel Short IGES Slide 1
Haplotype Analysis Specifies the genetic information descending through a pedigree Useful visualization of the gene flow through a pedigree A haplotype for a given individual and set of loci is defined
More informationHIMIC : A Hierarchical Mixed Type Data Clustering Algorithm
HIMIC : A Hierarchical Mixed Type Data Clustering Algorithm R. A. Ahmed B. Borah D. K. Bhattacharyya Department of Computer Science and Information Technology, Tezpur University, Napam, Tezpur-784028,
More informationK-Means Clustering With Initial Centroids Based On Difference Operator
K-Means Clustering With Initial Centroids Based On Difference Operator Satish Chaurasiya 1, Dr.Ratish Agrawal 2 M.Tech Student, School of Information and Technology, R.G.P.V, Bhopal, India Assistant Professor,
More informationA Quantified Approach for large Dataset Compression in Association Mining
IOSR Journal of Computer Engineering (IOSR-JCE) e-issn: 2278-0661, p- ISSN: 2278-8727Volume 15, Issue 3 (Nov. - Dec. 2013), PP 79-84 A Quantified Approach for large Dataset Compression in Association Mining
More informationExpectation Maximization (EM) and Gaussian Mixture Models
Expectation Maximization (EM) and Gaussian Mixture Models Reference: The Elements of Statistical Learning, by T. Hastie, R. Tibshirani, J. Friedman, Springer 1 2 3 4 5 6 7 8 Unsupervised Learning Motivation
More informationCollaborative Filtering using a Spreading Activation Approach
Collaborative Filtering using a Spreading Activation Approach Josephine Griffith *, Colm O Riordan *, Humphrey Sorensen ** * Department of Information Technology, NUI, Galway ** Computer Science Department,
More informationGPU Data Mining in Neuroimaging Genomics
GPU Data Mining in Neuroimaging Genomics Bob Zigon Beckman Coulter Indianapolis, Indiana May 10, 2017 1 / 20 Outline Background ANOVA for Voxels and SNPs VEGAS for Voxels and Genes High Speed GPU Monte-Carlo
More informationProgress Report: Collaborative Filtering Using Bregman Co-clustering
Progress Report: Collaborative Filtering Using Bregman Co-clustering Wei Tang, Srivatsan Ramanujam, and Andrew Dreher April 4, 2008 1 Introduction Analytics are becoming increasingly important for business
More informationA Comparison of Resampling Methods for Clustering Ensembles
A Comparison of Resampling Methods for Clustering Ensembles Behrouz Minaei-Bidgoli Computer Science Department Michigan State University East Lansing, MI, 48824, USA Alexander Topchy Computer Science Department
More informationParticle swarm optimizer for variable weighting in clustering high-dimensional data
Mach Learn (2011) 82: 43 70 DOI 10.1007/s10994-009-5154-2 Particle swarm optimizer for variable weighting in clustering high-dimensional data Yanping Lu Shengrui Wang Shaozi Li Changle Zhou Received: 30
More informationFast Efficient Clustering Algorithm for Balanced Data
Vol. 5, No. 6, 214 Fast Efficient Clustering Algorithm for Balanced Data Adel A. Sewisy Faculty of Computer and Information, Assiut University M. H. Marghny Faculty of Computer and Information, Assiut
More informationA Naïve Soft Computing based Approach for Gene Expression Data Analysis
Available online at www.sciencedirect.com Procedia Engineering 38 (2012 ) 2124 2128 International Conference on Modeling Optimization and Computing (ICMOC-2012) A Naïve Soft Computing based Approach for
More informationEvaluating Subspace Clustering Algorithms
Evaluating Subspace Clustering Algorithms Lance Parsons lparsons@asu.edu Ehtesham Haque Ehtesham.Haque@asu.edu Department of Computer Science Engineering Arizona State University, Tempe, AZ 85281 Huan
More informationContents. Preface to the Second Edition
Preface to the Second Edition v 1 Introduction 1 1.1 What Is Data Mining?....................... 4 1.2 Motivating Challenges....................... 5 1.3 The Origins of Data Mining....................
More informationCS Introduction to Data Mining Instructor: Abdullah Mueen
CS 591.03 Introduction to Data Mining Instructor: Abdullah Mueen LECTURE 8: ADVANCED CLUSTERING (FUZZY AND CO -CLUSTERING) Review: Basic Cluster Analysis Methods (Chap. 10) Cluster Analysis: Basic Concepts
More informationHARD, SOFT AND FUZZY C-MEANS CLUSTERING TECHNIQUES FOR TEXT CLASSIFICATION
HARD, SOFT AND FUZZY C-MEANS CLUSTERING TECHNIQUES FOR TEXT CLASSIFICATION 1 M.S.Rekha, 2 S.G.Nawaz 1 PG SCALOR, CSE, SRI KRISHNADEVARAYA ENGINEERING COLLEGE, GOOTY 2 ASSOCIATE PROFESSOR, SRI KRISHNADEVARAYA
More informationA General Greedy Approximation Algorithm with Applications
A General Greedy Approximation Algorithm with Applications Tong Zhang IBM T.J. Watson Research Center Yorktown Heights, NY 10598 tzhang@watson.ibm.com Abstract Greedy approximation algorithms have been
More informationUnsupervised Feature Selection Using Multi-Objective Genetic Algorithms for Handwritten Word Recognition
Unsupervised Feature Selection Using Multi-Objective Genetic Algorithms for Handwritten Word Recognition M. Morita,2, R. Sabourin 3, F. Bortolozzi 3 and C. Y. Suen 2 École de Technologie Supérieure, Montreal,
More informationMachine Learning (BSMC-GA 4439) Wenke Liu
Machine Learning (BSMC-GA 4439) Wenke Liu 01-31-017 Outline Background Defining proximity Clustering methods Determining number of clusters Comparing two solutions Cluster analysis as unsupervised Learning
More informationAN IMPROVED K-MEANS CLUSTERING ALGORITHM FOR IMAGE SEGMENTATION
AN IMPROVED K-MEANS CLUSTERING ALGORITHM FOR IMAGE SEGMENTATION WILLIAM ROBSON SCHWARTZ University of Maryland, Department of Computer Science College Park, MD, USA, 20742-327, schwartz@cs.umd.edu RICARDO
More informationNOVEL HYBRID GENETIC ALGORITHM WITH HMM BASED IRIS RECOGNITION
NOVEL HYBRID GENETIC ALGORITHM WITH HMM BASED IRIS RECOGNITION * Prof. Dr. Ban Ahmed Mitras ** Ammar Saad Abdul-Jabbar * Dept. of Operation Research & Intelligent Techniques ** Dept. of Mathematics. College
More informationNSGA-II for Biological Graph Compression
Advanced Studies in Biology, Vol. 9, 2017, no. 1, 1-7 HIKARI Ltd, www.m-hikari.com https://doi.org/10.12988/asb.2017.61143 NSGA-II for Biological Graph Compression A. N. Zakirov and J. A. Brown Innopolis
More informationGenetic Programming. Charles Chilaka. Department of Computational Science Memorial University of Newfoundland
Genetic Programming Charles Chilaka Department of Computational Science Memorial University of Newfoundland Class Project for Bio 4241 March 27, 2014 Charles Chilaka (MUN) Genetic algorithms and programming
More informationGWAsimulator: A rapid whole-genome simulation program
GWAsimulator: A rapid whole-genome simulation program Version 1.1 Chun Li and Mingyao Li September 21, 2007 (revised October 9, 2007) 1. Introduction...1 2. Download and compile the program...2 3. Input
More informationMetaData for Database Mining
MetaData for Database Mining John Cleary, Geoffrey Holmes, Sally Jo Cunningham, and Ian H. Witten Department of Computer Science University of Waikato Hamilton, New Zealand. Abstract: At present, a machine
More informationCLUSTERING IN BIOINFORMATICS
CLUSTERING IN BIOINFORMATICS CSE/BIMM/BENG 8 MAY 4, 0 OVERVIEW Define the clustering problem Motivation: gene expression and microarrays Types of clustering Clustering algorithms Other applications of
More informationMulti-Modal Data Fusion: A Description
Multi-Modal Data Fusion: A Description Sarah Coppock and Lawrence J. Mazlack ECECS Department University of Cincinnati Cincinnati, Ohio 45221-0030 USA {coppocs,mazlack}@uc.edu Abstract. Clustering groups
More informationK-modes Clustering Algorithm for Categorical Data
K-modes Clustering Algorithm for Categorical Data Neha Sharma Samrat Ashok Technological Institute Department of Information Technology, Vidisha, India Nirmal Gaud Samrat Ashok Technological Institute
More informationClustering Lecture 9: Other Topics. Jing Gao SUNY Buffalo
Clustering Lecture 9: Other Topics Jing Gao SUNY Buffalo 1 Basics Outline Motivation, definition, evaluation Methods Partitional Hierarchical Density-based Miture model Spectral methods Advanced topics
More informationHybrid Fuzzy C-Means Clustering Technique for Gene Expression Data
Hybrid Fuzzy C-Means Clustering Technique for Gene Expression Data 1 P. Valarmathie, 2 Dr MV Srinath, 3 Dr T. Ravichandran, 4 K. Dinakaran 1 Dept. of Computer Science and Engineering, Dr. MGR University,
More informationText Document Clustering Using DPM with Concept and Feature Analysis
Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology IJCSMC, Vol. 2, Issue. 10, October 2013,
More informationPackage EBglmnet. January 30, 2016
Type Package Package EBglmnet January 30, 2016 Title Empirical Bayesian Lasso and Elastic Net Methods for Generalized Linear Models Version 4.1 Date 2016-01-15 Author Anhui Huang, Dianting Liu Maintainer
More informationBiclustering Bioinformatics Data Sets. A Possibilistic Approach
Possibilistic algorithm Bioinformatics Data Sets: A Possibilistic Approach Dept Computer and Information Sciences, University of Genova ITALY EMFCSC Erice 20/4/2007 Bioinformatics Data Sets Outline Introduction
More informationClassification and Optimization using RF and Genetic Algorithm
International Journal of Management, IT & Engineering Vol. 8 Issue 4, April 2018, ISSN: 2249-0558 Impact Factor: 7.119 Journal Homepage: Double-Blind Peer Reviewed Refereed Open Access International Journal
More informationTowards New Heterogeneous Data Stream Clustering based on Density
, pp.30-35 http://dx.doi.org/10.14257/astl.2015.83.07 Towards New Heterogeneous Data Stream Clustering based on Density Chen Jin-yin, He Hui-hao Zhejiang University of Technology, Hangzhou,310000 chenjinyin@zjut.edu.cn
More informationMachine Learning (BSMC-GA 4439) Wenke Liu
Machine Learning (BSMC-GA 4439) Wenke Liu 01-25-2018 Outline Background Defining proximity Clustering methods Determining number of clusters Other approaches Cluster analysis as unsupervised Learning Unsupervised
More informationEvaluating Classifiers
Evaluating Classifiers Charles Elkan elkan@cs.ucsd.edu January 18, 2011 In a real-world application of supervised learning, we have a training set of examples with labels, and a test set of examples with
More information