Distance based Clustering for Categorical Data
|
|
- Kathryn French
- 5 years ago
- Views:
Transcription
1 Distance based Clustering for Categorical Data Extended Abstract Dino Ienco and Rosa Meo Dipartimento di Informatica, Università di Torino Italy {ienco, Abstract. Learning distances from categorical attributes is a very useful data mining task that allows to perform distance-based techniques, such as clustering and classification by similarity. In this article we propose a new context-based similarity measure that learns distances between the values of a categorical attribute (DILCA DIstance Learning of Categorical Attributes). We couple our similarity measure with a famous hierarchical distance-based clustering algorithm (Ward s hierarchical clustering) and compare the results with the results obtained from methods of the state of the art for this research field. 1 Introduction In this paper we present a new method named DILCA to compute distances between values of a categorical variable and apply this technique to clustering categorical data by a hierarchical approach. Computing distances between two examples is a crucial step for many data mining tasks. Examples of distance-based approaches include clustering (k-means) and distance-based classification algorithm (K-NN, SVM) [3]; all these algorithms manage distance computations as an inner step. Computing the proximity between two instances on the basis of continuous data is a well-understood task. An example of distance measure between two examples described only by the continuous attributes is Minkowski Distance (particularly Manhattan and Euclidean). This measure depends only on the difference between the values of the single attributes. On the contrary, for categorical data deciding on the distance between two values is not straightforward since categorical attribute values are not ordered. As a consequence, it is not possible to compute the difference between two values in a direct way. For categorical data the simplest measure known is the overlap [1]. Overlap is a similarity measure that increases proportionally to the number of attributes in the two examples which match. Notice that overlap does not distinguish between the different values taken by the attribute since it measures only the equality between pair of values. The similarity measure does not change with the values of the attribute when the values are different in the two examples. For any pair of different values the similarity value does not differ.
2 In this paper we give a contribution to this issue by developing a new distance measure on the values of categorical attributes that takes into account the context in which the attribute appears. The context is constituted by the other attributes describing the example. In literature some context-based measures have also been employed, but again they refer to continuous data, like Mahalanobis Distance [3]. Any single value v of a categorical attribute A i occurs in a certain set S of dataset examples. Let us denote by S 1 the set of examples having value v 1 for A i, and by S 2 the set of examples having value v 2 for the same attribute A i. Our proposal is based on the observation of the frequency with which the values that a certain set of attributes A j (the context of A i ) occur in the sets of examples S 1 and S 2. The distance between two values v 1 and v 2 of A i is determined by the difference of the frequency with which the values of the other attributes A j occur in the two sets S 1 and S 2. The details of our method are explained in Section 3. 2 Clustering categorical data The clustering task is becoming an important task in data mining field [3], in information retrieval [4] and in a wide range of applications [2]. Given a set of instances, the goal of clustering is to find a partition of the instances according to a predefined distance measure or to some objective function. The problem becomes more difficult when instances are described by categorical attributes. In literature many approaches to categorical clustering exist. As one of the first works in the field of categorical clustering, K-MODES [8], tries to extend K-Means algorithm for categorical data. Each cluster is represented by a centroid. This centroid contains the most frequent value for each attribute. Therefore, in K-MODES the similarity of the unlabeled data point and the cluster representative can be simply calculated by the overlap distance [1]. Another approach to categorical clustering is ROCK [5]. ROCK is an agglomerative hierarchical clustering. It employs links to measure the proximity between data points. For a given instance d i, an instance d j is a neighbor of d i if the Jaccard similarity J(d i, d j ) between them exceeds a prespecified threshold θ. At the beginning ROCK starts by assigning each instance to a singleton cluster and merges clusters on the basis of the number of neighbors. [7] introduces LIMBO, a scalable hierarchical categorical clustering algorithm built on the Information Bottleneck (IB) framework. As a hierarchical algorithm, LIMBO has the advantage that it produces clusterings of different size in a single execution. CLICKS [6] is a density based clustering algorithm based on graph/hypergraph partitioning. The key intuition here is to encode the data set into weighted summarization structures such as graphs. In general, the cost of clustering with a graph structure is acceptable, provided that the underlying data is low dimensional. CLICKS finds clusters in categorical datasets based on a search method for k-partite maximal cliques.
3 For the evaluation of our approach, we embed our computation of distances in a hierarchical clustering algorithm most used in literature: Ward s agglomerative algorithm [19]. In our experiments we compare DILCA both with ROCK and LIMBO clustering algorithms. 3 DILCA The key contributions of DILCA are the following: we propose a new methodology to compute a matrix of distances between any pair of values of a specific categorical attribute X; this approach is independent from the specific clustering algorithm since we can consider DILCA as a simple way to compute distances for categorical attribute; we obtained good results in clustering when the clustering algorithm uses DILCA distance. We start the presentation of DILCA with the notation. Let us consider a data set D = {d 1, d 2,.., d n } of instances defined over F = {X, Y,.., Z}, a set of m categorical attributes. Assume that X is a categorical attribute and we are interetsted to determine the distances between any pairs of its values. We refer to the cardinality of a single attribute (or feature) X as X. We denote a specific value i of a feature X by x i. Our approach is based on the following key points: 1. selection of the context of a given categorical attribute X: it is a relevant subset of F whose values are helpful for the prediction of the target X; 2. computation of the distance between values of the categorical target X using its context. Selection of a relevant subset of the whole attributes in respect to the given one. To solve this first step we investigate the problem of the selection of a good set of features with respect to the given one. This is a classical problem in data mining, known also as feature selection. In this research field many approaches on how to measure the correlation/association between two variables have been proposed [13]. An interesting one is Symmetric Uncertainty (SU), introduced in [14]. This is a measure that allows to quantify the mutual dependence of two variables. Numerator is mutual information: it measures how much the knowledge on one of the two variables reduces uncertainty about the other one. This uncertainty has been normalized by the total uncertainty on the variables, given by the sum of the entropies H(X) and H(Y) as follows: SU(X, Y ) = 2 IG(X Y ) H(X) + H(Y ) The value of this measure ranges between 0 and 1. The value 1 denotes that the knowledge of one of the two variables completely predicts the value of the other one; the value 0 denotes that X and Y are independent.
4 The advantage of SU with respect to other measures based on a difference of entropy, such as Information Gain (IG), is that SU is not biased by the number of values of the variables. In DILCA, given an attribute X, we want to select a set of context attributes, context(x), such that the attributes in context(x) have the best SU with the attribute X. The number of attributes considered as context of X is a key point for the solutionm of the problem. We want that the number of context attributes is correlated with the SU values obtained by each of them and X. We use an heuristic to set this number. The heuristic is based on the expectation of SU for X. Given the distribution of the values of X, we want to compute SU for each attribute Y, with Y X. We denote this Symmetric Uncertainty as: SU X (Y ). The mean of this quantity is: Y context(x) E[SU X ] = SU X(Y ) context(x) where context(x) denotes the cardinality of the set context(x). To determine the context of X we use the attributes that satisfy the following inequality: SU X (Y ) σe[su X ] where σ is a trade-off parameter, given by the user, that is useful to control the influence of the mean on the choice of the threshold. The parameter σ ranges in the interval [0, 1]. By means of the application of SU we are able to select a set of attributes that specify the context of a particular attribute. They are just those attributes whose values are associated to the values of the target categorical attribute X. Computation of the distance measure between values of the same attribute. The second step of our approach is to compute the distance between each pair of values of the target categorical attribute. We denote by P(x i y k ) the conditional probability of the value x i of X given the value y k of the context attribute Y. In DILCA, the distance between x i and x j is computed by the formula: dist(x i, x j ) = Y context(x) y k Y (P(x i y k ) P(x j y k )) 2 (1) For each context attribute Y, and each value y k Y, we compute the difference between the conditional probabilities of the two values x i and x j, given the third value y k of Y : (P(x i y k ) and P(x j y k )). Then, we use the euclidean distance to compute the final distance. Algorithm 1 describes the procedure adopted to compute the distance matrix between the values of a categorical attribute. The algorithm takes as parameters the correlation matrix matrixsu obtained by the Symmetric Uncertainty between any pair of attributes in the data set. The second argument is the target
5 attribute, X. The third argument σ is the trade-off parameter. At the first line, we select the vector with correlation values between X and all the other attributes: V ectorsu X = MatrixSU[X]. As a second step, we compute the mean of the correlation and then, with respect to this mean and the parameter σ, the algorithm selects the attributes that the generation of the context of X. From line 9, starting from the attributes in the context of X the distance between any two values of X is computed with formula 1. Algorithm 1 Distance(matrixSU,X,σ) 1: V ectorsu X = MatrixSU[X] 2: threshold = computemean(v ectorsu X) 3: context(x) = 4: for all y V ectorsu X do 5: if V ectorsu X[y] σthreshold then 6: insert(y,context(x)) 7: end if 8: end for 9: for all x i, x j X x i x j do q P 10: DistanceMatrix[x i][x j] = Y context(x) 11: end for 12: return DistanceMatrix X distance P y k Y (P(xi y k) P(x j y k )) 2 4 Evaluation Measures for Clustering In our work we use two objective criteria to evaluate the results. Accuracy, Acc: This measure considers the original class label to evaluate the clustering result. If we assume that the instances in D are already classified in c classes G = {g 1, g 2,..., g c }, and let denote by K a clustering of the instances of D into c clusters {cl 1, cl 2,..., cl c }. Consider a one-to-one mapping, f, from classes to clusters, such that each class g i is mapped to the cluster f(g i ). The classification error of the mapping is defined as: E = c i=1 g i f(g i ) where g i f(g i ) measures the number of tuples in class g i that received the wrong label. The optimal mapping between clusters and classes, is the one that minimizes the classification error. We denote it by E min. The Accuracy is obtained as follows: Acc = 1 E min D Normalized Mutual Information, N M I: This measure provides an information that is impartial with respect to the number of clusters [18]. This measure has a maximum at one when the clustering partition matches completely the original partition (class). We can consider NMI as an indicator of the purity
6 of the clustering result. NMI is computed as the average mutual information between any pair of a cluster and a class: I J i=1 j=1 x ij log n nij n in j NMI = I i=1 n i log ni n J j=1 n j log nj n where n ij is the cardinality of the set of objects that occur both in cluster i and in class j; n i is the number of objects in cluster i; n j is the number of objects in class j; n is the total number of objects. I and J are respectively the number of clusters and the number of classes. 5 Dataset for Categorical Attribute Evaluation In our experimental section, to evaluate DILCA on categorical attributes only, we use two real world data sets obtained from UCI Machine Learning Repository [11] (Congressional votes and Mushroom) and one synthetic dataset (SynA) obtained from synthetic data generator [12]. Dataset SynA: This dataset contains 1000 instances, has 5 different classes and is generated from a random distribution. Each instance has 50 categorical attributes whose domain ranges over 20 values. 5.1 Experiment Discussion In this section we evaluate the performance of DILCA coupled with Ward hierarchical clustering and name it with HCL DILCA. We compare it with ROCK and LIMBO. For all the approaches we set the number of clusters equal to the number of classes. To implement our approach we use WEKA library [3], a java open source library that implements machine learning and data mining algorithms. To run the experiments we use a PC with Intel(R) Pentium(R) M processor 1.86GHz, 1024MB of RAM and OpenSuse as OS. For each particular algorithm we use the following setting: For HCL DILCA we vary parameter σ between 0 and 1 with step of 0.1 at each execution. For ROCK we vary θ parameter between 0.2 and 1 with step of 0.05 each time. For LIMBO we vary φ parameter used in the algorithm between 0 and 1 with step of For all the algorithms we take the best result obtained. We choose these parameters setting because we saw that ROCK is more sensitive to variations of the parameter than LIMBO. We also observed that ROCK algorithm in many cases produces one giant cluster that includes instances from more classes. In the tables of Figure 1 we report results of the comparative evaluation of the clustering algorithms. For each data set and each specific execution, we specify the setting of the parameters, we report the Accuracy in percentage and the value of Normalized Mutual Information. We use bold face to underline the best result for each dataset.
7 Votes (2 clusters) (435 instances) Algorithm Acc. NMI HCL DILCA (σ = 0.5) 89.89% ROCK (θ = 0.75) 83.90% LIMBO (φ = 0.75) 87.12% Mushroom (2 clusters) (8.124 instances) Algorithm Acc. NMI HCL DILCA (σ = 1.0) 89.02% ROCK (θ = 0.8) 50.57% LIMBO(φ = 1.0) 88.95% Fig.1. Experiments on Congressional Votes and Mushroom data set Synthetic dataset A (5 clusters) (1000 instances) Algorithm Acc. NMI HCL DILCA (σ = 1.0) 94.3% Rock (θ = 0.05) 80.3% LIMBO(φ = 0.25) 87.6% Fig. 2. Experiments on Synthetic dataset A 6 Scalability of DILCA In this section we introduce the study of the scalability of the proposed approach of distance learning alone and coupled with Ward s hierarchical clustering algorithm. To evaluate the comparison of the scalability of the proposed method we compare HCL DILCA with LIMBO only, since LIMBO is known to outperform ROCK in this regard. We use another synthetic dataset with 1,000 instances and 1,000 attributes. From this dataset, we build 10 other datasets, each with the same number of instances but a different, progressive number of features: from 100 to 1,000 features. In figure 3 we can see the results of this evaluation. We can see that HCL DILCA is faster than LIMBO especially with the increase in size of the dataset. In fact, LIMBO computational complexity is higher than DILCA complexity in the distance computation between categorical attributes: DILCA depends only on the number of features while for the formation of clusters depends on the underlying clustering algorithm. 7 Conclusion In this work we presented a new context-based distance measure to manage categorical data. We believe that the proposed framework is general enough and we can apply it to any data mining task that involves nominal data and a distance computation over them. As a future work we want to use our distance learning approach to different distance-based tasks such as: outlier detection, nearest neighbours classification and kernel based algorithms and we want to extend our approach to manage datasets with mixed data types.
8 14000 HCLDilca Limbo time(sec) n. of instances Fig.3. The time performance of HCL coupled with DILCA and LIMBO References 1. C. Stanfill and D. Waltz: Toward memory-based reasoning. Commun. ACM, 29(12): , A. K. Jain, M. N. Murty and P. J. Flynn: Data Clustering: A Review. ACM Comput. Surv. 31(3): , M. Kamber and J. Han. Data Mining: Concepts and Techniques, 2nd ed., M. Charikar, C. Chekuri, T. Feder and R. Motwani: Incremental Clustering and Dynamic Information Retrieval. SIAM J. Comput. 33(6): , S. Guha, R. Rastogi and K. Shim: ROCK: A Robust Clustering Algorithm for Categorical Attributes. Information System 25(5): , M. J. Zaki and M. Peters: CLICKS: Mining Subspace Clusters in Categorical Data via K-partite Maximal Cliques. ICDE , P. Andritsos, P. Tsaparas, R. J. Miller and K. C. Sevcik: LIMBO: Scalable Clustering of Categorical Data. EDBT , Z. Huang: Extensions to the k-means Algorithm for Clustering Large Data Sets with Categorical Values. Data Min. Knowl. Discov. 2(3): , Y. Yang, X. Guan and J. You: CLOPE: A fast and effective clustering algorithm for transactional data. KDD , Witten and Frank: Data Mining: Practical Machine Learning Tools & Techniques. 11. C. Blake and C. Merez: UCI Repository of Machine Learning Databases, mlearn/mlrepository.html, Dataset Generator, Perfect data for an imperfect world, I. Guyon and A. Elisseeff: An Introduction to Variable and Feature Selection. Journal of Machine Learning Research 3: , L. Yu, H. Liu: Feature Selection for High-Dimensional Data: A Fast Correlation- Based Filter Solution. ICML , J. Ross Quinlan: C4.5: Programs for Machine Learning, C. M. Bishop: Pattern Recognition and Machine Learning, Ch and Banerjee and Kumar and Chandola: Outlier Detection: A Survey, A. Strehl and J. Ghosh: Cluster ensembles - a knowledge reuse framework for combining partitionings. AAAI, M. R. Anderberg: Cluster analysis for applications. Academic, 2nd ed., 1973.
Using PageRank in Feature Selection
Using PageRank in Feature Selection Dino Ienco, Rosa Meo, and Marco Botta Dipartimento di Informatica, Università di Torino, Italy fienco,meo,bottag@di.unito.it Abstract. Feature selection is an important
More informationUsing PageRank in Feature Selection
Using PageRank in Feature Selection Dino Ienco, Rosa Meo, and Marco Botta Dipartimento di Informatica, Università di Torino, Italy {ienco,meo,botta}@di.unito.it Abstract. Feature selection is an important
More informationUniversity of Florida CISE department Gator Engineering. Clustering Part 5
Clustering Part 5 Dr. Sanjay Ranka Professor Computer and Information Science and Engineering University of Florida, Gainesville SNN Approach to Clustering Ordinary distance measures have problems Euclidean
More informationCS570: Introduction to Data Mining
CS570: Introduction to Data Mining Scalable Clustering Methods: BIRCH and Others Reading: Chapter 10.3 Han, Chapter 9.5 Tan Cengiz Gunay, Ph.D. Slides courtesy of Li Xiong, Ph.D., 2011 Han, Kamber & Pei.
More informationBBS654 Data Mining. Pinar Duygulu. Slides are adapted from Nazli Ikizler
BBS654 Data Mining Pinar Duygulu Slides are adapted from Nazli Ikizler 1 Classification Classification systems: Supervised learning Make a rational prediction given evidence There are several methods for
More informationTRANSACTIONAL CLUSTERING. Anna Monreale University of Pisa
TRANSACTIONAL CLUSTERING Anna Monreale University of Pisa Clustering Clustering : Grouping of objects into different sets, or more precisely, the partitioning of a data set into subsets (clusters), so
More informationLIMBO: A Scalable Algorithm to Cluster Categorical Data. University of Toronto, Department of Computer Science. CSRG Technical Report 467
LIMBO: A Scalable Algorithm to Cluster Categorical Data University of Toronto, Department of Computer Science CSRG Technical Report 467 Periklis Andritsos Panayiotis Tsaparas Renée J. Miller Kenneth C.
More informationHIMIC : A Hierarchical Mixed Type Data Clustering Algorithm
HIMIC : A Hierarchical Mixed Type Data Clustering Algorithm R. A. Ahmed B. Borah D. K. Bhattacharyya Department of Computer Science and Information Technology, Tezpur University, Napam, Tezpur-784028,
More informationSemi-Supervised Clustering with Partial Background Information
Semi-Supervised Clustering with Partial Background Information Jing Gao Pang-Ning Tan Haibin Cheng Abstract Incorporating background knowledge into unsupervised clustering algorithms has been the subject
More informationClustering CS 550: Machine Learning
Clustering CS 550: Machine Learning This slide set mainly uses the slides given in the following links: http://www-users.cs.umn.edu/~kumar/dmbook/ch8.pdf http://www-users.cs.umn.edu/~kumar/dmbook/dmslides/chap8_basic_cluster_analysis.pdf
More informationTopic 1 Classification Alternatives
Topic 1 Classification Alternatives [Jiawei Han, Micheline Kamber, Jian Pei. 2011. Data Mining Concepts and Techniques. 3 rd Ed. Morgan Kaufmann. ISBN: 9380931913.] 1 Contents 2. Classification Using Frequent
More informationLIMBO: Scalable Clustering of Categorical Data
LIMBO: Scalable Clustering of Categorical Data Periklis Andritsos, Panayiotis Tsaparas, Renée J. Miller, and Kenneth C. Sevcik {periklis,tsap,miller,kcs}@cs.toronto.edu University of Toronto, Department
More informationPerformance Analysis of Data Mining Classification Techniques
Performance Analysis of Data Mining Classification Techniques Tejas Mehta 1, Dr. Dhaval Kathiriya 2 Ph.D. Student, School of Computer Science, Dr. Babasaheb Ambedkar Open University, Gujarat, India 1 Principal
More informationData Clustering With Leaders and Subleaders Algorithm
IOSR Journal of Engineering (IOSRJEN) e-issn: 2250-3021, p-issn: 2278-8719, Volume 2, Issue 11 (November2012), PP 01-07 Data Clustering With Leaders and Subleaders Algorithm Srinivasulu M 1,Kotilingswara
More informationClustering Part 4 DBSCAN
Clustering Part 4 Dr. Sanjay Ranka Professor Computer and Information Science and Engineering University of Florida, Gainesville DBSCAN DBSCAN is a density based clustering algorithm Density = number of
More informationNotes. Reminder: HW2 Due Today by 11:59PM. Review session on Thursday. Midterm next Tuesday (10/10/2017)
1 Notes Reminder: HW2 Due Today by 11:59PM TA s note: Please provide a detailed ReadMe.txt file on how to run the program on the STDLINUX. If you installed/upgraded any package on STDLINUX, you should
More informationIntroducing Partial Matching Approach in Association Rules for Better Treatment of Missing Values
Introducing Partial Matching Approach in Association Rules for Better Treatment of Missing Values SHARIQ BASHIR, SAAD RAZZAQ, UMER MAQBOOL, SONYA TAHIR, A. RAUF BAIG Department of Computer Science (Machine
More informationK-modes Clustering Algorithm for Categorical Data
K-modes Clustering Algorithm for Categorical Data Neha Sharma Samrat Ashok Technological Institute Department of Information Technology, Vidisha, India Nirmal Gaud Samrat Ashok Technological Institute
More informationData Mining. Clustering. Hamid Beigy. Sharif University of Technology. Fall 1394
Data Mining Clustering Hamid Beigy Sharif University of Technology Fall 1394 Hamid Beigy (Sharif University of Technology) Data Mining Fall 1394 1 / 31 Table of contents 1 Introduction 2 Data matrix and
More informationUniversity of Florida CISE department Gator Engineering. Clustering Part 4
Clustering Part 4 Dr. Sanjay Ranka Professor Computer and Information Science and Engineering University of Florida, Gainesville DBSCAN DBSCAN is a density based clustering algorithm Density = number of
More informationA Co-Clustering approach for Sum-Product Network Structure Learning
Università degli Studi di Bari Dipartimento di Informatica LACAM Machine Learning Group A Co-Clustering approach for Sum-Product Network Antonio Vergari Nicola Di Mauro Floriana Esposito December 8, 2014
More informationDatasets Size: Effect on Clustering Results
1 Datasets Size: Effect on Clustering Results Adeleke Ajiboye 1, Ruzaini Abdullah Arshah 2, Hongwu Qin 3 Faculty of Computer Systems and Software Engineering Universiti Malaysia Pahang 1 {ajibraheem@live.com}
More informationTHE AREA UNDER THE ROC CURVE AS A CRITERION FOR CLUSTERING EVALUATION
THE AREA UNDER THE ROC CURVE AS A CRITERION FOR CLUSTERING EVALUATION Helena Aidos, Robert P.W. Duin and Ana Fred Instituto de Telecomunicações, Instituto Superior Técnico, Lisbon, Portugal Pattern Recognition
More informationCluster Analysis. Angela Montanari and Laura Anderlucci
Cluster Analysis Angela Montanari and Laura Anderlucci 1 Introduction Clustering a set of n objects into k groups is usually moved by the aim of identifying internally homogenous groups according to a
More informationMultiple Layer Clustering of Large Software Systems
Multiple Layer Clustering of Large Software Systems Bill Andreopoulos Department of Computer Science and Engineering, York University, 4700 Keele Street, Toronto, M3J1P3 billa@cs.yorku.ca Aijun An Department
More informationCOMP 465: Data Mining Still More on Clustering
3/4/015 Exercise COMP 465: Data Mining Still More on Clustering Slides Adapted From : Jiawei Han, Micheline Kamber & Jian Pei Data Mining: Concepts and Techniques, 3 rd ed. Describe each of the following
More informationUnsupervised Learning
Outline Unsupervised Learning Basic concepts K-means algorithm Representation of clusters Hierarchical clustering Distance functions Which clustering algorithm to use? NN Supervised learning vs. unsupervised
More informationData Mining. 3.5 Lazy Learners (Instance-Based Learners) Fall Instructor: Dr. Masoud Yaghini. Lazy Learners
Data Mining 3.5 (Instance-Based Learners) Fall 2008 Instructor: Dr. Masoud Yaghini Outline Introduction k-nearest-neighbor Classifiers References Introduction Introduction Lazy vs. eager learning Eager
More informationClustering Web Documents using Hierarchical Method for Efficient Cluster Formation
Clustering Web Documents using Hierarchical Method for Efficient Cluster Formation I.Ceema *1, M.Kavitha *2, G.Renukadevi *3, G.sripriya *4, S. RajeshKumar #5 * Assistant Professor, Bon Secourse College
More informationConsensus clustering by graph based approach
Consensus clustering by graph based approach Haytham Elghazel 1, Khalid Benabdeslemi 1 and Fatma Hamdi 2 1- University of Lyon 1, LIESP, EA4125, F-69622 Villeurbanne, Lyon, France; {elghazel,kbenabde}@bat710.univ-lyon1.fr
More informationLesson 3. Prof. Enza Messina
Lesson 3 Prof. Enza Messina Clustering techniques are generally classified into these classes: PARTITIONING ALGORITHMS Directly divides data points into some prespecified number of clusters without a hierarchical
More informationCluster Analysis. Ying Shen, SSE, Tongji University
Cluster Analysis Ying Shen, SSE, Tongji University Cluster analysis Cluster analysis groups data objects based only on the attributes in the data. The main objective is that The objects within a group
More informationDocument Clustering: Comparison of Similarity Measures
Document Clustering: Comparison of Similarity Measures Shouvik Sachdeva Bhupendra Kastore Indian Institute of Technology, Kanpur CS365 Project, 2014 Outline 1 Introduction The Problem and the Motivation
More informationData Cleaning and Prototyping Using K-Means to Enhance Classification Accuracy
Data Cleaning and Prototyping Using K-Means to Enhance Classification Accuracy Lutfi Fanani 1 and Nurizal Dwi Priandani 2 1 Department of Computer Science, Brawijaya University, Malang, Indonesia. 2 Department
More informationClustering Mixed Data Set Using Modified MARDL Technique
Clustering Mixed Data Set Using Modified MARDL Technique Mrs. J.Jayabharathy Dr. S. Kanmani S. Pazhaniammal Senior Lecturer Professor & Head Computer Science and Engineering Department of CSE Department
More informationAPPLICATION OF MULTIPLE RANDOM CENTROID (MRC) BASED K-MEANS CLUSTERING ALGORITHM IN INSURANCE A REVIEW ARTICLE
APPLICATION OF MULTIPLE RANDOM CENTROID (MRC) BASED K-MEANS CLUSTERING ALGORITHM IN INSURANCE A REVIEW ARTICLE Sundari NallamReddy, Samarandra Behera, Sanjeev Karadagi, Dr. Anantha Desik ABSTRACT: Tata
More information[Gidhane* et al., 5(7): July, 2016] ISSN: IC Value: 3.00 Impact Factor: 4.116
IJESRT INTERNATIONAL JOURNAL OF ENGINEERING SCIENCES & RESEARCH TECHNOLOGY AN EFFICIENT APPROACH FOR TEXT MINING USING SIDE INFORMATION Kiran V. Gaidhane*, Prof. L. H. Patil, Prof. C. U. Chouhan DOI: 10.5281/zenodo.58632
More informationA Weighted Majority Voting based on Normalized Mutual Information for Cluster Analysis
A Weighted Majority Voting based on Normalized Mutual Information for Cluster Analysis Meshal Shutaywi and Nezamoddin N. Kachouie Department of Mathematical Sciences, Florida Institute of Technology Abstract
More informationData Mining. Introduction. Hamid Beigy. Sharif University of Technology. Fall 1395
Data Mining Introduction Hamid Beigy Sharif University of Technology Fall 1395 Hamid Beigy (Sharif University of Technology) Data Mining Fall 1395 1 / 21 Table of contents 1 Introduction 2 Data mining
More informationConsensus Clustering. Javier Béjar URL - Spring 2019 CS - MAI
Consensus Clustering Javier Béjar URL - Spring 2019 CS - MAI Consensus Clustering The ensemble of classifiers is a well established strategy in supervised learning Unsupervised learning aims the same goal:
More informationData mining. Classification k-nn Classifier. Piotr Paszek. (Piotr Paszek) Data mining k-nn 1 / 20
Data mining Piotr Paszek Classification k-nn Classifier (Piotr Paszek) Data mining k-nn 1 / 20 Plan of the lecture 1 Lazy Learner 2 k-nearest Neighbor Classifier 1 Distance (metric) 2 How to Determine
More informationInternational Journal of Advanced Research in Computer Science and Software Engineering
Volume 3, Issue 3, March 2013 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Special Issue:
More informationAn Enhanced K-Medoid Clustering Algorithm
An Enhanced Clustering Algorithm Archna Kumari Science &Engineering kumara.archana14@gmail.com Pramod S. Nair Science &Engineering, pramodsnair@yahoo.com Sheetal Kumrawat Science &Engineering, sheetal2692@gmail.com
More informationAN IMPROVED K-MEANS CLUSTERING ALGORITHM FOR IMAGE SEGMENTATION
AN IMPROVED K-MEANS CLUSTERING ALGORITHM FOR IMAGE SEGMENTATION WILLIAM ROBSON SCHWARTZ University of Maryland, Department of Computer Science College Park, MD, USA, 20742-327, schwartz@cs.umd.edu RICARDO
More informationClustering. Mihaela van der Schaar. January 27, Department of Engineering Science University of Oxford
Department of Engineering Science University of Oxford January 27, 2017 Many datasets consist of multiple heterogeneous subsets. Cluster analysis: Given an unlabelled data, want algorithms that automatically
More informationEnhancing Clustering Results In Hierarchical Approach By Mvs Measures
International Journal of Engineering Research and Development e-issn: 2278-067X, p-issn: 2278-800X, www.ijerd.com Volume 10, Issue 6 (June 2014), PP.25-30 Enhancing Clustering Results In Hierarchical Approach
More informationECLT 5810 Clustering
ECLT 5810 Clustering What is Cluster Analysis? Cluster: a collection of data objects Similar to one another within the same cluster Dissimilar to the objects in other clusters Cluster analysis Grouping
More informationAn Improved Apriori Algorithm for Association Rules
Research article An Improved Apriori Algorithm for Association Rules Hassan M. Najadat 1, Mohammed Al-Maolegi 2, Bassam Arkok 3 Computer Science, Jordan University of Science and Technology, Irbid, Jordan
More informationGene Clustering & Classification
BINF, Introduction to Computational Biology Gene Clustering & Classification Young-Rae Cho Associate Professor Department of Computer Science Baylor University Overview Introduction to Gene Clustering
More informationClustering Large Dynamic Datasets Using Exemplar Points
Clustering Large Dynamic Datasets Using Exemplar Points William Sia, Mihai M. Lazarescu Department of Computer Science, Curtin University, GPO Box U1987, Perth 61, W.A. Email: {siaw, lazaresc}@cs.curtin.edu.au
More informationData Mining. Introduction. Hamid Beigy. Sharif University of Technology. Fall 1394
Data Mining Introduction Hamid Beigy Sharif University of Technology Fall 1394 Hamid Beigy (Sharif University of Technology) Data Mining Fall 1394 1 / 20 Table of contents 1 Introduction 2 Data mining
More informationK-means clustering based filter feature selection on high dimensional data
International Journal of Advances in Intelligent Informatics ISSN: 2442-6571 Vol 2, No 1, March 2016, pp. 38-45 38 K-means clustering based filter feature selection on high dimensional data Dewi Pramudi
More informationECLT 5810 Clustering
ECLT 5810 Clustering What is Cluster Analysis? Cluster: a collection of data objects Similar to one another within the same cluster Dissimilar to the objects in other clusters Cluster analysis Grouping
More informationDENSITY BASED AND PARTITION BASED CLUSTERING OF UNCERTAIN DATA BASED ON KL-DIVERGENCE SIMILARITY MEASURE
DENSITY BASED AND PARTITION BASED CLUSTERING OF UNCERTAIN DATA BASED ON KL-DIVERGENCE SIMILARITY MEASURE Sinu T S 1, Mr.Joseph George 1,2 Computer Science and Engineering, Adi Shankara Institute of Engineering
More informationBENCHMARKING ATTRIBUTE SELECTION TECHNIQUES FOR MICROARRAY DATA
BENCHMARKING ATTRIBUTE SELECTION TECHNIQUES FOR MICROARRAY DATA S. DeepaLakshmi 1 and T. Velmurugan 2 1 Bharathiar University, Coimbatore, India 2 Department of Computer Science, D. G. Vaishnav College,
More informationInternational Journal of Advance Research in Computer Science and Management Studies
Volume 2, Issue 11, November 2014 ISSN: 2321 7782 (Online) International Journal of Advance Research in Computer Science and Management Studies Research Article / Survey Paper / Case Study Available online
More informationAn Improvement of Centroid-Based Classification Algorithm for Text Classification
An Improvement of Centroid-Based Classification Algorithm for Text Classification Zehra Cataltepe, Eser Aygun Istanbul Technical Un. Computer Engineering Dept. Ayazaga, Sariyer, Istanbul, Turkey cataltepe@itu.edu.tr,
More informationClustering part II 1
Clustering part II 1 Clustering What is Cluster Analysis? Types of Data in Cluster Analysis A Categorization of Major Clustering Methods Partitioning Methods Hierarchical Methods 2 Partitioning Algorithms:
More informationOutlier Detection Using Unsupervised and Semi-Supervised Technique on High Dimensional Data
Outlier Detection Using Unsupervised and Semi-Supervised Technique on High Dimensional Data Ms. Gayatri Attarde 1, Prof. Aarti Deshpande 2 M. E Student, Department of Computer Engineering, GHRCCEM, University
More informationEstimating Missing Attribute Values Using Dynamically-Ordered Attribute Trees
Estimating Missing Attribute Values Using Dynamically-Ordered Attribute Trees Jing Wang Computer Science Department, The University of Iowa jing-wang-1@uiowa.edu W. Nick Street Management Sciences Department,
More informationClustering Categorical Data based on Information Loss Minimization
Clustering Categorical Data based on Information Loss Minimization Periklis Andritsos, Panayiotis Tsaparas, Renée J. Miller, Kenneth C. Sevcik University of Toronto {periklis,tsap,miller,kcs}@cs.toronto.edu
More informationCluster based boosting for high dimensional data
Cluster based boosting for high dimensional data Rutuja Shirbhate, Dr. S. D. Babar Abstract -Data Dimensionality is crucial for learning and prediction systems. Term Curse of High Dimensionality means
More informationContents. Preface to the Second Edition
Preface to the Second Edition v 1 Introduction 1 1.1 What Is Data Mining?....................... 4 1.2 Motivating Challenges....................... 5 1.3 The Origins of Data Mining....................
More informationEnhancing Forecasting Performance of Naïve-Bayes Classifiers with Discretization Techniques
24 Enhancing Forecasting Performance of Naïve-Bayes Classifiers with Discretization Techniques Enhancing Forecasting Performance of Naïve-Bayes Classifiers with Discretization Techniques Ruxandra PETRE
More informationResearch Article Term Frequency Based Cosine Similarity Measure for Clustering Categorical Data using Hierarchical Algorithm
Research Journal of Applied Sciences, Engineering and Technology 11(7): 798-805, 2015 DOI: 10.19026/rjaset.11.2043 ISSN: 2040-7459; e-issn: 2040-7467 2015 Maxwell Scientific Publication Corp. Submitted:
More informationAutomatic Threshold Calculation for the Categorical Distance Measure ConDist
Automatic Threshold Calculation for the Categorical Distance Measure ConDist Markus Ring 1, Dieter Landes 1, and Andreas Hotho 2 1 Faculty of Electrical Engineering and Informatics, Coburg University of
More informationComparative Study of Subspace Clustering Algorithms
Comparative Study of Subspace Clustering Algorithms S.Chitra Nayagam, Asst Prof., Dept of Computer Applications, Don Bosco College, Panjim, Goa. Abstract-A cluster is a collection of data objects that
More informationEfficiently Handling Feature Redundancy in High-Dimensional Data
Efficiently Handling Feature Redundancy in High-Dimensional Data Lei Yu Department of Computer Science & Engineering Arizona State University Tempe, AZ 85287-5406 leiyu@asu.edu Huan Liu Department of Computer
More informationKeywords: hierarchical clustering, traditional similarity metrics, potential based similarity metrics.
www.ijecs.in International Journal Of Engineering And Computer Science ISSN: 2319-7242 Volume 4 Issue 8 Aug 2015, Page No. 14027-14032 Potential based similarity metrics for implementing hierarchical clustering
More informationCluster analysis. Agnieszka Nowak - Brzezinska
Cluster analysis Agnieszka Nowak - Brzezinska Outline of lecture What is cluster analysis? Clustering algorithms Measures of Cluster Validity What is Cluster Analysis? Finding groups of objects such that
More informationK-Nearest-Neighbours with a Novel Similarity Measure for Intrusion Detection
K-Nearest-Neighbours with a Novel Similarity Measure for Intrusion Detection Zhenghui Ma School of Computer Science The University of Birmingham Edgbaston, B15 2TT Birmingham, UK Ata Kaban School of Computer
More informationCS570: Introduction to Data Mining
CS570: Introduction to Data Mining Cluster Analysis Reading: Chapter 10.4, 10.6, 11.1.3 Han, Chapter 8.4,8.5,9.2.2, 9.3 Tan Anca Doloc-Mihu, Ph.D. Slides courtesy of Li Xiong, Ph.D., 2011 Han, Kamber &
More informationA FAST CLUSTERING-BASED FEATURE SUBSET SELECTION ALGORITHM
A FAST CLUSTERING-BASED FEATURE SUBSET SELECTION ALGORITHM Akshay S. Agrawal 1, Prof. Sachin Bojewar 2 1 P.G. Scholar, Department of Computer Engg., ARMIET, Sapgaon, (India) 2 Associate Professor, VIT,
More informationFeature Selection Using Modified-MCA Based Scoring Metric for Classification
2011 International Conference on Information Communication and Management IPCSIT vol.16 (2011) (2011) IACSIT Press, Singapore Feature Selection Using Modified-MCA Based Scoring Metric for Classification
More informationFilter methods for feature selection. A comparative study
Filter methods for feature selection. A comparative study Noelia Sánchez-Maroño, Amparo Alonso-Betanzos, and María Tombilla-Sanromán University of A Coruña, Department of Computer Science, 15071 A Coruña,
More informationInternational Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.7, No.3, May Dr.Zakea Il-Agure and Mr.Hicham Noureddine Itani
LINK MINING PROCESS Dr.Zakea Il-Agure and Mr.Hicham Noureddine Itani Higher Colleges of Technology, United Arab Emirates ABSTRACT Many data mining and knowledge discovery methodologies and process models
More informationWhat is Clustering? Clustering. Characterizing Cluster Methods. Clusters. Cluster Validity. Basic Clustering Methodology
Clustering Unsupervised learning Generating classes Distance/similarity measures Agglomerative methods Divisive methods Data Clustering 1 What is Clustering? Form o unsupervised learning - no inormation
More informationAutomatic Group-Outlier Detection
Automatic Group-Outlier Detection Amine Chaibi and Mustapha Lebbah and Hanane Azzag LIPN-UMR 7030 Université Paris 13 - CNRS 99, av. J-B Clément - F-93430 Villetaneuse {firstname.secondname}@lipn.univ-paris13.fr
More informationCluster Analysis. Mu-Chun Su. Department of Computer Science and Information Engineering National Central University 2003/3/11 1
Cluster Analysis Mu-Chun Su Department of Computer Science and Information Engineering National Central University 2003/3/11 1 Introduction Cluster analysis is the formal study of algorithms and methods
More informationAssociation Rule Mining and Clustering
Association Rule Mining and Clustering Lecture Outline: Classification vs. Association Rule Mining vs. Clustering Association Rule Mining Clustering Types of Clusters Clustering Algorithms Hierarchical:
More informationContents. List of Figures. List of Tables. List of Algorithms. I Clustering, Data, and Similarity Measures 1
Contents List of Figures List of Tables List of Algorithms Preface xiii xv xvii xix I Clustering, Data, and Similarity Measures 1 1 Data Clustering 3 1.1 Definition of Data Clustering... 3 1.2 The Vocabulary
More informationOn Sample Weighted Clustering Algorithm using Euclidean and Mahalanobis Distances
International Journal of Statistics and Systems ISSN 0973-2675 Volume 12, Number 3 (2017), pp. 421-430 Research India Publications http://www.ripublication.com On Sample Weighted Clustering Algorithm using
More informationScalable Clustering Using Rank Based Preprocessing Technique for Mixed Data Sets Using Enhanced Rock Algorithm
African Journal of Basic & Applied Sciences 7 (3): 129-136, 2015 ISSN 2079-2034 IDOSI Publications, 2015 DOI: 10.5829/idosi.ajbas.2015.7.3.22291 Scalable Clustering Using Rank Based Preprocessing Technique
More informationA Comparison of Resampling Methods for Clustering Ensembles
A Comparison of Resampling Methods for Clustering Ensembles Behrouz Minaei-Bidgoli Computer Science Department Michigan State University East Lansing, MI, 48824, USA Alexander Topchy Computer Science Department
More informationCSE 5243 INTRO. TO DATA MINING
CSE 5243 INTRO. TO DATA MINING Cluster Analysis: Basic Concepts and Methods Huan Sun, CSE@The Ohio State University Slides adapted from UIUC CS412, Fall 2017, by Prof. Jiawei Han 2 Chapter 10. Cluster
More informationMulti-Modal Data Fusion: A Description
Multi-Modal Data Fusion: A Description Sarah Coppock and Lawrence J. Mazlack ECECS Department University of Cincinnati Cincinnati, Ohio 45221-0030 USA {coppocs,mazlack}@uc.edu Abstract. Clustering groups
More informationClustering Part 3. Hierarchical Clustering
Clustering Part Dr Sanjay Ranka Professor Computer and Information Science and Engineering University of Florida, Gainesville Hierarchical Clustering Two main types: Agglomerative Start with the points
More informationBasic Data Mining Technique
Basic Data Mining Technique What is classification? What is prediction? Supervised and Unsupervised Learning Decision trees Association rule K-nearest neighbor classifier Case-based reasoning Genetic algorithm
More informationClustering Using Elements of Information Theory
Clustering Using Elements of Information Theory Daniel de Araújo 1,2, Adrião Dória Neto 2, Jorge Melo 2, and Allan Martins 2 1 Federal Rural University of Semi-Árido, Campus Angicos, Angicos/RN, Brasil
More informationActive Sampling for Constrained Clustering
Paper: Active Sampling for Constrained Clustering Masayuki Okabe and Seiji Yamada Information and Media Center, Toyohashi University of Technology 1-1 Tempaku, Toyohashi, Aichi 441-8580, Japan E-mail:
More informationDATA MINING LECTURE 7. Hierarchical Clustering, DBSCAN The EM Algorithm
DATA MINING LECTURE 7 Hierarchical Clustering, DBSCAN The EM Algorithm CLUSTERING What is a Clustering? In general a grouping of objects such that the objects in a group (cluster) are similar (or related)
More informationInformation Integration of Partially Labeled Data
Information Integration of Partially Labeled Data Steffen Rendle and Lars Schmidt-Thieme Information Systems and Machine Learning Lab, University of Hildesheim srendle@ismll.uni-hildesheim.de, schmidt-thieme@ismll.uni-hildesheim.de
More informationUniversity of Florida CISE department Gator Engineering. Clustering Part 2
Clustering Part 2 Dr. Sanjay Ranka Professor Computer and Information Science and Engineering University of Florida, Gainesville Partitional Clustering Original Points A Partitional Clustering Hierarchical
More informationKeywords: clustering algorithms, unsupervised learning, cluster validity
Volume 6, Issue 1, January 2016 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Clustering Based
More informationClustering: An art of grouping related objects
Clustering: An art of grouping related objects Sumit Kumar, Sunil Verma Abstract- In today s world, clustering has seen many applications due to its ability of binding related data together but there are
More informationOn Multiple Query Optimization in Data Mining
On Multiple Query Optimization in Data Mining Marek Wojciechowski, Maciej Zakrzewicz Poznan University of Technology Institute of Computing Science ul. Piotrowo 3a, 60-965 Poznan, Poland {marek,mzakrz}@cs.put.poznan.pl
More informationData Mining. 3.2 Decision Tree Classifier. Fall Instructor: Dr. Masoud Yaghini. Chapter 5: Decision Tree Classifier
Data Mining 3.2 Decision Tree Classifier Fall 2008 Instructor: Dr. Masoud Yaghini Outline Introduction Basic Algorithm for Decision Tree Induction Attribute Selection Measures Information Gain Gain Ratio
More informationUnsupervised Learning and Clustering
Unsupervised Learning and Clustering Selim Aksoy Department of Computer Engineering Bilkent University saksoy@cs.bilkent.edu.tr CS 551, Spring 2009 CS 551, Spring 2009 c 2009, Selim Aksoy (Bilkent University)
More informationDetermination of Similarity Threshold in Clustering Problems for Large Data Sets
Determination of Similarity Threshold in Clustering Problems for Large Data Sets Guillermo Sánchez-Díaz 1 and José F. Martínez-Trinidad 2 1 Center of Technologies Research on Information and Systems, The
More informationA Spectral-based Clustering Algorithm for Categorical Data Using Data Summaries (SCCADDS)
A Spectral-based Clustering Algorithm for Categorical Data Using Data Summaries (SCCADDS) Eman Abdu eha90@aol.com Graduate Center The City University of New York Douglas Salane dsalane@jjay.cuny.edu Center
More information