Distance based Clustering for Categorical Data

Size: px
Start display at page:

Download "Distance based Clustering for Categorical Data"

Transcription

1 Distance based Clustering for Categorical Data Extended Abstract Dino Ienco and Rosa Meo Dipartimento di Informatica, Università di Torino Italy {ienco, Abstract. Learning distances from categorical attributes is a very useful data mining task that allows to perform distance-based techniques, such as clustering and classification by similarity. In this article we propose a new context-based similarity measure that learns distances between the values of a categorical attribute (DILCA DIstance Learning of Categorical Attributes). We couple our similarity measure with a famous hierarchical distance-based clustering algorithm (Ward s hierarchical clustering) and compare the results with the results obtained from methods of the state of the art for this research field. 1 Introduction In this paper we present a new method named DILCA to compute distances between values of a categorical variable and apply this technique to clustering categorical data by a hierarchical approach. Computing distances between two examples is a crucial step for many data mining tasks. Examples of distance-based approaches include clustering (k-means) and distance-based classification algorithm (K-NN, SVM) [3]; all these algorithms manage distance computations as an inner step. Computing the proximity between two instances on the basis of continuous data is a well-understood task. An example of distance measure between two examples described only by the continuous attributes is Minkowski Distance (particularly Manhattan and Euclidean). This measure depends only on the difference between the values of the single attributes. On the contrary, for categorical data deciding on the distance between two values is not straightforward since categorical attribute values are not ordered. As a consequence, it is not possible to compute the difference between two values in a direct way. For categorical data the simplest measure known is the overlap [1]. Overlap is a similarity measure that increases proportionally to the number of attributes in the two examples which match. Notice that overlap does not distinguish between the different values taken by the attribute since it measures only the equality between pair of values. The similarity measure does not change with the values of the attribute when the values are different in the two examples. For any pair of different values the similarity value does not differ.

2 In this paper we give a contribution to this issue by developing a new distance measure on the values of categorical attributes that takes into account the context in which the attribute appears. The context is constituted by the other attributes describing the example. In literature some context-based measures have also been employed, but again they refer to continuous data, like Mahalanobis Distance [3]. Any single value v of a categorical attribute A i occurs in a certain set S of dataset examples. Let us denote by S 1 the set of examples having value v 1 for A i, and by S 2 the set of examples having value v 2 for the same attribute A i. Our proposal is based on the observation of the frequency with which the values that a certain set of attributes A j (the context of A i ) occur in the sets of examples S 1 and S 2. The distance between two values v 1 and v 2 of A i is determined by the difference of the frequency with which the values of the other attributes A j occur in the two sets S 1 and S 2. The details of our method are explained in Section 3. 2 Clustering categorical data The clustering task is becoming an important task in data mining field [3], in information retrieval [4] and in a wide range of applications [2]. Given a set of instances, the goal of clustering is to find a partition of the instances according to a predefined distance measure or to some objective function. The problem becomes more difficult when instances are described by categorical attributes. In literature many approaches to categorical clustering exist. As one of the first works in the field of categorical clustering, K-MODES [8], tries to extend K-Means algorithm for categorical data. Each cluster is represented by a centroid. This centroid contains the most frequent value for each attribute. Therefore, in K-MODES the similarity of the unlabeled data point and the cluster representative can be simply calculated by the overlap distance [1]. Another approach to categorical clustering is ROCK [5]. ROCK is an agglomerative hierarchical clustering. It employs links to measure the proximity between data points. For a given instance d i, an instance d j is a neighbor of d i if the Jaccard similarity J(d i, d j ) between them exceeds a prespecified threshold θ. At the beginning ROCK starts by assigning each instance to a singleton cluster and merges clusters on the basis of the number of neighbors. [7] introduces LIMBO, a scalable hierarchical categorical clustering algorithm built on the Information Bottleneck (IB) framework. As a hierarchical algorithm, LIMBO has the advantage that it produces clusterings of different size in a single execution. CLICKS [6] is a density based clustering algorithm based on graph/hypergraph partitioning. The key intuition here is to encode the data set into weighted summarization structures such as graphs. In general, the cost of clustering with a graph structure is acceptable, provided that the underlying data is low dimensional. CLICKS finds clusters in categorical datasets based on a search method for k-partite maximal cliques.

3 For the evaluation of our approach, we embed our computation of distances in a hierarchical clustering algorithm most used in literature: Ward s agglomerative algorithm [19]. In our experiments we compare DILCA both with ROCK and LIMBO clustering algorithms. 3 DILCA The key contributions of DILCA are the following: we propose a new methodology to compute a matrix of distances between any pair of values of a specific categorical attribute X; this approach is independent from the specific clustering algorithm since we can consider DILCA as a simple way to compute distances for categorical attribute; we obtained good results in clustering when the clustering algorithm uses DILCA distance. We start the presentation of DILCA with the notation. Let us consider a data set D = {d 1, d 2,.., d n } of instances defined over F = {X, Y,.., Z}, a set of m categorical attributes. Assume that X is a categorical attribute and we are interetsted to determine the distances between any pairs of its values. We refer to the cardinality of a single attribute (or feature) X as X. We denote a specific value i of a feature X by x i. Our approach is based on the following key points: 1. selection of the context of a given categorical attribute X: it is a relevant subset of F whose values are helpful for the prediction of the target X; 2. computation of the distance between values of the categorical target X using its context. Selection of a relevant subset of the whole attributes in respect to the given one. To solve this first step we investigate the problem of the selection of a good set of features with respect to the given one. This is a classical problem in data mining, known also as feature selection. In this research field many approaches on how to measure the correlation/association between two variables have been proposed [13]. An interesting one is Symmetric Uncertainty (SU), introduced in [14]. This is a measure that allows to quantify the mutual dependence of two variables. Numerator is mutual information: it measures how much the knowledge on one of the two variables reduces uncertainty about the other one. This uncertainty has been normalized by the total uncertainty on the variables, given by the sum of the entropies H(X) and H(Y) as follows: SU(X, Y ) = 2 IG(X Y ) H(X) + H(Y ) The value of this measure ranges between 0 and 1. The value 1 denotes that the knowledge of one of the two variables completely predicts the value of the other one; the value 0 denotes that X and Y are independent.

4 The advantage of SU with respect to other measures based on a difference of entropy, such as Information Gain (IG), is that SU is not biased by the number of values of the variables. In DILCA, given an attribute X, we want to select a set of context attributes, context(x), such that the attributes in context(x) have the best SU with the attribute X. The number of attributes considered as context of X is a key point for the solutionm of the problem. We want that the number of context attributes is correlated with the SU values obtained by each of them and X. We use an heuristic to set this number. The heuristic is based on the expectation of SU for X. Given the distribution of the values of X, we want to compute SU for each attribute Y, with Y X. We denote this Symmetric Uncertainty as: SU X (Y ). The mean of this quantity is: Y context(x) E[SU X ] = SU X(Y ) context(x) where context(x) denotes the cardinality of the set context(x). To determine the context of X we use the attributes that satisfy the following inequality: SU X (Y ) σe[su X ] where σ is a trade-off parameter, given by the user, that is useful to control the influence of the mean on the choice of the threshold. The parameter σ ranges in the interval [0, 1]. By means of the application of SU we are able to select a set of attributes that specify the context of a particular attribute. They are just those attributes whose values are associated to the values of the target categorical attribute X. Computation of the distance measure between values of the same attribute. The second step of our approach is to compute the distance between each pair of values of the target categorical attribute. We denote by P(x i y k ) the conditional probability of the value x i of X given the value y k of the context attribute Y. In DILCA, the distance between x i and x j is computed by the formula: dist(x i, x j ) = Y context(x) y k Y (P(x i y k ) P(x j y k )) 2 (1) For each context attribute Y, and each value y k Y, we compute the difference between the conditional probabilities of the two values x i and x j, given the third value y k of Y : (P(x i y k ) and P(x j y k )). Then, we use the euclidean distance to compute the final distance. Algorithm 1 describes the procedure adopted to compute the distance matrix between the values of a categorical attribute. The algorithm takes as parameters the correlation matrix matrixsu obtained by the Symmetric Uncertainty between any pair of attributes in the data set. The second argument is the target

5 attribute, X. The third argument σ is the trade-off parameter. At the first line, we select the vector with correlation values between X and all the other attributes: V ectorsu X = MatrixSU[X]. As a second step, we compute the mean of the correlation and then, with respect to this mean and the parameter σ, the algorithm selects the attributes that the generation of the context of X. From line 9, starting from the attributes in the context of X the distance between any two values of X is computed with formula 1. Algorithm 1 Distance(matrixSU,X,σ) 1: V ectorsu X = MatrixSU[X] 2: threshold = computemean(v ectorsu X) 3: context(x) = 4: for all y V ectorsu X do 5: if V ectorsu X[y] σthreshold then 6: insert(y,context(x)) 7: end if 8: end for 9: for all x i, x j X x i x j do q P 10: DistanceMatrix[x i][x j] = Y context(x) 11: end for 12: return DistanceMatrix X distance P y k Y (P(xi y k) P(x j y k )) 2 4 Evaluation Measures for Clustering In our work we use two objective criteria to evaluate the results. Accuracy, Acc: This measure considers the original class label to evaluate the clustering result. If we assume that the instances in D are already classified in c classes G = {g 1, g 2,..., g c }, and let denote by K a clustering of the instances of D into c clusters {cl 1, cl 2,..., cl c }. Consider a one-to-one mapping, f, from classes to clusters, such that each class g i is mapped to the cluster f(g i ). The classification error of the mapping is defined as: E = c i=1 g i f(g i ) where g i f(g i ) measures the number of tuples in class g i that received the wrong label. The optimal mapping between clusters and classes, is the one that minimizes the classification error. We denote it by E min. The Accuracy is obtained as follows: Acc = 1 E min D Normalized Mutual Information, N M I: This measure provides an information that is impartial with respect to the number of clusters [18]. This measure has a maximum at one when the clustering partition matches completely the original partition (class). We can consider NMI as an indicator of the purity

6 of the clustering result. NMI is computed as the average mutual information between any pair of a cluster and a class: I J i=1 j=1 x ij log n nij n in j NMI = I i=1 n i log ni n J j=1 n j log nj n where n ij is the cardinality of the set of objects that occur both in cluster i and in class j; n i is the number of objects in cluster i; n j is the number of objects in class j; n is the total number of objects. I and J are respectively the number of clusters and the number of classes. 5 Dataset for Categorical Attribute Evaluation In our experimental section, to evaluate DILCA on categorical attributes only, we use two real world data sets obtained from UCI Machine Learning Repository [11] (Congressional votes and Mushroom) and one synthetic dataset (SynA) obtained from synthetic data generator [12]. Dataset SynA: This dataset contains 1000 instances, has 5 different classes and is generated from a random distribution. Each instance has 50 categorical attributes whose domain ranges over 20 values. 5.1 Experiment Discussion In this section we evaluate the performance of DILCA coupled with Ward hierarchical clustering and name it with HCL DILCA. We compare it with ROCK and LIMBO. For all the approaches we set the number of clusters equal to the number of classes. To implement our approach we use WEKA library [3], a java open source library that implements machine learning and data mining algorithms. To run the experiments we use a PC with Intel(R) Pentium(R) M processor 1.86GHz, 1024MB of RAM and OpenSuse as OS. For each particular algorithm we use the following setting: For HCL DILCA we vary parameter σ between 0 and 1 with step of 0.1 at each execution. For ROCK we vary θ parameter between 0.2 and 1 with step of 0.05 each time. For LIMBO we vary φ parameter used in the algorithm between 0 and 1 with step of For all the algorithms we take the best result obtained. We choose these parameters setting because we saw that ROCK is more sensitive to variations of the parameter than LIMBO. We also observed that ROCK algorithm in many cases produces one giant cluster that includes instances from more classes. In the tables of Figure 1 we report results of the comparative evaluation of the clustering algorithms. For each data set and each specific execution, we specify the setting of the parameters, we report the Accuracy in percentage and the value of Normalized Mutual Information. We use bold face to underline the best result for each dataset.

7 Votes (2 clusters) (435 instances) Algorithm Acc. NMI HCL DILCA (σ = 0.5) 89.89% ROCK (θ = 0.75) 83.90% LIMBO (φ = 0.75) 87.12% Mushroom (2 clusters) (8.124 instances) Algorithm Acc. NMI HCL DILCA (σ = 1.0) 89.02% ROCK (θ = 0.8) 50.57% LIMBO(φ = 1.0) 88.95% Fig.1. Experiments on Congressional Votes and Mushroom data set Synthetic dataset A (5 clusters) (1000 instances) Algorithm Acc. NMI HCL DILCA (σ = 1.0) 94.3% Rock (θ = 0.05) 80.3% LIMBO(φ = 0.25) 87.6% Fig. 2. Experiments on Synthetic dataset A 6 Scalability of DILCA In this section we introduce the study of the scalability of the proposed approach of distance learning alone and coupled with Ward s hierarchical clustering algorithm. To evaluate the comparison of the scalability of the proposed method we compare HCL DILCA with LIMBO only, since LIMBO is known to outperform ROCK in this regard. We use another synthetic dataset with 1,000 instances and 1,000 attributes. From this dataset, we build 10 other datasets, each with the same number of instances but a different, progressive number of features: from 100 to 1,000 features. In figure 3 we can see the results of this evaluation. We can see that HCL DILCA is faster than LIMBO especially with the increase in size of the dataset. In fact, LIMBO computational complexity is higher than DILCA complexity in the distance computation between categorical attributes: DILCA depends only on the number of features while for the formation of clusters depends on the underlying clustering algorithm. 7 Conclusion In this work we presented a new context-based distance measure to manage categorical data. We believe that the proposed framework is general enough and we can apply it to any data mining task that involves nominal data and a distance computation over them. As a future work we want to use our distance learning approach to different distance-based tasks such as: outlier detection, nearest neighbours classification and kernel based algorithms and we want to extend our approach to manage datasets with mixed data types.

8 14000 HCLDilca Limbo time(sec) n. of instances Fig.3. The time performance of HCL coupled with DILCA and LIMBO References 1. C. Stanfill and D. Waltz: Toward memory-based reasoning. Commun. ACM, 29(12): , A. K. Jain, M. N. Murty and P. J. Flynn: Data Clustering: A Review. ACM Comput. Surv. 31(3): , M. Kamber and J. Han. Data Mining: Concepts and Techniques, 2nd ed., M. Charikar, C. Chekuri, T. Feder and R. Motwani: Incremental Clustering and Dynamic Information Retrieval. SIAM J. Comput. 33(6): , S. Guha, R. Rastogi and K. Shim: ROCK: A Robust Clustering Algorithm for Categorical Attributes. Information System 25(5): , M. J. Zaki and M. Peters: CLICKS: Mining Subspace Clusters in Categorical Data via K-partite Maximal Cliques. ICDE , P. Andritsos, P. Tsaparas, R. J. Miller and K. C. Sevcik: LIMBO: Scalable Clustering of Categorical Data. EDBT , Z. Huang: Extensions to the k-means Algorithm for Clustering Large Data Sets with Categorical Values. Data Min. Knowl. Discov. 2(3): , Y. Yang, X. Guan and J. You: CLOPE: A fast and effective clustering algorithm for transactional data. KDD , Witten and Frank: Data Mining: Practical Machine Learning Tools & Techniques. 11. C. Blake and C. Merez: UCI Repository of Machine Learning Databases, mlearn/mlrepository.html, Dataset Generator, Perfect data for an imperfect world, I. Guyon and A. Elisseeff: An Introduction to Variable and Feature Selection. Journal of Machine Learning Research 3: , L. Yu, H. Liu: Feature Selection for High-Dimensional Data: A Fast Correlation- Based Filter Solution. ICML , J. Ross Quinlan: C4.5: Programs for Machine Learning, C. M. Bishop: Pattern Recognition and Machine Learning, Ch and Banerjee and Kumar and Chandola: Outlier Detection: A Survey, A. Strehl and J. Ghosh: Cluster ensembles - a knowledge reuse framework for combining partitionings. AAAI, M. R. Anderberg: Cluster analysis for applications. Academic, 2nd ed., 1973.

Using PageRank in Feature Selection

Using PageRank in Feature Selection Using PageRank in Feature Selection Dino Ienco, Rosa Meo, and Marco Botta Dipartimento di Informatica, Università di Torino, Italy fienco,meo,bottag@di.unito.it Abstract. Feature selection is an important

More information

Using PageRank in Feature Selection

Using PageRank in Feature Selection Using PageRank in Feature Selection Dino Ienco, Rosa Meo, and Marco Botta Dipartimento di Informatica, Università di Torino, Italy {ienco,meo,botta}@di.unito.it Abstract. Feature selection is an important

More information

University of Florida CISE department Gator Engineering. Clustering Part 5

University of Florida CISE department Gator Engineering. Clustering Part 5 Clustering Part 5 Dr. Sanjay Ranka Professor Computer and Information Science and Engineering University of Florida, Gainesville SNN Approach to Clustering Ordinary distance measures have problems Euclidean

More information

CS570: Introduction to Data Mining

CS570: Introduction to Data Mining CS570: Introduction to Data Mining Scalable Clustering Methods: BIRCH and Others Reading: Chapter 10.3 Han, Chapter 9.5 Tan Cengiz Gunay, Ph.D. Slides courtesy of Li Xiong, Ph.D., 2011 Han, Kamber & Pei.

More information

BBS654 Data Mining. Pinar Duygulu. Slides are adapted from Nazli Ikizler

BBS654 Data Mining. Pinar Duygulu. Slides are adapted from Nazli Ikizler BBS654 Data Mining Pinar Duygulu Slides are adapted from Nazli Ikizler 1 Classification Classification systems: Supervised learning Make a rational prediction given evidence There are several methods for

More information

TRANSACTIONAL CLUSTERING. Anna Monreale University of Pisa

TRANSACTIONAL CLUSTERING. Anna Monreale University of Pisa TRANSACTIONAL CLUSTERING Anna Monreale University of Pisa Clustering Clustering : Grouping of objects into different sets, or more precisely, the partitioning of a data set into subsets (clusters), so

More information

LIMBO: A Scalable Algorithm to Cluster Categorical Data. University of Toronto, Department of Computer Science. CSRG Technical Report 467

LIMBO: A Scalable Algorithm to Cluster Categorical Data. University of Toronto, Department of Computer Science. CSRG Technical Report 467 LIMBO: A Scalable Algorithm to Cluster Categorical Data University of Toronto, Department of Computer Science CSRG Technical Report 467 Periklis Andritsos Panayiotis Tsaparas Renée J. Miller Kenneth C.

More information

HIMIC : A Hierarchical Mixed Type Data Clustering Algorithm

HIMIC : A Hierarchical Mixed Type Data Clustering Algorithm HIMIC : A Hierarchical Mixed Type Data Clustering Algorithm R. A. Ahmed B. Borah D. K. Bhattacharyya Department of Computer Science and Information Technology, Tezpur University, Napam, Tezpur-784028,

More information

Semi-Supervised Clustering with Partial Background Information

Semi-Supervised Clustering with Partial Background Information Semi-Supervised Clustering with Partial Background Information Jing Gao Pang-Ning Tan Haibin Cheng Abstract Incorporating background knowledge into unsupervised clustering algorithms has been the subject

More information

Clustering CS 550: Machine Learning

Clustering CS 550: Machine Learning Clustering CS 550: Machine Learning This slide set mainly uses the slides given in the following links: http://www-users.cs.umn.edu/~kumar/dmbook/ch8.pdf http://www-users.cs.umn.edu/~kumar/dmbook/dmslides/chap8_basic_cluster_analysis.pdf

More information

Topic 1 Classification Alternatives

Topic 1 Classification Alternatives Topic 1 Classification Alternatives [Jiawei Han, Micheline Kamber, Jian Pei. 2011. Data Mining Concepts and Techniques. 3 rd Ed. Morgan Kaufmann. ISBN: 9380931913.] 1 Contents 2. Classification Using Frequent

More information

LIMBO: Scalable Clustering of Categorical Data

LIMBO: Scalable Clustering of Categorical Data LIMBO: Scalable Clustering of Categorical Data Periklis Andritsos, Panayiotis Tsaparas, Renée J. Miller, and Kenneth C. Sevcik {periklis,tsap,miller,kcs}@cs.toronto.edu University of Toronto, Department

More information

Performance Analysis of Data Mining Classification Techniques

Performance Analysis of Data Mining Classification Techniques Performance Analysis of Data Mining Classification Techniques Tejas Mehta 1, Dr. Dhaval Kathiriya 2 Ph.D. Student, School of Computer Science, Dr. Babasaheb Ambedkar Open University, Gujarat, India 1 Principal

More information

Data Clustering With Leaders and Subleaders Algorithm

Data Clustering With Leaders and Subleaders Algorithm IOSR Journal of Engineering (IOSRJEN) e-issn: 2250-3021, p-issn: 2278-8719, Volume 2, Issue 11 (November2012), PP 01-07 Data Clustering With Leaders and Subleaders Algorithm Srinivasulu M 1,Kotilingswara

More information

Clustering Part 4 DBSCAN

Clustering Part 4 DBSCAN Clustering Part 4 Dr. Sanjay Ranka Professor Computer and Information Science and Engineering University of Florida, Gainesville DBSCAN DBSCAN is a density based clustering algorithm Density = number of

More information

Notes. Reminder: HW2 Due Today by 11:59PM. Review session on Thursday. Midterm next Tuesday (10/10/2017)

Notes. Reminder: HW2 Due Today by 11:59PM. Review session on Thursday. Midterm next Tuesday (10/10/2017) 1 Notes Reminder: HW2 Due Today by 11:59PM TA s note: Please provide a detailed ReadMe.txt file on how to run the program on the STDLINUX. If you installed/upgraded any package on STDLINUX, you should

More information

Introducing Partial Matching Approach in Association Rules for Better Treatment of Missing Values

Introducing Partial Matching Approach in Association Rules for Better Treatment of Missing Values Introducing Partial Matching Approach in Association Rules for Better Treatment of Missing Values SHARIQ BASHIR, SAAD RAZZAQ, UMER MAQBOOL, SONYA TAHIR, A. RAUF BAIG Department of Computer Science (Machine

More information

K-modes Clustering Algorithm for Categorical Data

K-modes Clustering Algorithm for Categorical Data K-modes Clustering Algorithm for Categorical Data Neha Sharma Samrat Ashok Technological Institute Department of Information Technology, Vidisha, India Nirmal Gaud Samrat Ashok Technological Institute

More information

Data Mining. Clustering. Hamid Beigy. Sharif University of Technology. Fall 1394

Data Mining. Clustering. Hamid Beigy. Sharif University of Technology. Fall 1394 Data Mining Clustering Hamid Beigy Sharif University of Technology Fall 1394 Hamid Beigy (Sharif University of Technology) Data Mining Fall 1394 1 / 31 Table of contents 1 Introduction 2 Data matrix and

More information

University of Florida CISE department Gator Engineering. Clustering Part 4

University of Florida CISE department Gator Engineering. Clustering Part 4 Clustering Part 4 Dr. Sanjay Ranka Professor Computer and Information Science and Engineering University of Florida, Gainesville DBSCAN DBSCAN is a density based clustering algorithm Density = number of

More information

A Co-Clustering approach for Sum-Product Network Structure Learning

A Co-Clustering approach for Sum-Product Network Structure Learning Università degli Studi di Bari Dipartimento di Informatica LACAM Machine Learning Group A Co-Clustering approach for Sum-Product Network Antonio Vergari Nicola Di Mauro Floriana Esposito December 8, 2014

More information

Datasets Size: Effect on Clustering Results

Datasets Size: Effect on Clustering Results 1 Datasets Size: Effect on Clustering Results Adeleke Ajiboye 1, Ruzaini Abdullah Arshah 2, Hongwu Qin 3 Faculty of Computer Systems and Software Engineering Universiti Malaysia Pahang 1 {ajibraheem@live.com}

More information

THE AREA UNDER THE ROC CURVE AS A CRITERION FOR CLUSTERING EVALUATION

THE AREA UNDER THE ROC CURVE AS A CRITERION FOR CLUSTERING EVALUATION THE AREA UNDER THE ROC CURVE AS A CRITERION FOR CLUSTERING EVALUATION Helena Aidos, Robert P.W. Duin and Ana Fred Instituto de Telecomunicações, Instituto Superior Técnico, Lisbon, Portugal Pattern Recognition

More information

Cluster Analysis. Angela Montanari and Laura Anderlucci

Cluster Analysis. Angela Montanari and Laura Anderlucci Cluster Analysis Angela Montanari and Laura Anderlucci 1 Introduction Clustering a set of n objects into k groups is usually moved by the aim of identifying internally homogenous groups according to a

More information

Multiple Layer Clustering of Large Software Systems

Multiple Layer Clustering of Large Software Systems Multiple Layer Clustering of Large Software Systems Bill Andreopoulos Department of Computer Science and Engineering, York University, 4700 Keele Street, Toronto, M3J1P3 billa@cs.yorku.ca Aijun An Department

More information

COMP 465: Data Mining Still More on Clustering

COMP 465: Data Mining Still More on Clustering 3/4/015 Exercise COMP 465: Data Mining Still More on Clustering Slides Adapted From : Jiawei Han, Micheline Kamber & Jian Pei Data Mining: Concepts and Techniques, 3 rd ed. Describe each of the following

More information

Unsupervised Learning

Unsupervised Learning Outline Unsupervised Learning Basic concepts K-means algorithm Representation of clusters Hierarchical clustering Distance functions Which clustering algorithm to use? NN Supervised learning vs. unsupervised

More information

Data Mining. 3.5 Lazy Learners (Instance-Based Learners) Fall Instructor: Dr. Masoud Yaghini. Lazy Learners

Data Mining. 3.5 Lazy Learners (Instance-Based Learners) Fall Instructor: Dr. Masoud Yaghini. Lazy Learners Data Mining 3.5 (Instance-Based Learners) Fall 2008 Instructor: Dr. Masoud Yaghini Outline Introduction k-nearest-neighbor Classifiers References Introduction Introduction Lazy vs. eager learning Eager

More information

Clustering Web Documents using Hierarchical Method for Efficient Cluster Formation

Clustering Web Documents using Hierarchical Method for Efficient Cluster Formation Clustering Web Documents using Hierarchical Method for Efficient Cluster Formation I.Ceema *1, M.Kavitha *2, G.Renukadevi *3, G.sripriya *4, S. RajeshKumar #5 * Assistant Professor, Bon Secourse College

More information

Consensus clustering by graph based approach

Consensus clustering by graph based approach Consensus clustering by graph based approach Haytham Elghazel 1, Khalid Benabdeslemi 1 and Fatma Hamdi 2 1- University of Lyon 1, LIESP, EA4125, F-69622 Villeurbanne, Lyon, France; {elghazel,kbenabde}@bat710.univ-lyon1.fr

More information

Lesson 3. Prof. Enza Messina

Lesson 3. Prof. Enza Messina Lesson 3 Prof. Enza Messina Clustering techniques are generally classified into these classes: PARTITIONING ALGORITHMS Directly divides data points into some prespecified number of clusters without a hierarchical

More information

Cluster Analysis. Ying Shen, SSE, Tongji University

Cluster Analysis. Ying Shen, SSE, Tongji University Cluster Analysis Ying Shen, SSE, Tongji University Cluster analysis Cluster analysis groups data objects based only on the attributes in the data. The main objective is that The objects within a group

More information

Document Clustering: Comparison of Similarity Measures

Document Clustering: Comparison of Similarity Measures Document Clustering: Comparison of Similarity Measures Shouvik Sachdeva Bhupendra Kastore Indian Institute of Technology, Kanpur CS365 Project, 2014 Outline 1 Introduction The Problem and the Motivation

More information

Data Cleaning and Prototyping Using K-Means to Enhance Classification Accuracy

Data Cleaning and Prototyping Using K-Means to Enhance Classification Accuracy Data Cleaning and Prototyping Using K-Means to Enhance Classification Accuracy Lutfi Fanani 1 and Nurizal Dwi Priandani 2 1 Department of Computer Science, Brawijaya University, Malang, Indonesia. 2 Department

More information

Clustering Mixed Data Set Using Modified MARDL Technique

Clustering Mixed Data Set Using Modified MARDL Technique Clustering Mixed Data Set Using Modified MARDL Technique Mrs. J.Jayabharathy Dr. S. Kanmani S. Pazhaniammal Senior Lecturer Professor & Head Computer Science and Engineering Department of CSE Department

More information

APPLICATION OF MULTIPLE RANDOM CENTROID (MRC) BASED K-MEANS CLUSTERING ALGORITHM IN INSURANCE A REVIEW ARTICLE

APPLICATION OF MULTIPLE RANDOM CENTROID (MRC) BASED K-MEANS CLUSTERING ALGORITHM IN INSURANCE A REVIEW ARTICLE APPLICATION OF MULTIPLE RANDOM CENTROID (MRC) BASED K-MEANS CLUSTERING ALGORITHM IN INSURANCE A REVIEW ARTICLE Sundari NallamReddy, Samarandra Behera, Sanjeev Karadagi, Dr. Anantha Desik ABSTRACT: Tata

More information

[Gidhane* et al., 5(7): July, 2016] ISSN: IC Value: 3.00 Impact Factor: 4.116

[Gidhane* et al., 5(7): July, 2016] ISSN: IC Value: 3.00 Impact Factor: 4.116 IJESRT INTERNATIONAL JOURNAL OF ENGINEERING SCIENCES & RESEARCH TECHNOLOGY AN EFFICIENT APPROACH FOR TEXT MINING USING SIDE INFORMATION Kiran V. Gaidhane*, Prof. L. H. Patil, Prof. C. U. Chouhan DOI: 10.5281/zenodo.58632

More information

A Weighted Majority Voting based on Normalized Mutual Information for Cluster Analysis

A Weighted Majority Voting based on Normalized Mutual Information for Cluster Analysis A Weighted Majority Voting based on Normalized Mutual Information for Cluster Analysis Meshal Shutaywi and Nezamoddin N. Kachouie Department of Mathematical Sciences, Florida Institute of Technology Abstract

More information

Data Mining. Introduction. Hamid Beigy. Sharif University of Technology. Fall 1395

Data Mining. Introduction. Hamid Beigy. Sharif University of Technology. Fall 1395 Data Mining Introduction Hamid Beigy Sharif University of Technology Fall 1395 Hamid Beigy (Sharif University of Technology) Data Mining Fall 1395 1 / 21 Table of contents 1 Introduction 2 Data mining

More information

Consensus Clustering. Javier Béjar URL - Spring 2019 CS - MAI

Consensus Clustering. Javier Béjar URL - Spring 2019 CS - MAI Consensus Clustering Javier Béjar URL - Spring 2019 CS - MAI Consensus Clustering The ensemble of classifiers is a well established strategy in supervised learning Unsupervised learning aims the same goal:

More information

Data mining. Classification k-nn Classifier. Piotr Paszek. (Piotr Paszek) Data mining k-nn 1 / 20

Data mining. Classification k-nn Classifier. Piotr Paszek. (Piotr Paszek) Data mining k-nn 1 / 20 Data mining Piotr Paszek Classification k-nn Classifier (Piotr Paszek) Data mining k-nn 1 / 20 Plan of the lecture 1 Lazy Learner 2 k-nearest Neighbor Classifier 1 Distance (metric) 2 How to Determine

More information

International Journal of Advanced Research in Computer Science and Software Engineering

International Journal of Advanced Research in Computer Science and Software Engineering Volume 3, Issue 3, March 2013 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Special Issue:

More information

An Enhanced K-Medoid Clustering Algorithm

An Enhanced K-Medoid Clustering Algorithm An Enhanced Clustering Algorithm Archna Kumari Science &Engineering kumara.archana14@gmail.com Pramod S. Nair Science &Engineering, pramodsnair@yahoo.com Sheetal Kumrawat Science &Engineering, sheetal2692@gmail.com

More information

AN IMPROVED K-MEANS CLUSTERING ALGORITHM FOR IMAGE SEGMENTATION

AN IMPROVED K-MEANS CLUSTERING ALGORITHM FOR IMAGE SEGMENTATION AN IMPROVED K-MEANS CLUSTERING ALGORITHM FOR IMAGE SEGMENTATION WILLIAM ROBSON SCHWARTZ University of Maryland, Department of Computer Science College Park, MD, USA, 20742-327, schwartz@cs.umd.edu RICARDO

More information

Clustering. Mihaela van der Schaar. January 27, Department of Engineering Science University of Oxford

Clustering. Mihaela van der Schaar. January 27, Department of Engineering Science University of Oxford Department of Engineering Science University of Oxford January 27, 2017 Many datasets consist of multiple heterogeneous subsets. Cluster analysis: Given an unlabelled data, want algorithms that automatically

More information

Enhancing Clustering Results In Hierarchical Approach By Mvs Measures

Enhancing Clustering Results In Hierarchical Approach By Mvs Measures International Journal of Engineering Research and Development e-issn: 2278-067X, p-issn: 2278-800X, www.ijerd.com Volume 10, Issue 6 (June 2014), PP.25-30 Enhancing Clustering Results In Hierarchical Approach

More information

ECLT 5810 Clustering

ECLT 5810 Clustering ECLT 5810 Clustering What is Cluster Analysis? Cluster: a collection of data objects Similar to one another within the same cluster Dissimilar to the objects in other clusters Cluster analysis Grouping

More information

An Improved Apriori Algorithm for Association Rules

An Improved Apriori Algorithm for Association Rules Research article An Improved Apriori Algorithm for Association Rules Hassan M. Najadat 1, Mohammed Al-Maolegi 2, Bassam Arkok 3 Computer Science, Jordan University of Science and Technology, Irbid, Jordan

More information

Gene Clustering & Classification

Gene Clustering & Classification BINF, Introduction to Computational Biology Gene Clustering & Classification Young-Rae Cho Associate Professor Department of Computer Science Baylor University Overview Introduction to Gene Clustering

More information

Clustering Large Dynamic Datasets Using Exemplar Points

Clustering Large Dynamic Datasets Using Exemplar Points Clustering Large Dynamic Datasets Using Exemplar Points William Sia, Mihai M. Lazarescu Department of Computer Science, Curtin University, GPO Box U1987, Perth 61, W.A. Email: {siaw, lazaresc}@cs.curtin.edu.au

More information

Data Mining. Introduction. Hamid Beigy. Sharif University of Technology. Fall 1394

Data Mining. Introduction. Hamid Beigy. Sharif University of Technology. Fall 1394 Data Mining Introduction Hamid Beigy Sharif University of Technology Fall 1394 Hamid Beigy (Sharif University of Technology) Data Mining Fall 1394 1 / 20 Table of contents 1 Introduction 2 Data mining

More information

K-means clustering based filter feature selection on high dimensional data

K-means clustering based filter feature selection on high dimensional data International Journal of Advances in Intelligent Informatics ISSN: 2442-6571 Vol 2, No 1, March 2016, pp. 38-45 38 K-means clustering based filter feature selection on high dimensional data Dewi Pramudi

More information

ECLT 5810 Clustering

ECLT 5810 Clustering ECLT 5810 Clustering What is Cluster Analysis? Cluster: a collection of data objects Similar to one another within the same cluster Dissimilar to the objects in other clusters Cluster analysis Grouping

More information

DENSITY BASED AND PARTITION BASED CLUSTERING OF UNCERTAIN DATA BASED ON KL-DIVERGENCE SIMILARITY MEASURE

DENSITY BASED AND PARTITION BASED CLUSTERING OF UNCERTAIN DATA BASED ON KL-DIVERGENCE SIMILARITY MEASURE DENSITY BASED AND PARTITION BASED CLUSTERING OF UNCERTAIN DATA BASED ON KL-DIVERGENCE SIMILARITY MEASURE Sinu T S 1, Mr.Joseph George 1,2 Computer Science and Engineering, Adi Shankara Institute of Engineering

More information

BENCHMARKING ATTRIBUTE SELECTION TECHNIQUES FOR MICROARRAY DATA

BENCHMARKING ATTRIBUTE SELECTION TECHNIQUES FOR MICROARRAY DATA BENCHMARKING ATTRIBUTE SELECTION TECHNIQUES FOR MICROARRAY DATA S. DeepaLakshmi 1 and T. Velmurugan 2 1 Bharathiar University, Coimbatore, India 2 Department of Computer Science, D. G. Vaishnav College,

More information

International Journal of Advance Research in Computer Science and Management Studies

International Journal of Advance Research in Computer Science and Management Studies Volume 2, Issue 11, November 2014 ISSN: 2321 7782 (Online) International Journal of Advance Research in Computer Science and Management Studies Research Article / Survey Paper / Case Study Available online

More information

An Improvement of Centroid-Based Classification Algorithm for Text Classification

An Improvement of Centroid-Based Classification Algorithm for Text Classification An Improvement of Centroid-Based Classification Algorithm for Text Classification Zehra Cataltepe, Eser Aygun Istanbul Technical Un. Computer Engineering Dept. Ayazaga, Sariyer, Istanbul, Turkey cataltepe@itu.edu.tr,

More information

Clustering part II 1

Clustering part II 1 Clustering part II 1 Clustering What is Cluster Analysis? Types of Data in Cluster Analysis A Categorization of Major Clustering Methods Partitioning Methods Hierarchical Methods 2 Partitioning Algorithms:

More information

Outlier Detection Using Unsupervised and Semi-Supervised Technique on High Dimensional Data

Outlier Detection Using Unsupervised and Semi-Supervised Technique on High Dimensional Data Outlier Detection Using Unsupervised and Semi-Supervised Technique on High Dimensional Data Ms. Gayatri Attarde 1, Prof. Aarti Deshpande 2 M. E Student, Department of Computer Engineering, GHRCCEM, University

More information

Estimating Missing Attribute Values Using Dynamically-Ordered Attribute Trees

Estimating Missing Attribute Values Using Dynamically-Ordered Attribute Trees Estimating Missing Attribute Values Using Dynamically-Ordered Attribute Trees Jing Wang Computer Science Department, The University of Iowa jing-wang-1@uiowa.edu W. Nick Street Management Sciences Department,

More information

Clustering Categorical Data based on Information Loss Minimization

Clustering Categorical Data based on Information Loss Minimization Clustering Categorical Data based on Information Loss Minimization Periklis Andritsos, Panayiotis Tsaparas, Renée J. Miller, Kenneth C. Sevcik University of Toronto {periklis,tsap,miller,kcs}@cs.toronto.edu

More information

Cluster based boosting for high dimensional data

Cluster based boosting for high dimensional data Cluster based boosting for high dimensional data Rutuja Shirbhate, Dr. S. D. Babar Abstract -Data Dimensionality is crucial for learning and prediction systems. Term Curse of High Dimensionality means

More information

Contents. Preface to the Second Edition

Contents. Preface to the Second Edition Preface to the Second Edition v 1 Introduction 1 1.1 What Is Data Mining?....................... 4 1.2 Motivating Challenges....................... 5 1.3 The Origins of Data Mining....................

More information

Enhancing Forecasting Performance of Naïve-Bayes Classifiers with Discretization Techniques

Enhancing Forecasting Performance of Naïve-Bayes Classifiers with Discretization Techniques 24 Enhancing Forecasting Performance of Naïve-Bayes Classifiers with Discretization Techniques Enhancing Forecasting Performance of Naïve-Bayes Classifiers with Discretization Techniques Ruxandra PETRE

More information

Research Article Term Frequency Based Cosine Similarity Measure for Clustering Categorical Data using Hierarchical Algorithm

Research Article Term Frequency Based Cosine Similarity Measure for Clustering Categorical Data using Hierarchical Algorithm Research Journal of Applied Sciences, Engineering and Technology 11(7): 798-805, 2015 DOI: 10.19026/rjaset.11.2043 ISSN: 2040-7459; e-issn: 2040-7467 2015 Maxwell Scientific Publication Corp. Submitted:

More information

Automatic Threshold Calculation for the Categorical Distance Measure ConDist

Automatic Threshold Calculation for the Categorical Distance Measure ConDist Automatic Threshold Calculation for the Categorical Distance Measure ConDist Markus Ring 1, Dieter Landes 1, and Andreas Hotho 2 1 Faculty of Electrical Engineering and Informatics, Coburg University of

More information

Comparative Study of Subspace Clustering Algorithms

Comparative Study of Subspace Clustering Algorithms Comparative Study of Subspace Clustering Algorithms S.Chitra Nayagam, Asst Prof., Dept of Computer Applications, Don Bosco College, Panjim, Goa. Abstract-A cluster is a collection of data objects that

More information

Efficiently Handling Feature Redundancy in High-Dimensional Data

Efficiently Handling Feature Redundancy in High-Dimensional Data Efficiently Handling Feature Redundancy in High-Dimensional Data Lei Yu Department of Computer Science & Engineering Arizona State University Tempe, AZ 85287-5406 leiyu@asu.edu Huan Liu Department of Computer

More information

Keywords: hierarchical clustering, traditional similarity metrics, potential based similarity metrics.

Keywords: hierarchical clustering, traditional similarity metrics, potential based similarity metrics. www.ijecs.in International Journal Of Engineering And Computer Science ISSN: 2319-7242 Volume 4 Issue 8 Aug 2015, Page No. 14027-14032 Potential based similarity metrics for implementing hierarchical clustering

More information

Cluster analysis. Agnieszka Nowak - Brzezinska

Cluster analysis. Agnieszka Nowak - Brzezinska Cluster analysis Agnieszka Nowak - Brzezinska Outline of lecture What is cluster analysis? Clustering algorithms Measures of Cluster Validity What is Cluster Analysis? Finding groups of objects such that

More information

K-Nearest-Neighbours with a Novel Similarity Measure for Intrusion Detection

K-Nearest-Neighbours with a Novel Similarity Measure for Intrusion Detection K-Nearest-Neighbours with a Novel Similarity Measure for Intrusion Detection Zhenghui Ma School of Computer Science The University of Birmingham Edgbaston, B15 2TT Birmingham, UK Ata Kaban School of Computer

More information

CS570: Introduction to Data Mining

CS570: Introduction to Data Mining CS570: Introduction to Data Mining Cluster Analysis Reading: Chapter 10.4, 10.6, 11.1.3 Han, Chapter 8.4,8.5,9.2.2, 9.3 Tan Anca Doloc-Mihu, Ph.D. Slides courtesy of Li Xiong, Ph.D., 2011 Han, Kamber &

More information

A FAST CLUSTERING-BASED FEATURE SUBSET SELECTION ALGORITHM

A FAST CLUSTERING-BASED FEATURE SUBSET SELECTION ALGORITHM A FAST CLUSTERING-BASED FEATURE SUBSET SELECTION ALGORITHM Akshay S. Agrawal 1, Prof. Sachin Bojewar 2 1 P.G. Scholar, Department of Computer Engg., ARMIET, Sapgaon, (India) 2 Associate Professor, VIT,

More information

Feature Selection Using Modified-MCA Based Scoring Metric for Classification

Feature Selection Using Modified-MCA Based Scoring Metric for Classification 2011 International Conference on Information Communication and Management IPCSIT vol.16 (2011) (2011) IACSIT Press, Singapore Feature Selection Using Modified-MCA Based Scoring Metric for Classification

More information

Filter methods for feature selection. A comparative study

Filter methods for feature selection. A comparative study Filter methods for feature selection. A comparative study Noelia Sánchez-Maroño, Amparo Alonso-Betanzos, and María Tombilla-Sanromán University of A Coruña, Department of Computer Science, 15071 A Coruña,

More information

International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.7, No.3, May Dr.Zakea Il-Agure and Mr.Hicham Noureddine Itani

International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.7, No.3, May Dr.Zakea Il-Agure and Mr.Hicham Noureddine Itani LINK MINING PROCESS Dr.Zakea Il-Agure and Mr.Hicham Noureddine Itani Higher Colleges of Technology, United Arab Emirates ABSTRACT Many data mining and knowledge discovery methodologies and process models

More information

What is Clustering? Clustering. Characterizing Cluster Methods. Clusters. Cluster Validity. Basic Clustering Methodology

What is Clustering? Clustering. Characterizing Cluster Methods. Clusters. Cluster Validity. Basic Clustering Methodology Clustering Unsupervised learning Generating classes Distance/similarity measures Agglomerative methods Divisive methods Data Clustering 1 What is Clustering? Form o unsupervised learning - no inormation

More information

Automatic Group-Outlier Detection

Automatic Group-Outlier Detection Automatic Group-Outlier Detection Amine Chaibi and Mustapha Lebbah and Hanane Azzag LIPN-UMR 7030 Université Paris 13 - CNRS 99, av. J-B Clément - F-93430 Villetaneuse {firstname.secondname}@lipn.univ-paris13.fr

More information

Cluster Analysis. Mu-Chun Su. Department of Computer Science and Information Engineering National Central University 2003/3/11 1

Cluster Analysis. Mu-Chun Su. Department of Computer Science and Information Engineering National Central University 2003/3/11 1 Cluster Analysis Mu-Chun Su Department of Computer Science and Information Engineering National Central University 2003/3/11 1 Introduction Cluster analysis is the formal study of algorithms and methods

More information

Association Rule Mining and Clustering

Association Rule Mining and Clustering Association Rule Mining and Clustering Lecture Outline: Classification vs. Association Rule Mining vs. Clustering Association Rule Mining Clustering Types of Clusters Clustering Algorithms Hierarchical:

More information

Contents. List of Figures. List of Tables. List of Algorithms. I Clustering, Data, and Similarity Measures 1

Contents. List of Figures. List of Tables. List of Algorithms. I Clustering, Data, and Similarity Measures 1 Contents List of Figures List of Tables List of Algorithms Preface xiii xv xvii xix I Clustering, Data, and Similarity Measures 1 1 Data Clustering 3 1.1 Definition of Data Clustering... 3 1.2 The Vocabulary

More information

On Sample Weighted Clustering Algorithm using Euclidean and Mahalanobis Distances

On Sample Weighted Clustering Algorithm using Euclidean and Mahalanobis Distances International Journal of Statistics and Systems ISSN 0973-2675 Volume 12, Number 3 (2017), pp. 421-430 Research India Publications http://www.ripublication.com On Sample Weighted Clustering Algorithm using

More information

Scalable Clustering Using Rank Based Preprocessing Technique for Mixed Data Sets Using Enhanced Rock Algorithm

Scalable Clustering Using Rank Based Preprocessing Technique for Mixed Data Sets Using Enhanced Rock Algorithm African Journal of Basic & Applied Sciences 7 (3): 129-136, 2015 ISSN 2079-2034 IDOSI Publications, 2015 DOI: 10.5829/idosi.ajbas.2015.7.3.22291 Scalable Clustering Using Rank Based Preprocessing Technique

More information

A Comparison of Resampling Methods for Clustering Ensembles

A Comparison of Resampling Methods for Clustering Ensembles A Comparison of Resampling Methods for Clustering Ensembles Behrouz Minaei-Bidgoli Computer Science Department Michigan State University East Lansing, MI, 48824, USA Alexander Topchy Computer Science Department

More information

CSE 5243 INTRO. TO DATA MINING

CSE 5243 INTRO. TO DATA MINING CSE 5243 INTRO. TO DATA MINING Cluster Analysis: Basic Concepts and Methods Huan Sun, CSE@The Ohio State University Slides adapted from UIUC CS412, Fall 2017, by Prof. Jiawei Han 2 Chapter 10. Cluster

More information

Multi-Modal Data Fusion: A Description

Multi-Modal Data Fusion: A Description Multi-Modal Data Fusion: A Description Sarah Coppock and Lawrence J. Mazlack ECECS Department University of Cincinnati Cincinnati, Ohio 45221-0030 USA {coppocs,mazlack}@uc.edu Abstract. Clustering groups

More information

Clustering Part 3. Hierarchical Clustering

Clustering Part 3. Hierarchical Clustering Clustering Part Dr Sanjay Ranka Professor Computer and Information Science and Engineering University of Florida, Gainesville Hierarchical Clustering Two main types: Agglomerative Start with the points

More information

Basic Data Mining Technique

Basic Data Mining Technique Basic Data Mining Technique What is classification? What is prediction? Supervised and Unsupervised Learning Decision trees Association rule K-nearest neighbor classifier Case-based reasoning Genetic algorithm

More information

Clustering Using Elements of Information Theory

Clustering Using Elements of Information Theory Clustering Using Elements of Information Theory Daniel de Araújo 1,2, Adrião Dória Neto 2, Jorge Melo 2, and Allan Martins 2 1 Federal Rural University of Semi-Árido, Campus Angicos, Angicos/RN, Brasil

More information

Active Sampling for Constrained Clustering

Active Sampling for Constrained Clustering Paper: Active Sampling for Constrained Clustering Masayuki Okabe and Seiji Yamada Information and Media Center, Toyohashi University of Technology 1-1 Tempaku, Toyohashi, Aichi 441-8580, Japan E-mail:

More information

DATA MINING LECTURE 7. Hierarchical Clustering, DBSCAN The EM Algorithm

DATA MINING LECTURE 7. Hierarchical Clustering, DBSCAN The EM Algorithm DATA MINING LECTURE 7 Hierarchical Clustering, DBSCAN The EM Algorithm CLUSTERING What is a Clustering? In general a grouping of objects such that the objects in a group (cluster) are similar (or related)

More information

Information Integration of Partially Labeled Data

Information Integration of Partially Labeled Data Information Integration of Partially Labeled Data Steffen Rendle and Lars Schmidt-Thieme Information Systems and Machine Learning Lab, University of Hildesheim srendle@ismll.uni-hildesheim.de, schmidt-thieme@ismll.uni-hildesheim.de

More information

University of Florida CISE department Gator Engineering. Clustering Part 2

University of Florida CISE department Gator Engineering. Clustering Part 2 Clustering Part 2 Dr. Sanjay Ranka Professor Computer and Information Science and Engineering University of Florida, Gainesville Partitional Clustering Original Points A Partitional Clustering Hierarchical

More information

Keywords: clustering algorithms, unsupervised learning, cluster validity

Keywords: clustering algorithms, unsupervised learning, cluster validity Volume 6, Issue 1, January 2016 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Clustering Based

More information

Clustering: An art of grouping related objects

Clustering: An art of grouping related objects Clustering: An art of grouping related objects Sumit Kumar, Sunil Verma Abstract- In today s world, clustering has seen many applications due to its ability of binding related data together but there are

More information

On Multiple Query Optimization in Data Mining

On Multiple Query Optimization in Data Mining On Multiple Query Optimization in Data Mining Marek Wojciechowski, Maciej Zakrzewicz Poznan University of Technology Institute of Computing Science ul. Piotrowo 3a, 60-965 Poznan, Poland {marek,mzakrz}@cs.put.poznan.pl

More information

Data Mining. 3.2 Decision Tree Classifier. Fall Instructor: Dr. Masoud Yaghini. Chapter 5: Decision Tree Classifier

Data Mining. 3.2 Decision Tree Classifier. Fall Instructor: Dr. Masoud Yaghini. Chapter 5: Decision Tree Classifier Data Mining 3.2 Decision Tree Classifier Fall 2008 Instructor: Dr. Masoud Yaghini Outline Introduction Basic Algorithm for Decision Tree Induction Attribute Selection Measures Information Gain Gain Ratio

More information

Unsupervised Learning and Clustering

Unsupervised Learning and Clustering Unsupervised Learning and Clustering Selim Aksoy Department of Computer Engineering Bilkent University saksoy@cs.bilkent.edu.tr CS 551, Spring 2009 CS 551, Spring 2009 c 2009, Selim Aksoy (Bilkent University)

More information

Determination of Similarity Threshold in Clustering Problems for Large Data Sets

Determination of Similarity Threshold in Clustering Problems for Large Data Sets Determination of Similarity Threshold in Clustering Problems for Large Data Sets Guillermo Sánchez-Díaz 1 and José F. Martínez-Trinidad 2 1 Center of Technologies Research on Information and Systems, The

More information

A Spectral-based Clustering Algorithm for Categorical Data Using Data Summaries (SCCADDS)

A Spectral-based Clustering Algorithm for Categorical Data Using Data Summaries (SCCADDS) A Spectral-based Clustering Algorithm for Categorical Data Using Data Summaries (SCCADDS) Eman Abdu eha90@aol.com Graduate Center The City University of New York Douglas Salane dsalane@jjay.cuny.edu Center

More information