High Accuracy Clustering Algorithm for Categorical Dataset

Size: px
Start display at page:

Download "High Accuracy Clustering Algorithm for Categorical Dataset"


1 Proc. of Int. Conf. on Recent Trends in Information, Telecommunication and Computing, ITC High Accuracy Clustering Algorithm for Categorical Dataset Aman Ahmad Ansari 1 and Gaurav Pathak 2 1 NIMS Institute of Engineering &Technology, Jaipur, India ansariaa1jan@gmail.com 2 NIMS Institute of Engineering &Technology, Jaipur, India pathakg86@gmail.com Abstract Step by step operations by which we make a group of objects in which attributes of all the objects are nearly similar, known as clustering. So, a cluster is a collection of objects that acquire nearly same attribute values. The property of an object in a cluster is similar to other objects in same cluster but different with objects of other clusters. Clustering is used in wide range of applications like pattern recognition, image processing, data analysis, machine learning etc. Nowadays, more attention has been put on categorical data rather than numerical data. Where, the range of numerical attributes organizes in a class like small, medium, high, and so on. There is wide range of algorithm that used to make clusters of given categorical data. Our approach is to enhance the working on wellknown clustering algorithm k-modes to improve accuracy of algorithm. We proposed a new approach named High Accuracy Clustering Algorithm for Categorical datasets. Index Terms clustering, k-mode Algorithm, categorical data, data mining. I. INTRODUCTION Data mining refers to extracting or mining knowledge from large amount of data [1], or synonym for KDD (knowledge discovery in databases). Data mining Techniques: Association Analysis: Discovering association rules showing attribute-value conditions that occur frequently together on a given data set. Classification: To learn to assign data objects to predefined classes. This requires supervised learning, i.e. the training data has to specify what have to be learning. Clustering: The process of grouping a set of physical or abstract objects into classes of similar objects is called clustering.. A cluster is a collection of collection of data objects that are similar to one another within the class or cluster, and are dissimilar to the objects in other clusters. The cluster of data objects can be treated collectively as one group. The example shown in figure 1, Clustering of objects into three groups. During a cholera outbreak in London in 1854, John Snow used a special map to plot the pees of the disease that were reported [2]. A key observation, alter the creation of the map, was Joe close association between the density of disease cases and a single well located at a central knee. Most of the clustering algorithms focus on data sets where objects are defined on a set of numerical values. Datasets also contain nonnumerical values to be clustered; each object is described by multiple attributes, categorical data sets. Clustering cannot be a one-step process. Jain and Dubes divide the clustering process in the following stages [9] a). Data Collection: b). Initial Screening: c). Representation: d). Clustering Tendency: e). Clustering DOI: 02.ITC Association of Computer Electronics and Electrical Engineers, 2014

2 Figure 1. Clustering of a set of points Strategy: f). Validation: g). Interpretation. This list of stages is given for exposition purposes since we do not propose solutions for each one of them. We mainly focus on the problem of Clustering Strategy by proposing a new algorithm for categorical data, and the problem of Clustering Tendency by proposing a heuristic for identifying appropriate values for the number of clusters that exist in a data set. II. PROBLEM DEFINITION The previous clustering algorithm for categorical dataset are not much accurate and do not give same result at every execution with the same categorical dataset. We want to solve this problem clustering of categorical data with high accuracy. III. CLUSTERING TECHNIQUES A. Rules for Clustering Techniques Every clustering algorithm must follow the following rules: 1. The measure used to assess similarity or dissimilarity between pairs of objects. 2. The particular strategy followed In order to merge Intermediate results. This strategy obviously affects the way the end clusters are produced, since we may merge intermediate clusters according to the distance of their closest or furthest points, or the distance of the average of their points [5]. 3. An objective function that needs to be minimized or maximized as appropriate, in order to produce final results. B. Basic Clustering Techniques 1. Partitional: Given n objects partitional clustering algorithm constructs k partitions of the data, so that an objective function is optimized. Some of these algorithms are high complexity, because of some of them generate all possible groupings and try to find the optimal solution. If we take small no of objects, there also the grouping (partitions) may high. Because of this, solutions start with initial, usually random, partition and proceed with its refinement. Better Approach was, run the partitioned algorithm for several different sets of k initial points and keep track of the result The majority of them could be considered as greedy algorithms, i.e., algorithms that at each step choose the best solution and may not lead to optimal results in the end The best solution at each step is the placement of a certain object In the cluster for which the representative point is nearest to the object, k-means [4], PAM (partitioning Around Medoids) [5], CLARA (Clustering LARge Applications) [5] are comes under this category All these are applicable to numerical attributes. 2. Categorical data clustering algorithms: These are for categorical data where Euclidean, or other numerically-oriented distance measures are not meaningful. These algorithms are close to partitioned and hierarchical types. For each category, there exists a plethora of sub-categories, e.g., density-based clustering oriented toward geographical data. An exception to this is the class of approaches to handling categorical data. Visualization of such data is not straight forward and there is no inherent geometrical structure in them, hence the approaches that have appeared in the literature mainly use concepts carried by the data, such as no-occurrences in tuples. On the other hand, data sets that include some categorical 293

3 attributes are abundant. Moreover, there are data sets with a mixture of attribute types, such as the United States Census data set [7] and data sets used in data integration [6]. IV. RELATED WORK To cluster categorical data objects, k-modes, ROCK, and COOLCAT [10], are exists, but in present work we are extending the k-modes algorithm especially for accuracy. A. K-modes Algorithm The first algorithm for categorical data sets is k-modes algorithm, which is extension to k-means [11].Kmodes algorithm partitions a categorical data set of n objects in clusters. It is based on k-means paradigm and use modes at the place of means for categorical data, and frequency based method to update modes. K- modes algorithm chooses k random objects to set initial mode of cluster, and different dissimilarity measure use for calculate distance between two objects. Dissimilarity measure is- d(x, Y) = Let Q= {q 1, q 2, q 3..q m } is mode of a cluster. Where δ(x, y )=0 δ(x, y )=1 δ x, y (1) x i =y i x i y i D(X, Q) = d(x, Q) (2) Where Q can be an object but not necessarily an object. Algorithm: k-modes Input: k: number of objects D: data set that contain n objects Output: set of k clusters Method: 1. Randomly choose k objects for initial cluster modes, one for each. 2. Allocate each object to that cluster which mode is most similar to that object, according to eqn.(1). 3. Update modes by calculate the frequent value for each attribute of all objects in cluster. 4. Repeat a. Reallocate each object to that cluster which mode is most similar to that object. If that cluster is not current cluster. b. Update modes of changed clusters. 5. until no changes. V. PROPOSED METHODOLOGY Proposed clustering algorithm extends k-modes clustering algorithm with new dissimilarity measure and selects initial modes by using select_init_modes algorithm unlike k-modes algorithm selects initial modes randomly. A.Selection of Initial Nodes Result of clustering process depends on the initial modes. So, if any clustering algorithm set initial modes in random manner, then clustering result of that algorithm may not have same accuracy every time for particular data set. Here, we proposed an algorithm select_init_mode to overcome this problem. This algorithm use k- modes to calculate modes and store np set of modes in mode-pool, P Algorithm: select_init_mode Input: n p : number of set of k modes in mode-pool. k: number of clusters. D: data set having n objects. 294

4 Output: P: mode-pool. Method: 1. Set i = Repeat a. Execute k-modes clustering algorithm. b. Store the set of modes in mode-pool. c. Increment i. 3. Until i<n p B.Dissimilarity Measure Similarity can be defined as how far or close the data objects are from one another. The notion of similarity will help. We call it as measure, or index or coefficient [3]. Dissimilarity can be measured in many ways and one can be in distance. Distance can be measured using any one of a variety of distance measures. Dissimilarity measure used by k-modes does not represent the real semantic distance between the object and cluster. For example- Let s take a categorical data set having 3 attributes A1={1,2}, A2={1,2} and A3={1,2,3,4,5}with 7 attributes on using k-modes clustering algorithm with k=2 after 6 objects are clustered as shown in table I below. TABLE I. CLUSTER 1 AND CLUSTER 2 Let 7th object of dataset are X = [2 1 1], for this object dissimilarities are d(x, C) = 1 and d(x, C) = l. we may not properly assign this object. But we can see that this object will be assigned to cluster2. By using k- modes dissimilarity measure we cannot sure this object allocate to cluster2. To solve this problem, I propose anew dissimilarity measure that accounts the frequency of values of attributes of objects in clusters. New dissimilarity measure are- d (X, Y) = θ x,y (3) Where θ x, y = 1 O O x j =y j θ x, y = 1 x j y j O l number of objects in the l th cluster, and O ljm the number of objects with value a j of the j th attribute in the l th cluster. By using this dissimilarity measure, we sure that 7 th object allocates to cluster2. C.Proposed Algorithm Input: n p : number of set of modes in mode-pool. k : number of clusters. D: data set having n objects. Output: set of k clusters. Method: 295

5 1. Execute select_init_mode algorithm, it returns mode-pool. 2. Select most frequent attribute value of all attributes for a mode n corresponding set of np modes in modepool. Initialize all modes. 3. Allocate each object to that cluster which dissimilarity measure is lowest with that object, according to equation. 4. Update modes by calculate the frequent value for each attribute of all objects in cluster. 5. Repeat a. Reallocate each object to that cluster which dissimilarity is lowest with that object, if that cluster is not current cluster. b. Update modes of changed clusters. 6. until no changes. VI. IMPLEMENTATION & RESULT For the implementation of my proposed algorithm we have designed a tool interface. Figure 2. Input Frame Figure 2 is the initial window of my tool. It takes the input file on which we want to apply clustering. It also takes the number of clusters from the user. Figure 3. Result Frame 296

6 From the window shown in figure 3, we can see the output of k-modes algorithm and proposed algorithm by using the appropriate button. I experimented with two real-life categorical datasets. Mushroom dataset, and Congressional Voting dataset taken from UCI Machine learning repository [8]. Clustering Accuracy: Cluster Accuracy r is defined as r=( Where, a i = number of objects occurring in a cluster, k=number of clusters, and n=number of objects in a data set Clustering error defined as ai)/n (4) e=1-r (5) We compare proposed k-modes algorithm, and existing k-modes algorithm. For a fixed number of clusters k, the clustering errors e of both algorithms compared and shown in figure 4. A.Datasets Congressional Voting Data Set: it includes votes of every house of United States representatives of congressmen on sixteen key votes recognized by the CQA. The CQA lists 9 various votes- voted for, paired for, announced for (all 3 are interpreted to yes). Voted against, paired against, and announced against (all 3 interpreted to no).voted present, voted present to elude conflict of interest, didn't vote or elsewhere make a position known (these 3 interpreted to unknown) [8]. Figure 4. Congressional Voting data (Clustering Error vs No. of clusters) Mushroom Data Set: We used mushroom database as input of my system. This database drawn from The Audubon Society Field Guide to North American Mushrooms (1981), this data set has 8124 data objects. Each object has 22 attributes (e.g., color, odor, and shape) and has a label characterizing the mushroom specimen as either poisonous (3916 records) or edible (4208 records) [8]. Soybean Disease Data Set: We used Soybean Disease database as input of my system. These databases drawn from this dataset have 19 classes, only the first 15 of which have been used in prior work. The folklore seems to be that the last four classes are unjustified by the data since they have so few examples. There are 35 categorical attributes, some nominal and some ordered. The value dna means does not apply. The values for attributes are encoded numerically, with the first value encoded as 0, the second as 1, and so forth. An unknown value is encoded as?.this data set has 307 data objects [8]. The proposed algorithm was tested on other categorical data [8] such as Zoo, Soybeans, US Census Data. VII. CONCLUSIONS As we all know clustering is applicable in every area, for eg ranging from image processing, bug prediction, pattern evolution, and machine learning and so on. So, we need a clustering algorithm that work efficiently as well as accurately on all type of databases numerical, categorical, and mixture of both. 297

7 In this paper, we work on only accuracy quality attribute of clustering algorithm, so that; we can find much accurate and nearly same result at every execution of algorithm on same dataset. Our algorithm worked well in this scenario to provide accurate result at every execution of algorithm. We applied this algorithm on only simple real time categorical datasets mushroom database, Congressional Voting Data Set. In future, it is possible to apply this algorithm on bug dataset to help developer to find the clusters of bugs that have a same cause. It helps in bug fixing during development and also after deployment. Presently it works only for categorical datasets. But in future it may enhance to work well with numerical datasets also. REFERENCES [1] Jiawei Han, Micheline Kamber: "data mining Concepts and Techniques", Morgan Kaufmann, [2] E. W. Gilbert: "Pioneer Maps of Health and Disease in England'', Geographical Journal, [3] Anil K. Jain and Richard C. Dubes: "Algorithms for Clustering data", Prentice-Hall, [4] Amir Ahmad, Lipika Dey: "A k-mean clustering algorithm for numeric data", Data & Knowledge Engineering, 2007 [5] Leonard Kaufman and Peter J. Rousseeuw: "Finding Groups in Data: An Introduction to Cluster Analysis.'', John Wiley & Sons, [6] Renjee J. Miller, Mauricio A. Hernjandez Laura M. Haas.: "The Clio Project: Managing Heterogeneity, SIGMOD Record, [7] US Census data set [8] UCI Repository of Machine Learning Databases. [9] Serge Abiteboul, Richard Hull, and Victor Vianu.: "Foundations of Data bases." AddisonWesley, [10] Daniel Barbarja, Julia Couto, and Yi Li.: "COOLCAT: An Entropy-based Algorithm for Categorical Clustering.", CIKM [11] Zhihua Cail, Dianhong Wang, and Liangxiao Jiang: A New Algorithm for Clustering Categorical Data, ICIC

Enhancing Forecasting Performance of Naïve-Bayes Classifiers with Discretization Techniques

Enhancing Forecasting Performance of Naïve-Bayes Classifiers with Discretization Techniques 24 Enhancing Forecasting Performance of Naïve-Bayes Classifiers with Discretization Techniques Enhancing Forecasting Performance of Naïve-Bayes Classifiers with Discretization Techniques Ruxandra PETRE

More information

A Review on Cluster Based Approach in Data Mining

A Review on Cluster Based Approach in Data Mining A Review on Cluster Based Approach in Data Mining M. Vijaya Maheswari PhD Research Scholar, Department of Computer Science Karpagam University Coimbatore, Tamilnadu,India Dr T. Christopher Assistant professor,

More information

A Spectral-based Clustering Algorithm for Categorical Data Using Data Summaries (SCCADDS)

A Spectral-based Clustering Algorithm for Categorical Data Using Data Summaries (SCCADDS) A Spectral-based Clustering Algorithm for Categorical Data Using Data Summaries (SCCADDS) Eman Abdu eha90@aol.com Graduate Center The City University of New York Douglas Salane dsalane@jjay.cuny.edu Center

More information

Mining Quantitative Association Rules on Overlapped Intervals

Mining Quantitative Association Rules on Overlapped Intervals Mining Quantitative Association Rules on Overlapped Intervals Qiang Tong 1,3, Baoping Yan 2, and Yuanchun Zhou 1,3 1 Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China {tongqiang,

More information

Generalized k-means algorithm on nominal dataset

Generalized k-means algorithm on nominal dataset Data Mining IX 43 Generalized k-means algorithm on nominal dataset S. H. Al-Harbi & A. M. Al-Shahri Information Technology Center, Riyadh, Saudi Arabia Abstract Clustering has typically been a problem

More information

Efficiency of k-means and K-Medoids Algorithms for Clustering Arbitrary Data Points

Efficiency of k-means and K-Medoids Algorithms for Clustering Arbitrary Data Points Efficiency of k-means and K-Medoids Algorithms for Clustering Arbitrary Data Points Dr. T. VELMURUGAN Associate professor, PG and Research Department of Computer Science, D.G.Vaishnav College, Chennai-600106,

More information

K-modes Clustering Algorithm for Categorical Data

K-modes Clustering Algorithm for Categorical Data K-modes Clustering Algorithm for Categorical Data Neha Sharma Samrat Ashok Technological Institute Department of Information Technology, Vidisha, India Nirmal Gaud Samrat Ashok Technological Institute

More information

Kapitel 4: Clustering

Kapitel 4: Clustering Ludwig-Maximilians-Universität München Institut für Informatik Lehr- und Forschungseinheit für Datenbanksysteme Knowledge Discovery in Databases WiSe 2017/18 Kapitel 4: Clustering Vorlesung: Prof. Dr.

More information

CLUSTERING. CSE 634 Data Mining Prof. Anita Wasilewska TEAM 16

CLUSTERING. CSE 634 Data Mining Prof. Anita Wasilewska TEAM 16 CLUSTERING CSE 634 Data Mining Prof. Anita Wasilewska TEAM 16 1. K-medoids: REFERENCES https://www.coursera.org/learn/cluster-analysis/lecture/nj0sb/3-4-the-k-medoids-clustering-method https://anuradhasrinivas.files.wordpress.com/2013/04/lesson8-clustering.pdf

More information

An Enhanced K-Medoid Clustering Algorithm

An Enhanced K-Medoid Clustering Algorithm An Enhanced Clustering Algorithm Archna Kumari Science &Engineering kumara.archana14@gmail.com Pramod S. Nair Science &Engineering, pramodsnair@yahoo.com Sheetal Kumrawat Science &Engineering, sheetal2692@gmail.com

More information

Clustering part II 1

Clustering part II 1 Clustering part II 1 Clustering What is Cluster Analysis? Types of Data in Cluster Analysis A Categorization of Major Clustering Methods Partitioning Methods Hierarchical Methods 2 Partitioning Algorithms:

More information

A fuzzy k-modes algorithm for clustering categorical data. Citation IEEE Transactions on Fuzzy Systems, 1999, v. 7 n. 4, p.

A fuzzy k-modes algorithm for clustering categorical data. Citation IEEE Transactions on Fuzzy Systems, 1999, v. 7 n. 4, p. Title A fuzzy k-modes algorithm for clustering categorical data Author(s) Huang, Z; Ng, MKP Citation IEEE Transactions on Fuzzy Systems, 1999, v. 7 n. 4, p. 446-452 Issued Date 1999 URL http://hdl.handle.net/10722/42992

More information

On the Consequence of Variation Measure in K- modes Clustering Algorithm

On the Consequence of Variation Measure in K- modes Clustering Algorithm ORIENTAL JOURNAL OF COMPUTER SCIENCE & TECHNOLOGY An International Open Free Access, Peer Reviewed Research Journal Published By: Oriental Scientific Publishing Co., India. www.computerscijournal.org ISSN:

More information

Unsupervised Learning

Unsupervised Learning Outline Unsupervised Learning Basic concepts K-means algorithm Representation of clusters Hierarchical clustering Distance functions Which clustering algorithm to use? NN Supervised learning vs. unsupervised

More information

Dynamic Clustering of Data with Modified K-Means Algorithm

Dynamic Clustering of Data with Modified K-Means Algorithm 2012 International Conference on Information and Computer Networks (ICICN 2012) IPCSIT vol. 27 (2012) (2012) IACSIT Press, Singapore Dynamic Clustering of Data with Modified K-Means Algorithm Ahamed Shafeeq

More information

An Improved Fuzzy K-Medoids Clustering Algorithm with Optimized Number of Clusters

An Improved Fuzzy K-Medoids Clustering Algorithm with Optimized Number of Clusters An Improved Fuzzy K-Medoids Clustering Algorithm with Optimized Number of Clusters Akhtar Sabzi Department of Information Technology Qom University, Qom, Iran asabzii@gmail.com Yaghoub Farjami Department

More information



More information

Clustering of Data with Mixed Attributes based on Unified Similarity Metric

Clustering of Data with Mixed Attributes based on Unified Similarity Metric Clustering of Data with Mixed Attributes based on Unified Similarity Metric M.Soundaryadevi 1, Dr.L.S.Jayashree 2 Dept of CSE, RVS College of Engineering and Technology, Coimbatore, Tamilnadu, India 1

More information

Automated Clustering-Based Workload Characterization

Automated Clustering-Based Workload Characterization Automated Clustering-Based Worload Characterization Odysseas I. Pentaalos Daniel A. MenascŽ Yelena Yesha Code 930.5 Dept. of CS Dept. of EE and CS NASA GSFC Greenbelt MD 2077 George Mason University Fairfax

More information

Efficient Clustering of Web Documents Using Hybrid Approach in Data Mining

Efficient Clustering of Web Documents Using Hybrid Approach in Data Mining Efficient Clustering of Web Documents Using Hybrid Approach in Data Mining Pralhad Sudam Gamare 1, Ganpati A. Patil 2 1 P.G. Student, Computer Science and Technology, Department of Technology-Shivaji University-Kolhapur,

More information

Hierarchical Document Clustering

Hierarchical Document Clustering Hierarchical Document Clustering Benjamin C. M. Fung, Ke Wang, and Martin Ester, Simon Fraser University, Canada INTRODUCTION Document clustering is an automatic grouping of text documents into clusters

More information

HIMIC : A Hierarchical Mixed Type Data Clustering Algorithm

HIMIC : A Hierarchical Mixed Type Data Clustering Algorithm HIMIC : A Hierarchical Mixed Type Data Clustering Algorithm R. A. Ahmed B. Borah D. K. Bhattacharyya Department of Computer Science and Information Technology, Tezpur University, Napam, Tezpur-784028,

More information

ECLT 5810 Clustering

ECLT 5810 Clustering ECLT 5810 Clustering What is Cluster Analysis? Cluster: a collection of data objects Similar to one another within the same cluster Dissimilar to the objects in other clusters Cluster analysis Grouping

More information

Keywords: clustering algorithms, unsupervised learning, cluster validity

Keywords: clustering algorithms, unsupervised learning, cluster validity Volume 6, Issue 1, January 2016 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Clustering Based

More information

Research on Data Mining Technology Based on Business Intelligence. Yang WANG

Research on Data Mining Technology Based on Business Intelligence. Yang WANG 2018 International Conference on Mechanical, Electronic and Information Technology (ICMEIT 2018) ISBN: 978-1-60595-548-3 Research on Data Mining Technology Based on Business Intelligence Yang WANG Communication

More information


CHAPTER 4 K-MEANS AND UCAM CLUSTERING ALGORITHM CHAPTER 4 K-MEANS AND UCAM CLUSTERING 4.1 Introduction ALGORITHM Clustering has been used in a number of applications such as engineering, biology, medicine and data mining. The most popular clustering

More information

Outlier Detection Using Unsupervised and Semi-Supervised Technique on High Dimensional Data

Outlier Detection Using Unsupervised and Semi-Supervised Technique on High Dimensional Data Outlier Detection Using Unsupervised and Semi-Supervised Technique on High Dimensional Data Ms. Gayatri Attarde 1, Prof. Aarti Deshpande 2 M. E Student, Department of Computer Engineering, GHRCCEM, University

More information

Data Mining: Concepts and Techniques. Chapter March 8, 2007 Data Mining: Concepts and Techniques 1

Data Mining: Concepts and Techniques. Chapter March 8, 2007 Data Mining: Concepts and Techniques 1 Data Mining: Concepts and Techniques Chapter 7.1-4 March 8, 2007 Data Mining: Concepts and Techniques 1 1. What is Cluster Analysis? 2. Types of Data in Cluster Analysis Chapter 7 Cluster Analysis 3. A

More information

K-Means Clustering With Initial Centroids Based On Difference Operator

K-Means Clustering With Initial Centroids Based On Difference Operator K-Means Clustering With Initial Centroids Based On Difference Operator Satish Chaurasiya 1, Dr.Ratish Agrawal 2 M.Tech Student, School of Information and Technology, R.G.P.V, Bhopal, India Assistant Professor,

More information

Clustering. Chapter 10 in Introduction to statistical learning

Clustering. Chapter 10 in Introduction to statistical learning Clustering Chapter 10 in Introduction to statistical learning 16 14 12 10 8 6 4 2 0 2 4 6 8 10 12 14 1 Clustering ² Clustering is the art of finding groups in data (Kaufman and Rousseeuw, 1990). ² What

More information

CS570: Introduction to Data Mining

CS570: Introduction to Data Mining CS570: Introduction to Data Mining Cluster Analysis Reading: Chapter 10.4, 10.6, 11.1.3 Han, Chapter 8.4,8.5,9.2.2, 9.3 Tan Anca Doloc-Mihu, Ph.D. Slides courtesy of Li Xiong, Ph.D., 2011 Han, Kamber &

More information

Using Categorical Attributes for Clustering

Using Categorical Attributes for Clustering Using Categorical Attributes for Clustering Avli Saxena, Manoj Singh Gurukul Institute of Engineering and Technology, Kota (Rajasthan), India Abstract The traditional clustering algorithms focused on clustering

More information



More information

K-Mean Clustering Algorithm Implemented To E-Banking

K-Mean Clustering Algorithm Implemented To E-Banking K-Mean Clustering Algorithm Implemented To E-Banking Kanika Bansal Banasthali University Anjali Bohra Banasthali University Abstract As the nations are connected to each other, so is the banking sector.

More information

ECLT 5810 Clustering

ECLT 5810 Clustering ECLT 5810 Clustering What is Cluster Analysis? Cluster: a collection of data objects Similar to one another within the same cluster Dissimilar to the objects in other clusters Cluster analysis Grouping

More information

Enhanced Bug Detection by Data Mining Techniques

Enhanced Bug Detection by Data Mining Techniques ISSN (e): 2250 3005 Vol, 04 Issue, 7 July 2014 International Journal of Computational Engineering Research (IJCER) Enhanced Bug Detection by Data Mining Techniques Promila Devi 1, Rajiv Ranjan* 2 *1 M.Tech(CSE)

More information

Cluster Center Initialization for Categorical Data Using Multiple Attribute Clustering

Cluster Center Initialization for Categorical Data Using Multiple Attribute Clustering Cluster Center Initialization for Categorical Data Using Multiple Attribute Clustering Shehroz S. Khan Amir Ahmad Abstract The K-modes clustering algorithm is well known for its efficiency in clustering

More information

ISSN: [Saurkar* et al., 6(4): April, 2017] Impact Factor: 4.116


More information

Comparative Study of Clustering Algorithms using R

Comparative Study of Clustering Algorithms using R Comparative Study of Clustering Algorithms using R Debayan Das 1 and D. Peter Augustine 2 1 ( M.Sc Computer Science Student, Christ University, Bangalore, India) 2 (Associate Professor, Department of Computer

More information

Data Mining Algorithms

Data Mining Algorithms for the original version: -JörgSander and Martin Ester - Jiawei Han and Micheline Kamber Data Management and Exploration Prof. Dr. Thomas Seidl Data Mining Algorithms Lecture Course with Tutorials Wintersemester

More information


CSE 5243 INTRO. TO DATA MINING CSE 5243 INTRO. TO DATA MINING Cluster Analysis: Basic Concepts and Methods Huan Sun, CSE@The Ohio State University 09/25/2017 Slides adapted from UIUC CS412, Fall 2017, by Prof. Jiawei Han 2 Chapter 10.

More information

Cluster Analysis. CSE634 Data Mining

Cluster Analysis. CSE634 Data Mining Cluster Analysis CSE634 Data Mining Agenda Introduction Clustering Requirements Data Representation Partitioning Methods K-Means Clustering K-Medoids Clustering Constrained K-Means clustering Introduction

More information


CSE 5243 INTRO. TO DATA MINING CSE 5243 INTRO. TO DATA MINING Cluster Analysis: Basic Concepts and Methods Huan Sun, CSE@The Ohio State University Slides adapted from UIUC CS412, Fall 2017, by Prof. Jiawei Han 2 Chapter 10. Cluster

More information

Mine Blood Donors Information through Improved K- Means Clustering Bondu Venkateswarlu 1 and Prof G.S.V.Prasad Raju 2

Mine Blood Donors Information through Improved K- Means Clustering Bondu Venkateswarlu 1 and Prof G.S.V.Prasad Raju 2 Mine Blood Donors Information through Improved K- Means Clustering Bondu Venkateswarlu 1 and Prof G.S.V.Prasad Raju 2 1 Department of Computer Science and Systems Engineering, Andhra University, Visakhapatnam-

More information



More information

Performance Analysis of Data Mining Classification Techniques

Performance Analysis of Data Mining Classification Techniques Performance Analysis of Data Mining Classification Techniques Tejas Mehta 1, Dr. Dhaval Kathiriya 2 Ph.D. Student, School of Computer Science, Dr. Babasaheb Ambedkar Open University, Gujarat, India 1 Principal

More information

Clustering (Basic concepts and Algorithms) Entscheidungsunterstützungssysteme

Clustering (Basic concepts and Algorithms) Entscheidungsunterstützungssysteme Clustering (Basic concepts and Algorithms) Entscheidungsunterstützungssysteme Why do we need to find similarity? Similarity underlies many data science methods and solutions to business problems. Some

More information

The Application of K-medoids and PAM to the Clustering of Rules

The Application of K-medoids and PAM to the Clustering of Rules The Application of K-medoids and PAM to the Clustering of Rules A. P. Reynolds, G. Richards, and V. J. Rayward-Smith School of Computing Sciences, University of East Anglia, Norwich Abstract. Earlier research

More information

Introduction to Data Mining and Data Analytics

Introduction to Data Mining and Data Analytics 1/28/2016 MIST.7060 Data Analytics 1 Introduction to Data Mining and Data Analytics What Are Data Mining and Data Analytics? Data mining is the process of discovering hidden patterns in data, where Patterns

More information

Basic Data Mining Technique

Basic Data Mining Technique Basic Data Mining Technique What is classification? What is prediction? Supervised and Unsupervised Learning Decision trees Association rule K-nearest neighbor classifier Case-based reasoning Genetic algorithm

More information

Unsupervised Learning and Clustering

Unsupervised Learning and Clustering Unsupervised Learning and Clustering Selim Aksoy Department of Computer Engineering Bilkent University saksoy@cs.bilkent.edu.tr CS 551, Spring 2008 CS 551, Spring 2008 c 2008, Selim Aksoy (Bilkent University)

More information

Multi-Modal Data Fusion: A Description

Multi-Modal Data Fusion: A Description Multi-Modal Data Fusion: A Description Sarah Coppock and Lawrence J. Mazlack ECECS Department University of Cincinnati Cincinnati, Ohio 45221-0030 USA {coppocs,mazlack}@uc.edu Abstract. Clustering groups

More information

K-means clustering based filter feature selection on high dimensional data

K-means clustering based filter feature selection on high dimensional data International Journal of Advances in Intelligent Informatics ISSN: 2442-6571 Vol 2, No 1, March 2016, pp. 38-45 38 K-means clustering based filter feature selection on high dimensional data Dewi Pramudi

More information

Data Clustering With Leaders and Subleaders Algorithm

Data Clustering With Leaders and Subleaders Algorithm IOSR Journal of Engineering (IOSRJEN) e-issn: 2250-3021, p-issn: 2278-8719, Volume 2, Issue 11 (November2012), PP 01-07 Data Clustering With Leaders and Subleaders Algorithm Srinivasulu M 1,Kotilingswara

More information

Data Cleaning and Prototyping Using K-Means to Enhance Classification Accuracy

Data Cleaning and Prototyping Using K-Means to Enhance Classification Accuracy Data Cleaning and Prototyping Using K-Means to Enhance Classification Accuracy Lutfi Fanani 1 and Nurizal Dwi Priandani 2 1 Department of Computer Science, Brawijaya University, Malang, Indonesia. 2 Department

More information

The Clustering Validity with Silhouette and Sum of Squared Errors

The Clustering Validity with Silhouette and Sum of Squared Errors Proceedings of the 3rd International Conference on Industrial Application Engineering 2015 The Clustering Validity with Silhouette and Sum of Squared Errors Tippaya Thinsungnoen a*, Nuntawut Kaoungku b,

More information

Efficient and Effective Clustering Methods for Spatial Data Mining. Raymond T. Ng, Jiawei Han

Efficient and Effective Clustering Methods for Spatial Data Mining. Raymond T. Ng, Jiawei Han Efficient and Effective Clustering Methods for Spatial Data Mining Raymond T. Ng, Jiawei Han 1 Overview Spatial Data Mining Clustering techniques CLARANS Spatial and Non-Spatial dominant CLARANS Observations

More information

Semi-Supervised Clustering with Partial Background Information

Semi-Supervised Clustering with Partial Background Information Semi-Supervised Clustering with Partial Background Information Jing Gao Pang-Ning Tan Haibin Cheng Abstract Incorporating background knowledge into unsupervised clustering algorithms has been the subject

More information

1. Inroduction to Data Mininig

1. Inroduction to Data Mininig 1. Inroduction to Data Mininig 1.1 Introduction Universe of Data Information Technology has grown in various directions in the recent years. One natural evolutionary path has been the development of the

More information

What is Cluster Analysis? COMP 465: Data Mining Clustering Basics. Applications of Cluster Analysis. Clustering: Application Examples 3/17/2015

What is Cluster Analysis? COMP 465: Data Mining Clustering Basics. Applications of Cluster Analysis. Clustering: Application Examples 3/17/2015 // What is Cluster Analysis? COMP : Data Mining Clustering Basics Slides Adapted From : Jiawei Han, Micheline Kamber & Jian Pei Data Mining: Concepts and Techniques, rd ed. Cluster: A collection of data

More information

Conceptual Review of clustering techniques in data mining field

Conceptual Review of clustering techniques in data mining field Conceptual Review of clustering techniques in data mining field Divya Shree ABSTRACT The marvelous amount of data produced nowadays in various application domains such as molecular biology or geography

More information

Gene Clustering & Classification

Gene Clustering & Classification BINF, Introduction to Computational Biology Gene Clustering & Classification Young-Rae Cho Associate Professor Department of Computer Science Baylor University Overview Introduction to Gene Clustering

More information

Estimating Missing Attribute Values Using Dynamically-Ordered Attribute Trees

Estimating Missing Attribute Values Using Dynamically-Ordered Attribute Trees Estimating Missing Attribute Values Using Dynamically-Ordered Attribute Trees Jing Wang Computer Science Department, The University of Iowa jing-wang-1@uiowa.edu W. Nick Street Management Sciences Department,

More information

Unsupervised Learning and Clustering

Unsupervised Learning and Clustering Unsupervised Learning and Clustering Selim Aksoy Department of Computer Engineering Bilkent University saksoy@cs.bilkent.edu.tr CS 551, Spring 2009 CS 551, Spring 2009 c 2009, Selim Aksoy (Bilkent University)

More information

Research Article Term Frequency Based Cosine Similarity Measure for Clustering Categorical Data using Hierarchical Algorithm

Research Article Term Frequency Based Cosine Similarity Measure for Clustering Categorical Data using Hierarchical Algorithm Research Journal of Applied Sciences, Engineering and Technology 11(7): 798-805, 2015 DOI: 10.19026/rjaset.11.2043 ISSN: 2040-7459; e-issn: 2040-7467 2015 Maxwell Scientific Publication Corp. Submitted:

More information

The k-means Algorithm and Genetic Algorithm

The k-means Algorithm and Genetic Algorithm The k-means Algorithm and Genetic Algorithm k-means algorithm Genetic algorithm Rough set approach Fuzzy set approaches Chapter 8 2 The K-Means Algorithm The K-Means algorithm is a simple yet effective

More information

Determination of Similarity Threshold in Clustering Problems for Large Data Sets

Determination of Similarity Threshold in Clustering Problems for Large Data Sets Determination of Similarity Threshold in Clustering Problems for Large Data Sets Guillermo Sánchez-Díaz 1 and José F. Martínez-Trinidad 2 1 Center of Technologies Research on Information and Systems, The

More information

Exploratory data analysis for microarrays

Exploratory data analysis for microarrays Exploratory data analysis for microarrays Jörg Rahnenführer Computational Biology and Applied Algorithmics Max Planck Institute for Informatics D-66123 Saarbrücken Germany NGFN - Courses in Practical DNA

More information

A New Clustering Algorithm On Nominal Data Sets

A New Clustering Algorithm On Nominal Data Sets A New Clustering Algorithm On Nominal Data Sets Bin Wang Abstract This paper presents a new clustering technique named as the Olary algorithm, which is suitable to cluster nominal data sets. This algorithm

More information

Clustering Large Dynamic Datasets Using Exemplar Points

Clustering Large Dynamic Datasets Using Exemplar Points Clustering Large Dynamic Datasets Using Exemplar Points William Sia, Mihai M. Lazarescu Department of Computer Science, Curtin University, GPO Box U1987, Perth 61, W.A. Email: {siaw, lazaresc}@cs.curtin.edu.au

More information

PAM algorithm. Types of Data in Cluster Analysis. A Categorization of Major Clustering Methods. Partitioning i Methods. Hierarchical Methods

PAM algorithm. Types of Data in Cluster Analysis. A Categorization of Major Clustering Methods. Partitioning i Methods. Hierarchical Methods Whatis Cluster Analysis? Clustering Types of Data in Cluster Analysis Clustering part II A Categorization of Major Clustering Methods Partitioning i Methods Hierarchical Methods Partitioning i i Algorithms:

More information

Dynamic Optimization of Generalized SQL Queries with Horizontal Aggregations Using K-Means Clustering

Dynamic Optimization of Generalized SQL Queries with Horizontal Aggregations Using K-Means Clustering Dynamic Optimization of Generalized SQL Queries with Horizontal Aggregations Using K-Means Clustering Abstract Mrs. C. Poongodi 1, Ms. R. Kalaivani 2 1 PG Student, 2 Assistant Professor, Department of

More information

Comparative Study Of Different Data Mining Techniques : A Review

Comparative Study Of Different Data Mining Techniques : A Review Volume II, Issue IV, APRIL 13 IJLTEMAS ISSN 7-5 Comparative Study Of Different Data Mining Techniques : A Review Sudhir Singh Deptt of Computer Science & Applications M.D. University Rohtak, Haryana sudhirsingh@yahoo.com

More information

An Efficient Technique to Test Suite Minimization using Hierarchical Clustering Approach

An Efficient Technique to Test Suite Minimization using Hierarchical Clustering Approach An Efficient Technique to Test Suite Minimization using Hierarchical Clustering Approach Fayaz Ahmad Khan, Anil Kumar Gupta, Dibya Jyoti Bora Abstract:- Software testing is a pervasive activity in software

More information


A REVIEW ON VARIOUS APPROACHES OF CLUSTERING IN DATA MINING A REVIEW ON VARIOUS APPROACHES OF CLUSTERING IN DATA MINING Abhinav Kathuria Email - abhinav.kathuria90@gmail.com Abstract: Data mining is the process of the extraction of the hidden pattern from the data

More information

International Journal of Research in Advent Technology, Vol.7, No.3, March 2019 E-ISSN: Available online at

International Journal of Research in Advent Technology, Vol.7, No.3, March 2019 E-ISSN: Available online at Performance Evaluation of Ensemble Method Based Outlier Detection Algorithm Priya. M 1, M. Karthikeyan 2 Department of Computer and Information Science, Annamalai University, Annamalai Nagar, Tamil Nadu,

More information

CS573 Data Privacy and Security. Li Xiong

CS573 Data Privacy and Security. Li Xiong CS573 Data Privacy and Security Anonymizationmethods Li Xiong Today Clustering based anonymization(cont) Permutation based anonymization Other privacy principles Microaggregation/Clustering Two steps:

More information

Analyzing Outlier Detection Techniques with Hybrid Method

Analyzing Outlier Detection Techniques with Hybrid Method Analyzing Outlier Detection Techniques with Hybrid Method Shruti Aggarwal Assistant Professor Department of Computer Science and Engineering Sri Guru Granth Sahib World University. (SGGSWU) Fatehgarh Sahib,

More information

Nearest Clustering Algorithm for Satellite Image Classification in Remote Sensing Applications

Nearest Clustering Algorithm for Satellite Image Classification in Remote Sensing Applications Nearest Clustering Algorithm for Satellite Image Classification in Remote Sensing Applications Anil K Goswami 1, Swati Sharma 2, Praveen Kumar 3 1 DRDO, New Delhi, India 2 PDM College of Engineering for

More information

Topic 1 Classification Alternatives

Topic 1 Classification Alternatives Topic 1 Classification Alternatives [Jiawei Han, Micheline Kamber, Jian Pei. 2011. Data Mining Concepts and Techniques. 3 rd Ed. Morgan Kaufmann. ISBN: 9380931913.] 1 Contents 2. Classification Using Frequent

More information

Data Mining: An experimental approach with WEKA on UCI Dataset

Data Mining: An experimental approach with WEKA on UCI Dataset Data Mining: An experimental approach with WEKA on UCI Dataset Ajay Kumar Dept. of computer science Shivaji College University of Delhi, India Indranath Chatterjee Dept. of computer science Faculty of

More information

Distance based Clustering for Categorical Data

Distance based Clustering for Categorical Data Distance based Clustering for Categorical Data Extended Abstract Dino Ienco and Rosa Meo Dipartimento di Informatica, Università di Torino Italy e-mail: {ienco, meo}@di.unito.it Abstract. Learning distances

More information

A Novel Approach for Minimum Spanning Tree Based Clustering Algorithm

A Novel Approach for Minimum Spanning Tree Based Clustering Algorithm IJCSES International Journal of Computer Sciences and Engineering Systems, Vol. 5, No. 2, April 2011 CSES International 2011 ISSN 0973-4406 A Novel Approach for Minimum Spanning Tree Based Clustering Algorithm

More information

Cluster Analysis for Microarray Data

Cluster Analysis for Microarray Data Cluster Analysis for Microarray Data Seventh International Long Oligonucleotide Microarray Workshop Tucson, Arizona January 7-12, 2007 Dan Nettleton IOWA STATE UNIVERSITY 1 Clustering Group objects that

More information

Iteration Reduction K Means Clustering Algorithm

Iteration Reduction K Means Clustering Algorithm Iteration Reduction K Means Clustering Algorithm Kedar Sawant 1 and Snehal Bhogan 2 1 Department of Computer Engineering, Agnel Institute of Technology and Design, Assagao, Goa 403507, India 2 Department

More information

Dynamic Data in terms of Data Mining Streams

Dynamic Data in terms of Data Mining Streams International Journal of Computer Science and Software Engineering Volume 1, Number 1 (2015), pp. 25-31 International Research Publication House http://www.irphouse.com Dynamic Data in terms of Data Mining

More information

A Genetic k-modes Algorithm for Clustering Categorical Data

A Genetic k-modes Algorithm for Clustering Categorical Data A Genetic k-modes Algorithm for Clustering Categorical Data Guojun Gan, Zijiang Yang, and Jianhong Wu Department of Mathematics and Statistics, York University, Toronto, Ontario, Canada M3J 1P3 {gjgan,

More information

COMP 465: Data Mining Still More on Clustering

COMP 465: Data Mining Still More on Clustering 3/4/015 Exercise COMP 465: Data Mining Still More on Clustering Slides Adapted From : Jiawei Han, Micheline Kamber & Jian Pei Data Mining: Concepts and Techniques, 3 rd ed. Describe each of the following

More information


CS145: INTRODUCTION TO DATA MINING CS145: INTRODUCTION TO DATA MINING 09: Vector Data: Clustering Basics Instructor: Yizhou Sun yzsun@cs.ucla.edu October 27, 2017 Methods to Learn Vector Data Set Data Sequence Data Text Data Classification

More information

Enhancing K-means Clustering Algorithm with Improved Initial Center

Enhancing K-means Clustering Algorithm with Improved Initial Center Enhancing K-means Clustering Algorithm with Improved Initial Center Madhu Yedla #1, Srinivasa Rao Pathakota #2, T M Srinivasa #3 # Department of Computer Science and Engineering, National Institute of

More information

A Review: Content Base Image Mining Technique for Image Retrieval Using Hybrid Clustering

A Review: Content Base Image Mining Technique for Image Retrieval Using Hybrid Clustering A Review: Content Base Image Mining Technique for Image Retrieval Using Hybrid Clustering Gurpreet Kaur M-Tech Student, Department of Computer Engineering, Yadawindra College of Engineering, Talwandi Sabo,

More information

Clustering: An art of grouping related objects

Clustering: An art of grouping related objects Clustering: An art of grouping related objects Sumit Kumar, Sunil Verma Abstract- In today s world, clustering has seen many applications due to its ability of binding related data together but there are

More information

A hybrid method to categorize HTML documents

A hybrid method to categorize HTML documents Data Mining VI 331 A hybrid method to categorize HTML documents M. Khordad, M. Shamsfard & F. Kazemeyni Electrical & Computer Engineering Department, Shahid Beheshti University, Iran Abstract In this paper

More information

International Journal of Computer Engineering and Applications, Volume VIII, Issue III, Part I, December 14

International Journal of Computer Engineering and Applications, Volume VIII, Issue III, Part I, December 14 International Journal of Computer Engineering and Applications, Volume VIII, Issue III, Part I, December 14 DESIGN OF AN EFFICIENT DATA ANALYSIS CLUSTERING ALGORITHM Dr. Dilbag Singh 1, Ms. Priyanka 2

More information

Data Informatics. Seon Ho Kim, Ph.D.

Data Informatics. Seon Ho Kim, Ph.D. Data Informatics Seon Ho Kim, Ph.D. seonkim@usc.edu Clustering Overview Supervised vs. Unsupervised Learning Supervised learning (classification) Supervision: The training data (observations, measurements,

More information

Data Mining. Clustering. Hamid Beigy. Sharif University of Technology. Fall 1394

Data Mining. Clustering. Hamid Beigy. Sharif University of Technology. Fall 1394 Data Mining Clustering Hamid Beigy Sharif University of Technology Fall 1394 Hamid Beigy (Sharif University of Technology) Data Mining Fall 1394 1 / 31 Table of contents 1 Introduction 2 Data matrix and

More information

University of Florida CISE department Gator Engineering. Clustering Part 2

University of Florida CISE department Gator Engineering. Clustering Part 2 Clustering Part 2 Dr. Sanjay Ranka Professor Computer and Information Science and Engineering University of Florida, Gainesville Partitional Clustering Original Points A Partitional Clustering Hierarchical

More information

Page Segmentation by Web Content Clustering

Page Segmentation by Web Content Clustering Page Segmentation by Web Content Clustering Sadet Alcic Heinrich-Heine-University of Duesseldorf Department of Computer Science Institute for Databases and Information Systems May 26, 20 / 9 Outline Introduction

More information

The Effect of Word Sampling on Document Clustering

The Effect of Word Sampling on Document Clustering The Effect of Word Sampling on Document Clustering OMAR H. KARAM AHMED M. HAMAD SHERIN M. MOUSSA Department of Information Systems Faculty of Computer and Information Sciences University of Ain Shams,

More information

Unsupervised Data Mining: Clustering. Izabela Moise, Evangelos Pournaras, Dirk Helbing

Unsupervised Data Mining: Clustering. Izabela Moise, Evangelos Pournaras, Dirk Helbing Unsupervised Data Mining: Clustering Izabela Moise, Evangelos Pournaras, Dirk Helbing Izabela Moise, Evangelos Pournaras, Dirk Helbing 1 1. Supervised Data Mining Classification Regression Outlier detection

More information