High Accuracy Clustering Algorithm for Categorical Dataset

Similar documents
Enhancing Forecasting Performance of Naïve-Bayes Classifiers with Discretization Techniques

A Review on Cluster Based Approach in Data Mining

A Spectral-based Clustering Algorithm for Categorical Data Using Data Summaries (SCCADDS)

Mining Quantitative Association Rules on Overlapped Intervals

Generalized k-means algorithm on nominal dataset

Efficiency of k-means and K-Medoids Algorithms for Clustering Arbitrary Data Points

K-modes Clustering Algorithm for Categorical Data

Kapitel 4: Clustering

CLUSTERING. CSE 634 Data Mining Prof. Anita Wasilewska TEAM 16

An Enhanced K-Medoid Clustering Algorithm

Clustering part II 1

A fuzzy k-modes algorithm for clustering categorical data. Citation IEEE Transactions on Fuzzy Systems, 1999, v. 7 n. 4, p.

On the Consequence of Variation Measure in K- modes Clustering Algorithm

Unsupervised Learning

Dynamic Clustering of Data with Modified K-Means Algorithm

An Improved Fuzzy K-Medoids Clustering Algorithm with Optimized Number of Clusters

USING SOFT COMPUTING TECHNIQUES TO INTEGRATE MULTIPLE KINDS OF ATTRIBUTES IN DATA MINING

Clustering of Data with Mixed Attributes based on Unified Similarity Metric

Automated Clustering-Based Workload Characterization

Efficient Clustering of Web Documents Using Hybrid Approach in Data Mining

Hierarchical Document Clustering

HIMIC : A Hierarchical Mixed Type Data Clustering Algorithm

ECLT 5810 Clustering

Keywords: clustering algorithms, unsupervised learning, cluster validity

Research on Data Mining Technology Based on Business Intelligence. Yang WANG

CHAPTER 4 K-MEANS AND UCAM CLUSTERING ALGORITHM

Outlier Detection Using Unsupervised and Semi-Supervised Technique on High Dimensional Data

Data Mining: Concepts and Techniques. Chapter March 8, 2007 Data Mining: Concepts and Techniques 1

K-Means Clustering With Initial Centroids Based On Difference Operator

Clustering. Chapter 10 in Introduction to statistical learning

CS570: Introduction to Data Mining

Using Categorical Attributes for Clustering

CHAPTER 3 A FAST K-MODES CLUSTERING ALGORITHM TO WAREHOUSE VERY LARGE HETEROGENEOUS MEDICAL DATABASES

K-Mean Clustering Algorithm Implemented To E-Banking

ECLT 5810 Clustering

Enhanced Bug Detection by Data Mining Techniques

Cluster Center Initialization for Categorical Data Using Multiple Attribute Clustering

ISSN: [Saurkar* et al., 6(4): April, 2017] Impact Factor: 4.116

Comparative Study of Clustering Algorithms using R

Data Mining Algorithms

CSE 5243 INTRO. TO DATA MINING

Cluster Analysis. CSE634 Data Mining

CSE 5243 INTRO. TO DATA MINING

Mine Blood Donors Information through Improved K- Means Clustering Bondu Venkateswarlu 1 and Prof G.S.V.Prasad Raju 2

DENSITY BASED AND PARTITION BASED CLUSTERING OF UNCERTAIN DATA BASED ON KL-DIVERGENCE SIMILARITY MEASURE

Performance Analysis of Data Mining Classification Techniques

Clustering (Basic concepts and Algorithms) Entscheidungsunterstützungssysteme

The Application of K-medoids and PAM to the Clustering of Rules

Introduction to Data Mining and Data Analytics

Basic Data Mining Technique

Unsupervised Learning and Clustering

Multi-Modal Data Fusion: A Description

K-means clustering based filter feature selection on high dimensional data

Data Clustering With Leaders and Subleaders Algorithm

Data Cleaning and Prototyping Using K-Means to Enhance Classification Accuracy

The Clustering Validity with Silhouette and Sum of Squared Errors

Efficient and Effective Clustering Methods for Spatial Data Mining. Raymond T. Ng, Jiawei Han

Semi-Supervised Clustering with Partial Background Information

1. Inroduction to Data Mininig

What is Cluster Analysis? COMP 465: Data Mining Clustering Basics. Applications of Cluster Analysis. Clustering: Application Examples 3/17/2015

Conceptual Review of clustering techniques in data mining field

Gene Clustering & Classification

Estimating Missing Attribute Values Using Dynamically-Ordered Attribute Trees

Unsupervised Learning and Clustering

Research Article Term Frequency Based Cosine Similarity Measure for Clustering Categorical Data using Hierarchical Algorithm

The k-means Algorithm and Genetic Algorithm

Determination of Similarity Threshold in Clustering Problems for Large Data Sets

Exploratory data analysis for microarrays

A New Clustering Algorithm On Nominal Data Sets

Clustering Large Dynamic Datasets Using Exemplar Points

PAM algorithm. Types of Data in Cluster Analysis. A Categorization of Major Clustering Methods. Partitioning i Methods. Hierarchical Methods

Dynamic Optimization of Generalized SQL Queries with Horizontal Aggregations Using K-Means Clustering

Comparative Study Of Different Data Mining Techniques : A Review

An Efficient Technique to Test Suite Minimization using Hierarchical Clustering Approach

A REVIEW ON VARIOUS APPROACHES OF CLUSTERING IN DATA MINING

International Journal of Research in Advent Technology, Vol.7, No.3, March 2019 E-ISSN: Available online at

CS573 Data Privacy and Security. Li Xiong

Analyzing Outlier Detection Techniques with Hybrid Method

Nearest Clustering Algorithm for Satellite Image Classification in Remote Sensing Applications

Topic 1 Classification Alternatives

Data Mining: An experimental approach with WEKA on UCI Dataset

Distance based Clustering for Categorical Data

A Novel Approach for Minimum Spanning Tree Based Clustering Algorithm

Cluster Analysis for Microarray Data

Iteration Reduction K Means Clustering Algorithm

Dynamic Data in terms of Data Mining Streams

A Genetic k-modes Algorithm for Clustering Categorical Data

COMP 465: Data Mining Still More on Clustering

CS145: INTRODUCTION TO DATA MINING

Enhancing K-means Clustering Algorithm with Improved Initial Center

A Review: Content Base Image Mining Technique for Image Retrieval Using Hybrid Clustering

Clustering: An art of grouping related objects

A hybrid method to categorize HTML documents

International Journal of Computer Engineering and Applications, Volume VIII, Issue III, Part I, December 14

Data Informatics. Seon Ho Kim, Ph.D.

Data Mining. Clustering. Hamid Beigy. Sharif University of Technology. Fall 1394

University of Florida CISE department Gator Engineering. Clustering Part 2

Page Segmentation by Web Content Clustering

The Effect of Word Sampling on Document Clustering

Unsupervised Data Mining: Clustering. Izabela Moise, Evangelos Pournaras, Dirk Helbing

Transcription:

Proc. of Int. Conf. on Recent Trends in Information, Telecommunication and Computing, ITC High Accuracy Clustering Algorithm for Categorical Dataset Aman Ahmad Ansari 1 and Gaurav Pathak 2 1 NIMS Institute of Engineering &Technology, Jaipur, India Email: ansariaa1jan@gmail.com 2 NIMS Institute of Engineering &Technology, Jaipur, India Email: pathakg86@gmail.com Abstract Step by step operations by which we make a group of objects in which attributes of all the objects are nearly similar, known as clustering. So, a cluster is a collection of objects that acquire nearly same attribute values. The property of an object in a cluster is similar to other objects in same cluster but different with objects of other clusters. Clustering is used in wide range of applications like pattern recognition, image processing, data analysis, machine learning etc. Nowadays, more attention has been put on categorical data rather than numerical data. Where, the range of numerical attributes organizes in a class like small, medium, high, and so on. There is wide range of algorithm that used to make clusters of given categorical data. Our approach is to enhance the working on wellknown clustering algorithm k-modes to improve accuracy of algorithm. We proposed a new approach named High Accuracy Clustering Algorithm for Categorical datasets. Index Terms clustering, k-mode Algorithm, categorical data, data mining. I. INTRODUCTION Data mining refers to extracting or mining knowledge from large amount of data [1], or synonym for KDD (knowledge discovery in databases). Data mining Techniques: Association Analysis: Discovering association rules showing attribute-value conditions that occur frequently together on a given data set. Classification: To learn to assign data objects to predefined classes. This requires supervised learning, i.e. the training data has to specify what have to be learning. Clustering: The process of grouping a set of physical or abstract objects into classes of similar objects is called clustering.. A cluster is a collection of collection of data objects that are similar to one another within the class or cluster, and are dissimilar to the objects in other clusters. The cluster of data objects can be treated collectively as one group. The example shown in figure 1, Clustering of objects into three groups. During a cholera outbreak in London in 1854, John Snow used a special map to plot the pees of the disease that were reported [2]. A key observation, alter the creation of the map, was Joe close association between the density of disease cases and a single well located at a central knee. Most of the clustering algorithms focus on data sets where objects are defined on a set of numerical values. Datasets also contain nonnumerical values to be clustered; each object is described by multiple attributes, categorical data sets. Clustering cannot be a one-step process. Jain and Dubes divide the clustering process in the following stages [9] a). Data Collection: b). Initial Screening: c). Representation: d). Clustering Tendency: e). Clustering DOI: 02.ITC.2014.5.47 Association of Computer Electronics and Electrical Engineers, 2014

Figure 1. Clustering of a set of points Strategy: f). Validation: g). Interpretation. This list of stages is given for exposition purposes since we do not propose solutions for each one of them. We mainly focus on the problem of Clustering Strategy by proposing a new algorithm for categorical data, and the problem of Clustering Tendency by proposing a heuristic for identifying appropriate values for the number of clusters that exist in a data set. II. PROBLEM DEFINITION The previous clustering algorithm for categorical dataset are not much accurate and do not give same result at every execution with the same categorical dataset. We want to solve this problem clustering of categorical data with high accuracy. III. CLUSTERING TECHNIQUES A. Rules for Clustering Techniques Every clustering algorithm must follow the following rules: 1. The measure used to assess similarity or dissimilarity between pairs of objects. 2. The particular strategy followed In order to merge Intermediate results. This strategy obviously affects the way the end clusters are produced, since we may merge intermediate clusters according to the distance of their closest or furthest points, or the distance of the average of their points [5]. 3. An objective function that needs to be minimized or maximized as appropriate, in order to produce final results. B. Basic Clustering Techniques 1. Partitional: Given n objects partitional clustering algorithm constructs k partitions of the data, so that an objective function is optimized. Some of these algorithms are high complexity, because of some of them generate all possible groupings and try to find the optimal solution. If we take small no of objects, there also the grouping (partitions) may high. Because of this, solutions start with initial, usually random, partition and proceed with its refinement. Better Approach was, run the partitioned algorithm for several different sets of k initial points and keep track of the result The majority of them could be considered as greedy algorithms, i.e., algorithms that at each step choose the best solution and may not lead to optimal results in the end The best solution at each step is the placement of a certain object In the cluster for which the representative point is nearest to the object, k-means [4], PAM (partitioning Around Medoids) [5], CLARA (Clustering LARge Applications) [5] are comes under this category All these are applicable to numerical attributes. 2. Categorical data clustering algorithms: These are for categorical data where Euclidean, or other numerically-oriented distance measures are not meaningful. These algorithms are close to partitioned and hierarchical types. For each category, there exists a plethora of sub-categories, e.g., density-based clustering oriented toward geographical data. An exception to this is the class of approaches to handling categorical data. Visualization of such data is not straight forward and there is no inherent geometrical structure in them, hence the approaches that have appeared in the literature mainly use concepts carried by the data, such as no-occurrences in tuples. On the other hand, data sets that include some categorical 293

attributes are abundant. Moreover, there are data sets with a mixture of attribute types, such as the United States Census data set [7] and data sets used in data integration [6]. IV. RELATED WORK To cluster categorical data objects, k-modes, ROCK, and COOLCAT [10], are exists, but in present work we are extending the k-modes algorithm especially for accuracy. A. K-modes Algorithm The first algorithm for categorical data sets is k-modes algorithm, which is extension to k-means [11].Kmodes algorithm partitions a categorical data set of n objects in clusters. It is based on k-means paradigm and use modes at the place of means for categorical data, and frequency based method to update modes. K- modes algorithm chooses k random objects to set initial mode of cluster, and different dissimilarity measure use for calculate distance between two objects. Dissimilarity measure is- d(x, Y) = Let Q= {q 1, q 2, q 3..q m } is mode of a cluster. Where δ(x, y )=0 δ(x, y )=1 δ x, y (1) x i =y i x i y i D(X, Q) = d(x, Q) (2) Where Q can be an object but not necessarily an object. Algorithm: k-modes Input: k: number of objects D: data set that contain n objects Output: set of k clusters Method: 1. Randomly choose k objects for initial cluster modes, one for each. 2. Allocate each object to that cluster which mode is most similar to that object, according to eqn.(1). 3. Update modes by calculate the frequent value for each attribute of all objects in cluster. 4. Repeat a. Reallocate each object to that cluster which mode is most similar to that object. If that cluster is not current cluster. b. Update modes of changed clusters. 5. until no changes. V. PROPOSED METHODOLOGY Proposed clustering algorithm extends k-modes clustering algorithm with new dissimilarity measure and selects initial modes by using select_init_modes algorithm unlike k-modes algorithm selects initial modes randomly. A.Selection of Initial Nodes Result of clustering process depends on the initial modes. So, if any clustering algorithm set initial modes in random manner, then clustering result of that algorithm may not have same accuracy every time for particular data set. Here, we proposed an algorithm select_init_mode to overcome this problem. This algorithm use k- modes to calculate modes and store np set of modes in mode-pool, P Algorithm: select_init_mode Input: n p : number of set of k modes in mode-pool. k: number of clusters. D: data set having n objects. 294

Output: P: mode-pool. Method: 1. Set i = 0. 2. Repeat a. Execute k-modes clustering algorithm. b. Store the set of modes in mode-pool. c. Increment i. 3. Until i<n p B.Dissimilarity Measure Similarity can be defined as how far or close the data objects are from one another. The notion of similarity will help. We call it as measure, or index or coefficient [3]. Dissimilarity can be measured in many ways and one can be in distance. Distance can be measured using any one of a variety of distance measures. Dissimilarity measure used by k-modes does not represent the real semantic distance between the object and cluster. For example- Let s take a categorical data set having 3 attributes A1={1,2}, A2={1,2} and A3={1,2,3,4,5}with 7 attributes on using k-modes clustering algorithm with k=2 after 6 objects are clustered as shown in table I below. TABLE I. CLUSTER 1 AND CLUSTER 2 Let 7th object of dataset are X = [2 1 1], for this object dissimilarities are d(x, C) = 1 and d(x, C) = l. we may not properly assign this object. But we can see that this object will be assigned to cluster2. By using k- modes dissimilarity measure we cannot sure this object allocate to cluster2. To solve this problem, I propose anew dissimilarity measure that accounts the frequency of values of attributes of objects in clusters. New dissimilarity measure are- d (X, Y) = θ x,y (3) Where θ x, y = 1 O O x j =y j θ x, y = 1 x j y j O l number of objects in the l th cluster, and O ljm the number of objects with value a j of the j th attribute in the l th cluster. By using this dissimilarity measure, we sure that 7 th object allocates to cluster2. C.Proposed Algorithm Input: n p : number of set of modes in mode-pool. k : number of clusters. D: data set having n objects. Output: set of k clusters. Method: 295

1. Execute select_init_mode algorithm, it returns mode-pool. 2. Select most frequent attribute value of all attributes for a mode n corresponding set of np modes in modepool. Initialize all modes. 3. Allocate each object to that cluster which dissimilarity measure is lowest with that object, according to equation. 4. Update modes by calculate the frequent value for each attribute of all objects in cluster. 5. Repeat a. Reallocate each object to that cluster which dissimilarity is lowest with that object, if that cluster is not current cluster. b. Update modes of changed clusters. 6. until no changes. VI. IMPLEMENTATION & RESULT For the implementation of my proposed algorithm we have designed a tool interface. Figure 2. Input Frame Figure 2 is the initial window of my tool. It takes the input file on which we want to apply clustering. It also takes the number of clusters from the user. Figure 3. Result Frame 296

From the window shown in figure 3, we can see the output of k-modes algorithm and proposed algorithm by using the appropriate button. I experimented with two real-life categorical datasets. Mushroom dataset, and Congressional Voting dataset taken from UCI Machine learning repository [8]. Clustering Accuracy: Cluster Accuracy r is defined as r=( Where, a i = number of objects occurring in a cluster, k=number of clusters, and n=number of objects in a data set Clustering error defined as ai)/n (4) e=1-r (5) We compare proposed k-modes algorithm, and existing k-modes algorithm. For a fixed number of clusters k, the clustering errors e of both algorithms compared and shown in figure 4. A.Datasets Congressional Voting Data Set: it includes votes of every house of United States representatives of congressmen on sixteen key votes recognized by the CQA. The CQA lists 9 various votes- voted for, paired for, announced for (all 3 are interpreted to yes). Voted against, paired against, and announced against (all 3 interpreted to no).voted present, voted present to elude conflict of interest, didn't vote or elsewhere make a position known (these 3 interpreted to unknown) [8]. Figure 4. Congressional Voting data (Clustering Error vs No. of clusters) Mushroom Data Set: We used mushroom database as input of my system. This database drawn from The Audubon Society Field Guide to North American Mushrooms (1981), this data set has 8124 data objects. Each object has 22 attributes (e.g., color, odor, and shape) and has a label characterizing the mushroom specimen as either poisonous (3916 records) or edible (4208 records) [8]. Soybean Disease Data Set: We used Soybean Disease database as input of my system. These databases drawn from this dataset have 19 classes, only the first 15 of which have been used in prior work. The folklore seems to be that the last four classes are unjustified by the data since they have so few examples. There are 35 categorical attributes, some nominal and some ordered. The value dna means does not apply. The values for attributes are encoded numerically, with the first value encoded as 0, the second as 1, and so forth. An unknown value is encoded as?.this data set has 307 data objects [8]. The proposed algorithm was tested on other categorical data [8] such as Zoo, Soybeans, US Census Data. VII. CONCLUSIONS As we all know clustering is applicable in every area, for eg ranging from image processing, bug prediction, pattern evolution, and machine learning and so on. So, we need a clustering algorithm that work efficiently as well as accurately on all type of databases numerical, categorical, and mixture of both. 297

In this paper, we work on only accuracy quality attribute of clustering algorithm, so that; we can find much accurate and nearly same result at every execution of algorithm on same dataset. Our algorithm worked well in this scenario to provide accurate result at every execution of algorithm. We applied this algorithm on only simple real time categorical datasets mushroom database, Congressional Voting Data Set. In future, it is possible to apply this algorithm on bug dataset to help developer to find the clusters of bugs that have a same cause. It helps in bug fixing during development and also after deployment. Presently it works only for categorical datasets. But in future it may enhance to work well with numerical datasets also. REFERENCES [1] Jiawei Han, Micheline Kamber: "data mining Concepts and Techniques", Morgan Kaufmann, 2001. [2] E. W. Gilbert: "Pioneer Maps of Health and Disease in England'', Geographical Journal, 1958. [3] Anil K. Jain and Richard C. Dubes: "Algorithms for Clustering data", Prentice-Hall, 2005. [4] Amir Ahmad, Lipika Dey: "A k-mean clustering algorithm for numeric data", Data & Knowledge Engineering, 2007 [5] Leonard Kaufman and Peter J. Rousseeuw: "Finding Groups in Data: An Introduction to Cluster Analysis.'', John Wiley & Sons, 1990. [6] Renjee J. Miller, Mauricio A. Hernjandez Laura M. Haas.: "The Clio Project: Managing Heterogeneity, SIGMOD Record, 2001. [7] US Census data set http://www.census.gov. [8] UCI Repository of Machine Learning Databases. http://archive.ics.uci.edu/ml/datasets.html [9] Serge Abiteboul, Richard Hull, and Victor Vianu.: "Foundations of Data bases." AddisonWesley, 1995. [10] Daniel Barbarja, Julia Couto, and Yi Li.: "COOLCAT: An Entropy-based Algorithm for Categorical Clustering.", CIKM -2002. [11] Zhihua Cail, Dianhong Wang, and Liangxiao Jiang: A New Algorithm for Clustering Categorical Data, ICIC- 2006. 298