High Accuracy Clustering Algorithm for Categorical Dataset

Proc. of Int. Conf. on Recent Trends in Information, Telecommunication and Computing, ITC High Accuracy Clustering Algorithm for Categorical Dataset Aman Ahmad Ansari 1 and Gaurav Pathak 2 1 NIMS Institute of Engineering &Technology, Jaipur, India Email: ansariaa1jan@gmail.com 2 NIMS Institute of Engineering &Technology, Jaipur, India Email: pathakg86@gmail.com Abstract Step by step operations by which we make a group of objects in which attributes of all the objects are nearly similar, known as clustering. So, a cluster is a collection of objects that acquire nearly same attribute values. The property of an object in a cluster is similar to other objects in same cluster but different with objects of other clusters. Clustering is used in wide range of applications like pattern recognition, image processing, data analysis, machine learning etc. Nowadays, more attention has been put on categorical data rather than numerical data. Where, the range of numerical attributes organizes in a class like small, medium, high, and so on. There is wide range of algorithm that used to make clusters of given categorical data. Our approach is to enhance the working on wellknown clustering algorithm k-modes to improve accuracy of algorithm. We proposed a new approach named High Accuracy Clustering Algorithm for Categorical datasets. Index Terms clustering, k-mode Algorithm, categorical data, data mining. I. INTRODUCTION Data mining refers to extracting or mining knowledge from large amount of data [1], or synonym for KDD (knowledge discovery in databases). Data mining Techniques: Association Analysis: Discovering association rules showing attribute-value conditions that occur frequently together on a given data set. Classification: To learn to assign data objects to predefined classes. This requires supervised learning, i.e. the training data has to specify what have to be learning. Clustering: The process of grouping a set of physical or abstract objects into classes of similar objects is called clustering.. A cluster is a collection of collection of data objects that are similar to one another within the class or cluster, and are dissimilar to the objects in other clusters. The cluster of data objects can be treated collectively as one group. The example shown in figure 1, Clustering of objects into three groups. During a cholera outbreak in London in 1854, John Snow used a special map to plot the pees of the disease that were reported [2]. A key observation, alter the creation of the map, was Joe close association between the density of disease cases and a single well located at a central knee. Most of the clustering algorithms focus on data sets where objects are defined on a set of numerical values. Datasets also contain nonnumerical values to be clustered; each object is described by multiple attributes, categorical data sets. Clustering cannot be a one-step process. Jain and Dubes divide the clustering process in the following stages [9] a). Data Collection: b). Initial Screening: c). Representation: d). Clustering Tendency: e). Clustering DOI: 02.ITC.2014.5.47 Association of Computer Electronics and Electrical Engineers, 2014

Figure 1. Clustering of a set of points Strategy: f). Validation: g). Interpretation. This list of stages is given for exposition purposes since we do not propose solutions for each one of them. We mainly focus on the problem of Clustering Strategy by proposing a new algorithm for categorical data, and the problem of Clustering Tendency by proposing a heuristic for identifying appropriate values for the number of clusters that exist in a data set. II. PROBLEM DEFINITION The previous clustering algorithm for categorical dataset are not much accurate and do not give same result at every execution with the same categorical dataset. We want to solve this problem clustering of categorical data with high accuracy. III. CLUSTERING TECHNIQUES A. Rules for Clustering Techniques Every clustering algorithm must follow the following rules: 1. The measure used to assess similarity or dissimilarity between pairs of objects. 2. The particular strategy followed In order to merge Intermediate results. This strategy obviously affects the way the end clusters are produced, since we may merge intermediate clusters according to the distance of their closest or furthest points, or the distance of the average of their points [5]. 3. An objective function that needs to be minimized or maximized as appropriate, in order to produce final results. B. Basic Clustering Techniques 1. Partitional: Given n objects partitional clustering algorithm constructs k partitions of the data, so that an objective function is optimized. Some of these algorithms are high complexity, because of some of them generate all possible groupings and try to find the optimal solution. If we take small no of objects, there also the grouping (partitions) may high. Because of this, solutions start with initial, usually random, partition and proceed with its refinement. Better Approach was, run the partitioned algorithm for several different sets of k initial points and keep track of the result The majority of them could be considered as greedy algorithms, i.e., algorithms that at each step choose the best solution and may not lead to optimal results in the end The best solution at each step is the placement of a certain object In the cluster for which the representative point is nearest to the object, k-means [4], PAM (partitioning Around Medoids) [5], CLARA (Clustering LARge Applications) [5] are comes under this category All these are applicable to numerical attributes. 2. Categorical data clustering algorithms: These are for categorical data where Euclidean, or other numerically-oriented distance measures are not meaningful. These algorithms are close to partitioned and hierarchical types. For each category, there exists a plethora of sub-categories, e.g., density-based clustering oriented toward geographical data. An exception to this is the class of approaches to handling categorical data. Visualization of such data is not straight forward and there is no inherent geometrical structure in them, hence the approaches that have appeared in the literature mainly use concepts carried by the data, such as no-occurrences in tuples. On the other hand, data sets that include some categorical 293

attributes are abundant. Moreover, there are data sets with a mixture of attribute types, such as the United States Census data set [7] and data sets used in data integration [6]. IV. RELATED WORK To cluster categorical data objects, k-modes, ROCK, and COOLCAT [10], are exists, but in present work we are extending the k-modes algorithm especially for accuracy. A. K-modes Algorithm The first algorithm for categorical data sets is k-modes algorithm, which is extension to k-means [11].Kmodes algorithm partitions a categorical data set of n objects in clusters. It is based on k-means paradigm and use modes at the place of means for categorical data, and frequency based method to update modes. K- modes algorithm chooses k random objects to set initial mode of cluster, and different dissimilarity measure use for calculate distance between two objects. Dissimilarity measure is- d(x, Y) = Let Q= {q 1, q 2, q 3..q m } is mode of a cluster. Where δ(x, y )=0 δ(x, y )=1 δ x, y (1) x i =y i x i y i D(X, Q) = d(x, Q) (2) Where Q can be an object but not necessarily an object. Algorithm: k-modes Input: k: number of objects D: data set that contain n objects Output: set of k clusters Method: 1. Randomly choose k objects for initial cluster modes, one for each. 2. Allocate each object to that cluster which mode is most similar to that object, according to eqn.(1). 3. Update modes by calculate the frequent value for each attribute of all objects in cluster. 4. Repeat a. Reallocate each object to that cluster which mode is most similar to that object. If that cluster is not current cluster. b. Update modes of changed clusters. 5. until no changes. V. PROPOSED METHODOLOGY Proposed clustering algorithm extends k-modes clustering algorithm with new dissimilarity measure and selects initial modes by using select_init_modes algorithm unlike k-modes algorithm selects initial modes randomly. A.Selection of Initial Nodes Result of clustering process depends on the initial modes. So, if any clustering algorithm set initial modes in random manner, then clustering result of that algorithm may not have same accuracy every time for particular data set. Here, we proposed an algorithm select_init_mode to overcome this problem. This algorithm use k- modes to calculate modes and store np set of modes in mode-pool, P Algorithm: select_init_mode Input: n p : number of set of k modes in mode-pool. k: number of clusters. D: data set having n objects. 294

Output: P: mode-pool. Method: 1. Set i = 0. 2. Repeat a. Execute k-modes clustering algorithm. b. Store the set of modes in mode-pool. c. Increment i. 3. Until i<n p B.Dissimilarity Measure Similarity can be defined as how far or close the data objects are from one another. The notion of similarity will help. We call it as measure, or index or coefficient [3]. Dissimilarity can be measured in many ways and one can be in distance. Distance can be measured using any one of a variety of distance measures. Dissimilarity measure used by k-modes does not represent the real semantic distance between the object and cluster. For example- Let s take a categorical data set having 3 attributes A1={1,2}, A2={1,2} and A3={1,2,3,4,5}with 7 attributes on using k-modes clustering algorithm with k=2 after 6 objects are clustered as shown in table I below. TABLE I. CLUSTER 1 AND CLUSTER 2 Let 7th object of dataset are X = [2 1 1], for this object dissimilarities are d(x, C) = 1 and d(x, C) = l. we may not properly assign this object. But we can see that this object will be assigned to cluster2. By using k- modes dissimilarity measure we cannot sure this object allocate to cluster2. To solve this problem, I propose anew dissimilarity measure that accounts the frequency of values of attributes of objects in clusters. New dissimilarity measure are- d (X, Y) = θ x,y (3) Where θ x, y = 1 O O x j =y j θ x, y = 1 x j y j O l number of objects in the l th cluster, and O ljm the number of objects with value a j of the j th attribute in the l th cluster. By using this dissimilarity measure, we sure that 7 th object allocates to cluster2. C.Proposed Algorithm Input: n p : number of set of modes in mode-pool. k : number of clusters. D: data set having n objects. Output: set of k clusters. Method: 295

1. Execute select_init_mode algorithm, it returns mode-pool. 2. Select most frequent attribute value of all attributes for a mode n corresponding set of np modes in modepool. Initialize all modes. 3. Allocate each object to that cluster which dissimilarity measure is lowest with that object, according to equation. 4. Update modes by calculate the frequent value for each attribute of all objects in cluster. 5. Repeat a. Reallocate each object to that cluster which dissimilarity is lowest with that object, if that cluster is not current cluster. b. Update modes of changed clusters. 6. until no changes. VI. IMPLEMENTATION & RESULT For the implementation of my proposed algorithm we have designed a tool interface. Figure 2. Input Frame Figure 2 is the initial window of my tool. It takes the input file on which we want to apply clustering. It also takes the number of clusters from the user. Figure 3. Result Frame 296

From the window shown in figure 3, we can see the output of k-modes algorithm and proposed algorithm by using the appropriate button. I experimented with two real-life categorical datasets. Mushroom dataset, and Congressional Voting dataset taken from UCI Machine learning repository [8]. Clustering Accuracy: Cluster Accuracy r is defined as r=( Where, a i = number of objects occurring in a cluster, k=number of clusters, and n=number of objects in a data set Clustering error defined as ai)/n (4) e=1-r (5) We compare proposed k-modes algorithm, and existing k-modes algorithm. For a fixed number of clusters k, the clustering errors e of both algorithms compared and shown in figure 4. A.Datasets Congressional Voting Data Set: it includes votes of every house of United States representatives of congressmen on sixteen key votes recognized by the CQA. The CQA lists 9 various votes- voted for, paired for, announced for (all 3 are interpreted to yes). Voted against, paired against, and announced against (all 3 interpreted to no).voted present, voted present to elude conflict of interest, didn't vote or elsewhere make a position known (these 3 interpreted to unknown) [8]. Figure 4. Congressional Voting data (Clustering Error vs No. of clusters) Mushroom Data Set: We used mushroom database as input of my system. This database drawn from The Audubon Society Field Guide to North American Mushrooms (1981), this data set has 8124 data objects. Each object has 22 attributes (e.g., color, odor, and shape) and has a label characterizing the mushroom specimen as either poisonous (3916 records) or edible (4208 records) [8]. Soybean Disease Data Set: We used Soybean Disease database as input of my system. These databases drawn from this dataset have 19 classes, only the first 15 of which have been used in prior work. The folklore seems to be that the last four classes are unjustified by the data since they have so few examples. There are 35 categorical attributes, some nominal and some ordered. The value dna means does not apply. The values for attributes are encoded numerically, with the first value encoded as 0, the second as 1, and so forth. An unknown value is encoded as?.this data set has 307 data objects [8]. The proposed algorithm was tested on other categorical data [8] such as Zoo, Soybeans, US Census Data. VII. CONCLUSIONS As we all know clustering is applicable in every area, for eg ranging from image processing, bug prediction, pattern evolution, and machine learning and so on. So, we need a clustering algorithm that work efficiently as well as accurately on all type of databases numerical, categorical, and mixture of both. 297

In this paper, we work on only accuracy quality attribute of clustering algorithm, so that; we can find much accurate and nearly same result at every execution of algorithm on same dataset. Our algorithm worked well in this scenario to provide accurate result at every execution of algorithm. We applied this algorithm on only simple real time categorical datasets mushroom database, Congressional Voting Data Set. In future, it is possible to apply this algorithm on bug dataset to help developer to find the clusters of bugs that have a same cause. It helps in bug fixing during development and also after deployment. Presently it works only for categorical datasets. But in future it may enhance to work well with numerical datasets also. REFERENCES [1] Jiawei Han, Micheline Kamber: "data mining Concepts and Techniques", Morgan Kaufmann, 2001. [2] E. W. Gilbert: "Pioneer Maps of Health and Disease in England'', Geographical Journal, 1958. [3] Anil K. Jain and Richard C. Dubes: "Algorithms for Clustering data", Prentice-Hall, 2005. [4] Amir Ahmad, Lipika Dey: "A k-mean clustering algorithm for numeric data", Data & Knowledge Engineering, 2007 [5] Leonard Kaufman and Peter J. Rousseeuw: "Finding Groups in Data: An Introduction to Cluster Analysis.'', John Wiley & Sons, 1990. [6] Renjee J. Miller, Mauricio A. Hernjandez Laura M. Haas.: "The Clio Project: Managing Heterogeneity, SIGMOD Record, 2001. [7] US Census data set http://www.census.gov. [8] UCI Repository of Machine Learning Databases. http://archive.ics.uci.edu/ml/datasets.html [9] Serge Abiteboul, Richard Hull, and Victor Vianu.: "Foundations of Data bases." AddisonWesley, 1995. [10] Daniel Barbarja, Julia Couto, and Yi Li.: "COOLCAT: An Entropy-based Algorithm for Categorical Clustering.", CIKM -2002. [11] Zhihua Cail, Dianhong Wang, and Liangxiao Jiang: A New Algorithm for Clustering Categorical Data, ICIC- 2006. 298