HIMIC : A Hierarchical Mixed Type Data Clustering Algorithm

Size: px

Start display at page:

Download "HIMIC : A Hierarchical Mixed Type Data Clustering Algorithm"

Piers Marsh
6 years ago
Views:

1 HIMIC : A Hierarchical Mixed Type Data Clustering Algorithm R. A. Ahmed B. Borah D. K. Bhattacharyya Department of Computer Science and Information Technology, Tezpur University, Napam, Tezpur , Assam India {bgb,dkb}.tezu.ernet.in & J K Kalita Department of Computer Science University of Colorado Colorado Springs CO USA jkalita@uccs.edu October 10, 2005 Abstract Clustering is an important data mining technique. There are many algorithms that cluster either numeric or categorical data. However few algorithms cluster mixed type datasets with both numerical and categorical attributes. In this paper, we propose a similarity measure between two clusters that enables hierarchical clustering of data with numerical and categorical attributes. This similarity measure is derived from a frequency vector of attribute values in a cluster. Experimental results establish that our algorithm produces good quality clusters. Key words: Mixed type of attributes, clustering, frequency vector of attributes, similarity of clusters. 1 Introduction Data mining has become increasingly important in recent years. In data mining, clustering is a useful technique for discovering interesting distributions and patterns in the underlying data. It is a technique for grouping data points such that points within a single group, called a cluster, have similar characteristics (or are close to each other) while points in different groups are dissimilar. Clustering techniques are broadly classified into two categories: partitional and hierarchical. Given a set of objects and a clustering criterion, partitional 1

2 clustering obtains a partition of the objects into k clusters such that each cluster contains at least one object and each object belongs to exactly one cluster. The number of clusters to be found, k, is pre-specified. A hierarchical clustering is a nested sequence of partitions. An agglomerative hierarchical clustering starts by placing each object in its own cluster and then merges these atomic clusters into larger clusters, until all objects are in a single cluster. Divisive hierarchical clustering reverses the process by starting with all objects in a cluster and subdividing into smaller pieces. The goal of clustering is to discover dense and sparse regions in a data set. Most work in clustering has focussed on numerical data whose inherent geometric properties can be exploited to naturally define distance functions between objects. However, many data sets consist of categorical attributes on which distance functions are not naturally defined. As an example, the cap color attribute of the MUSHROOM data set in the popular UCI Machine Learning repository [1] can take values from the domain: {brown, buff, cinnamon, gray, green, pink, purple, red, white, yellow}. It is hard to reason that one color is like or unlike another color in a way similar to real numbers. An important characteristic of categorical domains is that they typically have a small number of attribute values. Categorical attributes with large domain sizes typically do not contain information that may be useful for grouping tuples into classes [2]. Real world databases contain both numeric and categorical attributes, and as a consequence most clustering algorithms cannot be applied to such data. There has been a need for clustering algorithms that are capable of dealing with data sets containing different types of attributes, numeric or categorical. This paper presents a bottom up hierarchical clustering algorithm that can cluster mixed type data. In the clustering process, two clusters are merged using a similarity measure we define. The similarity measure is based on attribute value frequency counts and works well for data sets containing a mixture of categorical and numerical attributes. It also works equally well if the data set contains only numeric or only categorical data. The similarity measure can be computed efficiently as it requires only one pass through the two clusters whose similarity needs to be computed. We demonstrate the effectiveness of our algorithm using a number of synthetic and real world data sets. The rest of the paper is organized as follows. Section 2 gives an overview of related clustering algorithms. We present our new clustering algorithm in Section 3. Section 4 gives the experimental results. Section 5 contains conclusions and directions for future work. 2 Related Work The k-means [3] clustering technique is very popular for partitioning data sets with numerical attributes. In [4], Ralambondrainy proposed an extended version of the k-means algorithm which converts categorical attributes into binary attributes. This method needs increased time and space costs if the categorical

3 attributes have many categories. Further, real values between 0 and 1 representing cluster means do not indicate cluster characteristics. Huang proposed the k-modes algorithm to tackle the problem of clustering large categorical data sets [5, 6]. The k-modes algorithm extends the k-means algorithm by using a simple matching dissimilarity measure for categorical objects: modes instead of means for clusters, and a frequency based method to update modes to minimize the clustering cost. Further, Huang also combined the k-modes algorithm with the k-means algorithm resulting in the so-called k-prototypes algorithm for clustering objects described by mixed numerical and categorical attributes. These extensions remove the numeric-only limitation of the k-means algorithm and enable it to be used for efficient clustering of very large data sets from real world databases. However, the k-modes algorithm is unstable due to non-uniqueness of the modes. The clustering results depend strongly on the selection of modes during clustering. Another clustering algorithm for categorical and mixed data was proposed by Le [7]. The algorithm chooses k number of largest sets from Non-Expandable Strongly Connected sets built using the Breadth First Search algorithm. Therefore, it does not depend on selecting points like means or modes. In addition, the remaining objects are assigned to clusters by testing the minimum distance of the object from all clusters. 3 Our Algorithm 3.1 Notation We assume that the set of objects to be clustered is stored in a data set S defined by a set of attributes A 1,..., A m with domains D 1,..., D m, respectively. Each object in S is represented by a tuple X D 1 D m. We only consider two general data types, namely, numeric and categorical. The domains of attributes associated with these two types are called numerical and categorical, respectively. A numerical domain consists of continuous real values. As such, each numerical data object is considered a point in a multidimensional metric space that uses a distance metric such as the Euclidean or the Manhattan measure. A domain D i is defined as categorical if it is finite, discrete valued and not naturally ordered. Logically, each data object X in the dataset is represented as a conjunction of attribute-value pairs A 1 = x 1 A m = x m (1) where x i D i for 1 i m. For simplicity, we represent X as a tuple X = (x 1,..., x m ) D 1 D m. (2) If all D i s are categorical domains, then objects in S are called categorical objects. We consider the clustering problem for mixed-type data objects where some domains are numeric, while others are categorical.

4 Let S = {X 1, X 2,..., X n } be a set of n objects. Object X i is represented as X i = (x i1, x i2,..., x im ) (3) where m is the number of total attributes. We write X i = X k if x ij = x kj for 1 j m. The relation X i = X k does not mean that X i and X k are the same object in the real world database. It means the two objects have equal values for the attributes A 1, A 2,..., A m. Consider two clusters C i S and C j S with cardinalities C i and C j respectively. We want to measure similarity between the two clusters C i and C j as defined in the following sections. 3.2 Attribute Frequency Vector of Categorical Attribute We define attribute frequency vector AttV i (A j ) of categorical attribute A j of a given cluster C i as AttV i (A j ) = {(d, rf i (d)) d D j } (4) where D j is the domain of j th attribute A j and relative frequency rf i (d) of cluster C i is defined as rf i (d) = freq(d) (5) C i where freq(d) is defined as C i freq(d) = 1 such that x lj = d. (6) l=1 Here, 0 freq(d) C i and hence 0 rf i (d) Frequency Vector Then we define frequency vector of a cluster C i having mixed type attributes as given below F reqv i = (V i1, V i2,..., V im ) (7) where m is the total number of attributes, and V ij is defined by { Meani (A V ij = j ), if A j is numerical attribute; AttV i (A j ), if A j is categorical attribute. (8) where for numerical attribute A j, mean of j th attribute of cluster C i is Mean i (A j ) defined as Mean i (A j ) = 1 C i x kj (9) C i and for categorical attribute A j, AttV i (A j ) is defined by Equation 4. Here 0 Mean i (A j ) 1. k=1

5 3.4 Similarity of an Attribute of two Clusters Similarity of k th attribute of two Clusters C i & C j is S k (C i, C j ) and is defined by { 1 Vik V S k (C i, C j ) = jk, if A k is numerical (10) V ik V jk, if A k is categorical where attribute frequency vector V ik, is defined by the equations 8, and vector multiplication for k th categorical attribute is defined as D k V ik V jk = [rf i (d l ) rf j (d l )] (11) l=1 where relative frequency rf i (d l ) is defined by the the equation 5. Here, 1 V ik V jk V ik V jk 1 and 0 V ik V jk 1 and hence 0 S k (C i, C j ) 1. All numerical values are assumed to be within the range [0, 1], 0 and 1 inclusive. 3.5 Similarity of Clusters Similarity of two clusters C i & C j is defined by Sim(C i, C j ) = m S k (C i, C j ). (12) Here 0 Sim(C i, C j ) m, where m is the number of total attributes. Example 1 Let us take two cluster C 1 and C 2 as shown in Tables 1 and 2. The clusters are taken from a dataset containing 3 categorical and 2 numerical attributes. The numerical attributes are normalized in the value range [0,1]. The domains of the categorical attributes are: D 1 = {a1, a2, a3, a4} D 2 = {b1, b2, b3, b4, b5, b6} D 3 = {c1, c2, c3, c4} D 4 = {d1, d2, d3, d4, d5, d6} D 5 = {e1, e2, e3, e4, e5} Let us find the similarity of clusters C 1 & C 2 for which frequency vectors are computed in Table 3 and 4 respectively. Now the similarity of individual attributes S k (C 1, C 2 ) (for k = 1, 2,..., 5) are S 1 (C 1, C 2 ) = /3 + 3/5 0 k=1

6 A0 A1 A2 A3 A4 a3 b c a3 b c a4 b c a4 b c a3 b c Table 1: Cluster C 1 A0 A1 A2 A3 A4 a2 b c a2 b c a2 b c Table 2: Cluster C 2 + 2/5 0 = 0 S 2 (C 1, C 2 ) = /3 + 1/ / / /5 0 = 0 S 3 (C 1, C 2 ) = / /3 = S 4 (C 1, C 2 ) = /5 2/3 + 2/5 1/3 = S 5 (C 1, C 2 ) = / /3 = Hence Sim(C 1, C 2 ) = 5 k=1 S k(1, 2) = The Algorithm Our algorithm is given below. Input: k = Number of required clusters. 1. Begin with n clusters, each consisting of one object. 2. Repeat steps 3 a total of n k times. 3. Find the pair of most similar clusters C i and C j using the similarity measure Sim(C i, C j ) by equation 12 and merge C i and C j into a single cluster. We have one fewer cluster each time. Output: Remaining k numbers of disjoint clusters.

7 V 00 V 01 V 02 V 03 V 04 (a1, 0) (b1, 0) 3.469/5 (c1, 0) 0.033/5 (a2, 0) (b2, 0) (c2, 0) (a3, 3/5) (b3, 1/5) (c3, 3/5) (a4, 2/5) (b4, 1/5) (c4, 2/5) (b5, 2/5) (b6, 1/5) Table 3: Frequency vector F reqv 1 of cluster C 1 V 00 V 01 V 02 V 03 V 04 (a1, 0) (b1, 0) 1.530/3 (c1, 0) 0.011/3 (a2, 3/3) (b2, 3/3) (c2, 0) (a3, 0) (b3, 0) (c3, 2/3) (a4, 0) (b4, 0) (c4, 1/3) (b5, 0) (b6, 0) Table 4: Frequency vector F reqv 2 of cluster C Complexity Analysis The complexity of the hierarchical clustering algorithm becomes O(n 2 ) [9] if implemented as stated below. 1. The initial similarity matrix can be computed in O(n 2 ) for n number of data records. While computing the matrix the maximum entry for each row is found and the column where it occurs is recorded. 2. Prior to searching for the pair of clusters to be merged in the kth merge, there are n + 1 k clusters remaining, and they are described by n k active rows of the similarity matrix. To find the most similar pair, search is performed for the maximum of the row maxima which were found initially in step 1 and repeatedly updated in step 3. At the kth stage the search involves n k 1 comparisons. Total number of comparisons for n 1 stages is of the order of O(n 2 ) 3. After the most similar pair is found at kth stage, it is necessary to update the n k similarity matrix entries corresponding to the cluster resulting from the merger. Additionally, it is necessary to find the new row maximum corresponding to the new cluster and check the remaining updated values against their respective row maxima. For n 1 stages the number of comparisons and the number of updates are each O(n 2 ). Finally it is necessary to search for a new row maximum in any row where the cluster or column corresponding to the maximum is involved in the merger; that is, the most similar pair in such rows includes one of the clusters involved in the merger. The computation burden for

8 this latter aspect of updating is highly contingent on the pattern of numbers in the similarity matrix and the particular clustering method employed; however in the typical case it would be expected that one row would be updated for this reason at each stage and the expected length of the row at the kth stage is (n k)/2. Therefore, the expected complexity of this computation is O(n 2 ). 4. To avoid testing each row of the similarity matrix as to whether it represents an active cluster, a list of active clusters is maintained. At each stage one element of the list is removed and the remaining are pushed down. It is necessary to search this list for the location of the cluster to be deleted: the expected number of comparisons for n-1 stages is O(n 2 ). Adding up the complexities required for all steps gives a total complexity of O(n 2 ). 4 Experimental Results In this section, we present experimental evaluation of our algorithm and compare its performance with the other counterparts such as ROCK [8] and k-sets [7]. The primary use of clustering algorithms is to discover the grouping structures inherent in data. For this purpose an assumption is first made that a certain structure may exist in a given data set and then a clustering algorithm is used to verify the assumption and recover the structure. We adopted an external criterion which measures the degree of correspondence between the clusters obtained from our clustering algorithm and the classes assigned a priori. 4.1 Accuracy Calculation of Clustering Result The accuracy measure r of clustering results called the clustering accuracy(exactness), is defined as follows: r = 1 k a i (13) n where a i is the maximum number of data objects of cluster i belonging to the same original classes in the test data (correct answer) and n is the number of data objects in the database. Further, the clustering error e is defined as: 4.2 Data Sets i=1 e = 1 r (14) We have experimented with three commonly used real-life data sets as described below. The data sets are available in the UCI Machine Learning Repository. The Soybean (small) disease dataset is chosen to test our algorithm because all of its attributes are categorical. This data set has 47 instances, each being described by 35 attributes, all attributes are categorical. Each instance is labeled as one of four diseases: Diaporthe Stem Cancer(D),

9 Charcoal Rot(C), Rhizoctonia Root Rot(R), and Phytophthora Rot(P). Except for Phytophthora Rot, which has 17 instances, all other diseases have 10 instances each. The Congressional voting data set has 435 instances described by 16 attributes (e.g., education spending, crime, etc.). All attributes are Boolean with Yes or No values. The data set contains 168 Republican(R) and 267 Democrat(D) instances. The credit approval data set has 690 instances each is described by 6 numeric and 9 categorical attributes. The instances are classified into 2 classes, approved labeled (A) as + and rejected labeled (R) as. There are 37 instances having missing values on 7 attributes. The data set contains 307 approved instances and 383 reject instances. 4.3 Results on Soybean Dataset We used our algorithm to cluster the soybean small disease data set with 4 clusters(k = 4). The result is shown in Table 5, where C i is the cluster names produced by the algorithm, and D(Diaporthe Stem Canker), C(Charcoal Rot), R(Rhizoctonia Root Rot), P(Phytophthora) are the names of the original classes. Cluster No. D C R P C C C C Table 5: The clustering result on Soybean Dataset with our algorithm (k = 4). The algorithm discovers soybean disease clusters that are completely identical with the original clusters with accuracy r = 1.0. This clustering accuracy Algorithm Accuracy (average) Fuzzy k-modes 0.79 Fuzzy k-modes with tabu search 0.99 k-modes 0.89 k-modes with refinement initialization 0.98 k-sets 1.00 Our Algorithm 1.00 Table 6: Accuracy of clustering algorithms on Soybean Disease Data, results of K-Sets and others are obtained from [7]

10 Cluster No Republican Democrat C C Table 7: Results of our algorithm on Congressional Voting Dataset with k = 2 and with 432 records out of 435 records ROCK (with 372 records out of 435 records) Cluster No Republican Democrat C C K-Sets (with 350 records out of 435 records) Cluster No Republican Democrat C C Table 8: Results of ROCK and k-set algorithm on Congressional Voting Dataset. Results are found from [7, 8] is the highest in comparison with other methods as shown in the Table Results on Congressional Voting Dataset In the test on the Congressional Voting Dataset we took 432 records out of 435 records as input( records having missing values in all the attributes are excluded). The result for this data set (with k = 2) is shown in Table 7. We found 2 clusters namely C 1, C 2 of sizes 173 and 259 respectively. In the cluster for Republicans (cluster C 1 ) around 7.51% of the members are Democrats and in the cluster for Democrats(cluster C 2 ), around 2.31% of the members of Republicans. We found the accuracy for this data set as Table 8 contains the results of the Congressional Voting Data set using ROCK and k-sets algorithms. The accuracies of clusters on this dataset are shown on the Table 9, and we observed that accuracy of our clustering algorithm is the maximum with respect to the accuracy of ROCK and k-sets algorithm. Algorithm Accuracy (average) ROCK 0.93 k-sets algorithm 0.93 Our algorithm 0.96 Table 9: Accuracies of clustering algorithms on Congressional Voting Dataset

11 Credit Approval Dataset: with 666 records out of 690 records) Cluster No. A R C C Table 10: Results of our algorithm on Credit Approval Dataset. A=Approved class, R=Rejected class Credit Approval Dataset: with 666 records out of 690 records) Cluster No. A R C C Table 11: Results of k-sets algorithm on Credit Approval Dataset. A=Approved class, R=Rejected class 4.5 Results on Credit Approval Dataset For the Credit Approval dataset we have excluded 24 instances having missing values in numeric attributes as done in [7]. Using our algorithm we got two clusters namely C 1 and C 2, as shown in the Table 10. In the cluster of rejected class (C 1 ) we have 18.93% of approved instances and in the approved class (C 2 ) we found 8.40% of rejected instances. The accuracy of our clustering algorithm is Table 11 shows the result of the k-sets algorithm. The comparison of accuracies of our algorithm with k-prototypes, k-prototypes with the tabu search and k-sets algorithm (accuracy measures are found from [7]) are shown in Table 12, which indicates our algorithm bests the others as shown in Table 12. Algorithm Accuracy (average) k-prototype algorithm 0.77 Tabu search k-prototypes algorithm 0.80 k-sets algorithm 0.83 Our algorithm 0.85 Table 12: Accuracies of clustering algorithms on Credit Approval Dataset, results of K-Sets and others are obtained from [7]

12 5 Conclusion In this paper, we have presented the concept of frequency vectors for clusters and based on this idea we provided a method to measure the similarity between a pair of clusters with categorical as well as numerical attributes. We also developed a bottom up hierarchical clustering algorithm that employs the similarity measure for merging clusters. Our method naturally extends to non-metric similarity measures that are relevant in situations where clusters are the only source of knowledge. Our method clusters not only categorical or numerical data but also the mixture of both numerical and categorical data. We have demonstrated the effectiveness of our algorithm in a number of standard datasets. In future work, we will attempt to extend the frequency vector method to extend to find subspace clusters also. References [1] E. Keogh C. Blake and C.J. Merz, UCI repository of machine learning databases, [2] J. Gehrke V. Ganti and R. Ramakrishnan, CACTUS: Clustering categorical data using summaries,in Proceedings of the 5th International Conference on Knowledge Discovery and Data Mining (KDD), USA, 1999, [3] Z. Huang, Extensions to the k-means algorithm for clustering large data sets with categorical values, Data Mining Knowl. Discov., Vol. 2, No. 2, pp , [4] H. Ralambondrainy, A conceptual version of the kmeans algorithm, Pattern Recogn. Lett., Vol. 15, No. 11, pp , [5] Z. Huang, A fast clustering algorithm to cluster very large categorical data sets in data mining, In Research Issues on Data Mining and Knowledge Discovery, [6] Z. Huang, Clustering large datasets with mixed numeric and categorical value, Inprocedings of the First Pacific Asia Knowledge Discovery and Data Mining Conference World Scientific, Singapore, [7] S.Q. Le and T.B. Ho, A k-sets Clustering algorithm for categorical and mixed data, In Proceedings of the 6th SANKEN (ISIR) International Synposium, pp , [8] Rajeeb Rastogi Sudipto Guha and Kyuseok Shim, ROCK: A Robust Clustering Algorithm for Categorical Attributes, In Proc. of the 15th Intl Conf. on Data Eng., [9] Anderberg, Cluster Analysis for Applications, Academic Press, New York, 1973.

A Spectral-based Clustering Algorithm for Categorical Data Using Data Summaries (SCCADDS)

A Spectral-based Clustering Algorithm for Categorical Data Using Data Summaries (SCCADDS) Eman Abdu eha90@aol.com Graduate Center The City University of New York Douglas Salane dsalane@jjay.cuny.edu Center