The Application of K-medoids and PAM to the Clustering of Rules

Similar documents
Clustering. Chapter 10 in Introduction to statistical learning

Cluster Analysis. Angela Montanari and Laura Anderlucci

Generalized k-means algorithm on nominal dataset

A Review on Cluster Based Approach in Data Mining

Kapitel 4: Clustering

Efficient and Effective Clustering Methods for Spatial Data Mining. Raymond T. Ng, Jiawei Han

A New Online Clustering Approach for Data in Arbitrary Shaped Clusters

A fuzzy k-modes algorithm for clustering categorical data. Citation IEEE Transactions on Fuzzy Systems, 1999, v. 7 n. 4, p.

Clustering part II 1

An Improved Fuzzy K-Medoids Clustering Algorithm with Optimized Number of Clusters

Chapter 6 Continued: Partitioning Methods

Data Mining: Concepts and Techniques. Chapter March 8, 2007 Data Mining: Concepts and Techniques 1

Automated Clustering-Based Workload Characterization

UNSUPERVISED STATIC DISCRETIZATION METHODS IN DATA MINING. Daniela Joiţa Titu Maiorescu University, Bucharest, Romania

I. INTRODUCTION II. RELATED WORK.

Outlier Detection and Removal Algorithm in K-Means and Hierarchical Clustering

CHAPTER 6 MODIFIED FUZZY TECHNIQUES BASED IMAGE SEGMENTATION

Fast Efficient Clustering Algorithm for Balanced Data

ECLT 5810 Clustering

Cluster Analysis for Microarray Data

Efficiency of k-means and K-Medoids Algorithms for Clustering Arbitrary Data Points

The Clustering Validity with Silhouette and Sum of Squared Errors

Concept Tree Based Clustering Visualization with Shaded Similarity Matrices

K-Means Clustering With Initial Centroids Based On Difference Operator

CLUSTERING. CSE 634 Data Mining Prof. Anita Wasilewska TEAM 16

PAM algorithm. Types of Data in Cluster Analysis. A Categorization of Major Clustering Methods. Partitioning i Methods. Hierarchical Methods

Dynamic Clustering of Data with Modified K-Means Algorithm

Data Mining Algorithms

ECLT 5810 Clustering

Powered Outer Probabilistic Clustering

CS4445 Data Mining and Knowledge Discovery in Databases. A Term 2008 Exam 2 October 14, 2008

Unsupervised Learning. Andrea G. B. Tettamanzi I3S Laboratory SPARKS Team

An Efficient Clustering Method for k-anonymization

An Enhanced K-Medoid Clustering Algorithm

University of California, Berkeley

CHAPTER VII INDEXED K TWIN NEIGHBOUR CLUSTERING ALGORITHM 7.1 INTRODUCTION

Evaluation Measures. Sebastian Pölsterl. April 28, Computer Aided Medical Procedures Technische Universität München

Generating the Reduced Set by Systematic Sampling

Clustering Techniques

Indexing in Search Engines based on Pipelining Architecture using Single Link HAC

Multivariate analyses in ecology. Cluster (part 2) Ordination (part 1 & 2)

K+ Means : An Enhancement Over K-Means Clustering Algorithm

COMP 465: Data Mining Still More on Clustering

New Approach for K-mean and K-medoids Algorithm

Enhancing K-means Clustering Algorithm with Improved Initial Center

Cluster quality assessment by the modified Renyi-ClipX algorithm

Semi-Supervised Clustering with Partial Background Information

A Novel Algorithm for Associative Classification

Clustering (Basic concepts and Algorithms) Entscheidungsunterstützungssysteme

An Initialization Method for the K-means Algorithm using RNN and Coupling Degree

Datasets Size: Effect on Clustering Results

AN IMPROVED DENSITY BASED k-means ALGORITHM

Normalization based K means Clustering Algorithm

A Parallel Evolutionary Algorithm for Discovery of Decision Rules

A Novel Approach for Weighted Clustering

Olmo S. Zavala Romero. Clustering Hierarchical Distance Group Dist. K-means. Center of Atmospheric Sciences, UNAM.

Clustering in Data Mining

Hierarchical Document Clustering

Clustering Documentation

Summer School in Statistics for Astronomers & Physicists June 15-17, Cluster Analysis

A GENETIC ALGORITHM FOR CLUSTERING ON VERY LARGE DATA SETS

Stat 321: Transposable Data Clustering

CSE 5243 INTRO. TO DATA MINING

Mining Quantitative Association Rules on Overlapped Intervals

HARD, SOFT AND FUZZY C-MEANS CLUSTERING TECHNIQUES FOR TEXT CLASSIFICATION

Cluster Analysis. CSE634 Data Mining

Medoid Partitioning. Chapter 447. Introduction. Dissimilarities. Types of Cluster Variables. Interval Variables. Ordinal Variables.

Hartigan s Method for K-modes Clustering and Its Advantages

Lecture on Modeling Tools for Clustering & Regression

Clustering. (Part 2)

Multivariate Analysis

Cluster Analysis chapter 7. cse634 Data Mining. Professor Anita Wasilewska Compute Science Department Stony Brook University NY

An Effective Performance of Feature Selection with Classification of Data Mining Using SVM Algorithm

数据挖掘 Introduction to Data Mining

Using Categorical Attributes for Clustering

Data mining techniques for actuaries: an overview

Using a Clustering Similarity Measure for Feature Selection in High Dimensional Data Sets

CSE 5243 INTRO. TO DATA MINING

Associative Cellular Learning Automata and its Applications

A Study of Hierarchical and Partitioning Algorithms in Clustering Methods

A Game Map Complexity Measure Based on Hamming Distance Yan Li, Pan Su, and Wenliang Li

An Efficient Approach towards K-Means Clustering Algorithm

On the Consequence of Variation Measure in K- modes Clustering Algorithm

Swarm Based Fuzzy Clustering with Partition Validity

CS 2750 Machine Learning. Lecture 19. Clustering. CS 2750 Machine Learning. Clustering. Groups together similar instances in the data sample

ECS 234: Data Analysis: Clustering ECS 234

Study on Classifiers using Genetic Algorithm and Class based Rules Generation

Rowena Cole and Luigi Barone. Department of Computer Science, The University of Western Australia, Western Australia, 6907

A New Method For Forecasting Enrolments Combining Time-Variant Fuzzy Logical Relationship Groups And K-Means Clustering

Introducing Partial Matching Approach in Association Rules for Better Treatment of Missing Values

9.1. K-means Clustering

A New Clustering Algorithm On Nominal Data Sets

Multiobjective Formulations of Fuzzy Rule-Based Classification System Design

K-means algorithm and its application for clustering companies listed in Zhejiang province

Hierarchical Clustering

Comparative Study of Subspace Clustering Algorithms

A Naïve Soft Computing based Approach for Gene Expression Data Analysis

Equi-sized, Homogeneous Partitioning

Recent Progress on RAIL: Automating Clustering and Comparison of Different Road Classification Techniques on High Resolution Remotely Sensed Imagery

Modeling the Real World for Data Mining: Granular Computing Approach

Transcription:

The Application of K-medoids and PAM to the Clustering of Rules A. P. Reynolds, G. Richards, and V. J. Rayward-Smith School of Computing Sciences, University of East Anglia, Norwich Abstract. Earlier research has resulted in the production of an allrules algorithm for data-mining that produces all conjunctive rules of above given confidence and coverage thresholds. While this is a useful tool, it may produce a large number of rules. This paper describes the application of two clustering algorithms to these rules, in order to identify sets of similar rules and to better understand the data. 1 Introduction Data mining often involves the production of accurate yet simple rules that describe records of interest within a database. For example, in a motor insurance database, the records of interest may be those motorists who have claimed on their insurance. Often, rules produced are conjunctions of simple clauses, where each clause applies to one field within the database (e.g. [2]). For example, if age < 25 and car type = sports then claim = yes is such a rule. Although this approach produces useful rules, these rules may describe only small subsets of the records of interest. To describe larger subsets accurately, collections of such rules are required. B. de la Iglesia et al. [3] have researched the application of multi-objective genetic algorithms to data mining, using confidence and coverage as two separate objectives. This approach leads to the production of a collection of Pareto-optimal rules. Richards et al. [1] have created software that produces all the rules, above given confidence and coverage thresholds, that apply to the data. This all-rules algorithm can produce many thousands of rules. For example, on the adult database [4], seeking rules of greater than 15% coverage and greater than 60% confidence results in 4785 rules. Examining these rules by eye immediately reveals that many are very similar. Others, while appearing to be different, may in fact match similar sets of records. In order to determine which rules match similar sets of records, we applied two clustering algorithms to the rules produced by the all-rules algorithm: k-medoids and Partitioning About Medoids (PAM).

2 Rule Dissimilarity Let D be the database from which rules of the form α β are deduced. If ρ is such a rule then the set of support for ρ, S(ρ), is the set of records in D on which the rule holds, i.e. S(ρ) = {r D : α(r) β(r)}. If ρ 1 and ρ 2 are two arbitrary rules, we can define a dissimilarity measure d(ρ 1, ρ 2 ) = S(ρ 1 ) S(ρ 2 ) S(ρ 1 ) S(ρ 2 ) = S(ρ 1 ) S(ρ 2 ) + S(ρ 2 ) S(ρ 1 ). Dividing this measure by the number of records in the database gives the simple matching coefficient [5]. Alternatively, we can use the Jaccard coefficient [6] on the sets of support and define n(ρ 1, ρ 2 ) = d(ρ 1, ρ 2 )/ S(ρ 1 ) S(ρ 2 ). Both of these dissimilarity measures are pseudometrics [7]. Intuitively, two rules that are mutually supported by a thousand records and differ over only six are more similar than two rules that are mutually supported by no records and differ over five. For this reason, the results described in this paper are those obtained using the Jaccard dissimilarity coefficient. 3 K-means The k-means algorithm [8] is a well known technique for performing clustering on objects in R n. Each cluster is centred about a point called the centroid, where the centroid s coordinates are the mean of the coordinates of the objects in the cluster. K-means can be applied to the clustering of rules but there are two reasons for not doing so. Firstly, although each rule may be represented as a binary string, centroids will not have this structure and will make little sense as rules. More importantly, the k-means algorithm requires that distances to the centroids be calculated for each object at each iteration. Since the original database may have tens of thousands of records, such calculations will be slow. 4 K-medoids The k-medoids algorithm, as used to produce the results of this paper, is an adaptation of the k-means algorithm. Rather than calculate the mean of the items in each cluster, a representative item, or medoid, is chosen for each cluster at each iteration. Medoids for each cluster are calculated by finding object i within the cluster that minimizes j C i d(i, j),

where C i is the cluster containing object i and d(i, j) is the distance between objects i and j. There are two advantages to using existing rules as the centres of the clusters. Firstly, a medoid rule serves to usefully describe the cluster. Secondly, there is no need for repeated calculation of distances at each iteration, since the k-medoids algorithm can simply look up distances from a distance matrix. The k-medoids algorithm can be summarized as follows: 1. Choose k objects at random to be the initial cluster medoids. 2. Assign each object to the cluster associated with the closest medoid. 3. Recalculate the positions of the k medoids. 4. Repeat Steps 2 and 3 until the medoids become fixed. Step 3 could be performed by calculating j C i d(i, j) for each object i from scratch at each iteration. However, many objects remain in the same cluster from one iteration of the algorithm to the next. Improvements in speed can be obtained by adjusting the sums whenever an object leaves or enters a cluster. Step 2 can also be made more efficient in terms of speed, for larger values of k. For each object, an array of the other objects, sorted on distance, is maintained. The closest medoid can be found by scanning through this array until a medoid is found, rather than comparing the distance of every medoid. Figure 1 shows a comparison of the time requirements for each version of the Time taken (s) 100 80 60 40 20 0 Calculate distance sums Adjust distance sums Objects sorted on distance 0 50 100 150 200 250 300 350 400 Fig. 1. Time requirements of the k-medoids algorithm. code. The first series shows the requirements of the naive implementation. The second shows the speed improvements obtained by adjusting the sums required in step 3, rather than recalculating them. The third shows the extra improvements obtained by maintaining, for each object, an array of the other objects sorted on distance. Each point in the graph indicates the time required to run the k- medoids algorithm 100 times on 4785 rules extracted from the adult database, using a AMD Athlon XP 1800+ (1.53GHz) with 480Mb of RAM. 5 Partitioning About Medoids (PAM) Partitioning about medoids (PAM) also clusters objects about k medoids, where k is specified in advance. However, the algorithm takes the form of a steepest

ascent hillclimber, using a simple swap neighbourhood operation. In each iteration medoid object i and non-medoid object j are selected that produce the best clustering when their rôles are switched. The objective function used is the sum of the distances from each object to the closest medoid. As this search phase of the algorithm is slow, the initial set of medoids is constructed in a greedy build phase. Starting with an empty set of medoids, objects are added one at a time until k medoids have been selected. At each step, the new medoid is selected so as to minimize the objective function. 6 Silhouettes Kaufman and Rousseeuw [5] suggest the use of silhouettes both to determine which objects lie well within their clusters and which do not and also to judge the quality of the clustering obtained. For each object i, let a(i) be the average distance of i from all other objects in cluster C i. For every other cluster C C i, let d(i, C) be the average distance of i from the objects in C. After computing d(i, C) for all clusters C C i, let b(i) be the smallest. The cluster for which this minimum is attained is called the neighbour of i. The number s(i) is given by s(i) = b(i) a(i) max(a(i), b(i)) and provides a measure of how well object i fits into cluster C i rather than the neighbouring cluster. If s(i) is close to one, object i can be said to be well classified. If s(i) is close to zero, it is unclear whether i should belong to cluster C i or to its neighbour. A negative value suggests that i has been misclassified. The following summary values are also defined: The mean of s(i) for all objects i in a cluster is called the average silhouette width of that cluster. The mean of s(i) for all the objects is called the average silhouette width for the entire data set and is denoted by s(k), where k is the number of clusters. Kaufman and Rousseeuw suggest that s(k) can be used for the selection of a best value of k, by choosing that k for which s(k) is maximal. 7 Results We applied these two clustering algorithms, with small values of k, to the 4785 rules derived from the adult database. Figure 2 shows the sum of distances from rules to medoids. Figure 3 shows the silhouette widths for the clusterings. PAM typically outperforms the k-medoids algorithm according to both measures. The results presented are for small values of k, since we are endeavouring to simplify our understanding of the data. On the basis of silhouette width, PAM obtains the best clusterings when k equals two or seven. In both cases, the

Sum of distances 2000 1800 1600 1400 1200 1000 800 PAM 100 x k-medoids 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Av. silhouette width 0.45 0.4 0.35 0.3 0.25 2 3 4 5 6 7 8 9 10 11 12 13 14 15 PAM 100 x k-medoids Fig. 2. Sum of the distances of rules to the closest medoids. Fig. 3. Silhouette widths of clusterings obtained from PAM and k-medoids. strongest cluster contains all 1433 rules with a clause of the form Cap gain >= x. This cluster has an average silhouette width of 0.74 in both cases and has IF Age >= 32 AND Cap gain >= 3103 THEN *Salary = >50K as its medoid. In the two cluster case, the other 3352 rules appear in the second cluster, which has an average silhouette width of 0.27. In the seven cluster case, these rules are split into six clusters, with average silhouette widths from 0.07 to 0.37. It is unclear from figures 2 and 3 whether better clusterings might be obtained for larger values of k. The time requirements for PAM increase considerably as the number of clusters increases, as shown in figure 4, but figure 1 shows that the Time taken (s) 5000 4000 3000 2000 1000 0 0 20 40 60 80 Av. silhouette width 0.65 0.6 0.55 0.5 0.45 0.4 0.35 0.3 0 500 1000 1500 2000 Fig. 4. Time requirements for the PAM algorithm. Fig. 5. Results of 100 runs of the k- medoids algorithm for k up to 2000 k-medoids algorithm may be used with much larger values of k. Figure 5 shows the silhouette widths of clusterings obtained for values of k up to 2000. These results suggest that the best clusterings are obtained for values of k around 1200. However, such large values of k do not adequately simplify the data. 8 Conclusions and Further Research The PAM and k-medoids algorithms have been applied to the clustering of rules obtained from the all-rules algorithm. PAM obtains better results according to both the sum of distances to medoids and the overall silhouette width for most values of k. However, it requires considerably more time than k-medoids for larger values of k. The clusters obtained seem to be reasonable, especially the

cluster of all rules containing a clause of the form Cap gain x, which is the only strong cluster in the two and seven cluster cases. There are a number of potential areas for further research: Experimentation with other medoid based clustering algorithms, such as CLARANS [9], may result in improved clusterings and efficiency. The application of heirarchical clustering algorithms [5] to the clustering of rules might also reveal useful information. Better methods for summarizing clusters that give more information than just the cluster medoid are required if we are to improve our understanding of the data via clustering of rules. Numeric fields lead to the production of many similar rules differing only in the bounds on these fields. Every rule has the same weight when applying k-medoids or PAM, so if essentially the same rule is produced many times a bias is created towards clusters centred about such rules. More interesting clusters might be found if such rules were given a lesser weight or if the all-rules algorithm were prevented from producing almost identical rules. Our difference measures are semantic-based, i.e. related to the sets of support. An alternative strategy would be syntactic-based, i.e. based on the actual expressions used in the rules. A comparison of the clusters obtained using the two strategies is a matter for further research. References 1. G. Richards and V. J. Rayward-Smith. Discovery of association rules in tabular data. In Proceedings of IEEE First International Conference on Data Mining, San Jose, California, USA, pages 465 473. IEEE Computer Society, 2001. 2. R. Bayardo, R. Agrawal, and D. Gunopulos. Constraint-Based Rule Mining in Large, Dense Databases. In Proceedings of the 15th International Conference on Data Engineering, pages 188 197. IEEE, 1999. 3. B. de la Iglesia, M. S. Philpott, A. J. Bagnall, and V. J. Rayward-Smith. Data Mining Rules Using Multi-Objective Evolutionary Algorithms. In Proceedings of 2003 IEEE Congress on Evolutionary Computation, 2003. 4. C.L. Blake and C.J. Merz. UCI Repository of machine learning databases, 1998. http://www.ics.uci.edu/ mlearn/mlrepository.html. 5. Leonard Kaufman and Peter J. Rousseeuw. Finding Groups in Data: An Introduction to Cluster Analysis. Wiley series in probability and mathematical statistics. John Wiley and Sons Inc., 1990. 6. P. Jaccard. Étude comparative de la distribution florale dans une portion des Alpes et des Jura. Bulletin de la Société Vaudoise de la Sciences Naturelles, 37:547 579, 1901. 7. J. C. Gower and P. Legendre. Metric and Euclidean properties of dissimilarity coefficients. Journal of Classification, 3:5 48, 1986. 8. J. B. MacQueen. Some Methods for Classification and Analysis of Multivariate Observations. In Proceedings of the 5th Berkeley Symposium on Mathematical Statistics and Probability, volume 1, pages 281 297. University of California Press, 1967. 9. Raymond T. Ng and Jiawei Han. CLARANS: A Method for Clustering Objects for Spatial Data Mining. IEEE Transactions on Knowledge and Data Engineering, 14(5):1003 1016, 2002.