Rowena Cole and Luigi Barone. Department of Computer Science, The University of Western Australia, Western Australia, 6907

Size: px

Start display at page:

Download "Rowena Cole and Luigi Barone. Department of Computer Science, The University of Western Australia, Western Australia, 6907"

Everett Campbell
5 years ago
Views:

1 The Game of Clustering Rowena Cole and Luigi Barone Department of Computer Science, The University of Western Australia, Western Australia, 697 frowena, Abstract Clustering is a technique for discovering patterns and structure in data. Often, the most dicult part of this process is determining the optimal number of patterns or clusters. In this paper, we introduce a novel approach to clustering, that treats the clusters as \players" in a game. During this \clustering game", players compete for ownership of data points, according to a set of game rules. We describe a number of possible rule sets, and conduct an initial evaluation to ascertain which are suitable for the clustering problem. This is followed by a comparison of dierent game rules on four characteristic data sets. Results are promising, showing that the game framework can successfully nd optimal, or near optimal solutions. Without knowing the number of clusters! An analysis of these results highlights possible extensions to the game strategies. 1 The Clustering Problem Clustering is the grouping of similar objects. The aim is to divide n objects, each represented by a point in p-dimensional space, into clusters that reect the underlying structure of the data. This problem is NP-hard, even when the number of clusters is known [1]. Numerous algorithms, both deterministic and stochastic, have been developed for the clustering problem [2, 3, 4]. However, these techniques do not guarantee optimality, and the majority need the number of clusters a priori. Even when an algorithm can determine the number of clusters, this is often external to the clustering procedure (the process is repeated for diering values of k, and the optimal clustering selected). We propose a new approach to clustering, that treats the process as a game. The players in this game are clusters, who compete amongst themselves for ownership of objects according to a set of rules or strategies. These rules are used to determine: how the game is initialised, how turns are allocated, which objects are attacked, how attacks are resolved, and nally, when to stop the game. In this paper, we explore the game approach and investigate which strategies are appropriate for specic data sets. 2 The Game of Clustering The \clustering game" starts by randomly distributing the objects between an initial number of clusters. In turn, each player may attack any object they don't own; the object is defended by the cluster that owns it. If successful, the attacking cluster gains the object. If unsuccessful, ownership of the object does not change. Players are eliminated from the game when they lose their last object. The game concludes when the stopping criterion is met. We dene a game by the following strategies: initialisation (the number of starting players, and the initial distribution of objects), attack allocation (how attacks are assigned to players), object selection (which object a given cluster should attack), attack resolution (whether the attack is successful), and a stopping criterion (when to end the game). 2.1 Initialisation Given n objects, a simple strategy is to start with n players, each owning a single object. Alternatively, we can randomly distribute the objects between a lesser number of clusters. For the latter, it is important

2 The Game of Clustering R Cole and L Barone { Page 2 that the starting number of clusters is suciently high, or that there is some method of introducing new players, to counter the attrition of cluster numbers. 2.2 Attack Allocation We dene an attack as an attempt, by one cluster, to gain a object belonging to another cluster. A sequence of attacks by a single player is known as a turn. Turns are allocated to players according to their starting order; a set of turns (one for each player) forms a round. There are several ways of allocating attacks during a game. The single attack strategy limits each player to one attack per turn, whereas proportional allocation assigns attacks in proportion to the number of objects each cluster owns. Continue while winning allocates an extra attack, to the current player, for a successful attack. This can be used in conjunction with either of the above schemes. 2.3 Object Selection This strategy identies a victim for the attacking cluster. The random strategy chooses any point, provided it does not belong to the attacking cluster; probabilistic nearest neighbour selects closer objects with higher probability. The probabilities for the latter scheme can be calculated as a decreasing geometric series. These values can also be scaled relative to the defending cluster size, so that the attack is focused on smaller players. 2.4 Attack Resolution When a cluster attacks an object, the resolution strategy determines whether the attack is successful. First, two \strengths" are calculated: one for the attacking cluster, and one for the defending cluster. These values quantify the plausibility of each cluster owning the object. Then, ownership of the object is changed with a probability based on these values: P (successful attack) = attacking strength attacking strength + defending strength We dene ve strength measures for our clustering game. These are based on the following assumptions: objects close to a cluster's centroid should probably belong to that cluster, strength should rapidly decrease as the distance from the cluster centroid increases, and the shape and spread of the attacking (or defending) cluster should have some inuence on the strength. The Distance measure is a function of the distance between the attacked object and the relevant cluster centroid: Strength attacker = exp (1? d(x attacker ; x object )) where x attacker is the centroid of the attacking cluster, x object is the object coordinates, and d(x attacking ; x object ) is the Euclidean distance between the attacking cluster's centroid and the object. The defender's strength is calculated in a similar manner. The relative distance strength removes the attacked object from the defending cluster before calculating the distance to the cluster centroid: Strength attacker = e? exp d(xattacker ; x object ) d max Strength defender = e? exp d(xdefender?object ; x object ) d max Here d max is the maximum distance between any two points in the data set, and x defender?object is the centroid of the defending cluster calculated as if the attacked object is not a cluster member. Sum of squared distance within is based on the sum of the squared distances between cluster centroids and the clusters' objects. If the sum of the attacking and defending clusters' squared distances is smaller than when the object is placed with the attacker, we want the probability of a successful attack to be

The Game of Clustering R Cole and L Barone { Page 3 Figure 2{1: Multivariate density interpretation of the Ruspini data low (as re-assigning the object will increase the overall spread of the two

3 The Game of Clustering R Cole and L Barone { Page 3 Figure 2{1: Multivariate density interpretation of the Ruspini data low (as re-assigning the object will increase the overall spread of the two clusters). Thus, the strengths are calculated as follows: X attacker defender Strength attacker = d(x attacker ; x i ) 2 + d(x defender ; x i ) 2 i X i X attacker+object defender?object Strength defender = d(x attacker+object ; x i ) 2 + d(x defender?object ; x i ) 2 Multivariate normal density strength assumes that the cluster has a multivariate normal distribution (with the cluster centroid and covariance values), and returns the density at the object's position, i X i 1 Strength attacker = (2) p2 j attacker j 2 1 exp (?(x object? x attacker )?1 attacker (x object? x attacker )=2) where p is the number of dimensions, attacker the covariance matrix for the cluster. Figure 2{1 shows the multivariate normal density plot for the four cluster Ruspini [3] data set. Our nal strength measure is based on gravitation attraction, Strength attacker = m attackerm object d(x attacker ; x object ) 2 where m attacker and m object are the mass of the attacking cluster and the attacked object, respectively. We assign each object a mass of one unit, giving the attacking cluster a mass equal to its number of objects, and the attacked object a mass of one. The defense strength is calculated in a similar fashion. We always want a cluster to have some probability of winning a object. This means both the attacking and defending strengths must be non-zero. Additionally, some of the described functions may return undened values. For example, if the defending cluster has only one object (which means that this is also the attacked object), the distance between this object and its cluster centroid is zero. Thus the gravitational strength will be innite. The multivariate density strength will be undened if the cluster's covariance matrix is singular. We avoid undened values by setting a default distance value, and adapting the covariance information we do have. The default distance value is taken as the minimum non-zero distance between any two points in the data set. Unknown covariance values are calculated as the average of known values.

4 The Game of Clustering R Cole and L Barone { Page Stopping Criterion The game is limited to a maximum number of rounds. However, this is not necessarily the best place to stop. Obviously, the game should stop when only one cluster remains. We also want the game to end when a stable clustering is found; we assume stability is reached when there are no successful attacks for several rounds of play. 3 Rule Set Evaluation To judge the performance of our clustering game we evaluate our algorithm over ve sample data sets and compare the results to the known solutions. Figure 3{1 contains the ve two-dimensional data sets we used. The correct clusters are shown in each case. The Ruspini [3] data comprises 75 objects that are arranged into four clusters. The other four data sets were designed to test the abilities of our algorithm. Sparse is a simple data set (25 objects, seven clusters) with well separated, small clusters; Dense has a greater number of objects (98) that form ve compact clusters; Elliptical contains 75 objects divided between four clusters with elongated shape; and Unstructured has 17 evenly spread objects. Correctness is calculated as the percentage of objects correctly clustered, with respect to the structure of the data set. If the number of clusters exceeds that of the known solution, k, only objects in the best k clusters are counted toward the correctness. Testing started with the comparison of various rule sets on the Ruspini data. The aim being to nd a few strategy combinations that produced acceptable results. We then selected several of the combinations, and tested them on the remaining data sets to determine any limitations of the strategies. Each clustering game started with the objects divided randomly between ten clusters. In our experiments, we tested dierent combinations of attack allocation, object selection, and strength mechanism strategies. In each case, the game was stopped after ten rounds of unsuccessful attacks, or upon reaching rounds. All experiments were performed on a Sun Sparc 5 workstation running SunOS under normal load. A non-linear additive feedback random number generator was used to return successive pseudo-random numbers in the range to 2 31? 1. The program was written using C and compiled using the GNU gcc version compiler with maximum optimization options. 4 Results The rst rule set we tested was: random object selection, single attack allocation, and distance strength. Figure 4{1 is the nal clustering of the Ruspini data for these strategies. The objects of each of the correct clusters are shared between several players, and none of the initial players have been eliminated from the game. This rule set attained a correctness of 53.3% after 181 attacks. Changing the selection strategy to probabilistic nearest neighbour improved the correctness and decreased the execution time; the game ended after 478 attacks with a correctness of 7.7% (Figure 4{2). This strategy also retained the ten initial players. Figure 4{3 shows the correctness of the two selection schemes over the course of the game. Next we compared the attack allocation strategies (Figure 4{4). The single attack strategy's nal correctness was 7.7% after 478 turns, proportional allocation reached 65.3% correctness after attacks, and continue while winning (when starting with a single attack) was the most successful strategy with a nal correctness of 76.% for only 361 attacks. All ten players nished each of these games. Figures 4{5 through 4{9 contain the nal clusterings, and the correctness plots for the ve distance measures. Table 4{1 is a summary of strategy performance. The games with the distance and multivariate normal density strengths both nished with all ten clusters (Figures 4{5, 4{8). In each case single clusters from the correct solution were shared between a number of players. These games ended after a small number of attacks. Relative distance and sum of squared distances within strengths resulted in too few clusters (Figures 4{5 and 4{7). The correctness plots show that both games did nd the correct, or near correct, clustering, but the game continued (and the clustering worsened) after this point. The game with gravitational strength clustered the Ruspini data set correctly (Figure 4{9). At this stage we selected the distance, multivariate normal density, and gravitational strength measures

5 The Game of Clustering R Cole and L Barone { Page 5 (a) Ruspini (b) Sparse (c) Dense (d) Elliptical (e) Unstructured Figure 3{1: Experimental data sets

6 The Game of Clustering R Cole and L Barone { Page 6 Figure 4{1: Clustering of Ruspini data using random object selection, single attack turns, and distance strength Strength Clusters Found Correctness (%) Distance Relative distance Sum squared distances Multivariate normal density Gravity Table 4{1: Strength measure performance with probabilistic nearest neighbour selection and single attack while winning allocation for Ruspini data as strategies of interest, and compared their performance on the remaining four data sets (Figures 4{1 through 4{13). Summaries of the clustering performance for the data sets are presented in Table 4{2 through Table 4{5. Strength Clusters Found Correctness (%) Distance Multivariate normal density Gravity Table 4{2: Strength measure performance for Sparse data with probabilistic nearest neighbour selection and single attack while winning allocation None of the games were able to cluster the Sparse data correctly, the main problem being the allocation of the three single-point clusters. Both the distance and density measures added these to nearby groups. The game using gravity strength only did this for one object (Figure 4{1). All of the clustering games performed well on the Dense data set (Figure 4{11). In fact the gravity strength measure resulted in a perfect clustering. The other measures had too many clusters. None of the strength measures were sucient to correctly cluster the Elliptical data, there were too many clusters in all cases (Figure 4{12). The distance clustering divides the elliptical clusters into smaller spherical shapes, the multivariate density strength has overlapping clusters, and the gravity clustering has a few oddly clustered points. Our clustering game successfully clustered the Unstructured data set with both the distance and the gravitational strength measures (Figure 4{13). The number of attacks for the gravitational measure was considerably less than that for the distance strength. The multivariate density strength divided the objects between seven clusters. Finally, we timed our clustering game for the Ruspini data. We used the distance strength, single attack while winning turn allocation and the probabilistic nearest neighbour selection strategies. The game

7 The Game of Clustering R Cole and L Barone { Page Probabilistic Nearest Neighbour Random Figure 4{2: Clustering of Ruspini data using probabilistic nearest neighbour selection, with single attack turns and distance strength Figure 4{3: Correctness versus number of attacks for random and probabilistic nearest neighbour selection, with single attack turns and distance strength on Ruspini data Strength Clusters Found Correctness (%) Distance Multivariate normal density Gravity Table 4{3: Strength measure performance for Dense data with probabilistic nearest neighbour selection and single attack while winning allocation clustered the data set in an average of 6.69 seconds CPU time (standard deviation of 2.74), the average number of attacks was 485 (standard deviation of 1683). 5 Discussion The performance of our clustering game is highly dependent on the choice of the strength measure. Gravitational strength clustered three out of the ve data sets perfectly; one of the remaining sets had only one point clustered incorrectly. Both the relative distance and the sum of the squared distances within measures were too aggressive, eliminating too many players from the game (thus nding too few clusters in the data set). The games using these strengths found near perfect solutions earlier in their play. In contrast, the distance and multivariate normal density measures were not aggressive enough and a number of players shared objects belonging to a single correct cluster. A strength cooling factor which reduces the strength of the attacks as the game progresses, similar to the cooling law in simulated annealing, may help reduce the aggressiveness of certain strength measures. Another approach may be to introduce a method of dividing clusters when they become too large or spread. The non-aggressive measures may benet from the addition of a merging step, which can join two clusters in a single attack step. This should be based on the relative position and spread of the two Strength Clusters Found Correctness (%) Distance Multivariate normal density Gravity Table 4{4: Strength measure performance for Elliptical data with probabilistic nearest neighbour selection and single attack while winning allocation

8 The Game of Clustering R Cole and L Barone { Page Single Proportional While Winning Figure 4{4: Correctness versus number of attacks for dierent attack allocation strategies, using probabilistic nearest neighbour selection and distance strength on the Ruspini data Strength Clusters Found Correctness (%) Distance Multivariate normal density Gravity Table 4{5: Strength measure performance for Unstructured data with probabilistic nearest neighbour selection and single attack while winning allocation clusters. Of the strength measures, gravitational attraction was clearly the best, resulting in optimal, or near optimal, clusterings in all cases. As expected, the one data set this measure had diculty with, was the elliptical set, as the gravitational measure does not account for cluster shape. The distance strength clustered the unstructured data correctly, but was unable to nd the correct number of clusters in the other sets. Games using the multivariate density measure allowed clusters to be elliptical in shape, and to overlap. However, under this scheme, the strength of a cluster rapidly approaches zero as we move away from the cluster centroid. This means the probability of successful attacks are low, and that the game will tend to divide the objects between too many clusters. The execution time of our clustering game was reduced by probabilistic nearest neighbour selection, and continue while winning turn allocation. 6 Conclusions A novel game-based technique has been proposed for nding clusters in data. This approach treats clusters as players in a game competing for object ownership. We found that this technique can successfully discover optimal, or near optimal solutions for a range of data sets. However, the performance of the clustering game is highly dependent on the choice of game rules, particularly the strength measure which is used to resolve object ownership. The major advantage of this approach is that it oers a \natural" method of discovering the correct number of clusters in the data the elimination of players during the game. The clustering game also recognises unstructured data as such. This method is a fast, simple, and, if given suitable game strategies, an accurate technique for recognising structure in data.

9 The Game of Clustering R Cole and L Barone { Page (a) Final clustering (b) Correctness versus number of attacks Figure 4{5: Performance of clustering game using distance strength, with probabilistic nearest neighbour selection and single turn while winning for Ruspini data Bibliography 1. Michael R. Garey and David S. Johnson. Computers and Intractability: A Guide to the Theory of NP-Completeness. W. H. Freeman and Company, San Francisco, John A. Hartigan. Clustering Algorithms. John Wiley and Sons, Leonard Kaufman and Peter J. Rousseeuw. Finding Groups in Data : An Introduction to Cluster Analysis. John Wiley and Sons, Inc., Brian S. Everitt. Cluster Analysis. Halsted Press, third edition, Acknowledgements We thank our supervisors, Dr. Nick Spadaccini and Dr. Lyndon While for their helpful suggestions.

10 The Game of Clustering R Cole and L Barone { Page (a) Final clustering (b) Correctness versus number of attacks Figure 4{6: Performance of clustering game using relative distance strength, with probabilistic nearest neighbour selection and single turn while winning for Ruspini data (a) Final clustering (b) Correctness versus number of attacks Figure 4{7: Performance of clustering game using sum of squared distances within strength, with probabilistic nearest neighbour selection and single turn while winning for Ruspini data

11 The Game of Clustering R Cole and L Barone { Page (a) Final clustering (b) Correctness versus number of attacks Figure 4{8: Performance of clustering game using multivariate normal density strength, with probabilistic nearest neighbour selection and single turn while winning for Ruspini data (a) Final clustering (b) Correctness versus number of attacks Figure 4{9: Performance of clustering game using gravitational strength, with probabilistic nearest neighbour selection and single turn while winning for Ruspini data

12 The Game of Clustering R Cole and L Barone { Page (a) Distance (b) Multivariate normal density (c) Gravitational Figure 4{1: Performance of strength measures for Sparse data, with probabilistic nearest neighbour and single attack while winning

13 The Game of Clustering R Cole and L Barone { Page (a) Distance (b) Multivariate normal density (c) Gravitational Figure 4{11: Performance of strength measures for Dense data, with probabilistic nearest neighbour and single attack while winning

14 The Game of Clustering R Cole and L Barone { Page (a) Distance (b) Multivariate normal density (c) Gravitational Figure 4{12: Performance of strength measures for Elliptical data, with probabilistic nearest neighbour and single attack while winning

15 The Game of Clustering R Cole and L Barone { Page (a) Distance (b) Multivariate normal density (c) Gravitational Figure 4{13: Performance of strength measures for Unstructured data, with probabilistic nearest neighbour and single attack while winning

Chapter 6 Continued: Partitioning Methods

Chapter 6 Continued: Partitioning Methods Partitioning methods fix the number of clusters k and seek the best possible partition for that k. The goal is to choose the partition which gives the optimal