Flock by Leader: A Novel Machine Learning Biologically Inspired Clustering Algorithm

Size: px

Start display at page:

Download "Flock by Leader: A Novel Machine Learning Biologically Inspired Clustering Algorithm"

Mildred York
5 years ago
Views:

1 Flock by Leader: A Novel Machine Learning Biologically Inspired Clustering Algorithm Abdelghani Bellaachia 1, Anasse Bari 1 1 The George Washington University, School of Engineering and Applied Sciences Computer Science Department, nd Street NW, Washington DC 20052, USA {bell, bari}@gwu.edu Abstract. In the April 2010 Nature research report, it was announced that biological physicists only very recently discovered that there exists a leadership pattern in flocks of pigeon birds. The most authoritative birds of the pigeons flock take the lead, and followers follow the leaders directions. Pigeon leaders roles vary over time. Following this unprecedented discovery made by zoologists at the University of Oxford and Eötvös University, we extend in this paper the flocking model largely used in computer science. We define a new biologically inspired clustering algorithm entitled FlockbyLeader that detects hierarchical leaders, discovers their followers, and enables them to flock based on local proximity in an artificial virtual space to create clusters. We offer empirical evidence that the algorithm outperforms both the existing flocking algorithm and the K-means algorithm. We analyze the performance of the algorithm based on widely used datasets in the literature. Keywords: Swarm Intelligence, Information Retrieval, Machine Learning, Data Mining, Social Networks Analysis, Bioinformatics. 1 Introduction The long lasting mystery behind the phenomenon of flocking birds and the certainty of the existence of leaders orchestrating the flock logistics has finally been revealed. Biological physicists from Oxford University and Eötvös University s Department of Zoology found that flying pigeons flock following an organized chain of instructions. The recently published Nature research report [1] revealed that GPS loggers that were fitted into backpacks carried by flocks of pigeons allowed bird scientists to find hierarchies within flocks. It is now confirmed that there exist certain flock members that are authoritative over other birds. Dr. Biro of Oxford University's Department of Zoology claims[1], We found that, whilst most birds have a say in decisionmaking, a flexible system of 'rank' ensures that some birds are more likely to lead and others to boids. In computer science, flocking behavior is also known as Swarm Intelligence. Swarm Intelligence is the property of a system in which the collective behaviors of unsophisticated agents interacting locally with their environment cause 1 Corresponding author.

2 coherent functional global patterns to emerge [3]. Flocking modeling was initially introduced by Craig Reynolds [5]. Reynolds termed the generic simulated flocking birds as boids. The behavior of each bird (boid) is described by three simple rules: separation, cohesion, and alignment. Separation allows a boid to keep a certain distance from its nearest flock-mates, whereas cohesion permits a boid to join a local flock, and alignment enables a boid to move towards the average heading of local flock-mates. Examples of applications in which flocking modeling was successfully used would include but not be limited to robotics and computer animation. X. Cui et. al [2] were among the first researchers in the Swarm Intelligence literature who applied the flocking behavior into information retrieval. Also, very recently Bellaachia and Bari are the first in the Swarm Intelligence literature to introduce a flocking-based framework for community detection in dynamic social networks where a social network is modeled as an artificial life [7]. 2 Motivation The flocking model, also known as Craig Reynolds Model [5] introduced in 1985, lacks an important discovered component described earlier in the introduction: Leadership in Flocks Dynamics. The flocking clustering algorithm used in machine learning [2], [7], [13] is based on a pair-wise proximity in order to find similar data points. The recent discovery mentioned in the introduction shed light on considering mining leaders within the dataset, and thus, instead of a one-to-one proximity to discover similar data points, the algorithm performs a leader-to-many proximity through detecting local leaders and followers that will form subflocks. The existing algorithm in literature relies on a set of predefined heuristics that can significantly affect the clustering results [13]. Our proposed algorithm is motivated by the following open questions on the existing flocking algorithm: (1) Is it possible to minimize the number of moves of agents (birds) and yet maintain relatively good clustering results? (2) Is it possible to make the algorithm parameter free and make the maximum distance (d max ) a dynamic adaptive threshold? In this paper, we incorporate the recently discovered leadership dynamics in pigeon flocks into the existing flocking model, and we introduce a new biologically-inspired algorithm based on the extended model we present in this work. The rest of this paper is structured as follows: we present a formal definition of a Swarm Clustering Framework which serves as a clustering platform for several data mining applications that we have recently tackled in our research on but not limited to microarrays bioinformatics [8], social networks analysis [7], and information retrieval [2]. The fourth section introduces the Flock by Leader algorithm; the fifth section illustrates the experimental result; and the last section provides the conclusion. 3 Swarm Clustering Framework We define a multidisciplinary data mining framework that can be used in different clustering such as [8], [7] and [2]. We present the fundamental components that

3 constitute a Swarm Clustering Framework under which the FlockbyLeader algorithm will be defined in the next section. A swarm network can be modeled using algebraic graph theory. Formally, a graph consists of a set of vertices and a set of edges containing unordered pairs of distinct vertices. The graph has no self-loops and is undirected if. The scalar is referred as the order of graph, and is referred as the size of. Let be a set of heterogeneous data points to be clustered. We define a swarm clustering framework that consists of four main components: (0) Swarm Metric Space (1) Swarm Virtual Space, (2) Agents Position Graph, and (3) Feature Similarity Graph. Consider the following definitions: Definition 1 (Swarm Metric Space ). A Swarm Metric Space is a Metric Space that consists of a set and a distance function that satisfies three properties of a metric: Reflexivity, Symmetry, and Triangle inequality. An instance of a Swarm Metric Space is as follows: defined as the Euclidean distance in a d-dimensional space: (1) The Swarm metric space is taken to be instantiated and it is defined by the user depending on the application as Figure 1 shows. Definition 2 (Swarm Virtual Space ). The Swarm Virtual Space of a set is the Euclidean 2-dimensional space where n data points are being initially deployed at random. We refer to those points as agents. Every data point in is uniquely indexed by an agent. Agents in the virtual space move according to the flocking clustering algorithm that will be defined in the next sections. Let d min be the minimum distance that an agent must have to avoid collision with other agents in virtual space. The swarm virtual space serves as a simplified visualization of the clusters into a 2-dimensional space. Definition 3 (Agents Position Graph ). The agents position graph denoted as is a weighted graph that consists of the set of vertices and the set of edges. Let be the adjacency matrix of. is a matrix of size scalar such that: (2) where and are the position vectors of both Agents and j in the swarm virtual space. The scalar represents the distance between agents and agent in the swarm virtual space. The agents position graph

4 maintains the positions of the agents in the virtual space at every step of the algorithm and will be used to extract the topology of the clusters generated by the algorithm. Definition 4 (Feature Similarity Graph ). The feature similarity graph maintains the similarity between the entities involved in the clustering process. We define the feature similarity graph denoted as to be the weighted graph that consists of the set of vertices and the set of edges. Let be adjacency matrix of. is a matrix of size scalar such that: where and are the feature vectors of both. The scalar ρ represents the metric space that defines the similarity between node i and j. The feature similarity graph drives the movements of the agents in virtual space. We define a Swarm Clustering Framework denoted as to be the quadruple that consists of a metric space a position graph, a feature similarly graph, and a swarm virtual space. A flocking algorithm under the framework is a graph transformation process that transforms an ambiguous structure of heterogeneous entities into a partitioned structure. 4 Flock by Leader Clustering Algorithm We present in this section Flock by Leader clustering algorithm as an extension to the flocking algorithm known in [2] and [13]. In order to give a better understanding of the work presented in this paper, we invite the reader to a summary of the flocking model known as Reynolds model and presented in [2] and [13]. 4.1 Enhanced Flocking Model The enhanced Reynolds model we introduce in this section aims to (1) minimize the moves of the agents in virtual space; and (2) make the process parameters free in term of both the number of iterations and predefined maximum distance. The enhanced model uses the same flocking rules as Reynolds. However, instead of processing every boid and finding its neighbors, the enhanced model analyzes the data and discovers potential leaders. For every leader it finds its corresponding followers that will flock under their leader directions. In Reynolds model, the maximum distance is predefined and assigned to all boids. Boids within the maximum distance from a boid are considered its neighbors. In the enhanced model the maximum distance is relative to the leader. In Figure 1(right) leader (a) and leader (b) both have different distances that define their local neighbors. This observation is inspired from the pigeon leadership dynamics where the leaders distances are different from one to another. In the next sections we will explain how to find leaders and associate a maximum distance to them. In Figure 1 (right) a leader and its followers flock following the flocking rules (cohesion, alignment, and separations). The moves are

5 minimized as opposed to the original model shown in Figure 1(left): Instead of moving every boid to every other neighbor, we migrate the neighbors to their corresponding leaders. Fig. 1. Reynolds Model (left) and Enhanced Model (right) Flock by Leader Algorithm In every iteration, the algorithm starts by finding flock leaders. Then for every flock leader associated with a distance denoted as, the algorithm finds a leader s corresponding followers. The method that finds leaders and calculates their corresponding will be shown in the next section. Once a leader is identified, its corresponding followers agents in the virtual space will perform a flocking behavior and follow their leader. Then the followers are marked as visited in the feature graph and will be excluded in the flocking process on subsequent iterations. The leaders of every subflock are sent back to the virtual space as subflocks representatives. input:, the swarm clustering framework returns:, the new position graph While there are still nodes in that has are not been visited Do 1.1 LeadersList FindFlockLeaders ( 1.2 For leader Agent neighbor of in i

6 LeadersList within AgentFlock (Agent, L ) i i Agent.visited = true i Agent.leader = i End for Update ( End of do while Remark 1. An illustrative example of the aerobatics of agents in the virtual space following the FlockbyLeader algorithm. Every input data point in X (the set of data points to be clustered) is uniquely indexed by an agent in the virtual space. (a) Unvisited (blue) agents randomly deployed in the virtual space. In (b) six flock leaders (green) are detected and (c) their corresponding followers start flocking under their leaders direction in accordance with flocking rules (alignment, separation, cohesion). Figure (d) illustrates the beginning of another iteration of the algorithm: agent#1, agent#5 change roles into followers (yellow) (leaders in previous iteration (c)), agent#3, and agent#4 became outliers (gray) (leaders in (c)). In (e) the flocking process continues, agent#1 and its followers joined agent#2(leader) subflock, and agent#5 and its follower joined agent#6 subflock. Fig 2. Aerobatics of Agents It is important to note the following points as shown in Figure 2: In every iteration a node can be an unvisited node, a leader, a follower, or an outlier. A follower node is set as visited and its leader will serve as a representative in the next iteration. The

7 visited node will be excluded from the flocking process in the next iteration. A node that was a leader at iteration might stay a leader at iteration ; or might become a follower of a highly ranked leader; or might become an outlier as will be explained in the next section. The question arises how to distinguish between a leader, a follower, and an outlier. The following section illustrates our approach. 4.2 Mining Flock Leaders as Initial Clusters Centroids We rely on neighborhood and reverse neighborhood analysis to find potential flock leaders. The analysis is similar to the neighborhood and reverse-neighborhood approach that is mentioned in [6], [10] and [11]. The main difference is that the notion of neighborhood in the swarm framework is dynamic. During each iteration of the flocking process every agent s neighborhood changes depending on the flocking behavior of previous iterations. Let X be a dataset to be clustered. Let be a given distance function between objects and.let the set of nearest neighbors of at iteration is denoted by is a node in the feature graph, and is its corresponding Agent deployed in the virtual space. We adopt the definitions from [10] and apply them to the swarm framework as follows: Definition 5 (Dynamic k-neighborhood - DkNB). The k-neighborhood of at iteration denoted as ( is a set of data points that lie within a circle with as a center and as radius associated with leader at iteration t such that Definition 6 (Dynamic Reverse k-neighborhood DR-kNB). The reverse k- neighborhood of at iteration denoted as ( ) is the set of data points whose sets contains The ratio / has been widely used in the neighborhood based clustering literature [10] in order to determine which points are dense, even or spare. Several factors have been introduced, such as neighborhood density factor (NDF), and the structural role index (SRI) that was recently introduced in [10] and [11]. We define Dynamic Agent Role Factor denoted as of an Agent at Iteration to be: (3) (4) Intuitively a centroid of a cluster occupies the center position of a mass of associated data points. The larger is the more objects approaches. The initial centroid candidates should have the most reverse k nearest neighbors. Specifically, if then is a flockleader at iteration otherwise is a follower. If is close to zero then is an outlier. We extend the Agent role factor to introduce a local rank of an agent at iteration t to be:

8 (5) is the number of the neighbors at iteration and is the number of unvisited nodes at iteration. The rank is being used to sort the list of leaders. A leader of higher rank will be given priority to be processed first (finding its followers). 5 Experiments and Results 5.1 Datasets Two large datasets were used in our experiments. The first dataset consists of real news articles, details about the dataset can be found in [7]. The dataset consist of 100 news articles collected from cyberspace, which have been categorized by human experts into 12 clusters. We used KNIME tool [9] to preprocess the news articles and convert the dataset into keywords document matrix that Flock by Leader algorithm takes as input. The second dataset is the iris plant dataset. It contains 150 instances from three classes: Iris-virginica-class-1, Iris-versicolor-class-2, and Irissetosa-class-3. There exist fifty instances in each class. Each instance is described by four attributes. Details about the dataset can be found in [12]. 5.2 Evaluation Methodology We will use the F-measure as the quality measure. The F-Measure computes an average of the information retrieval precision and recall. Each cluster is considered as if it were the result of a query and each class as if it were the desired set of documents for a query. We then calculate the recall and precision of that cluster for each given class. The F-measure of cluster j (retrieved) and class i (known) is defined as follows. 5.3 Results Using the evaluation methods mentioned in the previous section, we compare the performance of FlockbyLeader algorithm against Flocking-based clustering algorithm mentioned in [3], and K-means. Table.9 illustrates the results of running FlockbyLeader algorithm on both the real news articles dataset and Iris dataset. We compare our results with results mentioned in [12] and [2] on the same dataset. Table 6 shows that FlockbyLeader has the largest F-measure values compared to both flocking Algorithm and K-means. The algorithm needed 4 iterations to converge. This is a significant improvement on the exiting flocking algorithm where the total number of iteration was 300 [2]. FlockbyLeader algorithm achieved 98.66% reduction in the number of iteration of the flocking process, a 7.5% increase in precision and recall (F-measure) over the existing flocking algorithm, and an average of 16.5% percent increase of precision and recall over K-means on both datasets. Figures 3 are (6)

7811 Iris Dataset FlockbyLeader 3 0.8076 Fig. 3. The process of Running Flock by Leader Algorithm on the IRIS Dataset 7 Conclusion In this paper we presented a simple, biologically-inspired clustering algorithm.

9 snapshots of the virtual space on both datasets at initialization and after running the algorithm. Table 1. F-measure Evaluation Results. Dataset Algorithms Number F-measure of Clusters News Articles Flocking News Articles K-means 12( k=12) News Articles FlockbyLeader Iris Dataset K-means 3 (k=3) Iris Dataset FlockbyLeader Fig. 3. The process of Running Flock by Leader Algorithm on the IRIS Dataset 7 Conclusion In this paper we presented a simple, biologically-inspired clustering algorithm. FlockbyLeader incorporates a new discovery on Pigeons: Leadership Dynamics. Our algorithm is an enhancement of the existing flocking algorithm. The algorithm outperforms K-means on two large datasets. Our future work will include running the algorithm on different datasets.

10 8 References 1. Nagy, M., Z. Akos, D. Biro, and T. Vicsek. Hierarchical group dynamics in pigeon flocks. Nature 464, no (2010): X. Cui, J. Gao and T. E. Potok, A Flocking Based Algorithm for Document Clustering Analysis, Journal of System Architecture, June, 2006, ISSN: Vladimir G. Red'ko, Artificial Life Evolutionary Models, E. Bonabeau, M. Dorigo, and G. Theraulaz, Swarm intelligence: from natural to artificial systems, Oxford University Press, Craig W. Reynolds, Flocks, Herds, and Schools: A Distributed Behavioral Model, Computer Graphics, 21(4), July 1987, pp ] 6. S. Zhou, Y. Zhao, J. Guan, and J.Z. Huang, A Neighborhood-Based Clustering Algorithm, in Proc. PAKDD, 2005, pp Bellaachia, A.; Bari, A.;, SFLOSCAN: A biologically-inspired data mining framework for community identification in dynamic social networks, Swarm Intelligence (SIS), 2011 IEEE Symposium on, vol., no., pp.1-8, April 2011 doi: /SIS Bellaachia, A.; Bari, A.; A Flocking Based Data Mining Algorithm for Detecting Outliers in Cancer Gene Expression Microarray Data in Proc. IEEE International Conference on Information Retrieval and Knowledge Management, CAMP12, M. R. Berthold, N. Cebron, F. Dill, T. R. Gabriel, T. Kotter, T. Meinl, P. Ohl, K. Thiel, and B. Wiswedel. Knime - the konstanz information miner: version 2.0 and beyond.sigkdd Explor.Newsl., 11(1):26 31, Y. Ye, J.Z. Huang, X. Chen, S. Zhou, G.J. Williams, and X. Xu, "Neighborhood Density Method for Selecting Initial Cluster Centers in K-Means Clustering", in Proc. PAKDD, 2006, pp J. Ding, R. Ma, J. Yang, and S. Chen, "A tree-structured framework for purifying "complex" clusters with structural roles of individual data", presented at Pattern Recognition, 2010, pp Guillet, F., G. Ritschard, D.A. Zighed and H. Briand (eds) (2010) Advances in Knowledge Discovery and Management, Series: Studies in Computational Intelligence, Vol. 292, Berlin: Springer. doi: / Bellaachia, A.; X. He, An Artificial Life Based Data Mining Algorithm, Swarm Intelligence IEEE, 2006

PARTICLE SWARM OPTIMIZATION (PSO)

PARTICLE SWARM OPTIMIZATION (PSO) J. Kennedy and R. Eberhart, Particle Swarm Optimization. Proceedings of the Fourth IEEE Int. Conference on Neural Networks, 1995. A population based optimization technique