A Comparison of Pattern-Based Spectral Clustering Algorithms in Directed Weighted Network

Size: px

Start display at page:

Download "A Comparison of Pattern-Based Spectral Clustering Algorithms in Directed Weighted Network"

Alexia Hopkins
6 years ago
Views:

1 A Comparison of Pattern-Based Spectral Clustering Algorithms in Directed Weighted Network Sumuya Borjigin 1. School of Economics and Management, Inner Mongolia University, No.235 West College Road, Hohhot, Inner Mongolia, P.R.C., ; 2. Academy of Mathematics and Systems Sciences, Chinese Academy of Sciences No.55 Zhongguancun East Road, Haidian District, Beijing City, P.R.C., Abstract. Pattern-based spectral clustering algorithms in directed weighted network have important application values in many domains, including computer science, electronic commerce, economics and finance. In this paper, we compare several of the best-known algorithms from the point of view of clustering quality over some existing benchmark data sets. Experimental results show that, it is necessary to propose a more common pattern-based spectral clustering algorithm in directed weighted network. Keywords: Pattern-based cluster structure, Spectral clustering, Comparison 1 Introduction Networks have become ubiquitous as data from many different disciplines can be naturally mapped to network structures, such as technological networks, biological networks and social networks. A cluster in a directed weighted network can be considered as a set of nodes that share common or similar characteristics. The clusters in directed weighted networks can be partitioned into two categories, density-based clusters and pattern-based clusters. Density-based clusters in directed weighted network are groups of nodes that follow the traditional clustering definition based on edge density characteristics, which is defined as a group of nodes with more intra-cluster edges than inter-cluster edges. Although density-based clusters in directed weighted network represents the most common and well-studied clustering definition in both directed and undirected networks, it cannot capture more sophisticated clustering structures, than the classical well-cohesive groups, where edge density may not represent the major clustering criterion. Another category of clusters are pattern-based clusters, where the nodes of a directed weighted network can be naturally clustered together according to similar connectivity patterns that may exist and are not captured completely applying only density criteria. They represent structures

2 2 Sumuya Borjigin with interesting connectivity properties in directed weighted networks. Revealing the underlying patter-based cluster structure of directed weighted networks has become a crucial and interdisciplinary topic with a plethora of relevant application domains, including computer science, electronic commerce, economics and finance. There are different types of pattern-based cluster structures in directed weighted networks, two of interesting and commonly used patterns for clustering directed weighted networks are the cases of citation-based clustering and flow-based clustering [1]. Spectral clustering method either has advantages of effectiveness and easy implementation or an optimization framework for the modularity function. For this reason, spectral clustering method has been used to detect pattern-based clustering structure of directed weighted networks. Fan Chung considered Laplacians for directed graphs and examined their eigenvalues and introduced a notion of a circulation in a directed graph and its connection with the Rayleigh quotient, then defined a Cheeger constant and established the Cheeger inequality for directed graphs. These relations can be used to deal with various problems that often arise in the study of non-reversible Markov chains including bounding the rate of convergence and deriving comparison theorems [2]. David Gleich examined the generalization of the Laplacian of a graph due to Fan Chung. He showed that Fan Chungs generalization reduces to examining one particular symmetrization of the adjacency matrix for a directed graph. From this result, the directed Cheeger bounds trivially follow. Additionally, David Gleich implemented and examined the benefits of directed hierarchical spectral clustering empirically on a data set from Wikipedia. Finally, David Gleich examined a set of competing heuristic methods on the same data set [3]. Perrault-Joncas and Meila considered the problem of embedding directed graphs in Euclidean space while retaining directional information. He modeled the observedgraph as a sample from a manifold endowed with a vector field, and designed an algorithm that separates and recovers the features of this process: the geometry of the manifold, the data density and the vector field. The algorithm was motivated by their analysis of Laplacian-type operators and their continuous limit as generators of diffusions on a manifold. They illustrated the recovery algorithm on both artificially constructed and real data [4]. Bauer considered the normalized Laplace operator for directed graphs with positive and negative edge weights. This generalization of the normalized Laplace operator for undirected graphs was used to characterize directed acyclic graphs. Moreover, they identified certain structural properties of the underlying graph with extremal eigenvalues of the normalized Laplace operator. Bauer proved comparison theorems that establish a relationship between the eigenvalues of directed graphs and certain undirected graphs. This relationship was used to derive eigenvalue estimates for directed graphs. Finally Bauer introduced the concept of neighborhood graphs for directed graphs and use it to obtain further eigenvalue estimates [5]. Given a directed graph in which some of the nodes were labeled, Zhou, Schokopf and Hofmann investigated the question of how to exploit the link structure of the graph to infer the labels of the remaining unlabeled nodes. To that extent

3 Sumuya Borjigin 3 they propose a regularization framework for functions defined over nodes of a directed graph that forces the classification function to change slowly on densely linked subgraphs. A powerful, yet computationally simple classification algorithm was derived within the proposed framework. The experimental evaluation on real-world web classification problems demonstrated encouraging results that validate the approach [6]. Pentney and Meila [7], Huang, Zhu and Schuurmans [8], Meila and Pentney [9], Zhou and Burges [10], Chen, Liu and Tang [11], Mirzal and Furukawa [12], Mantrach et. al. [13], Li and Zhang [14], Satuluri and Parthasarathy [15], Long, Wu and Zhang [16], Rosvall and Bergstrom [17] and Lai, Lu and Nardini [18] also studied the pattern-based spectral clustering in the directed weighted network. In spite of the large number of papers on spectral clustering method for pattern-based cluster structures in directed weighted network, so far no systematic comparison among the existing algorithms has been published. This is why we set out to do the comparison here. We compare performance of the existing pattern-based spectral clustering algorithms in directed weighted network on several kinds of benchmark data sets. We hope to be able to find out each of these spectral clustering is suit for detecting what kinds of patterns, fail to discover what kinds of patterns, how the informative eigenvectors are chosen and ties between the cluster number with eigenvalues of the spectral clustering matrices. The paper is organized as follows. In Section 2, pattern-based spectral clustering algorithms in directed weighted network are given. In section 3, the data sets which contain pattern-based cluster structures are introduced. Section 4 is the experiments and Section 5 is conclusions. 2 Pattern-based Spectral Clustering Algorithms in Directed Weighted Network Similar to classical spectral clustering algorithms in undirected network, the pattern-based spectral clustering algorithms in directed weighted network mentioned here can be thought of as composed of the following stages: The first stage: Construct spectral clustering matrix based on asymmetric similarity matrix; The second stage: Map the original data set according to the informative eigenvectors of the spectral clustering matrix; The third stage: Cluster the mapped data set by a simple clustering algorithm. Main difference of the pattern-based spectral clustering algorithms in directed weighted network proposed in related works is the spectral clustering matrix constructed on the first stage. Different kinds of spectral clustering matrices lead to different kinds of partition results. We compare performance of 9 kinds of pattern-based spectral clustering algorithms in directed weighted network listed in Table 1. Most of the pattern-based spectral clustering algorithms in directed weighted network we selected are the

4 4 Sumuya Borjigin ones given in the related articles. For the integrity of our comparison, we also present some pattern-based spectral clustering algorithms in directed weighted network. Table 1. Pattern-based spectral clustering algorithms in directed weighted network Algorithm Reference Spectral clustering matrix 1 FC-1 [2] eye(size(a,1)) (Z A F + F A Z)/2 2 FC-2 [this paper] π (Z A F + F A Z)/2 3 FC-3 [this paper] (Z A F + F A Z)/2 4 FC-Com [2] π (π A + A π)/2 5 Random Walk [15] (π A + A π)/2 6 Bibliometric [15] (A A + A A)/2 7 Degree-Discounted [15] (F A F A F + F A F A F)/2 8 Asymmetric A 9 FC-Com+Bibliometric [this paper] π (π A + A π)/2 + (A A + A A)/2 In the Table 1, A is the asymmetric similarity matrix, diagonal elements of π are sum of the corresponding columns of A, for nonzero elements of π, Z = π 1/2, F = π 1/2. 3 data sets In this section, 9 kinds of pattern-based spectral clustering algorithms in directed weighted network listed in Table 1 were compared in 3 citation-based benchmark data sets and 3 flow-based benchmark data sets, the data sets were shown in figure 1 to figure 6, respectively. Fig. 1. Citation-based cluster: Cite-4 [1], [2], [15]. The data set in figure 1 forms two citation-based clusters. The most interesting point in this case is that the nodes of the graph that are clustered together

5 Sumuya Borjigin 5 Fig. 2. Citation-based cluster: Cite-8 [1], [2], [15]. Fig. 3. Citation-based cluster: Cite-Long [16].

6 6 Sumuya Borjigin do not have an edge between them. In this case, the data set consists of 2 clusters, The nodes 1 and 3 form a cluster, the nodes 2 and 4 form a cluster. Their similarity emanates from the co-citation event. Respectively the two nodes of the right cluster are pointed by the same group of nodes. Let us consider the case of a citation network where nodes correspond to scientific papers and a directed edge from paper 1 to paper 2 implies that the first paper cites the latter. Although papers 1 and 3 do not share an edge, they form a natural cluster since they both cite papers 2 and 4 and it is probable that they belong on the same scientific topic [1]. A similar example of citation-based cluster structure appears in the second data set in figure 2. In this case, the data set consists of 3 clusters. The nodes 4 and 5 form a cluster, since they have out-links to the same nodes 1, 2 and 3, while at the same time having in-links from the same group of nodes 6, 7 and 8. This structure constitutes a common situation in the context of directed graphs. In this case the clustering features correspond to the common neighbors in the graph and thus the nodes are clustered together if they share common neighbors [1]. The undirected version of the data set Cite-Long was constructed in the reference [16]. Here, we add direction to the data set to construct a directed network which contains citation-based cluster structure. The data set contains 4 clusters, nodes 1, 2, 3 and 4 form a cluster, nodes 5, 6, 7 and 8 for a cluster, nodes 9, 10, 11 and 12 form a cluster, nodes 13, 14, 15 and 16 form a cluster. Fig. 4. Flow-based cluster: Flow-16 [1], [17], [18].

7 Sumuya Borjigin 7 Fig. 5. Flow-based cluster: Flow-Small [1], [17], [18]. The main characteristicof the networks in figure 4 to 6 is that the edges form patterns of flow among nodes. In other words, the local interactions in the network combined with the edge directionality, induce a flow of information among the entities and therefore the clustering structure depends on how information flows. Then, a cluster or community in the network corresponds to a group of nodes where the flow is larger as compared to the flow outside the group. Assuming a user that conducts random walk on the graph, a flow-based community is a group of nodes where a random surfer is more likely to be trapped inside instead of moving out of the group [1]. The data set in figure 4 consists of 16 nodes, which can be partitioned into 4 clusters, and each cluster forms a circle structure. Nodes 1 4 form a cluster, nodes5 8formacluster,nodes9 12formacluster,nodes13 16formacluster. The data sets in figure 5 and in figure 6 are all consist of 36 nodes, which can be partitioned into 6 clusters, and each cluster forms a circle structure. Nodes 1 6 form a cluster, nodes 7 12 form a cluster, nodes form a cluster, nodes form a cluster, nodes form a cluster, nodes form a cluster. The difference between data sets Flow-Small and Flow-Large are the weights of the edges, weights of Flow-Large is significantly larger then corresponding weights of Flow-Small. 4 Experiments 4.1 Experimental Results Pattern-based cluster structures detection results in directed weighted networks were shown in the Table 2. In the Table 2, if the algorithm successfully reveals

8 8 Sumuya Borjigin Fig. 6. Flow-based cluster: Flow-Large [1], [17], [18]. corresponding pattern-based cluster structure, we mark, else we mark. The numbers corresponding to stand for informative eigenvetors. Table 2. Pattern-based cluster structures detection results in directed weighted networks algorithms & data sets Cite-4 Cite-8 Cite-Long Flow-16 Flow-Small Flow-Large 1 FC-1 (1:4) (1:6) 2 FC-2 3 FC-3 4 FC-Com (7:8) (1:4) (1:6) 5 Random-Walk (8) (13:16) 6 Bibliometric (3:4) (6:8) (13:16) 7 Degree-Discounted 8 Asymmetric (3:4) 9 FC-Com+Bibliometric (2&4) (6:8) (1:4) From Table 2 we find that spectral clustering method based on the matrices Bibliometric, Asymmetric and FC-Com+Bibliometric can successfully detect citation-based cluster structure of the data set Cite-4, spectral clustering method based on the matrices FC-Com, Random-Walk, Bibliometric and FC- Com+Bibliometric can successfully detect citation-based cluster structure of the data set Cite-8, spectral clustering method can be used to detect the citationbased cluster structure of the data set Cite-Long only based on the spectral clustering matrix Bibliometric. Spectral clustering method based on the matri-

9 Sumuya Borjigin 9 ces FC-1, FC-Com, Random-Walk and FC-Com+Bibliometric can successfully detect flow-based cluster structure of the data set Flow-16. Spectral clustering method based on the matrices FC-1 and FC-Com can successfully detect flowbased cluster structure of the data set Flow-Small. Spectral clustering method is incapable of detecting flow-based structure of the data set Flow-Large based on all the 9 kinds of spectral clustering matrices. 4.2 Discussions For the purpose of compare performance of the existing pattern-based spectral clustering method, we did experiments in two categories of benchmark data sets, the first category contains citation-based cluster structure, the second category contains flow-based cluster structure. We get the following conclusions: (1) None of the spectral clustering matrix can be used to detect patternbased cluster structure of all the benchmark data sets. FC-1 can be used to detect flow-based cluster structure of the data sets Flow-16 and Flow-Small, Bibliometric can be used to detect citation-based cluster structure of all the 3 data sets which contain citation-based cluster structure. FC-Com, Random-Walk and FC-Com+Bibliometric succeed in discovering both kinds of pattern-based cluster structures in some of the data sets simultaneously. Performance of the algorithms in two data sets Flow-Small and Flow-Large, which have similar flowbased cluster structure different edge weights, is different. (2) Informative eigenvectors of the spectral clustering methods mentioned above were selected manually. For the classical spectral clustering methods in undirected network, informative eigenvectors consist of the smallest k eigenvectors of the normalized Laplacians or the largest k eigenvectors of normalized similarity matrix or transition probability matrix, where k is the cluster number. But for the spectral clustering methods in directed weighted network, informative eigenvectors not always consist of the k largest or smallest eigenvectors. (3) Mechanism of the pattern-based spectral clustering methods in directed weighted network is not so clear compared to spectral clustering methods in undirected network. For the spectral clustering methods in undirected network, optimization theory or lumpability theorem are used to explain why the original data sets are mapped to high dimensional space in ideal case, matrix perturbation theory is used to explain why the algorithms extended to general case. Researchers made effort to discover mechanism of the spectral clustering methods in directed weighted network, however they haven t solve mechanism of pattern-based spectral clustering in directed weighted network. 5 Conclusions Based on above discussions, we conclude that it is need to construct a more common spectral clustering matrix, which can be used to detect pattern-based cluster structure of most of the data sets, for given cluster number, informative eigenvectors can be selected automatically. Moreover, finding mechanism of

10 10 Sumuya Borjigin the pattern-based spectral clustering methods in directed weighted network is another interesting topic. Acknowledgments. This work was partly supported by National Natural Science Foundation of China (Grant No ) and Natural Science Foundation of Inner Mongolia (Grant No. 2014BS0706). The authors also gratefully acknowledge the helpful comments and suggestions of the reviewers, which have improved the presentation. References 1. Malliaros, F., Vazirgiannis, M.: Clustering and Community Detection in Directed Networks: A Survey. Physics Reports, 533, (2013) 2. Chung, F.: Laplacians and the Cheeger Inequality for Directed Graphs. Annals of Combinatorics, 9, 1 19 (2005) 3. Gleich, D.: Hierarchical Directed Spectral Graph Partitioning. Information Networks, Stanford University, Final Project (2005) 4. Perrault-Joncas, D., Meila, M.: Directed Graph Embedding: an Algorithm Based on Continuous Limits of Laplacian Type Operators. In: Advances in Neural Information Processing Systems, pp.1 9. MIT Press, Cambridge (2011) 5. Bauer, F.: Normalized Graph Laplacians for Directed Graphs, Linear Algebra and its Applications, 436, (2012) 6. Zhou, D., Schokopf, B., Hofmann, T.: Semi-Supervised Learning on Directed Graphs. In: Advances in Neural Information Processing Systems, pp.1 8. MIT Press, Cambridge (2005) 7. Pentney, W., Meila, M., Spectral Clustering of Biological Sequence Data. In: Proceedings of the 20th National Conference on Artificial intelligence, pp AAAI Press, Pittsburgh (2005) 8. Huang, J., Zhu, T., Schuurmans, D., Web Communities Identification from Random Walks. In: Joint European Conference on Machine Learning and European Conference on Principles and Practice of Knowledge Discovery in Databases, pp Springer Verlag, Heidelberg (2006) 9. Meila, M., Pentney, W., Clustering by Weighted Cuts in Directed Graphs. In: Proceedings of the 2007 SIAM International Conference on Data Mining, pp SIAM, Philadelphia (2007) 10. Zhou, D., Burges, C., Spectral Clustering and Transductive Learning with Multiple Views. In: Proceedings of the 24th International Conference on Machine learning, pp ACM Press, New York (2007) 11. Chen, M., Liu, J., Tang, X., Clustering via Ravndom Walk Hitting Time on Directed Graphs. In: Proceedings of the 20th National Conference on Artificial intelligence, pp AAAI Press, Pittsburgh (2008) 12. Mirzal, A., Furukawa, M., Eigenvectors for Clustering: Unipartite, Bipartite, and Directed Graph Cases. In: 2010 International Conference on Electronics and Information Engineering, pp IEEE Press, Piscataway (2010) 13. Mantrach, A., Zeebroeck, N., Francq, P., Shimbo, M., Bersini, H., Saerens, M.: Semi-Supervised Classification and Betweenness Computation on Large, Sparse, Directed Graphs. Pattern Recognition, 44, (2011)

11 Sumuya Borjigin Li, Y., Zhang, Z.: Digraph Laplacian and the Degree of Asymmetry. Internet Mathematics, 8, (2012) 15. Satuluri, V., Parthasarathy, S.: Symmetrizations for Clustering Directed Graphs. In: Proceedings of the 14th International Conference on Extending Database Technology, pp ACM Press, New York (2011) 16. Long, B., Wu, X., Zhang, Z.: Community Learning by Graph Approximation. In: Seventh IEEE International Conference on Data Mining, pp IEEE Press, Piscataway (2007) 17. Rosvall, M., Bergstrom, C.: Maps of Random Walks on Complex Networks Reveal Community Structure. Proceedings of the National Academy of the Sciences of the United States of America, 105, (2008) 18. Lai, D., Lu, H., Nardini, C.: Finding Communities in Directed Networks by PageRank Random Walk Induced Network Embedding, Physica A: Statistical Mechanics and its Applications, 389, (2010)

arxiv: v1 [cs.si] 5 Aug 2013

arxiv: v1 [cs.si] 5 Aug 2013 Clustering and Community Detection in Directed Networks: A Survey Fragkiskos D. Malliaros a,, Michalis Vazirgiannis a,b a Computer Science Laboratory, École Polytechnique, 91120 Palaiseau, France b Department