Journal of Chemical and Pharmaceutical Research, 2014, 6(6): Research Article

Similar documents
FINDING IMPORTANT NODES IN SOCIAL NETWORKS BASED ON MODIFIED PAGERANK

CS 534: Computer Vision Model Fitting

Cluster Analysis of Electrical Behavior

Outline. Type of Machine Learning. Examples of Application. Unsupervised Learning

Unsupervised Learning

The Research of Support Vector Machine in Agricultural Data Classification

NUMERICAL SOLVING OPTIMAL CONTROL PROBLEMS BY THE METHOD OF VARIATIONS

An Entropy-Based Approach to Integrated Information Needs Assessment

Subspace clustering. Clustering. Fundamental to all clustering techniques is the choice of distance measure between data points;

BioTechnology. An Indian Journal FULL PAPER. Trade Science Inc.

Biostatistics 615/815

Support Vector Machines

Parallelism for Nested Loops with Non-uniform and Flow Dependences

A New Approach For the Ranking of Fuzzy Sets With Different Heights

Compiler Design. Spring Register Allocation. Sample Exercises and Solutions. Prof. Pedro C. Diniz

y and the total sum of

A Robust Method for Estimating the Fundamental Matrix

Determining the Optimal Bandwidth Based on Multi-criterion Fusion

Learning the Kernel Parameters in Kernel Minimum Distance Classifier

An Improved Image Segmentation Algorithm Based on the Otsu Method

Three supervised learning methods on pen digits character recognition dataset

Problem Set 3 Solutions

6.854 Advanced Algorithms Petar Maymounkov Problem Set 11 (November 23, 2005) With: Benjamin Rossman, Oren Weimann, and Pouya Kheradpour

Feature Reduction and Selection

For instance, ; the five basic number-sets are increasingly more n A B & B A A = B (1)

A MOVING MESH APPROACH FOR SIMULATION BUDGET ALLOCATION ON CONTINUOUS DOMAINS

Meta-heuristics for Multidimensional Knapsack Problems

Mathematics 256 a course in differential equations for engineering students

A Binarization Algorithm specialized on Document Images and Photos

Hermite Splines in Lie Groups as Products of Geodesics

Problem Definitions and Evaluation Criteria for Computational Expensive Optimization

An Optimal Algorithm for Prufer Codes *

Content Based Image Retrieval Using 2-D Discrete Wavelet with Texture Feature with Different Classifiers

SLAM Summer School 2006 Practical 2: SLAM using Monocular Vision

Hierarchical clustering for gene expression data analysis

EECS 730 Introduction to Bioinformatics Sequence Alignment. Luke Huan Electrical Engineering and Computer Science

CHAPTER 2 DECOMPOSITION OF GRAPHS

An Application of the Dulmage-Mendelsohn Decomposition to Sparse Null Space Bases of Full Row Rank Matrices

Tsinghua University at TAC 2009: Summarizing Multi-documents by Information Distance

Comparison of Heuristics for Scheduling Independent Tasks on Heterogeneous Distributed Environments

Classifier Selection Based on Data Complexity Measures *

DESIGNING TRANSMISSION SCHEDULES FOR WIRELESS AD HOC NETWORKS TO MAXIMIZE NETWORK THROUGHPUT

Hybridization of Expectation-Maximization and K-Means Algorithms for Better Clustering Performance

Edge Detection in Noisy Images Using the Support Vector Machines

User Authentication Based On Behavioral Mouse Dynamics Biometrics

Term Weighting Classification System Using the Chi-square Statistic for the Classification Subtask at NTCIR-6 Patent Retrieval Task

An Iterative Solution Approach to Process Plant Layout using Mixed Integer Optimisation

Solving two-person zero-sum game by Matlab

Performance Evaluation of Information Retrieval Systems

S1 Note. Basis functions.

Implementation Naïve Bayes Algorithm for Student Classification Based on Graduation Status

Active Contours/Snakes

APPLICATION OF MULTIVARIATE LOSS FUNCTION FOR ASSESSMENT OF THE QUALITY OF TECHNOLOGICAL PROCESS MANAGEMENT

X- Chart Using ANOM Approach

Analysis of Continuous Beams in General

Simulation: Solving Dynamic Models ABE 5646 Week 11 Chapter 2, Spring 2010

A new segmentation algorithm for medical volume image based on K-means clustering

A Background Subtraction for a Vision-based User Interface *

Module Management Tool in Software Development Organizations

Related-Mode Attacks on CTR Encryption Mode

Support Vector Machines

Smoothing Spline ANOVA for variable screening

FAHP and Modified GRA Based Network Selection in Heterogeneous Wireless Networks

Scheduling Remote Access to Scientific Instruments in Cyberinfrastructure for Education and Research

A Fast Content-Based Multimedia Retrieval Technique Using Compressed Data

Sum of Linear and Fractional Multiobjective Programming Problem under Fuzzy Rules Constraints

EXTENDED BIC CRITERION FOR MODEL SELECTION

A Fast Visual Tracking Algorithm Based on Circle Pixels Matching

Synthesizer 1.0. User s Guide. A Varying Coefficient Meta. nalytic Tool. Z. Krizan Employing Microsoft Excel 2007

Machine Learning: Algorithms and Applications

Parameter estimation for incomplete bivariate longitudinal data in clinical trials

Structure Formation of Social Network

A Topology-aware Random Walk

CMPS 10 Introduction to Computer Science Lecture Notes

Unsupervised Learning and Clustering

TN348: Openlab Module - Colocalization

A Deflected Grid-based Algorithm for Clustering Analysis

Investigating the Performance of Naïve- Bayes Classifiers and K- Nearest Neighbor Classifiers

5 The Primal-Dual Method

Categories and Subject Descriptors B.7.2 [Integrated Circuits]: Design Aids Verification. General Terms Algorithms

Lecture 5: Multilayer Perceptrons

Fast Computation of Shortest Path for Visiting Segments in the Plane

Learning-Based Top-N Selection Query Evaluation over Relational Databases

Vanishing Hull. Jinhui Hu, Suya You, Ulrich Neumann University of Southern California {jinhuihu,suyay,

Query Clustering Using a Hybrid Query Similarity Measure

Outline. Self-Organizing Maps (SOM) US Hebbian Learning, Cntd. The learning rule is Hebbian like:

Resource and Virtual Function Status Monitoring in Network Function Virtualization Environment

Quality Improvement Algorithm for Tetrahedral Mesh Based on Optimal Delaunay Triangulation

Course Introduction. Algorithm 8/31/2017. COSC 320 Advanced Data Structures and Algorithms. COSC 320 Advanced Data Structures and Algorithms

FEATURE EXTRACTION. Dr. K.Vijayarekha. Associate Dean School of Electrical and Electronics Engineering SASTRA University, Thanjavur

CSCI 104 Sorting Algorithms. Mark Redekopp David Kempe

Private Information Retrieval (PIR)

A Statistical Model Selection Strategy Applied to Neural Networks

A Unified Framework for Semantics and Feature Based Relevance Feedback in Image Retrieval Systems

Intra-Parametric Analysis of a Fuzzy MOLP

2x x l. Module 3: Element Properties Lecture 4: Lagrange and Serendipity Elements

The Shortest Path of Touring Lines given in the Plane

REFRACTIVE INDEX SELECTION FOR POWDER MIXTURES

F Geometric Mean Graphs

Skew Angle Estimation and Correction of Hand Written, Textual and Large areas of Non-Textual Document Images: A Novel Approach

Transcription:

Avalable onlne www.jocpr.com Journal of Chemcal and Pharmaceutcal Research, 2014, 6(6):2512-2520 Research Artcle ISSN : 0975-7384 CODEN(USA) : JCPRC5 Communty detecton model based on ncremental EM clusterng method Qu L-qng, Lang Yong-quan and Chen Zhuo-yan College of Informaton Scence and Technology, Shandong Unversty of Scence and Technology, Qngdao, Shandong, Chna ABSTRACT Networks are wdely used n a varety of dfferent felds and attract more and more researchers. Communty detecton, one of the research hotspots, can dentfy salent structure and relatons among ndvduals from the networks. Many dfferent solutons have been put forward to detect communtes. EM as a model on statstcal nference methods has receved more attenton because of ts smple and effcent structure. Unlke many other statstcal nference methods, no extra nformaton s assumed except for the network tself and the number of groups for the EM approach. However, practcal usefulness of the EM method s often lmted by computatonal neffcency. The EM method makes a pass through all of the avalable data n every teraton. Thus, f the sze of the networks s large, every teraton can be computatonally ntensve. Therefore we put forward an ncremental EM method-iem for communty detecton. IEM uses the machnery of probablstc mxture models and the ncremental EM algorthm to generalze a feasble model ft the observed network wthout pror knowledge except the networks and the number of groups. Usng only a subset rather than the entre networks allows for sgnfcant computatonal mprovements snce many fewer data ponts need to be evaluated n every teraton. We also argue that one can select the subsets ntellgently by appealng to EM s hghly-apprecated lkelhood judgment condton and ncrement factor. We perform some expermental studes, on several datasets, to demonstrate that our IEM can detect communtes correctly and prove to be effcent. Key words: Communty Detecton, Expectaton Maxmzaton, Incremental Expectaton Maxmzaton INTRODUCTION As a new emergng dscplne, research on networks attracts researchers from a varety of dfferent felds. In fact, studes that can qualtatvely and quanttatvely characterze networks wll help to unvel the general laws regulatng dfferent real systems modeled by networks, and therefore wll be relevant n a number of dscplnes (bology, socal scences, et al).communty structure s one of the crucal structural characterstcs of networks; therefore, accurately analyzng ther communty structure represents a very relevant topc [1-6]. Communtes are groups of nodes wth a hgh level of group nter-connecton [1]. They can be seen as relatve solated subgroups wth few contacts wth the rest of the network. Communty detecton can dentfy salent structure and relatons among ndvduals from the network. Researchers put forward many dfferent methods, whch are manly used to detect the groups wth dense connectons wthn groups but sparser connectons between them. To detect more latent structures n realty networks, varous models on statstcal nference have been proposed recently, whch are on sound theoretcal prncples and have better performances dentfyng structures, and have become the state-of-the-art models [7-10]. The models am s to defne a generatve process to ft the observed network, and transfer the communty detectng problem to Bayesan nference or Maxmum Lkelhood methods [11-14]. The drawback, shared wth many other methods, s that structure detecton usually mples computatonal expensve exploraton of the solutons maxmzng the posteror probablty of the lkelhood. More recently, a maxmum lkelhood method that consders model clusterng as mssng nformaton and deals wth t usng an teratve 2512

Expectaton Maxmzaton (EM) method has been ntroduced by Newman and Leche [2]. The EM method s a smple algorthm that s capable of detectng a broad range of structural sgnatures n networks, ncludng conventonal communty structure, bpartte of dsassortatve structure, fuzzy or overlappng classfcatons, and many mxed or hybrd structural forms that have not been consdered explctly n the past. Due to the smple structure of the EM method, there has been a growng body of work on the analyss of the EM algorthm [3-5]. Many mprovements have been put forward to better the EM method snce then. However, a common weakness n these studes, as we wll dscuss n detal n related work, s that the EM method wll be low effcent when the networks are large-scaled. In fact, the EM method may make sense when the networks are small-scaled or medum-scaled. On the contrary, more often than not, real-world networks are large-scaled. Under such scenaros, f an algorthm lke teratve EM method evaluates all samples at each step, t may results n hgh complcty and low effcency. Therefore, we argue that a more approprate approach s to mprove the EM method n order to reduce samples at each step. Consequently, we propose an ncremental EM algorthm on the sample subset that s converge to optmal solutons usng the proposed formulatons. We prove the correctness and convergence of our algorthm and show that ths algorthm has low tme complexty when the data of the networks s large-scaled. The rest of the paper s organzed as follows: n Secton 2 we dscuss related work and the EM method s formally ntroduced n Secton 3. Next, we descrbe n Secton 4 our generalzaton of ncremental EM method of communty detecton. In Secton 5, we provde expermental studes. Fnally n Secton 6, we gve the concluson and future drectons. RELATED WORKS Communty structure has been extensvely studed n varous research areas such as socal network analyss, Web communty analyss, computer vson, et al. In network analyss, an mportant research topc s to dentfy cohesve subgroups of ndvduals wthn a network where cohesve subgroups are defned as communty detecton. Recently there exsts a growng body of lterature on communty detecton. Many approaches, such as clque-based, degree-based, and matrx-perturbaton-based, have been proposed to extract cohesve subgroups from network. The approach of communty detecton can be characterzed as heurstc measure methods and statstcal nference methods accordng to the bass of object functon. Heurstc measure methods such as modularty maxmzaton [6] and extreme optmzaton [7] use a heurstc metrc to measure communty structure and lack of rgorous theoretcal bass. Statstcal nference methods such as planted partton model [8] and the EM method [2] can dentfy the structure of structural equvalence and regular equvalence, and classfy the vertces of the networks usng the observed networks ft by a generatve process. Statstcal nference methods have perfect theoretcal bass whch s dfferent from heurstc measure methods, and have become the state-of-the-art methods. Statstcal nference methods have the advantage that, dependng on the statstcal model used, they can be very general detectng both structural equvalent and regular equvalent set of ndvduals. Consequently, more lteratures have been proposed on statstcal nference methods. The EM approach as a model on statstcal nference methods has receved more attenton because of ts smple and effcent structure. Unlke many other statstcal nference methods, no extra nformaton s assumed except for the network tself and the number of groups for the EM approach. Contrast to tradtonal communty detecton methods, the EM approach s capable of detectng dsassortatve structure as well as overlappng classfcatons. There are some recent studes on the EM method for communty detecton. The EM approach to communty detecton s frst ntroduced by Newman et al [2]. We wll denote t by the acronym NL-EM from now on. They use the machnery of probablstc mxture models and the EM algorthm to generalze a feasble model ft the observed network wthout pror knowledge except the networks and the number of groups. They also gve a number of examples demonstratng how the method can be used to shed lght on the propertes of real-world networks. In ther model, ther parameter defnton mples that the classfcaton must be such that each class has at least one member wth non-zero out-degree. The constrant forces NL-EM to classfy a smple b-partte graph. Based on the dea, Ramasco et al. [3] generalze an extenson of NL-EM, n whch they extend the parameter θ.the examples show both numercally and analytcally that the new generalzed EM method s able to recover the process that led to the generaton of content-based networks. Muggan et al. [4] use NL-EM to yeld a stablty analyss the groupng that quantfes the extent to whch each node nfluences ts neghbor group membershp. All these studes, however, have a common weak pont-the EM method s usually low effcent and hgh complexty when the networks are large-scaled. That s, when the EM method s used to detect communtes n the networks, t evaluates all samples n every teraton whch may result n low convergence rate and bad clusterng effect. In contrast, n our proposed method, we study an ncremental EM method on the sample subset nstead of whole samples whch prove hgher effcent. 2513

NL- EM METHOD NL-EM s capable of detectng networks structure relyng on followng basc assumptons: (1) The actual connectvty of the networks s related to a pror unknown groupng of the ndvduals; (2) The presence or absence of a lnk s ndependent from the other lnks of the networks. We begn wth a quck summary of NL-EM as appled to graphs. Gven a graph G of N nodes and an A j adjacency matrx, NL-EM method searches for a partton of the nodes nto K groups such that a certan log-lkelhood functon for the graph s maxmzed. Henceforth we wll refer to the groups nto whch NL-EM dvdes the nodes, as classes. There are three varables as follows n NL-EM: r,the probablty that a randomly selected node s n group r θ ; rj,the probablty that an edge leavng group r connects to a certan node j q ; r,the probablty that node s assgned to group r θ.the parameters r and rj satsfy the normalzaton condtons: K = 1, θ = 1 N r r= 1 = 1 r (1) Assumng that the parameters and θ are gven, the probablty Pr( A, g, θ ) under a node classfcaton g, such that g of realzng the gven graph s the group that node has been assgned to, can be wrtten as: A, j Pr( A, g, θ ) = g,, θ g j j (2) Pr( A, g, θ ) nstead: s the lkelhood to be maxmzed, but t turns out to be more convenent to consder ts logarthm L(, θ ) = ln g + A ln,. j θg j j (3) Treatng a pror unknown group assgnment averaged log-lkelhood constructed as: L(, θ ) = qr ln r + Aj ln θrj. r j The fnal results are 1 r = N q r A q g of the nodes as statstcal mssng nformaton, one consders the, (5) j r θ rj =, k qr (6) qr k s the total degree of node. The stll unknown probabltes Where notng that: Pr( A, g = r, θ ) qr = Pr( g = r A,, θ ) =, Pr( A, θ ) (4) are then determned a posteror by (7) 2514

From whch one obtans: q r = Aj r θ j rj. Aj s s θ j rj (8) Equaton (5), (6), and (8) form a set of self consstent equatons for expected log-lkelhood must satsfy. r, θ rj and qr that any extreme of the Thus, gven a graph G, the EM algorthm conssts of pckng a number of classes K nto whch the nodes are to be classfed and searchng for solutons of Equaton (5), (6), and (8). These equatons are derved by Newman et al. θrj q [2]. They also show that when appled to dverse type of networks the resultng, and r yeld useful nformaton about the nternal structure of the network. Note that only a mnmal amount of a pror nformaton s suppled: the number of classes K and the networks. INCREMENTAL EM METHOD Despte the EM method s wde-spread popularty, practcal usefulness of the EM method s often lmted by computatonal neffcency. The EM method makes a pass through all of the avalable data n every teraton. Thus, f the sze of the networks s large, every teraton can be computatonally ntensve. We ntroduce an ncremental EM algorthm for fast computaton based on random sub-samplng whch s denoted by the acronym IEM from now on. Usng only a subset rather than the entre networks allows for sgnfcant computatonal mprovements snce many fewer data ponts need to be evaluated n every teraton. We also argue that one can select the subsets ntellgently by appealng to EM s hghly-apprecated lkelhood judgment condton and ncrement factor. Gven a graph G of N nodes, we frst select M ( M N ) nodes as the ntal sample subset, and then we wll tran the ntal subset by usng NL-EM method. After the tranng, we wll add d ( d N M ) nodes from the remanng samples to the ntal subsets, and then we wll tran the new formed subsets. The smlar teratve operaton s repeated untl the subset s dentcal to the entre samples. The quanttes n our theory thus fall nto three classes: (1) How to defne parameter M? In other words, how many nodes should be frst chosen as the ntal subsets? (2) How to defne parameter d? That s to say, how many nodes should be complemented after last tranng? (3)When wll d nodes be added to the subset? Namely what condtons should be satsfed when the subset changes? We wll gve some reasonable solutons as followed The defnton of parameter M Parameter M means the number of nodes n the ntal subset. The ntal subset selecton s an mportant part of IEM whch has a great nfluence on the results. Our goal s to select some nodes as the ntal subset whch s most representatve of the entre data, and therefore the selected subset can well descrbe the global features. There s a popular vew n network analyss that the mportant nodes are most representatve of the entre networks. Consequently we wll select the mportant nodes of the whole networks. Centralty analyss provdes answers wth measures that defne the mportance of nodes. There are many classcal and commonly methods used ones [9]: degree centralty, closeness centralty, and betweenness centralty. These centralty measures capture the mportance of nodes n dfferent perspectves. Wth large-scale networks, the computaton of centralty measures can be expensve except for degree centralty. We defne n to be the number of the nodes and m to be the number of edges between nodes. Then we can get tme complexty and space complexty about the centralty measures. Closeness centralty, for nstance, nvolves the computaton of all the parwse shortest paths, wth tme complexty 2 3 2 of O( n ) and space complexty of O( n ) wth the Floyd-Warshall algorthm [10] or O( n log n + nm) tme O( nm) complexty wth Johnson s algorthm [11]. The betweenness centralty requres computatonal tme followng [12]. For large-scale networks, effcent computaton of centralty measures s crtcal and requres further research. We propose a new method of measurng the centralty whch s a compromse between complexty and effcency. Now we study degree centralty whch s the smplest measures. For degree centralty, the mportance of a node s determned by the number of nodes adjacent to t. The larger the degree of one node, the more mportant the node s. The degree centralty of node v s defned as: 2515

C ( v) = d / ( n 1) D (9) d where network. s defned as the number of nodes adjacent to v, and n s defned as the number of nodes n the However, the measure s not comprehensve enough,.e., some mportant nodes (.e., brdge contacts connect wth merely two edges) don t have hgh degree centralty. Based on the dea, we argue that the mportance of one node s determned by ts connecton model as well as ts role n the networks. Accordngly we consder two factors, namely the connecton model of the node and ts role n the network. The connecton model of one node can be descrbed by ts degree centralty, and the role of one node can be descrbed by ts coheson centralty. Defnton1: The connectvty of node v s defned as the number of the edges between v and the nodes drectly connected wth v. The connectvty of a node measures how close t s to the nodes whch are drectly connected wth t, and reflects C ( )( ( ) 1) / 2 the local connecton property of the node. Obvously, the span of connectvty s between 0 and D v CD v. Defnton 2: The coheson centralty of node v s defned as follows: CD ( v)( CD ( v) 1) Cc ( v) = 2c (10) Where C ( ) D v s the degree centralty of node v, and c s the connectvty of node v. Accordng to the relatons between the nodes and the edges n the network, the value of condtons: Cc ( v) 1 Cc ( v) satsfes the (11) We fnd that the larger the connectvty of one node, the less mportant the node s. Ths s because the deleton of the node wth larger connectvty wll make less affecton on the network. Thus accordng to equaton (10), the more mportance one node s, the larger the coheson centralty of the node s. Therefore, the coheson centralty s the postve evaluaton ndex of the node. To ntegrate the two factors (.e., connecton model of one node and ts role n networks), an mportance functon s ntroduced to measure the mportance of the node, where the mportance conssts of two parts-a degree centralty and a coheson centralty: I( v) = α C ( v) + (1 α) C ( v) D c (12) where α satsfes 0 α 1. C ( ) D v In ths mportance functon, the degree centralty C ( ) coheson centralty c v of emphass on each part of the total mportance functon. measures the connecton model of node v, and the measures the role of node v. The parameter α s set by the user to control the level Thus accordng to equaton (12) we can select mportant nodes wth hgh value as the ntal subset. The defnton of parameter d Parameter d means the number of nodes whch wll be complemented to the subset n every teraton. The defnton of parameter d s another crucal queston n IEM. The parameter should make the subset ft the real model as much as possble. Here we propose the concept of ncremental factor to descrbe parameter d based on 2516

nformaton entropy. Accordng to nformaton theory, the entropy measures the uncertanty of the system. The larger the entropy s, the more uncertan the system s. If the densty functon values of every node n the subset are approxmately equal, the uncertanty of the dstrbuton for the entre data s largest (.e., the subset has maxmum entropy). Conversely, f the densty functon values of every node are very asymmetrc, the subset has mnmal uncertanty. Therefore, we ntroduce the concept of densty entropy to measure ncremental factor. D = { x Defnton 3: Gven nodes set 1, x2, L, x N } whch has N nodes, the densty functon value of every node s f ( x ), 1,2,, = L N, and δ s the sample varance, then the densty entropy s defned as follows: N f ( x ) f ( x ) DenEn( δ ) = ln Sum Sum = 1 (13) where Sum s the normalzed factor defned as follows: Sum N = = 1 f ( x ) (14) The densty entropy has two propertes: Property 1: 0 DemEn( δ ) ln( N) ; Property 2: DemEn( δ ) = ln( N) f ( x when and only when 1) = f ( x2 ) = L = f ( x N ), therefore lm DenEn( δ ) = ln( N ) = max( DenEn( δ )) δ 0 δ. From Property 2, when DenEn( δ ) = ln( N), the nodes n the subset are consstent wth the real dstrbuton whch s deal case. Wth the ncrease of δ, the value of DenEn( δ ) decreases whch wll reach a mnmum subsequently, DenEn( δ ) and then the value of wll become larger whch wll reach a maxmum ln( N) when δ 0 and f ( x1 ) = f ( x2) = L = f ( x N ). The change of the sample varance δ s smlar to the densty entropy, and we take nto account the mddle value DenEn( δ ) of. Consequently, we propose ncremental factor β as: β = DenEn( δ ) / 2 = ln( N) / 2 (15) Accordng to equaton (15), the parameter d can be descrbed as follows: d = N / β Once parameter d s determned, the teratve process of IEM can be carred out as follows: when the samples n the subset ft the real networks, d nodes are added to the subset, and then new fttng process goes on. The ncremental process wll end untl the subset s equal to the entre data. In the process, the subset gradually approaches the entre data. It s worth mentonng that the complement nodes n every teraton are selected from the entre nodes. Hence the number of nodes n subset turns out to be: (16) M = M + d = M + N / β The condtons when the subsets changes From Secton 3, we can see that the EM method s an teratve procedure for Maxmzng (17) L(, θ ) whch we 2517

descrbe n Equaton (4). Assume that after the and that: th n teraton the current estmate for and θ s gven by θ n. Snce the objectve s maxmze L(, θ ), we wsh to computer an updated estmate n and θ such L(, θ ) > L(, θ ) n n (18) Equaton (18) means the teraton condton of the EM method, namely f the update lkelhood s not more than the current lkelhood then the teraton wll end. Inspred by Equaton (18), we propose the teraton condton of IEM. We defne th lkelhood after t L( t+ teraton and 1)(, θ ) as the maxmum lkelhood after ( condton can be defned as: t + Lt (, θ ) as the maxmum 1)th teraton. The teraton L( t+ 1)(, θ ) > Lt (, θ ) (19) Equvalently we want to maxmze the dfference: D = L( t+ 1)(, θ ) Lt (, θ ) Assume γ s a postve number whch s small enough, then f D γ γ (20) >, we argue that the current estmate s th t teraton s ft to the model of the real undesrable and the teraton should go on. If D, then the subset n data, and the new samples should be complemented to carry out the next teraton. When the subset s equal to the entre data, the termnate condton s consstent to the EM method. The determnng of α How to determne the α n Equaton (12) s a challengng ssue. When the ground truth s avalable, standard valdaton procedures can be used to select an optmalα. However, n many cases there s no ground truth and the communty detecton performance depends on the user s subjectve performance. In the respect, through the parameter α, our IEM provde the user a mechansm to push the communty detecton results toward hs or her preferred outcomes. The problems of whether a correct α exsts and how to automatcally fnd the best α when there s no ground truth are beyond the scope of ths paper. To smply the experments, we wll set α as 0.5 n the followng example applcatons. EXAMPLE APPLICATIONS In ths secton, we use several synthetc dataset to study the performance of our IEM from dfferent aspects. In secton 5.1 we wll frst verfy the correctness of IEM, and then we wll compare our IEM wth baselne algorthm-em n secton 5.2. Frst Example Applcaton We start wth the frst synthetc dataset, whch s a statc network, to llustrate some good propertes of our IEM. Ths dataset s frst studed by Whte and Smyth[13] and s shown n Fgure 1(a). The network contans 15 nodes whch roughly form 3 communtes-c1, C2, and C3-where edges tend to occur between nodes n the same communty. We frst apply our algorthm to the network wth varous communty numbers m and the resultng Q values are plotted n Fgure1 (b). Q values can be nterpreted as modularty values whch s a measure of the devaton between the observed edge-cluster probabltes and what one would predct under an ndependence model. Newman etc. [14] show that larger Q values are correlated wth better graph clusterng. In Fgure 1(b) we also show the ' Q that are reported by Whte and Smyth. As can be seen from Fgure1 (b), both Q and modularty values show dstnct peaks when m = 3, whch corresponds to the correct communty number. Our IEM algorthm gets hgher modularty values whch ndcate that IEM can classfy the network better. ' Q 2518

Next, after our IEM algorthm correctly parttons the nodes nto three communtes, we llustrate the soft communty membershp by studyng two communtes-c2 and C3. In Fgure 1(b) we use trangle shape to represent the nodes n C2, and crcle shape to represent the nodes n C3. But we use dfferent gray levels to ndcate ther communty membershp-we use whte color to llustrate the level that a node belongs to C2 and dark color to show the level that a node belongs to C3. As can be seen, whle the nodes whch are whte or black have very clear communty membershps, the nodes whch are on the boundary between C2 and C3, have rather fuzzy membershp. The shallower the nodes are, the more lkely the nodes belong to C2, Centrally, the deeper the nodes are, the more lkely the nodes belong to C3.In other words, our IEM algorthm s capable of assgnng meanngful soft membershp to a node to ndcate to whch level the node belongs to a certan communty. (a) Q Fgure1. Frst example applcaton: (a) applcatons of IEM method, (b) Modularty value and Modularty value dfferent communty numbers (b) ' Q under Second Example Applcaton We secondly apply our IEM algorthm to a small network- Karate club network [14]. The network contans 34 nodes whch roughly form 2 communtes-c1 and C2. The network s of partcular nterest because the club splt n two durng the course of Zachary s observatons as a result of an nternal dspute and Zachary recorded the membershp of the two factons after the splt. Fgure 2 shows the result of our IEM algorthm whch the number of clusters s set to 2. We use dfferent gray levels to ndcate ther communty membershp as frst example applcaton. In Fgure 2 we use crcle shape and rectangle shape to represent the nodes n C1 and C2 respectvely. But we use dfferent gray levels to ndcate ther communty membershp-we use whte color to llustrate the level that a node belongs to C1 and dark color to show the level that a node belongs to C2. As can be seen, node 9, 3, 14, 20 are on the boundary between C1 and C2, whch have rather fuzzy membershp. Fgure2. Second example applcaton: applcatons of IEM method Next, after our IEM algorthm correctly parttons the nodes nto two communtes, we compare our IEM wth baselne algorthm NL-EM. The compared result s shown as Table 1. As we can see from Table 1, under the same computng envronment the tme of IEM need only 0.471 second, whch s much less than t of NL-EM; the teratons of IEM s only 35, whch s greatly less than t of NL-EM. From the table we have the followng observatons. On the dataset, among the two algorthms (NL-EM and IEM), IEM outperforms NL-EM. In other words, our IEM can reach the neghborhood faster than NL-EM, and s hgh effcent because of fast convergence rate. Table 1.The comparson between NL-EM and IEM Dataset Parameter NL-EM IEM Karate club Tme(s) Iteratons Lkelhood Estmate 0.782 87-6.321 0.471 35-6.435 2519

CONCLUSION Communty detecton s a challengng research problem wth broad applcatons. In ths paper we have descrbed an ncremental EM method-iem for communty detecton. IEM uses the machnery of probablstc mxture models and the ncremental EM algorthm to generalze a feasble model ft the observed network wthout pror knowledge except the networks and the number of groups. The method s more effcent than prevous NL-EM, makng use of a new ncremental approach whch s more close to the optmal solutons. We use only a subset rather than the entre networks allows for sgnfcant computatonal mprovements snce many fewer data ponts need to be evaluated n every teraton. We also argue that one can select the subsets ntellgently by appealng to EM s hghly-apprecated lkelhood-judgment condton and ncrement factor. We have demonstrated the method wth applcatons to some smple examples, ncludng computer-generated and real-world networks. The method s strength s ts effcency whch leads to hgh convergence rate and good clusterng effect. As part of future work, we plan to extend our framework n two drectons. Frst, our current model only appled on statc networks where no temporal analyss s used for evoluton study. We are usng our model n dynamc networks to detect communtes. Second, so far we only consdered the lnk nformaton. In many applcatons, the content nformaton s also very mportant. We are nvestgatng how to ncorporate content nformaton nto our model. REFERENCES [1]Grvan M and Newman MEJ. Proceedngs of the Natonal Academy of Scences, 2002, 99(12), 7821-7826. [2]Newman MEJ and Lecht EA. Proceedngs of the Natonal Academy of Scences, 2007, 104(23), 9564-9569. [3] Ramasco JJ and Mungan M. Physcs and Socety E, 2008, 77(3), 036122. [4] Mungan M and Ramasco JJ. Journal of Statstcal Mechancs: Theory and Experment, 2010, 4, 04028. [5]Vazquez A. Populatons and Evoluton, 2008, 77(6), 066106. [6] Krkpatrck S, Gelatt CD, Vecch MP. Optmzaton by smulated annealng, Scence, 1983, 220(4598), 671 680. [7] Duch J and Arenas A. Physcal Revew E, 2005, 72(2), 027104. [8] Condon A and Karp RM. Random structures and algorthms, 2001, 18(2), 116-140. [9] Zhang B.; Zhang S.; Lu G.. Journal of Chemcal and Pharmaceutcal Research, 2013, 5(9), 256-262. [10] Zhang B.; Internatonal Journal of Appled Mathematcs and Statstcs, 2013, 44(14), 422-430. [11] Zhang B.; Yue H.. Internatonal Journal of Appled Mathematcs and Statstcs, 2013, 40(10), 469-476. [12] Zhang B.; Feng Y.. Internatonal Journal of Appled Mathematcs and Statstcs, 2013, 40(10), 136-143. [13] Bng Zhang. Journal of Chemcal and Pharmaceutcal Research, 2014, 5(2), 649-659. [14] Bng Zhang; Zhang S.; Lu G.. Journal of Chemcal and Pharmaceutcal Research, 2013, 5(9), 256-262. 2520