The Random Subgraph Model for the Analysis of an Ecclesiastical Network in Merovingian Gaul

Size: px

Start display at page:

Download "The Random Subgraph Model for the Analysis of an Ecclesiastical Network in Merovingian Gaul"

Egbert Chambers
5 years ago
Views:

1 The Random Subgraph Model for the Analysis of an Ecclesiastical Network in Merovingian Gaul Charles Bouveyron Laboratoire MAP5, UMR CNRS 8145 Université Paris Descartes This is a joint work with Y. Jernite, P. Latouche, P. Rivera, L. Jegou & S. Lamassé 1

2 Outline Introduction The random subgraph model (RSM) Model inference Numerical experiments Analysis of an ecclesiastical network Conclusion 2

3 Introduction The analysis of networks: is a recent but increasingly important field in statistical learning, with applications in domains ranging from biology to history: biology: analysis of gene regulation processes, social sciences: analysis of political blogs, history: visualization of medieval social networks. Two main problems are currently well addressed: visualization of the networks, clustering of the network nodes. 3

4 Introduction The analysis of networks: is a recent but increasingly important field in statistical learning, with applications in domains ranging from biology to history: biology: analysis of gene regulation processes, social sciences: analysis of political blogs, history: visualization of medieval social networks. Two main problems are currently well addressed: visualization of the networks, clustering of the network nodes. Network comparison: is a still emerging problem is statistical learning, which is mainly addressed using graph structure comparison, but limited to binary networks. 3

5 Introduction Figure : Clustering of network nodes: communities (left) vs. structures with hubs (right). 4

6 Introduction Key works in probabilistic models: stochastic block model (SBM) by Nowicki and Snijders (2001), latent space model by Hoff, Handcock and Raftery (2002), latent cluster model by Handcock, Raftery and Tantrum (2007), mixed membership SBM (MMSBM) by Airoldi et al. (2008), mixture of experts for LCM by Gormley and Murphy (2010), MMSBM for dynamic networks by Xing et al. (2010), overlapping SBM (OSBM) by Latouche et al. (2011). A good overview is given in: M. Salter-Townshend, A. White, I. Gollini and T. B. Murphy, Review of Statistical Network Analysis: Models, Algorithms, and Software, Statistical Analysis and Data Mining, Vol. 5(4), pp ,

7 Introduction: the historical problem Our colleagues from the LAMOP team were interested in answering the following question: Does the Church was organized in the same way within the different kingdoms in Merovingian Gaul? 6

8 Introduction: the historical problem Our colleagues from the LAMOP team were interested in answering the following question: Does the Church was organized in the same way within the different kingdoms in Merovingian Gaul? To this end, they have build a relational database: from written acts of ecclesiastical councils that took place in Gaul during the 6th century ( ), those acts report who attended (bishops, kings, dukes, priests, monks,...) and what questions (regarding Church, faith,...) were discussed, they also allowed to characterize the type of relationship between the individuals, it took 18 months to build the database. 6

9 Introduction: the historical problem The database contains: 1331 individuals (mostly clergymen) who participated to ecclesiastical councils in Gaul between 480 and 614, 4 types of relationships between individuals have been identified (positive, negative, variable or neutral), each individual belongs to one of the 5 regions of Gaul: 3 kingdoms: Austrasia, Burgundy and Neustria, 2 provinces: Aquitaine and Provence. additional information is also available: social positions, family relationships, birth and death dates, hold offices, councils dates,... 7

10 Introduction: the historical problem Neustria Provence Unknown Aquitaine Austrasia Burgundy 8Figure : Adjacency matrix of the ecclesiastical network (sorted by regions).

11 Introduction Expected difficulties: existing approaches can not analyze networks with categorical edges and a partition into subgraphs, comparison of subgraphs has, up to our knowledge, not been addressed in this context, a source effect is expected due to the overrepresentation of some places (Neustria through Ten History Book of Gregory of Tours) or individuals (hagiographies). 9

12 Introduction Expected difficulties: existing approaches can not analyze networks with categorical edges and a partition into subgraphs, comparison of subgraphs has, up to our knowledge, not been addressed in this context, a source effect is expected due to the overrepresentation of some places (Neustria through Ten History Book of Gregory of Tours) or individuals (hagiographies). Our approach: we consider directed networks with typed (categorical) edges and for which a partition into subgraphs is known, we base our comparison on the cluster organization of the subgraphs, we propose an extension of SBM which takes into account typed edges and subgraphs, subgraph comparison is possible afterward using model parameters. 9

13 Outline Introduction The random subgraph model (RSM) Model inference Numerical experiments Analysis of an ecclesiastical network Conclusion 10

14 The random subgraph model (RSM) Before the maths, an example of an RSM network: We observe: the partition of the network into S = 2 subgraphs (node form), the presence A ij of directed edges between the N nodes, the type X ij {1,..., C} of the edges (C = 3, edge color). Figure : Example of an RSM network. 11

15 The random subgraph model (RSM) Before the maths, an example of an RSM network: We observe: the partition of the network into S = 2 subgraphs (node form), the presence A ij of directed edges between the N nodes, the type X ij {1,..., C} of the edges (C = 3, edge color). Figure : Example of an RSM network. We search: a partition of the node into K = 3 groups (node color), which overlap with the partition into subgraphs. 11

16 The random subgraph model (RSM) The network (represented by its adjacency matrix X) is assumed to be generated as follows: the presence of an edge between nodes i and j is such that: A ij B(γ sis j ) where s i {1,..., S} indicates the (observed) subgraph of node i, 12

17 The random subgraph model (RSM) The network (represented by its adjacency matrix X) is assumed to be generated as follows: the presence of an edge between nodes i and j is such that: A ij B(γ sis j ) where s i {1,..., S} indicates the (observed) subgraph of node i, each node i is as well associated with an (unobserved) group among K according to: Z i M(α si ) where α s [0, 1] K and K k=1 α sk = 1, 12

18 The random subgraph model (RSM) The network (represented by its adjacency matrix X) is assumed to be generated as follows: the presence of an edge between nodes i and j is such that: A ij B(γ sis j ) where s i {1,..., S} indicates the (observed) subgraph of node i, each node i is as well associated with an (unobserved) group among K according to: Z i M(α si ) where α s [0, 1] K and K k=1 α sk = 1, each edge X ij can be finally of C different (observed) types and such that: X ij A ij Z ik Z jl = 1 M(Π kl ) where Π kl [0, 1] C and C c=1 Π klc = 1. 12

19 The random subgraph model (RSM) 4 Y. JERNITE ET AL. Notations X A Z N K S C α Π γ Description Adjacency matrix. X ij {0,...,C} indicates the edge type Binary matrix. A ij =1indicatesthepresenceofanedge Binary matrix. Z ik =1indicatesthati belongs to cluster k Number of vertices in the network Number of latent clusters Number of subgraphs Number of edge types α sk is the proportion of cluster k in subgraph s Π klc is the probability of having an edge of type c between vertices of clusters k and l γ rs probability of having an edge between vertices of subgraphs r and s Table 1 Summary of the notations used in the paper. Table : Summary of the notations. the model, we also consider the binary matrix A with entries A ij such that A i,j =1 X i,j 0. We 13 also emphasize that the observed partition P induces a decomposition of the graph into subgraphs where each class of vertices corresponds to a

20 The random subgraph model (RSM) Remark 1: the RSM model separates the roles of the known partition and the latent clusters, this was motivated by historical assumptions on the creation of relationships during the 6th century, indeed, the possibilities of connection were preponderant over the type of connection and mainly dependent on the geography. 14

21 The random subgraph model (RSM) Remark 1: the RSM model separates the roles of the known partition and the latent clusters, this was motivated by historical assumptions on the creation of relationships during the 6th century, indeed, the possibilities of connection were preponderant over the type of connection and mainly dependent on the geography. Remark 2: an alternative approach would consist in allowing X ij to directly depend on both the latent clusters and the partition, however, this would dramatically increase the number of model parameters (K 2 S 2 (C + 1) + SK instead of S 2 + K 2 C + SK), if S = 6, K = 6 and C = 4, then the alternative approach has parameters while RSM has only

22 The random subgraph model (RSM) We consider a Bayesian framework: the previous model is fully defined by its joint distribution: p(x, A, Z α, γ, Π) = p(x A, Z, Π)p(A γ)p(z α), which we complete with conjuguate prior distributions for model parameters: the prior distribution for α is: p(γ rs) = Beta(a rs, b rs), the prior distribution for γ is: p(α s) = Dir(χ s), the prior distribution for Π is: p(π kl ) = Dir(Ξ kl ). 15

23 The random subgraph model (RSM) THE RANDOM SUBGRAPH MODEL 5 χ α P γ a, b Z i Z j A ij Ξ Π X ij Figure : A graphical representation of the RSM model. Fig 1. GraphicalrepresentationoftheRSMmodel. depending 16 on the latent clusters. Thus, if i belongs to cluster k and j to

24 Outline Introduction The random subgraph model (RSM) Model inference Numerical experiments Analysis of an ecclesiastical network Conclusion 17

25 Model inference Due to the Bayesian framework introduces above: we aim at estimating the posterior distribution p(z, α, γ, Π X, A), which in turn will allow us to compute MAP estimates of Z and (α, γ, Π), as expected, this distribution is not tractable and approximate inference procedures are required, the use of MCMC methods is obviously an option but MCMC methods have a poor scaling with sample sizes. 18

26 Model inference Due to the Bayesian framework introduces above: we aim at estimating the posterior distribution p(z, α, γ, Π X, A), which in turn will allow us to compute MAP estimates of Z and (α, γ, Π), as expected, this distribution is not tractable and approximate inference procedures are required, the use of MCMC methods is obviously an option but MCMC methods have a poor scaling with sample sizes. We chose to use variational approaches: because they allow to deal with large networks (N > 1000), recent theoretical results (Celisse et al., 2012; Mariadassou and Matias, 2013) gave new insights about convergence properties of variational approaches in this context. 18

27 The VBEM algorithm for RSM Variational Bayesian inference in our case: we aim at approximating the posterior distribution p(z, α, γ, Π X, A) we therefore search the approximation q(z, α, γ, Π) which maximizes L(q) where: log p(x, A) = L(q) + KL(q p(. X, A)), and q is assumed to factorize as follows: q(z, α, γ, Π) = q(z i ) q(α s ) q(γ st ) q(π kl ). The VBEM algorithm for RSM: E step: compute the update parameter τ i for q(z i ), M step: compute the update parameters χ, γ, Ξ for respectively q(α s ), q(γ st ) and q(π kl ). 19

28 Initialization and choice of K Initialization of the VBEM algorithm: the VBEM is known to be sensitive to its initialization, we propose a strategy based on several k-means algorithms with a specific distance: d(i, j) = N N δ(x ih X jh )A ih A jh + δ(x hi X hj )A hi A hj. h=1 h=1 20

29 Initialization and choice of K Initialization of the VBEM algorithm: the VBEM is known to be sensitive to its initialization, we propose a strategy based on several k-means algorithms with a specific distance: d(i, j) = N N δ(x ih X jh )A ih A jh + δ(x hi X hj )A hi A hj. h=1 Choice of the number K of groups: h=1 once the VBEM algorithm has converged, the lower bound L(q) is a good approximation of the integrated log-likelihood log p(x, A), we thus can use L(q) as a model selection criterion for choosing K, if computed right after the M step, L(q) = S r,s B(ars, brs) log( B(a 0 rs, b0 rs ) ) + S s=1 log( C(χ s ) C(χ 0 s ) ) + K k,l log( C(Ξ kl) C(Ξ 0 kl ) ) N i=1 k=1 K τ ik log(τ ik ). 20

30 Outline Introduction The random subgraph model (RSM) Model inference Numerical experiments Analysis of an ecclesiastical network Conclusion 21

31 Experimental setup We considered 3 different situations: S1 : network without subgraphs and with a preponderant proportion of edges of type 1, S2 : network without subgraphs and with balanced proportions of the three edge types, S3 : network with 3 subgraphs and with balanced proportions of the three edge types. Global setup: in all cases, the number of (unobserved) groups is K = 3 and the network size is N = 100, we use the adjusted Rand index (ARI) for evaluating the clustering quality (and thus the model fitting). 22

32 Choice of the number K of groups First, a model selection study: we aim at validating the use of L(q) as model selection criteria, we simulated 50 RSM networks according to scenario 1 and with N = 100, and applied our VB-EM algorithm for different values of K (K = 2,..., 5), the actual value of K is K = 3. 23

33 Choice of the number K of groups 12 Y. JERNITE ET AL. Criterion L L ARI repartition ARI K K Table Fig 4. : Repartition Lower bound ofthe L and criterion(left ARI averaged panel) over and50 ARI(right networkspanel) simulated over 50 according networksto the generated RSM with model. the parameters of the first scenario. data drawn according to its generative process. We were interested in the 24 comparison with the following models:

34 Comparison with other SBM-based approaches Second, a comparison with other SBM-based methods: binary SBM: the original SBM algorithm was applied on a collapsed version of the data (only the presence of edges); the mixer package was used, binary SBM (type 1, 2 or 3): the original SBM algorithm was applied on a collapsed version of the data (only edges of type 1, 2 or 3); the mixer package was used, typed SBM: we had to implement the categorical version of SBM since it is not available in existing software; this version of SBM will be available in mixer soon, the studied methods were applied to the the three scenarii and results are averaged over 50 networks. 25

35 Comparison with other SBM-based approaches Method Scenario 1 Scenario 2 Scenario 3 binary SBM (presence) ± ± ± binary SBM (type 1) ± ± ± binary SBM (type 2) ± ± ± binary SBM (type 3) ± ± ± Typed SBM ± ± ± RSM ± ± ± Table : ARI averaged over 50 networks simulated according to the three considered situations. 26

36 Outline Introduction The random subgraph model (RSM) Model inference Numerical experiments Analysis of an ecclesiastical network Conclusion 27

37 The ecclesiastical network The data: 1331 individuals (mostly clergymen) who participated to ecclesiastical councils in Gaul between 480 and 614, 4 types of relationships between individuals have been identified (positive, negative, variable or neutral), each individual belongs to one of the 5 regions (3 kingdoms et 2 provinces). Our modeling allows a multi-level analysis: Z allows to characterize the found clusters through social positions of the individuals, parameter Π describes the relations between the found clusters, parameter γ describes the connections between the subgraphs, parameter α describes the cluster repartition in the subgraphs. 28

38 RSM results: the latent clusters Cluster 1 Cluster Bishop Priest Abbot Earl Duke Monk Deacon King Queen Archdeacon Bishop Priest Abbot Earl Duke Monk Deacon King Queen Archdeacon Cluster 3 Cluster Bishop Priest Abbot Earl Duke Monk Deacon King Queen Archdeacon Bishop Priest Abbot Earl Duke Monk Deacon King Queen Archdeacon Cluster 5 Cluster Bishop Priest Abbot Earl Duke Monk Deacon King Queen Archdeacon Bishop Priest Abbot Earl Duke Monk Deacon King Queen Archdeacon 29 Figure : Characterization of the K = 6 clusters found by RSM.

39 RSM results: the latent clusters The latent clusters from the historical point of view: clusters 1 and 3 correspond to local, provincial of diocesan councils, mostly interested in local issues (ex: council of Arles, 554), clusters 2 and 6 correspond to councils dedicated to political questions, usually convened by a king (ex: Orleans, 511), clusters 4 and 5 correspond to aristocratic assemblies, where queens and duke and earls are present (ex: Orleans, 529). 30

40 RSM results: the relationships between clusters positive negative cluster 4 cluster 4 cluster 5 cluster 3 cluster 5 cluster 3 cluster 6 cluster 2 cluster 6 cluster 2 cluster 1 cluster 1 Figure : Characterization of the relationships between clusters (parameter Π). 31

41 RSM results: the relationships between clusters variable neutral cluster 4 cluster 4 cluster 5 cluster 3 cluster 5 cluster 3 cluster 6 cluster 2 cluster 6 cluster 2 cluster 1 cluster 1 Figure : Characterization of the relationships between clusters (parameter Π). 32

42 RSM results: the relationships between clusters The clusters relationships from the historical point of view: positive relations between clusters 3, 5 and 6 mainly corresponds to personal friendships between bishops (source effect), negative and variable relations betweens clusters 4, 5 and 6 report the conflicts in the hierarchy of the power, neutral relations between clusters 1, 3 and 6 were expected because they deal with different issues (local / political). 33

43 RSM results: the relationships between regions 1.5 Neustria Provence Unknown Aquitaine Austrasia Burgundy Neustria Provence Unknown Aquitaine Austrasia Burgundy Figure : Characterization of the relationships between the regions (parameter γ in log scale). 34

44 RSM results: comparison of the regions Proportions cluster 1 cluster 2 cluster 3 cluster 4 cluster 5 cluster 6 Neustria Provence Unknown Aquitaine Austrasia Burgundy total Figure : Characterization of regions through cluster repartition (parameter α). 35

45 RSM results: comparison of the regions Aquitaine Comp Neustria Burgundy Unknow Austrasia Provence Comp.1 36 Figure : PCA for compositional data on the parameter α.

46 Outline Introduction The random subgraph model (RSM) Model inference Numerical experiments Analysis of an ecclesiastical network Conclusion 37

47 Conclusion Our contribution: a model for network clustering which takes into account an existing partition of the network into subgraphs, this modeling allows afterward a comparison of the subgraphs, inference is done in a Bayesian framework using a VBEM algorithm, our approach has been applied to a complex historical network. 38

48 Conclusion Our contribution: a model for network clustering which takes into account an existing partition of the network into subgraphs, this modeling allows afterward a comparison of the subgraphs, inference is done in a Bayesian framework using a VBEM algorithm, our approach has been applied to a complex historical network. Interesting problems to address: temporality of the network (evolution of relations, social positions,...), visualization of this kind of networks, procedures to test the similarity of subgraphs... 38

49 Conclusion Our contribution: a model for network clustering which takes into account an existing partition of the network into subgraphs, this modeling allows afterward a comparison of the subgraphs, inference is done in a Bayesian framework using a VBEM algorithm, our approach has been applied to a complex historical network. Interesting problems to address: temporality of the network (evolution of relations, social positions,...), visualization of this kind of networks, procedures to test the similarity of subgraphs... Software: Reference: package Rambo for the R software is available on the CRAN The Random Subgraph Model for the Analysis of an Ecclesiastical Network in Merovingian Gaul, The Annals of Applied Statistics, vol. 8(1), pp , 2014 ( 38

50 The EM, VEM and VBEM algorithms First, it necessary to write the log-likelihood as: log(p(x θ)) = L(q(Z); θ) + KL(q(Z) p(z X, θ)), where: L(q(Z); θ) = Z q(z) log(p(x, Z θ)/q(z)) is a lower bound of the log-likelihood, KL(q(Z) p(z X, θ)) = Z q(z) log(p(x Z, θ)/q(z)) is the KL divergence between q(z) and p(z X, θ). 39

51 The EM, VEM and VBEM algorithms First, it necessary to write the log-likelihood as: log(p(x θ)) = L(q(Z); θ) + KL(q(Z) p(z X, θ)), where: L(q(Z); θ) = Z q(z) log(p(x, Z θ)/q(z)) is a lower bound of the log-likelihood, KL(q(Z) p(z X, θ)) = Z q(z) log(p(x Z, θ)/q(z)) is the KL divergence between q(z) and p(z X, θ). The EM algorithm: E step: θ is fixed and L is maximized over q q (Z) = p(z X, θ) M step: L(q (Z), θ old ) is now maximized over θ L(q (Z), θ old ) = Z p(z X, θ old ) log(p(x, Z θ)/p(z X, θ old )) = E[log(p(X, Z θ) θ old ] + c. 39

52 The EM, VEM and VBEM algorithms The variational approach: let us now suppose that p(x, Z θ) is, for some reason, intractable, the variational approach restrict the range of functions for q such that the problem is tractable, a popular variational approximation is to assume that q factorizes: q(z) = i q i (Z i ). The VEM algorithm: V-E step: θ is fixed and L is maximized over q log q j (Z j) = E i j [log p(x, Z θ)] + c V-M step: L(q (Z), θ old ) is now maximized over θ 40

53 The EM, VEM and VBEM algorithms We consider now the Bayesian framework: we aim at estimating the posterior distribution p(z, θ X), we have here the relation: log(p(x)) = L(q(Z, θ)) + KL(q(Z, θ) p(z, θ X)), we also assume that q factorizes over Z and θ: q(z, θ) = i q i (Z i )q θ (θ). The VBEM algorithm: VB-E step: q θ (θ) is fixed and L is maximized over the q i log q j (Z j) = E i j,θ [log p(x, Z, θ)] + c VB-M step: all q i (Z i ) are now fixed and L is maximized over q θ log q θ (θ) = E Z[log p(x, Z, θ)] + c 41

54 The VBEM algorithm for RSM: the M step The M step of the VBEM algorithm: the VBEM update step for the distributions q(α s ) is: log q (α s ) = E Z,α,γ,Π[log p(x, A, Z, α, γ, Π)] + c \s { } K N = log(α sk ) χ 0 sk + δ(r i = s)τ ik 1 + c, k=1 i=1 42

55 The VBEM algorithm for RSM: the M step The M step of the VBEM algorithm: the VBEM update step for the distributions q(α s ) is: log q (α s ) = E Z,α,γ,Π[log p(x, A, Z, α, γ, Π)] + c \s { } K N = log(α sk ) χ 0 sk + δ(r i = s)τ ik 1 + c, k=1 i=1 which is the functional form for a Dirichlet distribution: q(α s ) = Dir(α s ; χ s ), s {1,..., S} where χ sk = χ 0 sk + N i=1 δ(r i = s)τ ik, k {1,..., K}. 42

56 The VBEM algorithm for RSM: the M step The M step of the VBEM algorithm: the VBEM update step for the distributions q(α s ), q(γ st ) and q(π kl ) are: q(α s ) = Dir(α s ; χ s ), s {1,..., S}, q(γ rs ) = Beta(γ rs ; a rs, b rs ), (r, s) {1,..., S} 2, q(π kl ) = Dir(Π kl ; Ξ kl ), (k, l) {1,..., K} 2, where: χ sk = χ 0 sk + N i=1 δ(r i = s)τ ik, k {1,..., K}, a rs = a 0 rs + r i=r,r j=s (A ij), b rs = b 0 rs + r i=r,r j=s (1 A ij), Ξ klc = Ξ 0 klc + N i j δ(x ij = c)τ ik τ jl, c {1,..., C}. 43

57 The VBEM algorithm for RSM: the E step The E step of the VBEM algorithm: the VBEM update step for the distribution q(z i ) is given by: which implies that log q (Z i ) = E Z \i,α,γ,π[log p(x, A, Z, α, γ, Π)] + c where ( K τ ik exp ψ(χ ri,k) ψ( 44 q(z i ) = M(Z i ; 1, τ i ), i = 1,..., N l=1 χ ri,l) ) N C K C + exp δ(x ij = c)τ jl (ψ(ξ klc ) ψ( Ξ klu ) j i c=1 l=1 u=1 ) N C K C + exp δ(x ji = c)τ jl (ψ(ξ lkc ) ψ( Ξ lku ). j i c=1 l=1 ) u=1

The Random Subgraph Model for the Analysis of an Ecclesiastical Network in Merovingian Gaul

The Random Subgraph Model for the Analysis of an Ecclesiastical Network in Merovingian Gaul Charles Bouveyron, Laurent Jegou, Yacine Jernite, Stéphane Lamassé, Pierre Latouche, Patrick Rivera To cite this