Exponential Random Graph (p ) Models for Affiliation Networks

Size: px

Start display at page:

Download "Exponential Random Graph (p ) Models for Affiliation Networks"

Martina Carr
6 years ago
Views:

1 Exponential Random Graph (p ) Models for Affiliation Networks A thesis submitted in partial fulfillment of the requirements of the Postgraduate Diploma in Science (Mathematics and Statistics) The University of Melbourne by Peng Wang Supervisor: Dr Ken Sharpe October 31, 2006 Abstract Statistical modeling of social networks as complex systems has always been and remains a challenge for social scientists. Exponential family models give us a convenient way of expressing local network structures that have sufficient statistics for their corresponding parameters. This kind of model, known as Exponential Random Graph Models (ERGMs), or p models, have been developed since the 1980s. However, due to the difficulty of dealing with the intractable normalizing constant, pseudo-likelihood estimation methods have been applied in most studies. Recently, simulation based MCMC maximum likelihood estimation techniques have been developed. Furthermore, current advances in the ERGM provides a much better chance of model convergence for large networks compared with the traditional Markov models. To date most work on ERGMs has focused on one-mode networks, and little has been done on applying maximum likelihood estimation in the case of affiliation networks with two or more modes. This paper considers the application of MCMC maximum likelihood estimation to affiliation networks. Similar techniques have been applied to affiliation networks as in the latest specification for one-mode networks. We investigated features of the model by simulation, and compared the goodness of fit results obtained using the maximum likelihood and pseudolikelihood approaches. Examples used in this paper show that the ERGM with the newly specified statistics is a powerful tool for statistical analysis of affiliation networks. Key words: exponential random graph (p ) models, affiliation networks, MCMC MLE, realization dependence assumption.

2 Acknowledgements I wish to give my sincere thank to my supervisor Dr Ken Sharpe for guiding me through the two years of my postgraduate diploma study. Thanks to Dr Malcolm Alexander for providing me with the Interlocking director data used in this thesis. I would also like to thank Associate Professor Garry Robins and Professor Philippa Pattison. Thank you for introducing me to this field, and providing me with generous support and valuable advice for my study. It has been an honour and a great pleasure to work with you. Thanks to all my friends in the honors year and the members of the X ij MelNet social network analysis group. Finally, to Mum and Dad, and my beloved partner Yan. 1

3 Contents 1 Introduction Network Representations Local measures Nodal Degrees Local Clustering Coefficient Global measures Density Global Clustering Coefficient Degree Distribution Geodesic distribution Exponential Random Graph (p ) Models Unconditional ERGMs Conditional ERGMs Model Interpretations Model Specifications Bernoulli Random Graphs Markov Model Markov Model for Non-directed Graphs Markov Model for Directed Graphs Model for Bipartite Graphs Simulation 13 5 Estimation Maximum Pseudolikelihood Estimation Markov Chain Monte Carlo Maximum Likelihood Estimation New Specifications Limitations of Markov models Alternating k-stars Alternating k-stars for one-mode networks Alternating k-stars for Bipartite networks Simulation with alternating k-stars Alternating k-triangles Simulation with alternating k-triangle Alternating k-two-paths Alternating k-two-paths for one-mode networks

4 6.4.2 Alternating k-two-paths for bipartite networks Simulation with alternating k-two-paths Limitations of the new specifications Goodness of fit 27 8 Modeling Examples Southern Women Pseudo-Likelihood and Maximum-Likelihood Estimation Results Model Selection Interlocking directors Top 50 Financial Institutions (1996) Largest interlocked component from the top 500 (1996) Conclusion & Discussion 42 3

5 1 Introduction 1.1 Network Representations A network consists of nodes, and ties representing the relationship between the nodes. A pair of nodes that is either linked, or not linked, by a tie is referred as a dyad; and a set of three nodes, linked or not linked, is called a triad. Depending on the context, we can have different kinds of nodes and different kinds of relationships defined on them and therefore different kinds of networks. In a business network, nodes can be suppliers and consumers, and ties can be the purchasing activities; in a computer network, computer terminals and servers can be the nodes, and the network connections can be the ties. We can also have different networks among the same set of nodes, for example, a friendship network and an advice network among people working in the same organization. An affiliation network represents the association between two or more sets of nodes where each set is a different social entity. For example, in an interlocking director network, one set of nodes are the directors, the other are the companies, with ties representing directors sitting on company boards. The number of entities within the network is the mode of the network. This paper will focus on two-mode networks, also called bipartite networks. Networks can be represented by binary adjacency matrices. A one-mode network with n nodes is an (n, n) square matrix X; if there is a tie from node i to node j, then the cell X ij = 1, otherwise 0. The values of cells X ij and X ji will indicate the direction of the ties between i and j; if the network is non-directed, the matrix will be symmetric, i.e. X ij = X ji, i, j n. Ties in non-directed networks are often referred to as edges, and ties in directed networks are often called arcs. Figure 1 shows an example of the matrix representation of a nondirected network with 10 nodes. Figure 1: Adjacency matrix representation of a network of size 10. Using the matrix format, bipartite networks with n nodes of type A and m nodes of type B will give us a (n, m) rectangular matrix with numbers of rows and columns equal to the number of nodes in each of the two sets. If node i in one set is associated with node j in the other set, X ij = 1, otherwise 0. Figure 2 shows an example of the matrix representation of a network with 4

6 4 nodes in set A and 6 nodes in set B. Figure 2: Adjacency matrix representation of a (4, 6) bipartite network. Two one-mode networks can be derived from a bipartite network. For example, with a club membership bipartite network (i.e. if a person i is connected with club j, then i is a member of j), we can derive a person to person network such that if two people are in the same club, there is a tie between them; a club to club network can be constructed in a similar way to form a member sharing network. Figure 3 shows an example of transforming the bipartite network in Figure 2 to two one-mode networks with four and six nodes. As such a transformation can be carried out, one may analyze a bipartite network by looking at the two one-model networks. However, by transferring a bipartite network into two one-mode network, if we omit the value of the ties, we lose the information about the number of nodes of the other type acting as ties between connected pairs in the one-mode networks. Figure 3: Transformation of (4, 6) bipartite network to 2 one-mode networks. One aspect of the statistical analysis of social networks focuses on the formation of network ties, and investigates the impact of local interactive processes on the global network structure. A good statistical model should capture the significance of various kinds of local processes and still be able to reproduce the observed network at the global level. There are several local and global network properties listed below that are often used as network measurement. If our model can generate a distribution of networks that have consistent network measures, such as what follows, we may say it is a good model. These network measures can be calculated from the adjacency matrices. 5

7 1.2 Local measures Nodal Degrees The degree of a node is the number of ties incident with the node. It ranges from 0 to (n 1) for a network of size n. A node with 0 degree is referred to as an isolate. The nodes in a directed graph have both in and out degrees. The in-degrees are measures of popularity or receptivity of a node, and the out-degrees are measures of expensiveness or the extend to which a node sends out ties. They are also the basic measurements for node degree centrality, i.e. the active node in the center of the network should have high degrees. The degree for node i, denoted as d(i), can be calculated from the adjacency matrix X, by taking row or column sums. For non-directed graphs, d(i) = For directed graphs, x ji = x +i = x ij = x i+ (1.1) j=1 j=1 d in (i) = x ji = x +i, d out (i) = x ij = x i+ (1.2) For bipartite graphs with (N, M) nodes, j=1 j=1 M d n (i) = x ij = x i+, d m (j) = j=1 N x ij = x +j (1.3) i= Local Clustering Coefficient For a node i, the local clustering coefficient is defined as the proportion of nodes connected with node i that are themselves connected. Let τ(i) denote the number of triangles (a complete subgraph of size three) that node i is involved in, and s 2 (i) denote the number of two-stars or two-paths (three nodes connected by two ties) that node i is part of, then the local clustering coefficient C l (X) for non-directed graphs is calculated as: C l (X) = 1 n i=1 τ(i) s 2 (i) (1.4) where, τ(i) = j=1 k>j x ij x ik x jk, i j, i k (1.5) ( d(i) s 2 (i) = 2 ), d(i) > 1 (1.6) For directed graphs, with the direction of ties involved, a triad can form seven different triangles, and six different two-stars, so that different local clustering coefficients can be obtained. 6

8 In bipartite graphs, the smallest closure that is more than a dyad is a diamond, which is a three-path closed by a tie. Let C 4 (i) represent the number of four-cycles node i is part of, and L 3 (i) be the number of three-paths that i is involved in but not at either ends of the three-paths, the local clustering coefficient for bipartite graphs can be calculated as 1.3 Global measures Density C l (X) = 1 n i=1 C 4 (i) L 3 (i) Density is the ratio of the number of ties that are present and the maximum possible number of ties for a network with given size. Let t denote the number of ties in a network X of size n, then the density for a non-directed network D(X) is D(X) = t n(n 1)/2 = 2t n(n 1) (1.7) (1.8) If the network is directed, then D(X) is D(X) = t n(n 1) For a bipartite network X with (n, m) nodes and t ties, the density D(X) is D(X) = t nm (1.9) (1.10) Global Clustering Coefficient The global clustering coefficient is a measure of the overall clustering of a network. As shown in expression (1.11), it is defined as the ratio between the number of triangles (τ(x)) and the number of two-stars (s 2 (X)). Since there are three two-stars involved in a triangle, the factor of three makes the coefficient have range 0 to 1. C g (X) = 3 τ(x) s 2 (X) (1.11) For bipartite networks the global clustering coefficient is defined as the ratio between the number of four-cycles C 4 (X) and the number of three-paths L 3 (X), as shown in expression (1.12). C g (X) = 4 C 4(X) L 3 (X) (1.12) Degree Distribution For a network of size n, the degree distribution is the distribution of degrees of the node in the network over the range of [0,, n 1]. Directed networks have both in and out degree distributions. The (n, m) bipartite networks have separate degree distributions for the two sets of nodes. 7

9 1.3.4 Geodesic distribution Define path as a sequence of consecutive ties in a network, then the geodesic is the shortest path between two nodes. The geodesic distribution is the distribution of geodesics of all pairs of nodes in the network; it has a range from [1,, ), where means the pair of nodes can not reach each other. 2 Exponential Random Graph (p ) Models 2.1 Unconditional ERGMs The exponential random graph models, also called (p ) models are a class of stochastic models which use network local structures to model the formation of network ties for a network with a fixed number of nodes. We define a network space X, which contains all networks with a given number of nodes n. The network with n nodes can then be represented by a random variable X, which itself is a set of (n(n 1)) tie variables X ij, or X = {X ij }. A realization of X is denoted by x = {x ij }. Given the values of all other tie variables, two network tie variables are defined as neighbours if they are conditionally dependent, i.e. one tie s existence depends on the other tie s existence. A neighbourhood of mutually, conditionally dependent, tie variables then forms a local network configuration. Various local interaction processes can be represented by these network configurations based on different tie dependency, or neighbourhood assumptions. From the Hammersley-Clifford theorem (Besag, 1974)[3], a model for X has a form determined by its neighbourhood. This approach leads to ERGMs, or p models, introduced by Frank and Strauss (1986) [5], and Wasserman and Pattison (1996)[31]. Depending on the underlying neighbourhood assumptions, ERGM assigns probabilities to X based on a set of counts of regular local configurations which are sufficient statistics for their parameters. where ERGMs have the following general form Pr(X = x) = 1 κ(θ) exp p z p (x) is the network statistic of neighbourhood type p. θ p is the parameter associated with z p (x). κ(θ) = { X exp } p θ pz p (x) is a normalizing constant. θ p z p (x) (2.1) The network statistic z p (x) has a typical form of z p (x) = x ij, X ij p. The normalizing constant κ(θ) is generated over the entire graph space X with 2 n(n 1) possible graphs. Without Monte Carlo simulations, the intractable normalizing constant κ(θ) makes maximum likelihood estimation of the model very difficult, even for networks with a small number of nodes. 8

10 2.2 Conditional ERGMs The probability of a graph under given conditions can be modeled using Conditional ERGMs. For example, we may have a network with a fixed number of ties; the nodes may have a maximum number of degrees; the nodes in the network may be kept fully connected, i.e. keep the network at one component. These conditions arise due to the properties of network data, or how the network data is collected. For example, if a friendship network is derived based on the survey question List up to five people whom you consider as your friend?, then the maximum degree one node can have is five. Let Q denote the condition that the network must satisfy, the conditional ERGM can be expressed as Pr(X Q = x Q ) = Pr(X = x Q) = 1 κ(θ) exp p θ p z p (x) (2.2) The graph distribution generated by such a model is a conditional graph distribution. The size of the sample graph space for such a graph distribution is reduced from X to (X Q). With a smaller graph space, it might be easier to obtain the maximum likelihood estimates for the parameters. 2.3 Model Interpretations For a tie variable X ij in a given network X, let Cij denote the set that is the complement of {X ij }, x + denote the graph with x ij = 1, and x denote the graph with x ij = 0, then the conditional distribution of tie variable X ij is given as logit{pr(x ij = 1 C ij )} = p θ p u p (x ij ) (2.3) where u p (x ij ) is the change statistic of type p by obtained by changing x ij from 1 to 0, such that u p (x ij ) = z p (x + ) z p (x ) (2.4) Expression (2.3) gives us the log-odds of forming a tie between nodes i and j, conditioning on the rest of the network. 3 Model Specifications Based on different neighbourhood assumptions from the simplest Bernoulli random graph assumption by Holland and Leinhardt (1981)[11] to the most recent Realization-dependent random graph assumption by Pattison and Robins (2002)[13], different ERGM specifications have been developed. 9

11 3.1 Bernoulli Random Graphs The simplest ERGM is called the Bernoulli model, which only has a density effect for a nondirected graph. It is based on the simplest neighbourhood assumption, namely that all tie variables X ij are independent. Ties are equiprobable in a graph; there is only one parameter for the edge effect that controls the density of the network. We assumes homogeneity that the edge effect is the same across the entire network. Pr(X = x) = 1 κ exp {θz e(x)} (3.1) where z e (x) is the total number of edges in the graph, and θ is the density parameter. Graphs generated by a Bernoulli model are called Bernoulli Random Graphs which features low clustering and short path-length. Figure 4 shows an example of a Bernoulli graph on 30 nodes with a probability for a edge to be present φ = Pr(X ij = 1) = 0.1. Figure 4: A Bernoulli graph with φ = 0.1 The relationship between θ and φ is θ = logit(φ) (3.2) The ML estimate of θ can be obtained from the density of the graph D(x): ( ) D(x) ˆθ = log 1 D(x) A negative value of θ produces a graph with density less than 0.5. (3.3) 3.2 Markov Model The Bernoulli model is not particularly interesting and is not adequate for representing social networks, as evidences in the social science literature show that there are much more than just density in social networks. The Markov neighbourhood assumption was introduced by Frank and Strauss (1986) [5], in which all ties sharing a node are conditionally dependent on each other. Markov models are based on the Markov assumption. 10

12 3.2.1 Markov Model for Non-directed Graphs The Markov dependency assumption infers graph statistics including stars of different sizes and triangles for non-directed graphs, hence the Markov model has parameters for such configurations as shown in Figure 5. Figure 5: Configurations for Markov Models. Markov model for non-directed graphs ( ) Pr(X = x) = 1 n 1 κ(θ) exp θz e (x) + σ k z sk (x) + τz t (x) k=2 (3.4) where θ is the parameter for the edge statistic z e (x) = x ij (3.5) i=1 j=i+1 σ k is the parameter for statistics of stars of size k, or k-stars z sk (x) = i=1 ( xi+ k ) (3.6) τ is the parameter for the triangle statistic z t (x) = x ij x jh x ih (3.7) i=1 j=i+1 h=j Markov Model for Directed Graphs A Markov model for directed graphs would include statistics and corresponding parameters for arcs, reciprocal ties, in and out stars, and other triad configurations as shown in Figure 6. The notations for the triads in the figure are in the order of numbers of reciprocal ties, nonreciprocal ties and empty ties. The additional character further distinguishes configurations. T means transitive, C means cyclic, and U or D means upward from the reciprocal tie or downward to the reciprocal tie. For example, 120U stands for triads with one reciprocal tie, two non-reciprocal ties, zero empty ties, and the triangle is upwards. With Markov models, one can explicitly capture the tendency to form stars, which relates to the popularity and expensiveness of nodes, as well as the clustering and balance effects of social networks. Simulation studies on Markov models by Robins, Pattison, and Woolcock. (2005)[22] 11

13 Figure 6: Configurations for Directed Markov Models. shows that Markov graphs have much higher clustering effect compared with Bernoulli graphs, when a positive triangle parameter is included. Figure 7 shows an example Markov graph with positive triangle parameter on 30 nodes. Notice that there are many more triangles in this network than in the Bernoulli random network of Figure 4 with the same density. Figure 7: A Markov random graph Model for Bipartite Graphs As bipartite graphs cannot form triangles, models satisfying the Markov assumption only have density and star configurations, where the stars are of two different types corresponding to the two sets of nodes, we label them as S P for People-Stars and S A for Association-Stars. The 12

14 Markov assumption has the limitation that it cannot capture the basic closure, a four-cycle, in bipartite networks. A simulation study on interlocking directors by Robins and Alexander (2004)[17] introduced another two configurations, three-paths L 3 and four-cycles C 4 to reflect features of bipartite networks. Four-cycles are the simplest local closures representing the strength of ties when transferring to one-mode networks. However, the information about tie strength cannot be captured by binary adjacency matrices. Hence C 4 should be included in the ERGM for bipartite networks. Three-paths represent a local structure that could potentially be closed by another tie to form a four-cycle. For a bipartite network with given density, more three-paths and less four-cycles could shorten the average path length. A typical ERGM for bipartite networks will include local configurations as shown in Figure 8. Figure 8: Configurations for Bipartite Graph Models. The four-cycle and three-path configurations satisfy the realization dependence assumption by Pattison and Robins (2004)[14]. Detailed discussion about this assumption is presented in section 6 New Specifications, as the new specifications for ERGMs are based on the realization dependence assumption. 4 Simulation There are many different strategies for simulating exponential random graph models (Snijders 2002)[25]. The strategy used here is based on the Metropolis-Hastings sampling algorithm, conducted as follows: 1. Start with a given graph x, which can be any graph within the graph distribution state space X. 2. A pair of nodes i and j is selected randomly, and the tie between them x ij is either added or removed to form a candidate graph x, such that x ji = 1 x ij. 3. Using the change statistic of type p, as defined in equation (2.4), denoted by u p u p (x ij ) = z p (x + ) z p (x ) = z p (x ) z p (x) (4.1) The candidate graph x is accepted with probability min(1, r), where r is defined as follows: r = Pr(X = x ) Pr(X = x) = exp p θ p u p (x ij ) (4.2) 13

15 The simulation establishes a Markov Chain through the state space of graphs with a given number of nodes. This strategy has the advantage that the normalizing constant κ cancels due to division, and calculation of the change of graph statistic u p s consumes much less computing power than would recalculation of the graph statistics for every candidate graph. To generate a graph distribution that is independent of the starting graph, an initial burn-in time is required, and the graphs generated from the burn-in simulation should not be taken into account. The length of the burn-in depends on how different the starting graph is from the true graph distribution defined by the model. The simulation method forms the basis for exploring properties of various model specifications and the effect of a change in parameter values for a specified model. The Markov Chain Monte Carlo maximum likelihood estimation relies on simulation, and simulation is also used to test the goodness of fit of the model. 5 Estimation 5.1 Maximum Pseudolikelihood Estimation Maximum likelihood estimation is difficult for exponential random graph models as calculation of the normalizing constant is intractable. To avoid the need to calculate the constant, a pseudo likelihood estimation method was proposed by Strauss and Ikeda (1990) [29]. Instead of maximizing the original likelihood function, a logit model can be fitted conditioning on the rest of the network, using standard logistic regression methods. The maximum pseudolikelihood estimator (MPE) is the value of θ that maximizes the pseudolikelihood function: P L(θ) = Pr(x ij C ij ) (5.1) i j where C ij denotes the complement of x ij, which includes all x kl such that k i and l j. By changing each dyad x ij to (1 x ij ) from the given network, the logistic regression is performed based on the change statistics of various local configurations included in the model. If our observed graph has size n, we will have n(n 1) sets of change statistics. The MPE is the same as the MLE if the dyads of the network are assumed to be conditionally independent. However, this assumption is rarely satisfied in the case of social networks, hence the standard error from PLE does not apply, and the estimates can be quite different from the MLE. This can be assessed by comparing the observed network with the simulated graph distribution from the PLE result. Section gives an example of PLE and its goodness of fit on an observed network. 5.2 Markov Chain Monte Carlo Maximum Likelihood Estimation Maximum likelihood estimation procedures for exponential random graph (p ) models were proposed by Snijders (2002)[25] based on the stochastic approximation method proposed by 14

16 Robbins and Monro (1951) [16]. The maximum likelihood estimate (MLE) ˆθ will generate a graph distribution X with expected values of the graph statistics equal to the observed graph statistics. µ(θ) = E[z(X) ˆθ] = z(y), (5.2) where z(x) is a vector of graph statistics, and y is the observed graph. The moment equation (5.2) can be solved using the Newton-Raphson iterative approximation, ˆθ n+1 = ˆθ n cov 1 θ (ˆθ n )(µ(ˆθ n ) z(y)), (5.3) where cov θ ( ˆ θ n ) = cov 1 (z(x) ˆθ) (5.4) is the asymptotic covariance matrix of the ML estimator. Both µ(θ) and cov θ ( ˆ θ n ), as given in equations (5.2) and (5.4), are intractable for big networks, hence the means of sample statistics from Monte Carlo simulations are used to approximate these values. The original Robbins-Monro algorithm proposed an iterative parameter updating strategy given by where a n is a gain sequence that converges to 0. Zˆθn ˆθ n+1 = ˆθ ) n a n D 1 Z(y) ˆθ n (Zˆθn is the conditional distribution of Z θ given ˆθ. Dˆθn is a consistent estimator for D θ. The overall estimation algorithm consists of three phases. (5.5) 1. Starting with an initial guess θ 0, the first phase simulates a small number N 1 of sample graphs. Denote the sample graph distribution by X 0, if we have m parameters in the parameter vector θ, then the m m variance-covariance derivative matrix D θ0 derived as can be ˆD θ0 = 1 N 1 [z(y) E(z(X 0 ) θ 0 )][z(y) E(z(X 0 ) θ 0 )] T (5.6) 2. The second phase contains L subphases. Within each subphase l, parameter values are updated using the formula ˆθ n+1 = ˆθ ) n a ˆD 1 l Z(y) ˆθ n (Zˆθn where Zˆθn is based on S simulated graph samples with the parameter value ˆθ n. (5.7) The maximum likelihood estimation requires independent samples. To make the simulated samples close to independent samples, there are w simulation iterations before each sample 15

17 is collected. This can be computationally costly. From experience, to achieve adequate performance w = c D(y) (1 D(y)) o 2, (5.8) where c is a constant; o is the number of nodes in the observed network y; and D(y) is the density of the network. Within each subphase, the number of simulated graph samples S must be greater than a lower bound N l. The subphase is terminated by checking whether the sum of successive products Q l is negative, where Q l is defined as S Q l = [z(x s ) z(y)][z(x s 1 ) z(y)] (5.9) s=2 A negative Q l gives the indication that the model is converging. An upper bound N + l also enforces subphase termination, since Q l may never become negative. From experience, N l = (7 + p) 2 4l/3, N + l = N l + 200, (5.10) where p is the number of parameters, have been found to lead to adequate performance. At the end of each subphase, the mean of all updated parameter values are taken as the starting parameter values for the next subphase (l + 1). The newer subphase will simulate more samples and the gain factor a l+1 is reduced. 3. The third phase repeats simulations as in phase one but based on the final estimated parameters ˆθ from the second phase, and a large number of simulation iterations are carried out to check whether ˆθ can generate the expected graph distribution that is centered at the observed network. For each of the statistics, a t-ratio is calculated as t p = z p(y) Ê(z p(x) ˆθ) ˆσ(z p (X) ˆθ) (5.11) where X is the graph distribution simulated by applying parameter ˆθ, and y is the observed network. ˆσ is the estimated standard error calculated from the square-root of the estimated covariance matrix Dˆθ. If t 0.1, then the approximation may be considered as having converged. This estimation algorithm has been implemented in the program SIENA (Snijders, Steglich, Schweinberger and Huisman 2005[26]), and the program PNet (Wang, Robins and Pattison (2006)[30]). Another program called statnet (Handcock, Butts, Hunter, Goodreau and Morris (2006)[9]) is implemented under the R environment, and used a different algorithm based on Geyer and Thompson (1992)[7] to estimate similar models. For the purpose of MLE of p models for bipartite networks, the program BPNet as an extension to PNet is implemented. 16

18 6 New Specifications 6.1 Limitations of Markov models The Markov models described in section 3.2 have problems with achieving convergence. If the parameter associated with the number of triangles and k-stars (k 2) are positive, then changing some of the tie x ij s may lead to large increases in the change statistics for other tie variables x kl. As simulation proceeds, this can lead to a near complete graph with very little probability of getting back to sparse networks. If we change the parameter values to negative, then the model generates graphs that are near empty. To illustrate this behavior, a simulation was carried out on a nondirected network with 50 nodes. We simulated the edge-triangle Markov model with the edge parameter fixed at θ = 3.0, and the triangle parameter τ changed from -1 to 2 in steps of All star parameters in this simulation were kept at 0. For each parameter set (θ, τ), 100, 000 simulated graphs were cut off as the initial burn-in, and every 10,000th sample graph was taken from another 100, 000 simulated graphs, so there were 10 graphs for each set of parameters representing the corresponding graph distribution. The number of edges z e in each simulated graphs is plotted against the triangle parameter in Figure 9. The blue diamonds are from simulations started with an empty graph, and the red crosses are from simulation starting with a complete graph. The plot shows that when τ (0+, 1), and depending on the starting density, the model generates a two-region graph distribution that is close to either empty graphs or complete graphs. Figure 9: Simulation: θ and τ Markov model The model is near degenerate since it puts too much weight on near complete and near empty graphs. Most human social networks are denser than empty networks and sparser than complete networks, and an edge and triangle Markov model is seen to be a poor one for such contexts. Handcock (2003)[10] has a theoretical analysis of this issue, and Robins, Pattison and 17

19 Woolcock(2005) [22] show some simulated degenerate graph examples using Markov models. For bipartite networks, as described in section 3.2.3, the Markov assumption is not capable of capturing three-paths L 3 and the basic closure four-cycles C 4. By including parameters for L 3 and C 4 in the model, we have a model that captures the closure effect. However, this does not solve the degeneracy issue as the model still puts large weight on near complete or near empty bipartite graphs, since the change statistics for C 4, L 3 or large stars can be big. To avoid large changes in triangles and k-star (k 2) statistics, a set of newly specified ERGMs were proposed by Snijders, Pattison, Robins and Handcock (2006)[27]. Robins, Pattison, Kalish and Lusher (2006)[20], Robins, Snijders, Wang, Handcock and Pattison (2006)[23], Hunter (2006)[12] and Goodreau (2006)[8] provide further discussions and modeling examples using the new specifications. The new specifications model all (n 2) parameters for k-stars (k 2) as a function of a single parameter. Since all k-stars up to size (n 1) are modeled by this single parameter, it is effectively a parameter for the degree distribution. The new specification also introduced two new graph statistics k-triangles and k-two-paths based on a more general type of dependence assumption introduced by Pattison and Robins (2002)[13] and further discussed in Pattison and Robins (2004)[14] called the partial conditional dependency assumption, also known as the realization dependence assumption following Baddeley and Möller (1986)[2]. The following sections give a detailed description of the new specifications, and their extensions to bipartite networks. 6.2 Alternating k-stars Alternating k-stars for one-mode networks From the Markov model defined in equation (3.4), one can model stars up to size (n 1). The model puts large weights on big stars, or nodes with high degree, which causes the degeneracy problem. The new specification uses a single parameter for the entire degree distribution by introducing a weight parameter λ s, λ s 1, which dampens the effect of large changes in the statistics of large stars. The weights of stars also have alternating signs, so that the even-k-stars positive weights are balanced by the odd-k-stars negative weight. The new statistic, known as alternating k-stars with parameter λ s, can be expressed as, z s (λ s, x) = z s2 (x) z s 3 (x) λ s n 1 = ( 1) k z s k (x) k=2 λ k 2 s + z s 4 (x) λ 2 s + ( 1) n 2 z s n 1 (x) λ n 3 s (6.1) Denote the degree of node i in network x by d x (i), then each of the statistics for stars of size k, as defined in equation (3.6), can be expressed as ( ) xi+ z sk (x) = = k i=1 18 ( ) dx (i) i=1 k (6.2)

20 Expression 6.1 can then be written as z s (λ s, x) = n 1 ( 1) k z n 1 s k (x) = λ 2 s( 1 ) k λ s k=2 = λ 2 s = λ 2 s λ k 2 s k=2 n 1 ( 1 ( ) ) k dx (i) λ s k i=1 k=2 { n 1 i=1 k=0 ( ) dx (i) i=1 [ ( 1 ( )] ) k dx (i) 1 + d x(i) λ s k λ s Applying the binomial formula, then gives { z s (λ s, x) = λ 2 s (1 1 ) dx(i) + d } x(i) 1 λ s λ s i=1 When λ s = 1.0, expression 6.4 simplifies to z s (λ s, x) = 2z e (x) n + k } (6.3) (6.4) I{d x (i) = 0} (6.5) where z e (x) is the number of edges, and I is a binary indicator function such that i=1 I{d x (i) = 0} = { 1, if dx (i) = 0 0, otherwise (6.6) As defined in equation (2.4), the change statistic for alternating k-stars is calculated based on the number of alternating (k-1)-stars in the reduced graph x where the tie x ij = 0. For node i the change statistic for the alternating k-stars is Similarly, for node j, n 2 ( u si (λ s, x ij ) = 1 ) k 1 ( ) { dx (i) = λ s 1 (1 1 } ) d x (i) λ s k λ s k=1 { u sj (λ s, x ij ) = λ s 1 (1 1 } ) d x (j) λ s Combining (6.7) and (6.8) gives the following formula for the alternating k-star change statistic, { u s (λ s, x ij ) = λ s 2 (1 1 ) d x (i) (1 1 } ) d x (j) (6.9) λ s λ s When λ s = 1.0, (6.9) simplifies to where I is a binary indicator function. (6.7) (6.8) u s (λ s, x ij ) = I{d x (i) > 0} + I{d x (j) > 0}, (6.10) By assigning alternating signs to the stars, we assume that the parameters for stars of different sizes also have alternating signs. Let σ denote the parameter for the alternating k-stars statistic, the parameters for each individual star of size k, denoted by σ k, can be derived from σ by σ k+1 = σ k λ s, where σ 2 = σ, k 2 (6.11) 19

21 When λ s = 1, the alternating k-star parameter models the number of isolated nodes distinctly. When λ s = 2, the difference in the change statistics of 5-stars and 6-stars are less than 0.02, hence the model treats nodes with degree higher than five almost equivalently. As λ s, the alternating k-star is equivalent to a Markov two-star Alternating k-stars for Bipartite networks For Bipartite network x of size (n, m), since there are two sets of nodes, P and A, two separate alternating k-star statistics are defined as z SP (λ s, x) = z SA (λ s, x) = n 1 ( 1) k z S Pk (x) k=2 m 1 k=2 λ k 2 s ( 1) k z S Ak (x) λ k 2 s (6.12) The corresponding change statistics have the same form as for alternating k-stars in one-mode networks, as derived in equation(6.9) Simulation with alternating k-stars Simulations comparing the edge and two-star model versus the edge and alternating k-star model show that the edge and alternating k-star model gives a better coverage over the graph space, hence a better chance of achieving model convergence in the MCMC MLE. Figure 10 shows simulation plots of an edge L and two-star S P 2 model that simulates bipartite graphs with (30, 20) nodes. The L parameter is fixed at θ = 3.0, and the parameter σ (P ) 2 changes from -1 to 1 in steps of 0.1. For each σ (P ) 2, every 100,000th simulated graph is picked from 1,000,000 simulated graphs, the number of ties L for the sample graphs are plotted against the σ (P ) 2 parameter value. The results show that the L and S P 2 model is more consistent in that there is no multiple region for one set of parameters. However it still puts too much weight on graphs with very low or very high densities. Results from simulations conducted using the same simulation strategy for an edge and alternating k-star k-s P model on the same sized (30, 20) bipartite network are plotted in Figure 11. From the results we can see that as the alternating k-star parameter increases, the density of the network increases slowly from empty to the complete graph. There are reasonable numbers of simulated graphs with densities over the entire range of 0 to 1. The edge and alternating k-star model could potentially fit any observed bipartite network of the same size. 6.3 Alternating k-triangles The Markov assumption models network transitivity by a single triangle parameter. The previous simulations show that the Markov edge and triangle model has the problem of degeneracy, since it only covers the near empty or near complete region of the network space. The Markov 20

22 Figure 10: Simulation: L and σ (P ) 2 model assumption restricts the tie dependence structure such that tie variables must share a node to be considered as conditionally dependent. However, according to the realization dependent assumption described below, ties in a network may well be conditionally dependent even if they do not share a node. The Markov assumption is too restrictive, and a simple single triangle is not sufficient to capture all completed structures involved in human social networks. The realization dependence assumption expands the dependency structure to subgraphs of four nodes. The assumption states that two edge variables X ij and X kl are conditionally dependent, given the rest of the network, only if one of the two following conditions is satisfied: 1. X ij and X kl shares a node, i.e. {i, j} {k, l} φ, which is the condition needed to satisfy the Markov assumption. 2. x ik = x jl = 1, i.e. if the tie between nodes i and k, and the tie between j and l exists, then X ij and X kl would be part of a four-cycle as shown in Figure 12. Based on the realization dependence, the formation of a tie x ij is not only affected by other ties that nodes i and j have, but also other ties that do not directly involve nodes i or j, so that the probability of forming a tie is assumed to depend on whether the dyad is part of a social circuit (four-cycle). Graphs generated from a realization dependence model are called realization dependent graphs. From experience, triangles in social networks tends to form clique-like structures (a clique is a completed subgraph), where many triangles are formed within a small group of nodes. The new specification proposed a new graph statistic called k-triangles which is defined as k triangles sharing a common edge, as shown in Figure 13. A k-triangle is a further specification that satisfies the realization dependent assumption; it represents connected dyads having multiple shared partners. 21

23 Figure 11: Simulation: L and σ (P ) k model Figure 12: Realization dependence assumption when a four-cycle is created Let L 2ij (x) denote the number of two-paths between nodes i and j in network x, L 2ij (x) = x ih x jh, h i, j (6.13) h=1 the k-triangle statistic for a nondirected graph x of size n can be expressed as z tk (x) = i=1 j=i+1 ( ) L2ij (x) x ij, 2 k (n 3) (6.14) k To avoid the problem that the model puts large weight on large sized triangles, in analogy to alternating k-stars, the parameters for all (n 2) k-triangles are modeled as a function of a single parameter τ. The k-triangles also have a weight parameter λ t and alternating signs such that τ k = τ k 1 /λ t, which leads to the alternating k-triangle statistic which can be simplified by the binomial formula. When λ t > 1, 22

24 Figure 13: K-triangles z t (λ t, x) = 3z t1 (x) z t 2 (x) + z t 3 (x) + ( 1) n 3 z t n 2 (x) = = i=1 j=i+1 i=1 j=i+1 = λ t i=1 j=i+1 λ t x ij n 2 x ij n 2 λ 2 t λ n 3 t ( ) 1 k 1 ( ) L2ij (x) λ t k k=1 { ( ) 1 k ( ) } L2ij (x) ( λ t ) + λ t λ t k k=0 { x ij 1 (1 1 } ) L 2ij(x) λ t (6.15) To calculate the change statistic for alternating k-triangles, a tie x ij is removed from the network x. The formula involves two terms, since tie x ij can either be the base of the alternating k-triangle that makes the closure of multiple two-paths, or form part of the multiple two-paths. Let x denote the graph without tie x ij, the change statistic for x ij as the base is n 2 ( ) 1 k 1 ( L2ij (x ) { ) u tb (λ t, x ij ) = = λ t 1 (1 1 } ) L 2ij(x ) k λ t k=1 λ t Let h be another node that is connected to both i or j such that x ih x jh = 1, for x ij to be part of the multiple two-paths in an alternating k-triangle, the change statistic is the number of (k 1) two-paths that the dyads x ih and x jh have, for all h. u ts (λ t, x ij ) = = { n 3 ( ) 1 k ( L2ih (x ) n 3 ) ( ) 1 k ( L2jh (x ) } ) x ih x jh + x ih x jh λ t k λ t k k=0 k=0 ( {x ih x jh 1 1 ) L2ih (x ) + x ih x jh (1 1 ) } L2hj (x ) (6.16) λ t λ t h=1 h=1 Therefore, the change statistic for the alternating k-triangles is { u t (λ t, x ij ) = λ t 1 (1 1 } ) L 2ij(x ) λ t + h=1 {x ih x jh ( 1 1 λ t ) L2ih (x ) + x ih x jh (1 1 λ t ) L2hj (x ) } (6.17) 23

25 When λ t = 1.0 the alternating k-triangle statistic and the corresponding change statistic are z t (λ t, x) = i=1 j=i+1 u t (λ t, x ij ) = I L2ij >0 + x ij I L2ij(x) >0, h=1 { } x ih x jh I L2ih (x )=0 + x ih x jh I L2jh (x )=0 (6.18) where I is a binary indicator function Simulation with alternating k-triangle The edge and triangle Markov model can only model networks with either very low density or complete networks. The edge and alternating k-triangle model gives a better coverage of network space, as shown in Figure 14 which is the result of applying the same simulation strategy. Figure 14: Simulation: θ and τ k model The alternating k-triangle statistic is a measure of clustering based on the dependency between the formation of a tie between two nodes and whether they share multiple partners. A positive value for the alternating k-triangle parameter indicates that people sharing multiple partners are likely to be connected. In bipartite networks, triangles cannot be formed, as ties within one of the two sets of actors are not defined. Hence the alternating k-triangle statistic does not apply. The smallest local closure is a four-cycle, which is a two-path closed by another two-path. If we apply the same technique as the alternating k-triangles, we have a new statistic Alternating k-two-paths which is a measure of how nodes with multiple shared partners are likely to be closed by another two-path. 24

26 6.4 Alternating k-two-paths Alternating k-two-paths for one-mode networks A two-path is the same as a two-star, four nodes with two two-paths forms a four-cycle or a 2-two-path. We define a k-two-path as a structure such that two nodes are connected by k twopaths, as shown in Figure 15. The k-two-path structure also satisfies the realization dependence assumption. Figure 15: K-two-paths The number of k-two-paths can be expressed as, { n n ( L2ij (x)) i=1 j=i+1 if k > 2 z vk (x) = k 1 n n ) (6.19) 2 i=1 j=i+1 if k = 2, due to symmetry ( L2ij (x) 2 For ties to be part of a k-triangle, it can either be the base that makes the closure of the k-two-paths, or be the side as part of a two-path. Inclusion of the k-two-path in the model will distinguish between the effect of closure and the effect of forming prerequisites for closure. Applying a weight parameter λ v, and alternating signs as for k-stars and k-triangles, we form the alternating k-two-path statistic, when λ v > 1, z v (λ v, x) = z v1 (x) 2z v 2 (x) λ v + = λ v i=1 j=i+1 n 2 ( ) 1 k 1 z (x) vk k=3 λ v { 1 (1 1 } ) L 2ij(x) λ v (6.20) When λ v = 1, the statistic reduces to the number of dyads which are indirectly connected by at least one two-path. z v (λ v, x) = i=1 j=i+1 I L2ij (x)>0 (6.21) The change statistic for alternating k-two-paths is similar to the formula for the alternating k-triangles, except there is no base tie involved. ( u v (λ v, x ij ) = {x jh 1 1 ) L2ih (x ) + x ih (1 1 ) } L2jh (x ), if λ v > 1 λ v λ v u v (λ v, x ij ) = h=1 h=1 { } x jh I L2ih (x )=0 + x ih I L2jh (x )=0, if λ v = 1 (6.22) 25

27 6.4.2 Alternating k-two-paths for bipartite networks The alternating k-two-paths in bipartite networks can be understood in two different ways. By analogy to triangles in one-mode networks, a four-path is the smallest closure that is not a dyad, hence the parameter value for alternating k-two-paths is an indication of the likelihood of forming a social circuit. As bipartite networks have two sets of nodes P and A, two different k-two-path structures, k-c P and k-c A can be formed as shown in Figure 16. Figure 16: k-c P s and k-c A s The change statistics for the alternating k-c P and alternating k-c A can be derived in a similar way as for the alternating k-two-paths for one-model networks expressed in formula (6.22), the only difference is in the definition of L 2ij (x). For bipartite networks, L 2ij (x) is defined as the number of two-paths between nodes i and j, where both i and j belong to the same set of nodes Simulation with alternating k-two-paths Simulation was carried out on bipartite graphs with (30, 20) nodes, starts both from empty and complete graphs. The parameter θ was fixed at θ = 3.0, and the parameter β (P ) k for k-c P varied from -1 to 10. The result is plotted in Figure 17, and shows a good coverage over the graph space. Figure 17: Simulation: θ and β (P ) k model 26

28 6.5 Limitations of the new specifications The new specifications provide much higher possibility of obtaining maximum likelihood parameter estimates. Based on the realization dependence assumption, which is a weaker assumption compared with the Markov assumption, the model has a wider coverage of the graph space, and less likely to be degenerate. However, different combinations of k-stars or k-two-paths of different sizes can produce the same number of alternating-k-stars or alternating-k-two-paths, hence we may have a converged model that fits the newly specified statistics well, but not each individual k-star or k-two-path, such that the underlining graph distribution would be different from the observed network. Section is an example of such a case. Further investigation of possible dependency assumptions and graph statistics may help resolve this issue. 7 Goodness of fit The goodness of fit of an ERGM can be assessed by simulation, where various statistics from the observed network are compared with the statistics collected from the simulated network distribution to see whether the simulated graph distribution is centered at the observed network. The various statistics should not be limited to the ones that are being modeled in the given ERGM, as they should have been considered as very well fitted during the third phase of the MCMC maximum likelihood estimation algorithm where model convergence is tested. Instead they should include all possible network statistics and other local and global network measurements like the ones described in sections 1.2 and 1.3. A simple goodness of fit statistic is the t-ratio as defined in equation(5.11). Small t-ratios indicate good model fit. For statistics that are modeled in a given ERGM, the absolute value of the t-ratios should be less than 0.1 to prove that the model has converged. For other network statistics, t-ratios that are smaller than 2.0 are considered as indicating a good fit. The t-ratios assess goodness of fit on each network statistic independently. To test the overall fit of the model, we need to take into account correlations among these statistics. The Mahalanobis distance, introduced by P. C. Mahalanobis in 1936, gives us a way of testing how similar the observed network is compared with a distribution of networks generated by a p model. Let Z(x) = [z 1 (x), z 2 (x),, z p (x)] be the vector of observed network statistics, µ = (µ 1, µ 2,, µ p ) be the vector of means of network statistics from the simulated graph distribution, and Σ be the covariance matrix, the Mahalanobis distance d M is calculated as d M = (Z(x) µ) T Σ 1 (Z(x) µ) (7.1) If the distribution of Z(x) is multivariate normal, then d 2 M follows χ2 p k-distribution, where k is the number of parameters that are inside a given model. However, there is evidence that in quite a lot of cases, the distributions of graph statistics are not normal, hence Z(x) is not multivariate normal. Appropriate transformations are needed to perform a more valid χ 2 -test on model goodness of fit. 27

29 Table 1: Transformations of Graph Statistics z(x) KS p-value (z) z = f(z) KS p-value (z ) L > f(z) = z > S P f(z) = z 1/4 > S P3 < f(z) = z 1/5 > S A f(z) = z 1/4 > S A3 < f(z) = z 1/5 > L 3 < f(z) = z 1/5 > C 4 < f(z) = z 1/6 > K-S P > f(z) = z > K-S A > f(z) = z > K-C P f(z) = z 1/2 > K-C A > f(z) = z > stddev D P > f(z) = z > skew D P > f(z) = z > stddev D A < f(z) = z 1/2 > skew D A < f(z) = (z + 2) 1/2 > Clust.Coef. > f(z) = z > To assess normality, a simulation with a Bernoulli model (7.2) on bipartite networks of size (18, 14) was carried out, and all available graph statistics that are implemented in BPNet were collected from every 1,000th graph of the 1,000,000 simulated graphs. The collected statistics were then tested using the Normal Q-Q plot. The Komogorov-Smirnov p-values were used to indicate departure from normality. Pr(X = x) = 1 κ exp [ 0.605z L(x)] (7.2) Graph statistics with significant departure from normality (p-value < 0.1) were transformed using various forms of power transformations. The p-values were then tested on the transformed statistics. The transformation functions and p-values are listed in Table (1), where D P and D A are the degree distributions for nodes of type P and A. As for each individual graph there are degree distributions associated with each type of node, the means and standard deviations are used as graph statistics for the distribution of graphs representing the degree distributions. Using the graph statistic C 4 as an example, before the transformation, it is highly skewed, the Normal Q-Q plot shows a huge departure from normality. However, after a one-sixth power transformation, the transformed statistic passed the normality test, as shown in Figure

30 Figure 18: Normality test of C 4 raw and transformed data. 29

31 8 Modeling Examples In this section, two bipartite data sets are analyzed using the newly proposed ERGM for bipartite networks. The first dataset, known as the Southern Women data set, is a classic affiliation network data set collected by Davis, Gardner and Gardner (1941)[4]. It is about the participation in 14 informal social events by 18 women in Natchez, Mississippi over nine months. The second dataset, collected by the Social Networks Research Group at the Netherlands Institute for Advanced Study (NIAS) in , and analysed by Robins and Alexander (2004)[17], has two affiliation networks describing how directors were interlocked among the top 500 companies in Australia in Various models with different parameters were fitted and assessed using the goodness of fit strategy described in section Southern Women Since first published in the 1940s, the Southern Women data has been analysed using several social network analysis techniques, including some early specifications of ERGM for affiliation networks by Skovretz and Faust (1999)[24]. Freeman (2003)[6] gives an overview of various analyses that have been conducted on this data set. A plot of the network is shown in Figure 19 where circles represent women and squares represent events. Figure 19: Southern Women Data There are some interesting features of this network, the women painted in yellow attended 30

32 Table 2: PLE and MLE of Model 8.1 for the Southern Women Data Effect PLE MLE (S.E.) t-ratio* Choice (L) (0.314) Woman 2-Stars (S P2 ) (0.059) Event 2-Stars (S A2 ) (0.039) *t-ratio for convergence blue and pink events, women in green attended white and pink evens, and women in red only attended pink events. From the display of the data, we see that most of the women can reach most the events within three-steps; pink, green and white nodes, or pink, yellow and blue nodes form a lot of four-cycles, hence one may conclude that three-paths and four-cycles are important local configurations. Are three-paths and four-cycles really the building blocks for this network? We need to fit models and get a statistical answer Pseudo-Likelihood and Maximum-Likelihood Estimation Results Skovretz and Faust (1999) [24] explored some possible p models on this data set, including some network statistics that satisfy the Markov assumption. However, the maximum likelihood (ML) estimation method was not available at that time, and pseudo-likelihood (PL) estimation was used. Table 2 shows both the PL estimates (PLE) from Skovretz and Faust (1999) and ML estimates (MLE) with estimated standard errors using BPNet for the same Markov model. Pr(X = x) = 1 } {θz κ exp L (x) + σ P2 z SP2 (x) + σ A2 z SA2 (x) (8.1) The t-ratios from the MLE are less than 0.1 indicating good model convergence, and the standard errors for θ and σ A2 are less than half their corresponding parameter estimates suggesting that both parameters are significantly different from 0, whereas σ P2 has a big standard error indicating that the parameter is not significant, and may be removed from the model. We keep it here to provide a direct comparison with the PLE, more detailed discussion of model selection is presented in section Comparing the PLE and MLE, we see that the estimates are similar for event two-stars σ A2, however, the parameter for the choice effect θ P LE is less than θ MLE by more than one standard error, and the woman two-star parameter σ P2 is over-estimated by more than one standard error in the PL estimation. These differences in parameter estimates will cause large differences in the graph distributions represented by the model. To demonstrate the difference in graph distributions, simulations were carried out and model goodness of fit were assessed using the simulated graph distributions. In the simulation, the first 100,000 simulated graphs were used as initial burn-in; 1,000 graphs are taken out of another 1,000,000 simulated graphs by selecting every 1,000th graph. The means 31

33 Table 3: Goodness of Fit for the PLE and MLE of Model 8.1 PLE MLE z(x) obs. M ean S.D. t-ratio M ean S.D. t-ratio L S P S P S A S A L C K-S P K-S A K-C P K-C A stddev D P skew D P stddev D A skew D A Clust.Coef Mahalanobis Distance (µ) of various statistics collected through the simulation were used to test the observed graph statistics (obs); standard deviations (S.D.) and the values of the t-ratios are shown in Table 3, where S P denotes woman stars, S A denotes event stars, K-C P denotes alternating k-two-paths expended by two-paths centered on a woman, K-C A denotes alternating k-two-paths expended by two-paths centered on an event, D P denotes the degree distribution of woman nodes and D A denotes the degree distribution of event nodes. The differences in t-ratios of the estimated parameters between Table 2 and Table 3 are due to randomness of the simulations. The PLE provide a poor fit to the data as most of the t-ratios are greater than 2.0. The huge Mahalanobis distance also indicates poor model fit. In contrast, MLE give a very good fit to each individual network statistic, where the largest t-ratio is for the skewness of the event degree distribution (t = < 2.0), and it has a much smaller Mahalanobis distance. Hence the MLE does provide a much better model compared with the PLE. The advantage of PLE is that it will always produce parameter estimates, and quite often the estimated parameters are consistent with MLE. The PLE however, can be misleading, as illustrated here in the over-estimation of the parameter for woman two-stars (σ P2 ). Therefore, we should consider MLE, if possible, before using the PLE. 32

34 Table 4: Parameter Estimates of Models from (8.2) to (8.7) MLE (standard error) Effect Model (8.2) Model (8.3) Model (8.4) Choice (L) (0.127) (0.408) (0.267) Woman2-Stars (S P2 ) (0.084) Event 2-Stars (S A2 ) (0.039) Model (8.5) Model (8.6) Model (8.7) Choice (L) (0.314) (0.413) (0.491) Woman2-Stars (S P2 ) (0.059) (0.176) (0.175) Event 2-Stars (S A2 ) (0.039) (0.131) (0.160) 3-Paths (L 3 ) (0.018) (0.019) Alt.2-Paths (K-CP ) (0.208) Model Selection For an observed network, one may fit several p models with different numbers of parameters according to the underlying neighborhood assumption. However, not all fitted, or converged, models will give a good fit to the observations. An ideal model should converge, provide a good fit to the original network, and be easy to interpret. In our case, to find the best model for the Southern Women data, six different p models were fitted. Starting with the simplest Bernoulli model (8.2), and ending with a model involving alternating two-paths (8.7), they all successfully converged during estimation. The parameter estimates and their estimated standard errors are shown in Table 4. Pr(X = x) = 1 κ exp {θz L(x)} (8.2) Pr(X = x) = 1 } {θz κ exp L (x) + σ P2 z SP2 (x) (8.3) Pr(X = x) = 1 } {θz κ exp L (x) + σ A2 z SA2 (x) (8.4) Pr(X = x) = 1 } {θz κ exp L (x) + σ P2 z SP2 (x) + σ A2 z SA2 (x) (8.5) Pr(X = x) = 1 } {θz κ exp L (x) + σ P2 z SP2 (x) + σ A2 z SA2 (x) + αz L3 (x) (8.6) Pr(X = x) = 1 {θz κ exp L (x) + σ P2 z SP2 (x) + σ A2 z SA2 (x) + αz L3 (x) + β KCP z KCP (x, λ)}, λ = 1.1 (8.7) To select the best model out of the six, the goodness of fit strategy was carried out where graph distributions are simulated from each of the models, and tested against the original data. The goodness of fit involved 100,000 graphs as burn-in, then 1,000 graphs taken from 1,000,000 33

35 Table 5: Goodness of Fit of Models from (8.2) to (8.7) Model t-ratios Statistics (8.2) (8.3) (8.4) (8.5) (8.6) (8.7) L S P S P S A S A L C K-S P K-S A K-C P K-C A stddev D P skew D P stddev D A skew D A Clust.Coef d M simulated graphs using a step size of 1,000. The t-ratios of various statistics and the Mahalanobis distances are listed in Table 5. The Mahalanobis distances were calculated after transforming the graph statistics using the transformation functions in Table 1. Note that model (8.5) is the same as model (8.1), which is used to test against pseudo-likelihood estimates. For the Bernoulli model (8.2), the MCMC MLE of parameter θ = 0.605(0.127) agrees with the MLE obtained from the density of the graph as in equation (3.3), where for the Southern Women data, D(x) = , and ˆθ ) = log = The model gives a good fit on the ( D(x) 1 D(x) density of the network, but not on the event stars (S A2 and S A3 ), the event degree distribution (D A ),the closure effect (C 4 ) or the global clustering coefficient. The large Mahalanobis distance also indicates this is not a good model for the data. By adding the Women two-star effect (S P2 ), model (8.3) does not improve the goodness of fit of the model. Also notice that the parameter estimate for σ P2 from 0. is not significantly different Both model (8.4) and model (8.5) fitted the data quite well, as all t-ratios in both cases are less than 2.0. In both models, the Event two-star parameter σ P2 is significant. Also notice that σ A2 is not significant in model (8.5), however, inclusion of σ A2 in the model, gives a better fit 34

36 to the data, especially on C 4 and the standard deviation of the event degree distribution. For ERGMs, it is not always the case that the goodness of fit would be improved by including more parameters in the model. Model (8.6) fitted the three-paths (L 3 ) explicitly, and the estimation results show that all parameters in this model are significant. However, compared with the simpler model (8.5), the model has a greater Mahalanobis distance, and it provides a worse fit on C 4, and the clustering coefficient is not fitted well. The model with a parameter for C 4 did not converge due to the degenerate behavior of the model. Instead, model (8.7), with an alternating k-women-two-paths did converge with a small damping parameter λ = 1.1. This model gives a reasonable fit to the data, but it is more complicated than model (8.5) which also provided a good fit for the data. If we compare the squares of the Mahalanobis distances to the χ 2 -distribution with corresponding degrees of freedom, all models have p-values less than 0.01, i.e. the observed network is not in the centre of any of the graph distributions generated from any of the models considered, when correlation between the statistics are taken into account. However, given that we are testing 16 different statistics using models that have less than 6 parameters, the Mahalanobis distance test is a very powerful test. The differences in Mahalanobis distance between models give us a good indication as to which model fits the network relatively better. Therefore, models (8.4),(8.5) and (8.7) have similar Mahalanobis distance where model (8.5) has the smallest (d M = 8.068). This result is consistent with the results we get from the t-ratios for various statistics. From the discussion above, we conclude that model (8.5) is the best model from the models we have considered for the Southern Women data set. We can obtain the conditional log-odds of woman i attending event j is given by 2.031u L (x ij ) u SP2 (x ij ) u SA2 (x ij ) (8.8) The model tells us that the three-path and four-cycle structures all happened just by chance given the density and two-star effects. However, we know from the Mahalanobis distance that there are correlations between the graph statistics that are not controlled well by the model, further investigations of other graph configurations may result a better model that may give us more interesting interpretations about this data set. 8.2 Interlocking directors In the simulation study conducted by Robins and Alexander (2004)[17], they compared the interlocking company directors network structures of the US and Australia in The observed network statistics were compared with simulated random network distributions with the same density as the observed network, then Z-scores were used as indications of the level of differences between the observed networks and the random network distribution. The modeling examples we are going to use here are based on the same data source. The first example is the data from the top fifty financial institutions in Australia (1996); the second 35

is the largest interlocked component from the top 500 companies from both the financial and industrial sectors. 8.2.

37 is the largest interlocked component from the top 500 companies from both the financial and industrial sectors Top 50 Financial Institutions (1996) From the data collected, there are 366 directors working for the top 50 financial institutions in Australia in The network plot is shown in Figure 20, the blue squares are companies and the red circles are directors. If a director is sitting on the board of a company, there is a tie between them. There are 395 ties which gives a density of Figure 20: Top 50 Financial Institutions, Australia (1996) There are thirty separate components in this network, and the largest interlocked component has fourteen companies and eighty directors. The greatest number of directors one company has is fourteen, and there is one such company. The largest director degree is of size four, and there is one such director. The (366, 50) node network is much larger then the (18, 14) node southern women data, we use this example to show how the new specifications perform on big networks. Using a Markov model with L 3 and C 4, it is almost impossible to get convergence. The New Specification Model (8.9) does converge with a damping factor λ = 2.0. The estimation results are shown in Table 6; the parameter estimates for Choice and Company Alternating- K-Star are not significantly different from 0. There is a strong negative tendency for forming Director Alternating-K-Stars and Alternating-two-paths centered at a company (K CA, where 36

38 Table 6: Parameter Estimates for Model (8.9) on 366 directors Effect MLE S.E. t-ratio* Choice (L) Director Alternating-K-Star (K-S P ) Company Alternating-K-Star (K-S A ) Alternating-Two-Path (K-C A ) *t-ratio for convergence two directors are linked by one company). Pr(X = x) = 1 κ exp {θz L(x) + σ P z SP (x, λ) + σ A z SA (x, λ) + β KCA z KCA (x, λ)}, λ = 2.0 (8.9) The goodness of fit test used 3,000 out of 5,000,000 simulated graphs as a representation of the underlining graph distribution from the estimated model. The test results are shown in Table 7. From the goodness of fit results we can see that the model gives a good fit on most of the graph statistics, except the director three-star (S P3 ) and the skewness of the director degree distribution (D P ), as indicated by the large t-ratios, and hence the large Mahalanobis distance. This can be explained by looking at the observed network, as mentioned before, there is one director that has degree four, and this director is a four-star that makes up four out of the five observed director three-stars. The model does not include a director four-star parameter, and even if we included such a parameter, it will still be difficult to ask the model to keep a single four-star throughout the distribution of graphs, and very unlikely to achieve model convergence on a network of this size. A way of solving this problem is to treat the two directors (one has degree three; the other has degree four) as special cases, and we can fit a model for the network without them, hence there is no director three stars in the observed network. The model converged with the same set of parameters as in Model 8.9 on this (364, 50) network, and the parameter estimates are listed in Table 8. The new estimation result gives similar interpretations as the result for the 366 directors. Although the parameter estimates for company alternating-k-stars has changed sign from negative to positive, it is still not significant. The goodness of fit results are listed in Table 9 for the 364 directors. Without the three- and four-star directors, the model fits the data very well, as indicated by the small t-ratios, small Mahalanobis distance, and a non-significant p-value at 0.1 level from the χ 2 -test with 12 degrees of freedom. We can conclude that, by remove the two high-degree directors, model 8.9 fitted most of the graph statistics very well, hence it is a good model for this network. A direct interpretation is 37

39 Table 7: Goodness of Fit of Model (8.9) on 366 directors z(x) obs. Mean S.D. t-ratio L S P S P S A S A L C K-S P K-S A K-C P K-C A stddev D P skew D P stddev D A skew D A Clust.Coef Mahalanobis Distance Table 8: Parameter Estimates for Model (8.9) on 364 directors Effect MLE S.E. t-ratio* Choice (θ) Director Alternating-K-Star (σ P ) Company Alternating-K-Star (σ A ) Alternating-Two-Path K CA (β KCA ) *t-ratio for convergence 38

40 Table 9: Goodness of Fit of Model (8.9) on 364 directors z(x) obs. Mean S.D. t-ratio L S P S P S A S A L C K-S P K-S A K-C P K-C A stddev D P skew D P stddev D A skew D A Clust.Coef Mahalanobis Distance p-value (χ 2 12 )

that the conditional log-odds of director i sitting on the board of company j is given by 4.806u L (x ij ) 6.485u KSP (x ij, λ) + 0.610u KSA (x ij, λ) 5.806u KCA (x ij, λ), λ = 2.0 (8.

41 that the conditional log-odds of director i sitting on the board of company j is given by 4.806u L (x ij ) 6.485u KSP (x ij, λ) u KSA (x ij, λ) 5.806u KCA (x ij, λ), λ = 2.0 (8.10) Given the rest of the model, there are not many popular directors, or directors with high degrees; and directors tend not to be linked by multiple companies Largest interlocked component from the top 500 (1996) The largest interlocked component network, which is from the top 500 listed companies of both financial and industrial sectors in Australia in 1996, has 198 companies interlocked by 255 directors with 675 ties. A display of the network is shown in Figure 21 where squares are the companies and circles are directors. We use this large network as an example to show how robust the new specification is, in terms of obtaining model convergence. At the same time, it also shows the limitations of the model for large networks. Figure 21: Largest interlocked component, Australia (1996) The Markov model with L 3 and C 4 parameters is far away from convergence due to the degenerate property. Two different new specification models have converged successfully. Model (8.12) has parameters for the choice effect and alternating company and director k-stars, plus company alternating k-two-paths, while Model (8.11) used the same choice and star effects, plus a director alternating two-path parameter. Both models used a damping parameter of λ = 2.0. Pr(X = x) = 1 κ exp {θz L(x) + σ P z SP (x, λ) + σ A z SA (x, λ) + β KCP z KCP (x, λ)}, λ = 2.0(8.11) Pr(X = x) = 1 κ exp {θz L(x) + σ P z SP (x, λ) + σ A z SA (x, λ) + β KCA z KCA (x, λ)}, λ = 2.0(8.12) 40

Transitivity and Triads

Transitivity and Triads 1 / 32 Tom A.B. Snijders University of Oxford May 14, 2012 2 / 32 Outline 1 Local Structure Transitivity 2 3 / 32 Local Structure in Social Networks From the standpoint of structural individualism, one