STATISTICAL METHODS FOR NETWORK ANALYSIS

Size: px

Start display at page:

Download "STATISTICAL METHODS FOR NETWORK ANALYSIS"

Warren Oliver
5 years ago
Views:

1 NME Workshop 1 DAY 2: STATISTICAL METHODS FOR NETWORK ANALYSIS Martina Morris, Ph.D. Steven M. Goodreau, Ph.D. Samuel M. Jenness, Ph.D. Supported by the US National Institutes of Health

2 Today we will cover Three classes of statistical models for networks Simple null models Generative models for static networks Generative models for dynamic networks morning afternoon Ending with an example How we can use these models to answer questions about epidemic dynamics and interventions NME Workshop 2

3 Outline for the morning session Basic null hypothesis statistical tests Does your network differ from a simple random graph? Simple null models to test against: CUG and BRG ERGMs : generative models to test for multiple structural properties simultaneously in static networks Components of an ERG model Interpretation of coefficients Estimation algorithms (and when they fail) Model Diagnostics (estimation, and goodness of fit) Simulation from fitted models Network data requirements This is what we ll use ERGMs for in Epi modeling NME Workshop 3

4 Note We ll cover a lot of ground here some of the material and vocabulary may be unfamiliar Don t worry if you don t understand everything Focus on getting the big picture, not the details EpiModel puts a lot of this behind the curtain So you don t have to deal with it, for the most part The details do matter when you have a problematic model And don t be afraid to ask questions NME Workshop 4

5 Getting started Recall, the two ways to access statnetweb library(statnetweb); run_sw(); Open statnetweb and load the faux.mesa.high network NME Workshop 5

6 6 Statistical Testing: Basics How do you know if your network is significantly different than a simple random graph? NME Workshop

7 Description vs. Inference in statistics So far we have been using descriptive statistics to explore our network data Density Degree and geodesic distributions Mixing matrices Component size distributions Next, we might want to compare these statistics to what we would expect by chance What do we mean by chance? Is there a natural null hypothesis test in this context? NME Workshop 7

8 Recap Does the structure of our social network differ from a simple random graph? faux.mesa.high network Simple random graph with the same tie probability What are some structural differences you can see? NME Workshop 8

9 Consider triangles Suppose kids have a tendency to become friends with their friends friends And this is the only generative process occurring. Presumably, this would mean that you would observe more triangles than expected by chance in the graph. How would you test this for a specific network? NME Workshop 9

10 A basic statistical test for triangles Begin by counting the # triangles in your network Say this is T, your test statistic Then determine the probability of observing T or more triangles in this network And see if it is less than 5% But how do you determine that probability? For that you need a null hypothesis of some sort NME Workshop 10

11 What is the natural null hypothesis? It turns out there s more than one But they all get used the same way when constructing a statistical test. To create a sampling distribution consistent with the null And compare your observed value to that distribution NME Workshop 11

12 Null probability distribution (1) Unconditional: For a network this size (size = # nodes) enumerate all possible networks for a fixed number of nodes, count the number of triangles in each network, and construct the frequency distribution of these counts. Where does the number of triangles in your network lie in this distribution? Top 5%? Bottom 5% Near the middle? NME Workshop 12

13 Null probability distribution (1) For example: Take a network with 3 nodes How many dyads are there? 3 2 = 3! = 3 2! 3 2! How many different networks on these dyads? Every dyad has 2 possible values, and there are 3 dyads So the number of possible networks is: 2 3 = 8 What is the distribution of triangle counts? 7 networks have 0 triangles 1 network has 1 triangle Triangle distribution for 3 node networks 0 1 So if your network has 1 triangle, what do you think? NME Workshop 13

14 Null probability distribution (1) One problem with the unconditional null distribution enumerate all possible networks for a fixed number of nodes this is not so easy in practice for 4 nodes: # of dyads is 4*3/2 = 6 # of possible networks = 2 6 = 64 for 10 nodes: # of dyads is 10*9/2 = 45 # of possible networks = trillion for 20 nodes: # of dyads is 20*19/2 = 190 # of possible networks = We can solve this problem by sampling from the space of networks. NME Workshop 14

15 Null probability distribution (1) More important question for the unconditional null distribution Do you really care about comparing your network to networks with zero ties? Or all possible ties? Or does it make more sense to compare your network to other networks with the same number of ties? Controlling for density, does your network have more or less triangles than expected? NME Workshop 15

16 Null probability distribution (2) Condition on the density, the number of nodes and links This is the Conditional Uniform Graph test (CUG) enumerate all possible networks for a fixed number of nodes and links, count the number of triangles in each network, construct the frequency distribution of the counts compare the value in your network This also reduces the sample space but it s still a lot of graphs n 2 e = n 2! /e! ( n 2 e)! so we will still need to sample from this space in practice NME Workshop 16

17 The CUG is implemented as a permutation test Since full enumeration is typically not possible We sample the enumeration space by permutation Randomly choose a tied dyad, and a dyad without a tie Permute the tie and the non-tie This preserves the exact density in the network Count the number of triangles in the new network Repeat until you have the desired sample size Permutation tests are often used in statistics When the distribution of a sample statistic is not known NME Workshop 17

18 Null probability distribution (3) Condition on the probability of a tie This is the Bernoulli Random Graph test (BRG) Similar to the CUG, but treats density as a random variable Implemented via Markov Chain Monte-Carlo (MCMC) Randomly choose a dyad Flip a coin with probability(tie) = density of the network This will not preserve the exact density for each network, but will preserve it on average Repeat many times, then count the number of triangles in the final network Repeat until you have a sample of the desired size NME Workshop 18

19 Null models in statnetweb Select a summary measure for the observed data Compare it to the distribution simulated from a null model In statnetweb: We can plot null distribution overlays on degree and geodesic distributions And plot the CUG and BRG distributions for selected network summary measures NME Workshop 19

20 In statnetweb: Degree distribution Compare the degree distribution in faux.mesa.high to what we would expect by chance Network Descriptives Degree Distribution Select CUG and BRG null models Overlays the mean and 95% confidence intervals from 100 simulations What do you see now? NME Workshop 20

21 Test for the number of isolates Compare the number of isolates in faux.mesa.high to what we would expect by chance Network Descriptives More Conditional uniform graph tests NME Workshop 21

22 CUG test for triangles Are there more triangles in the observed network? Choose the triangle term from the dropdown menu and run 100 simulations to see how our network compares to the two null models CUG and BRG NME Workshop 22

23 Indeed NME Workshop 23

24 Yes the observed triangle count is high But why? a simple null hypothesis test doesn t provide any insight about that. NME Workshop 24

25 Limitations of simple null hypotheses If we are only interested in whether the triangle counts are different than expected given the density of the graph One can use these simple null hypothesis tests But if we want to understand the underlying generative process, quantify the impact of each process on our network, and control for other network features This requires a more general statistical modeling framework NME Workshop 25

26 26 Statistical Testing: Beyond Basics Can you control for more than just density? What if you want to test more than one network feature? And you want a model grounded in generative social theory? That s when you turn to ERGMs NME Workshop

27 Motivation Why are there so many more triangles? What do you see when color-coding the nodes by their attributes? faux.mesa.high network Simple random graph with the same tie probability NME Workshop 27

28 Friend of a friend, or birds of a feather? (At least) two theories about the process that generates triangles: 1. Homophily: People tend to chose friends who are like them, in terms of grade, race, etc. ( birds of a feather ), triad closure is a by-product 2. Transitivity: People who have friends in common tend to become friends ( friend of a friend ), triad closure is the key process So, for three actors in the same grade A cycle-closing tie may form due to transitivity But it may be due instead to homophily NME Workshop 28

29 partially Transitivity and homophily are confounded But not completely. Any tie may be classified by whether it is: Triangle forming: Within Grade: Yes No Yes Both Homophily No Transitivity Neither The cells represent how the processes jointly influence that tie, so the distribution of ties in this table is informative. This suggests we should be able to disentangle the two processes statistically NME Workshop 29

30 ERGMs: Basic idea We want to model the probability of a tie as a function of: Nodal attributes (that influence degree and mixing) The propensity for certain configurations (like triangles) The dyads may be dependent Nodal attribute effects do not induce dyad dependence But triad closure does So we model the joint distribution directly NME Workshop 30

31 Exponential Random Graph Model (ERGM) Probability of observing a graph (set of relationships) y on a fixed set of nodes: P Y = y ) = exp(θ g y ) k( ) where: g(y) = vector of network statistics = vector of model parameters k( ) = numerator summed over all possible networks on node set y Exponential family model Well understood statistical properties Besag (1974), Frank (1986) Very general and flexible NME Workshop 31

32 Exponential Random Graph Model (ERGM) Probability of observing a graph (set of relationships) y on a fixed set of nodes: P Y = y ) = exp(θ g y ) k( ) If you re not familiar with this kind of compact vector notation, the numerator is just: exp(θ 1 x 1 + θ 2 x θ p x p ) Kind of like a linear model, but a bit different (watch out for this later) NME Workshop 32

The conditional odds of a tie The probability of the graph P Y = y ) = exp(θ g y ) k( ) can be re-expressed as The conditional log odds of a specific tie logit P Y ij = 1 rest of the graph ) = log P

33 The conditional odds of a tie The probability of the graph P Y = y ) = exp(θ g y ) k( ) can be re-expressed as The conditional log odds of a specific tie logit P Y ij = 1 rest of the graph ) = log P Y ij = 1 rest of the graph ) P Y ij = 0 rest of the graph ) = θ g y where g y represents the change in g y when Y ij is toggled between 0 and 1 This is an auto logistic regression (auto because of the possible dependence) NME Workshop 33

34 ERGM specification: θ g y The g y terms in the model are summary network statistics Counts of network configurations, for example: 1. Edges: y ij 2. Within-group ties: y ij I(i C, j C) 3. 2-stars: y ij y ik 4. 3-cycles: y ij y ik y jk A key distinction in the types of terms: Dyad independent (1 & 2 are examples) Dyad dependent (3 & 4 are examples) NME Workshop 34

35 ERGM specification: θ g y Model specification involves: 1. Choosing the set of network statistics g y From minimal : # of edges To saturated: one term for every dyad in the network statnetweb allows you to choose from the list of terms and retrieve documentation for each one 2. Choosing homogeneity constraints on the parameter, for example, with edges: all homogeneous group specific (e.g., sex or age specific ) dyad specific NME Workshop 35

36 36 to StatnetWeb Let s explore the Florentine marriage network NME Workshop

37 Flomarriage: Bernoulli Model Load the flomarriage network Network of marriage ties between families in Renaissance Florence On the Fit Model page, look up the documentation on the edges term NME Workshop 37

38 Flomarriage: Bernoulli Model Add edges to the ergm formula Fit the model Step 1 Step 2 What does this model imply? Homogeneous edge probability Every tie is equally likely Not a very interesting model NME Workshop 38

39 Interpreting the coefficients The log-odds of any tie existing is: = change in # ties = Corresponding probability: = exp exp = You can confirm that this is the density of the network NME Workshop 39

40 Flomarriage: Triad Formation The triangle term is a measure of clustering Read the documentation for the triangle term Fit the model edges + triangle Hint: you can just add the triangle term if edges is already in your formula Then click Fit Model Triangle is a dyad dependent term, so the estimation algorithm changes to MCMC (more on this later) NME Workshop 40

Flomarriage: Triad Formation Note, not significant Now how to interpret the coefficients? Conditional log-odds of two actors having a tie: ( 1.68 change in the # of ties) + (0.

41 Flomarriage: Triad Formation Note, not significant Now how to interpret the coefficients? Conditional log-odds of two actors having a tie: ( 1.68 change in the # of ties) + (0.16 change in # of triangles) always=1 how many triangles can one tie change? For a tie that will create zero triangles = 1.68 One triangle (0.16 1) = 1.52 Two triangles (0.16 2) = 1.36 Still unlikely, but a bit less so NME Workshop 41

42 Flomarriage: Nodal covariates flomarriage sized by wealth What do you notice? We can test whether edge probabilities are a function of wealth This is a quantitative nodal attribute, so we use the ergm term nodecov NME Workshop 42

43 Flomarriage: Nodal covariates Reset the ergm formula and fit the following model: There is a significant positive wealth effect on the odds of a tie What does the positive coefficient mean? Not that there is homophily by wealth Just that wealthy nodes have more ties Note that the wealth effect operates on both nodes in a dyad. NME Workshop 43

44 Flomarriage: Nodal covariates The conditional log-odds of a tie between two actors is: 2.59 change in # ties wealth of node wealth of node 2 For a tie between two nodes with minimum wealth (3) = 2.53 For a tie between two nodes with maximum wealth (146) = 0.33 For a tie between nodes with maximum and minimum wealth = 1.1 Note: To specify homophily on wealth, you would use the ergm-term absdiff NME Workshop 44

45 Estimation (in one slide) There is no closed form or analytic solution for the estimated coefficients (as there is in OLS: β = X X 1 (X Y)) Instead, we rely on a defining property of Maximum Likelihood Estimates (MLEs) for exponential family models At the MLE of the coefficients: expected values of the statistics under the model = the observed statistics And we find these MLEs using an iterative search algorithm A Markov Chain Monte Carlo (MCMC) algorithm Start with some initial θ values, simulate a sample of networks from those values Compare the means of the simulated statistics to the observed values Update the values of θ based on the deviations Repeat until the (expected observed) < epsilon NME Workshop 45

46 Estimation (ok, I needed 2 slides) What does it mean to simulate networks from those values? Pick a dyad at random Toss a coin to set the tie status The probability of the tie is determined by the model And the details of the MCMC sampling algorithm (Gibbs, Metropolis, Metropolis-Hastings) Repeat (many many many times) This produces a Markov Chain of networks Sample from this chain, every 1000 th element (say) Calculate the mean of the model statistics from this sample And compare the this mean to the observed network statistics NME Workshop 46

47 Computationally intensive estimation Has been key to statistical estimation of complex (i.e., realistic and interesting) models for dependent data And to the emergence of the field of data science In most cases, it works really well And there is lots of mathematical theory proving it has good convergence properties but, it can run into trouble especially if the model you re trying to fit is not a good one for the observed network NME Workshop 47

48 Model Degeneracy Models with dyad dependent terms can behave differently than we expect They look simple, almost like logistic regression But they represent effects that cascade through a network via a chain of dependence (this is the watch out from earlier) Homogeneous triangle and k-star terms turn out to be some of the worst offenders NME Workshop 48

49 Model Degeneracy Technical Definition: When a model places almost all probability on a small number of uninteresting graphs Most common uninteresting graphs: Complete (all links exist) Empty Model degeneracy = misspecification The model you specified would almost never produce the network you observed NME Workshop 49

50 Model Degeneracy Switch to the faux.mesa.high network Fit a model with: edges + triangle What happens? Trying to fit this model, the algorithm heads off into networks that are much more dense than the observed network. What does this mean? That this model would not have produced this network, for any combination of parameter estimates for the two terms i.e., this is a model misspecification problem NME Workshop 50

Degeneracy Plot (for the 2 star model) Only the white area has networks with some interesting variation The dark areas are complete graphs, or empty graphs (+/- 1 or

51 Degeneracy Plot (for the 2 star model) Only the white area has networks with some interesting variation The dark areas are complete graphs, or empty graphs (+/- 1 or 2 edges) From Mark Handcock s 2003 tech report: This model does not produce many useful networks NME Workshop 51

52 Solution: better network statistics Old statistic: t(x) = # of triangles in the graph Here, every additional 3-cycle has the same impact, New statistic: Set declining marginal returns for each additional 3-cycle involving the same edge The specific function we place on this shared partner distribution involves a geometric weighting Hence the name: geometrically weighted edge-wise shared partners A.k.a. GWESP The parameter that specifies the rate of decline in marginal returns is α The smaller the α, the more rapid the decline NME Workshop 52

53 Solution: better network statistics gwesp = e α n 2 i=1 1 1 e α i sp i sp i = # of edges with i shared partners This configuration contains: 1 edge with 3 shared partners 6 edges with 1 shared partner α GWESP(α) 0 e e e e = e e e e = 7.55 The # of edges with 1+ shared partners 1 e e e e = 8.03 NME Workshop 53

54 Solution: better network statistics gwesp = e α n 2 i=1 1 1 e α i sp i sp i = # of edges with i shared partners Count of edges in each triangle (i.e. # of triangles x 3) Count of edges in at least one triangle (because only an edge s first triangle counts) NME Workshop 54

55 55 to StatnetWeb Adding a gwesp term to the faux.mesa.high model And conducting model assessments NME Workshop

56 Fitting and diagnosing a model Convergence is the first assessment Dyad independent models will always converge Dyad dependent models may or may not Next step depends on the model: Dyad independent Dyad dependent Convergence assessment: MCMC diagnostics Goodness of fit assessment: GOF plots NME Workshop 56

57 What are MCMC Diagnostics? MCMC Diagnostics tell us if the estimation algorithm is mixing well These are taken from the penultimate MCMC chain, which is stored in the ergm output object These look pretty good The traceplots on the left show random walks around the target value (a bit of correlation in the sampled networks, but not enough to cause concern) The distribution of sampled statistics on the right is centered on the target values NME Workshop 57

58 Goodness of Fit (GOF) Traditional GOF stats can be used AIC, BIC are included in the model summary We also take another approach We are interested in how well we fit aggregate properties of the network structure that we did not include as model terms This helps to identify what the model gets wrong We use 3 higher order statistics: Degree distribution Shared partner distribution (non-parametric) (local clustering) Geodesic distance distribution (global clustering) NME Workshop 58

59 DATA MODEL ESTIMATED COEFFICIENTS SIMULATED DATA (draws from the prob. dist.) HIGHER ORDER GRAPH STATISTICS OF DATA HIGHER ORDER GRAPH STATISTICS OF SIMULATED DATA GOODNESS OF FIT OF MODEL TO DATA We ll show how to do this next NME Workshop 59

60 60 Take a break? NME Workshop

61 So let s run and compare several models These will allow us to examine the evidence for homophily vs. transitivity We ll assess the convergence of the different models As well as the goodness of fit And the implications for the generative process of high school friendship patterns in this network NME Workshop 61

62 Fit the Bernoulli model to faux.mesa.high Estimate, and run the default set of GOF terms for this model: faux.mesa.high ~ edges Is this a dyad independent or dyad dependent model? Dyad independent models are not fit with MCMC, so we don t need to check MCMC diagnostics We can move directly to GOF NME Workshop 62

63 Save the model This will keep the results so we can compare them later NME Workshop 63

64 Run the GOF for this model Go to the Goodness of Fit tab Run the default GOF This will take a moment because GOF is simulating 100 networks from the model, and calculating the default summary statistics for each one NME Workshop 64

65 Goodness of fit measure 1: degree distribution Data: Black line shows the observed data from faux.mesa.high Boxplots show 100 simulations from the Bernoulli model Model: Bernoulli (i.e. edges only) NME Workshop 65

66 Goodness of fit measure 2: ESP distribution (local clustering) Data: Model: Bernoulli (i.e. edges only) This edge has an ESP value of 3 NME Workshop 66

67 Goodness of fit measure 3: geodesic distribution (global clustering) Data: Model: Bernoulli (i.e. edges only) A C B A/B have geodesic 2 A/C have geodesic NME Workshop 67

68 Goodness of fit measures assembled faux.mesa.high ~ edges degree edgewise shared partner geodesic Summary: Not a good fit to any of the aggregate structural properties observed NME Workshop 68

69 Fit a model with gwesp Estimate, save and assess this model: faux.mesa.high ~ edges + gwesp(0.25, fixed = TRUE) Save this model too. This is a dyad dependent model It converges (unlike the triangle model) It is fit with MCMC NME Workshop So we should check the MCMC diagnostics 69

70 Run the MCMC diagnostics for this model Go to the MCMC diagnostics tab Select Model 2 Looks pretty good NME Workshop 70

71 Run the Goodness of Fit Much better, though the ESP distribution fit isn t great faux.mesa.high ~ edges + gwesp(0.25, fixed = TRUE) degree edge-wise shared partners minimum geodesic distance NME Workshop 71

72 And, a quick eyeball test The global structure looks kinda similar now,... But something is not right Observed network Network simulated from model* * We ll get to simulation in just a bit So, back to our original question: How much of the clustering is due to homophily, and how much to transitivity? NME Workshop 72

73 Test this by comparing four models Model Edges Edges + GWESP (transitivity) Edges + Attributes (homophily) Network Statistics g(y) # of edges # of edges weighted shared partners # of edges # of edges for each race, sex, grade # of edges that are within-race, within-grade, within-sex Edges + Attributes + GWESP (both) # of edges # of edges for each race, sex, grade # of edges that are within-race, within-grade, within-sex weighted shared partners NME Workshop 73

74 Fitting and saving models statnetweb allows you to save up to 5 models we ll fit 4 (you can cut and paste from here to statnetweb): 1. edges Fit model, save model, reset formula 2. edges + gwesp(0.25, fixed = T) Fit model, save model, reset formula You ve already fit and saved these 3. edges + nodefactor("grade") + nodefactor("race") + nodefactor("sex") + nodematch("grade", diff = T) + nodematch("race", diff = F) + nodematch("sex", diff = F) Fit model, save model, reset formula 4. edges + nodefactor("grade") + nodefactor("race") + nodefactor("sex") + nodematch("grade", diff = TRUE) + nodematch("race", diff = FALSE) + nodematch("sex", diff = FALSE) + gwesp(0.25, fixed = TRUE) Fit model, save model NME Workshop 74

75 Model Comparison Note how the gwesp estimate changes from model 2 to 4 About 25% smaller That s the impact of controlling for attribute effects, including homophily Homophily estimates change also, once you control for transitivity NME Workshop 75

76 GOF comparison for all 4 models: This will take some time to run 1. Edges AIC: Edges + GWESP AIC: Edges + Attributes AIC: Edges + Attributes + GWESP AIC: 1648 NME Workshop 76

77 Summary Both transitivity and homophily play a role in clustering these friendships Homophily accounts for the distribution of path lengths (geodesics) Transitivity (Triadic closure) Accounts for the large number of isolates Captures the local clustering (ESP) reasonably well ~25% of the transitivity effect is a by-product of homophily The gwesp coefficient drops by ~25% when homophily is added to the model The GOF suggests the ESP distribution is still not well fit You could tinker some more, if this was a real research question But we ll move on NME Workshop 77

78 Simulating networks from the model A fitted model describes a probability distribution across all networks of this size The model assigns a probability to every possible network The model terms and the estimated coefficients make some networks more likely than others You can simulate networks from this distribution Using the same MCMC algorithm that was used for estimation And the simulated networks will be centered on the network statistics in the original observed network This, of course, is why these models are really useful for network epidemiology NME Workshop 78

79 Simulations Choose one of the models that you have saved and run 100 simulations with the default control settings Choose the model on the Simulations page next to ergm formula Do you see autocorrelation in the simulation statistics? Increase the MCMC interval to 10,000 and re-run the simulations to see how this changes the autocorrelation NME Workshop 79

80 Some common statistics in ergms undirected network of 10 nodes, including nodal attribute color, with values: 1=black, 2=red, 3=green Term Formula Unit Value(s) ~edges # of edges edges 8 ~nodefactor( color ) Sum of degrees for nodes of each color nodes/edges* [8,] 6, 2 ~nodefactor( color, base=2) Sum of degrees for nodes of each color nodes/edges* 8, [6,] 2 ~nodematch( color ) # of edges between nodes of same color edges 6 ~nodematch( color, diff = TRUE) # of edges between nodes of same color, for each color edges 3, 2, 1 NME Workshop 80

81 Some common statistics in ergms undirected network of 10 nodes, including nodal attribute color, with values: 1=black, 2=red, 3=green Term Formula Unit Value(s) ~nodemix( color, base=1) # of edges between nodes of each color combo edges [3,] 2, 2, 0, 0, 1 ~degree(0) # of nodes of degree 0 nodes 2 ~degree(2:5) # of nodes of degrees 2, 3, 4, 5 each nodes 1, 2, 1, 0 ~concurrent # of nodes of at least degree 2 nodes 4 NME Workshop 81

82 Some common statistics in ergms undirected network of 10 nodes, including nodal attribute color, with values: 1=black, 2=red, 3=green Term Formula Unit Value(s) ~triangle # of triangles (beware!) triangles 2 ~gwesp(0) # of edges in at least one triangle edges 5 ~gwesp( ) # of edges in triangles total (=3 * # triangles) triangles 6 NME Workshop 82

83 83 Network Data Where we will see the other benefit of an MLE based statistical modeling approach

84 Network data: Three main types Network census Data on every node and every link Adaptively sampled networks Link tracing designs (e.g., snowball or RDS) Infeasible in practice Challenging, and requires strong assumptions for limited purposes Egocentrically sampled networks Enroll population sample ( egos ) Ask them the usual questions about themselves Ask them non-identifying information about their partners ( alters ) Timing (start and end of partnership) Alter characteristics (sex, age, race, etc.) Relational characteristics (type, cohabitation, etc.) Pair-specific behaviors (act frequency, condom use, etc.) Optional: ask about alter-alter ties Optional: ask about perceptions of alters alters more generally Feasible, statistically supported and general NME Workshop 84

85 Egocentric data Egocentrically sampled data allow us to observe Degree Mean degree, which sets density Degree distributions Nodal attributes Heterogeneity in degree by nodal attributes Mixing by nodal attributes Triads Only if the alter-alter matrix data are collected Timing Start/End, Duration of active and completed partnerships Much of the global structure of a network is set by these local properties And we can used the observed data to estimate the ERGM coefficients NME Workshop 85

86 Egocentric estimation for ERGMs Why does this work? MLEs for exponential families ERGMs are based in exponential family theory One of the properties of MLEs for exponential families is that E(sufficient stats under the model) = observed sufficient stats. Any graph with the same observed sufficient stats has the same probability under the model So we don t need to observe the specific complete network We just iterate our way (using MCMC) to finding the coefficients that satisfy E(sufficient stats under the model) = observed sufficient stats. Statistical inference for sampled data The sufficient stats are like any other sample statistic (e.g., a sample mean) There is a sampling distribution for these statistics Which allows the standard errors to be estimated NME Workshop 86

87 Egocentric data in ERGMs These can be handled in the software quite easily. Recall with faux.mesa.high above, we fit the ergm by providing: A model formula A network containing: nodes with their attributes the relations among those nodes But alternatively, one can pass: A model formula A network containing nodes with their attributes The sufficient statistics for the terms in the model formula for that set of nodes these are known as the target stats NME Workshop 87

88 Egocentric data in ERGMs Option 1: Option 2: net~edges+triangle net~edges+triangle target.stats = c(40, 7) NME Workshop 88

89 We ll be using this extensively this week EpiModel is designed to work with both Complete network data (census) Egocentric data with target stat specifications So you ll get lots of practice during the labs And we will be reviewing published examples Based on egocentric data That address key issues in HIV prevention and care NME Workshop 89

90 Egocentric estimation for ERGMs There is a also a specific package for estimating ERGMs from egocentrically sampled data ergm.ego Automates calculation of the target stats Handles survey weighting Provides other utilities for egocentric EDA Available on CRAN But is currently being refactored with a new API And is not yet integrated with EpiModel In the (near) future, this will be an option for EpiModel NME Workshop 90

91 Egocentric data for temporal ERGMs The same principles apply to estimating temporal ERGMs TERGMS -- For dynamic networks Specify the process of link formation and dissolution This requires collecting data on the duration of ties You ll learn more about this in the next session (on STERGMs) And it is the foundation for dynamic, stochastic network-based epidemic simulations This is what makes the EpiModel framework so powerful Simple data collection requirements Robust statistical methodology for estimation and inference Simulations rooted in empirical network data NME Workshop 91

92 Summary Network structure influences transmission dynamics Statistical models for networks (ERGMs) provide a way to estimate and evaluate hypotheses about the generative processes that lead to the structures we observe And the fully specified models can also be used to simulate networks. The expected values of the model statistics from the simulated networks will match the statistics in the observed network Of course, the networks we want to simulate need to be dynamic (and that s where we ll go after lunch) NME Workshop 92

93 Selected References Journal of Statistical Software (v42) 2008 Eight papers on ERGMs and statnet Goodreau, S., et al. (2009). "Birds of a Feather, or Friend of a Friend? Using Statistical Network Analysis to Investigate Adolescent Social Networks." Demography 46(1): Krivitsky PN, Morris M. Inference for social network models from egocentrically sampled data, with application to understanding persistent racial disparities in HIV prevalence in the US. Annals of Applied Statistics. 2017;11(1): NME Workshop 93

Statistical Methods for Network Analysis: Exponential Random Graph Models

Day 2: Network Modeling Statistical Methods for Network Analysis: Exponential Random Graph Models NMID workshop September 17 21, 2012 Prof. Martina Morris Prof. Steven Goodreau Supported by the US National