On the parallelization of network diffusion models

Size: px

Start display at page:

Download "On the parallelization of network diffusion models"

Gary Hutchinson
5 years ago
Views:

edu/etd/5831 Recommended Citation Rhomberg, Patrick. "On the parallelization of network diffusion models.

1 University of Iowa Iowa Research Online Theses and Dissertations Summer 2017 On the parallelization of network diffusion models Patrick Rhomberg University of Iowa Copyright 2017 Patrick Rhomberg This dissertation is available at Iowa Research Online: Recommended Citation Rhomberg, Patrick. "On the parallelization of network diffusion models." PhD (Doctor of Philosophy) thesis, University of Iowa, Follow this and additional works at: Part of the Applied Mathematics Commons

2 ON THE PARALLELIZATION OF NETWORK DIFFUSION MODELS by Patrick Rhomberg A thesis submitted in partial fulfillment of the requirements for the Doctor of Philosophy degree in Applied Mathematics and Computational Sciences in the Graduate College of The University of Iowa August 2017 Thesis Supervisor: Professor Alberto Maria Segre

3 Graduate College The University of Iowa Iowa City, Iowa CERTIFICATE OF APPROVAL PH.D. THESIS This is to certify that the Ph.D. thesis of Patrick Rhomberg has been approved by the Examining Committee for the thesis requirement for the Doctor of Philosophy degree in Applied Mathematics and Computational Sciences at the August 2017 graduation. Committee: Alberto Maria Segre, Thesis Supervisor Sriram Pemmaraju Bruce Ayati Colleen Mitchell Philip Polgreen

4 To unopposed bodies in motion. ii

5 Imagination is Change. Patrick Rhomberg. On the Parallelization of Network Diffusion Models. iii

6 ABSTRACT In this thesis, we investigate methods by which discrete event network diffusion simulators may execute without the restriction of lockstep or near lockstep synchronicity. We develop a discrete event simulator that allows free clock drift between threads, develop a differential equations model to approximate communication cost of such a simulator, and propose an algorithm by which we leverage information gathered in the natural course of simulation to redistribute agents to parallel threads such that the burden of communication is lowered during future replicates. iv

7 PUBLIC ABSTRACT This study investigates ways to reduce the computational cost of epidemiclike simulations. We model one of the most expensive portions of large simulations communication between computers and explore ways to distribute workload across computers so that this communication cost is reduced. v

8 TABLE OF CONTENTS LIST OF TABLES viii LIST OF FIGURES x CHAPTER 1 PRELIMINARY WORK Introduction Statement of the Problem Current State of the Art EpiSimdemics EpiFast GSAM Frías et al Indemics Significance of This Study Prerequisite Discussion on Random Graph Generators Erdős-Rényi Graphs Stochastic Block Model Recursive Matrix (R-MAT) LFR-Benchmarks Test Set: LFR-A Test Set: LFR-B Test Set: LFR-C Test Set: LFR-D SNAP Dataset DISCRETE EVENT SIMULATORS Development Implementation A COMMUNICATION MODEL FOR COMPARTMENTAL ERDŐS- RÉNYI GRAPHS Model Definition Traditional Compartmental Model Preparation of Compartment Subdivision Subdivision for Latency vi

9 3.1.4 Subdivision for Threading Further Subdivision to Capture Thread Communication Estimation of Communication Costs Comparison and Conversion Between Models Investigation Erdős-Rényi Graph Stochastic Block Model Stochastic Block Model - Worst Case Stochastic Block Model - Best Case LOAD REBALANCING PROTOCOLS Defining Attraction Transfer Set Generation and Redistribution Performance of Attraction Functions Partition Scoring Possible Attraction Functions Evaluation Stochastic Block Model with Near Worst Case Initial Partitioning Stochastic Block Model with Best Case Initial Partitioning Hierarchical Community Structure in Recursive Matrix Graphs EMPIRICAL RESULTS Benefiting from Parallelization Improvement of Performance Via Transfer Protocol Crossing Transfers Thread Count Scaling Community Complexity Performance on Real-World Graphs Conclusions Future Work Acknowledgments REFERENCES APPENDIX A: AGENT AND EDGE DATA STRUCTURES APPENDIX B: LFR GENERATION PARAMETERS vii

10 LIST OF TABLES Table 3.1 The compartmental model estimates Erdős-Rényi graph communication behavior with a high degree of accuracy, though consistently overestimates each of our observed data. Epidemic outcome as measured by total population infected is accurate within 1% of the observed mean. The estimate of transmission and overhead messages is accurate within 5%, and the estimate of updating messages is accurate within 6% While there is a significant difference in the uniformity of the graph between Erdős-Rényi and stochastic block model, very little of that difference is noted in our metrics when agents are assigned according to a near worst case partitioning, both as estimated by the compartmental model and as observed by the discrete event simulator. Indeed, the compartmental model uses the same parameters as it did in the Erdős-Rényi experiment, as aggregate internal and external mixing is constant between these experiments. Still, overall accuracy is relatively high Here we see the great impact of quality partitioning. While epidemic behavior is consistent with the previous experiment, communication costs have been reduced by an order of magnitude across all metrics. Our compartmental model overestimates communication costs, transmissions and underestimates overhead and updating messages While we frame the edge attributes around our previous consideration of message types, keep in mind that we wish Attr to be meaningful within a single partition as well. Also note that, of the various attributes belonging to an edge uv, only uv a can be known by a thread without additional communication. While a computational thread certainly knows the number of messages it sends a neighboring thread, the type of message is determined at the target. This communication can be performed in bulk after computation completes, but its cost is non-trivial when many cross-partition edges exist. We also note that any metric driven by uv u will have attraction biased to favor cross-partition edges, as internal edges only require a Resync in the event of a cascading update. Additionally, an internal overhead message typically requires no action, but will require bookkeeping when associated metrics are used viii

11 4.2 Possible attraction metrics from aggregated simulation data. Data shorthand defined in Table 4.1. Attr any weighs edges by their message counts. Attr succ counts only messages that successfully transfer infection. Attr prop recv weighs edges by the proportion of target agent infections received through that edge. Attr prop succ weighs by the proportion of messages along the edge that yield new infection. Attr inf fail opts for an inverse-linear penalty of updating and overhead messages. Comparative performance of these metrics is discussed in the next chapter Vertex-object fields used Edge-object fields used Parameters used in generation of LFR-Benchmark test sets. The parameters changed between test sets are listed last. The other parameters were chosen based on the suggested default use ix

12 LIST OF FIGURES Figure 1.1 An 8-agent example of the R-MAT generator s edge placement. Probabilities a, b, c, d are associated with each quadrant of the connectivity matrix. A random outcome is generated. The associated quadrant is divided into four sub-quadrants, and the process repeats. This process continues until a single entry is identified. (Here, we have used a 1-indexing to identify edge (3, 5).) A visualization of the different expected degree distributions of the Erdős- Rényi, stochastic block model, R-MAT, and LFR-Benchmark graph generators. The expected degree distributions of Erdős-Rényi and stochastic block model graphs are identical, with the only distinction between them being whether those edges exist within or across community boundaries. An LFR graph follows a power-law distribution by design, and we observe a near-power-law distribution in the R-MAT generator (though this is parameter dependent) An example of propagation requiring Resync. Agent infection status is given by color: black when susceptible, green when infectious, and red when recovered. Propagation between agents is indicated by a directed arrow. (A) marks successful transmissions. (B) marks transmissions targeting a non-susceptible agent, which are duly ignored. At (C1), agent A3 would infect agent A2 at timestep t = 5. This message is queued to the outgoing buffer. At (C2), all local infection has resolved in Partition 1. If Partition 1 had messages in its incoming buffer, they would be processed here. After this, messages in the outgoing buffer are sent, received to Partition 0 s incoming buffer. At (C3), all local infection has resolved in Partition 0. It now receives the message in its buffer for processing. Because the infection message would resolve earlier than the infection already computed, a Resync phase is required. This process is demonstrated in Figure x

13 2.2 A message, marked (A), is received requiring a Resync event. The target s infected behavior (infection duration and offset of outgoing edges) has already been computed. The target s infection emergence time is updated to the new message s infection time, and the target s recovery time is updated to keep the infection duration consistent. These adjustments are marked with (B). Likewise, any outgoing infection is updated to keep the corresponding edge s offset value consistent. The early propagation is marked (C). While the agent A1 was itself unchanged, we note that the propagation marked (D) has changed from a successful transmission to one that fails due to agent status In some instances, the action of Resync requires additional adjustment. In this example, infection durations and edge offset values match those of Figure 2.1; the only change is that agent A3 seeds infection in Partition 1 instead of agent A5. As in Figure 2.1, (A) denotes successful propagation, and (B) denotes those propagation attempts that fail due to target infection status. (C1) marks a message being sent to the outgoing buffer. In (C2), that message is transmitted to the target s incoming buffer. In (C3) the message is received for processing, which is demonstrated in Figure As in Figure 2.2, (A) denotes the message that requires a Resync phase. (B) marks the adjustment of the target s infection emergence time to the received message and the adjustment of its recovery time to keep infection duration consistent. (C) denotes the early propagation time along one of its edges to keep the associate offset value consistent. In this case, however, this adjusted propagation time would now successfully resolve on the targeted agent A1. Another Resync phase begins. (D) marks the adjustment of emergence and recovery times, akin to the action marked with (B). A1 s propagation is also adjusted, marked by (E). Note that this propagation has changed both with respect to timestep as well as changing from successful transmission to failed. Likewise, while the agent A0 was itself unchanged, we note that the propagation marked (F) has changed from successful transmission to one that fails due to agent status ŜÎ ˆR and SILRK population movement diagrams. The Infected compartment is divided into two compartments, but overall individual behavior is unchanged. Also note that transitioning from compartment Ŝ to Î involves the mixing of an individual in Ŝ with an individual in Î or, in the case of the SILRK model, an individual in I or L. All other transitions are independent of individual behavior are functions only of the respective parameter and of time xi

14 3.2 Epidemic curves for the traditional SIR and our SILRK compartmental models reveal that division of infected and recovered compartments do not impact epidemic behavior. We observe that compartment S is identical for both models, and that the SIR model s compartments I and R are reproduced in the SILRK model s combined compartments I +L and R+K respectively. These curves were solved computationally, using parameters β = 0.2, γ = 0.1, = 0.05, I(0) = The threaded SILRK model is not unlike the unthreaded version. Individuals are wholly partitioned and do not deviate from their threads, as represented by each layer above Figure 3.4a shows the flow of individuals in a single thread of our expanded SILRK model, as given by Equations An individual owned by thread i infected by an individual owned by thread j flows to I ij, proceeding to either L ij or R ij before arriving to K ij. Figure 3.4b represents that, as before, individuals within a single thread s compartments will remain within that thread s compartments throughout the model, giving us this stacked visualization Messages are captured by duplicating individuals as they mix, as given by Equations Green arrows indicate this duplication, as opposed to flow diagrams underlying and used previously. Figure 3.5a shows the duplication of individuals who become infected during mixing being noted as they transition from the susceptible to the infected state. Figure 3.5b shows similar copying occurring to capture updating messages from the mixing of individuals in either infected state with individuals in either early state, as well as the capturing of overhead messages resulting from the mixing of individuals in either infected state with individuals in either late state. While these equations are refined in Equations , the increased complexity prohibits as clean a visualization Each attraction function performs similarly overall, here operating on graphs generated using the stochastic block model with 750 communities with an initial partitioning of agents assigned uniformly at random to one of two computational threads. We note that the proportion-successful and any-transmission attraction functions appear to regress in the middle, whereas the inverse-failure and successful-transmission functions perform slightly higher in earlier redistribution phases xii

15 4.2 Here we see the degradation of partition quality under each attraction function when the initial partitioning is the best case partitioning. While no attraction function yields significant reduction in quality, both proportionsuccessful and successful-transmission attraction functions do decrease partition quality and never recover from this deviation. The other attraction functions maintain the high quality partitioning throughout Evaluation of each attraction metric using score on a recursive matrix graph. We see that the nested, hierarchical structure of the R-MAT graph causes our transfer set protocol to struggle to find improvement after an the first few redistributions. Performance across each metric is similar, with a slight regression in the proportion-successful and successfultransmission metrics Evaluation of each attraction metric using score on a recursive matrix graph. While we might have expected the difference in community sizes to exhibit a difference in scoring, the nested nature of the smaller communities is likely responsible for similar behavior of both score and score metrics The required clock seconds of each thread during simulation on the LFR-A graph. The line represents the perfect speed-up case, which is attainable only in the abscense of communication costs and with perfect partitioning. The significantly higher frequency of large clock-times in the 16-process experiments are likely due to use reaching the capacity of our computational resources The required clock seconds of each thread during simulation on the LFR-B is not significantly different from the LFR-A graph, but is included here in the interest of completeness The required clock seconds of each thread during simulation on the LFR- C is not significantly different from the LFR-A or LFR-B graphs, but is included here in the interest of completeness The required clock seconds of each thread during simulation on the LFR-D is not significantly different from the previous LFR graphs, but is included here in the interest of completeness Performance over multiple redistributions in stochastic block model graphs, with and without crossing transfers. Counter to our intuition, prohibiting crossing transfers unduly slows partition improvements xiii

16 5.6 Performance over multiple redistributions an R-Mat graph, with and without crossing transfers. In either case, it appears that the nested nature of the communities inhibits partition quality more heavily than the the potential trading attractive agents between two threads Partition quality over redistribution in the 16-community stochastic block model graph. We see that partition quality seems to suffer significantly in the 16-community stochastic block model as we increase the number of computational threads towards the number of communities. Indeed, no improvement is observed in any simulation series when the number of threads matches the number of communities An alternate visualization of Figure 5.7. Here, the score of each plot is scaled by the number of processors, so that the scores in a k-thread simulation will range between 1 and k. In this way, we more clearly see the significant (relative) improvement the 4-thread case displays Partition quality over redistribution in the 750-community stochastic block model graph. Unlike the 16-community experiment, the addition of computational threads to a 750-community stochastic block model average partition score improvements consistent across most of our simulation sets. We note, however unsurprisingly, that the increased number of computational threads sees a wider variance in an individual simulation s partition score at a given replicate An alternate visualization of Figure 5.9. Again, the score of each plot is scaled by the number of processors, so that the scores in a k-thread simulation will range between 1 and k. We see here that the minor improvement in the 16-thread case is less significant than the increases in the 8- and 4-thread cases Partition quality over redistribution in stochastic block model graph with a varying number of stochastic blocks. As the number of communities increases, we observe much more rapid initial improvements in partition score. Overall long term partition improvement is comparable across each simulation set Here, we have scaled the score of each partition score by the number of threads, so that each may be viewed in a single scale. Overall, the LFR-A test set of disjoint communities performs very well under our transfer set protocol. The improvement with each redistribution appears comprable across each experiment, relative to the number of threads involved xiv

17 5.13 The LFR-B test set with very loosely connected communities continues to perform well under our transfer set protocol, but performs less well than the LFR-A test set The LFR-C test set does not perform significantly worse than the LFR-B test set. In this set, we have many more multiple-community agents, but each of these still belong only to two communities. While more diffuse as a whole, communities only have a small amount of intermixing The LFR-D test set sees a significant decline in transfer set protocol performance. In this test set, each multi-community agent belongs to ten communities, blurring lines between communities a great deal Simulation clocks over redistribution in the Enron network. While subject to stochastic noise, we see clocks are slowly but steadily reduced over redistribution, with the least squares line possessing a slope of m Simulation clocks over redistribution in the DBLP network. We see again clocks slowly but consistently reduced, with the least-squares line possessing a slope of m Simulation clocks over redistribution in the YouTube network. With poorly-defined community structure, simulation clocks in the YouTube network do not exhibit improvement. The least squares line possesses a slope of m , an increase in clocks, but five orders of magnitude smaller than the average runtime xv

18 1 CHAPTER 1 PRELIMINARY WORK 1.1 Introduction The modeling of complex systems is the primary method of investigation for those systems which cannot be tested experimentally. Most models may be partitioned into one of two categories: mathematical models and computer simulations. While both mathematical models and computer simulations attempt to replicate the same behaviors, they do so with typically different underlying assumptions. Of primary interest here is the phenomena of network diffusion, which takes place on discrete graphs where statuses associated with vertices are propagated along edges to neighboring vertices. Network diffusion is observed in epidemic propagation, information or rumor diffusion, and innovation adoption. For consistency in language, we will frame our investigation in the epidemiological context. The mathematical models of interest here are often built in the form of differential equations intending to mirror observed or expected phenomena. Particularly in the application of epidemiology, the most common structure of this sort is compartmental modeling. This de facto standard epidemiological model was introduced in 1927 by Kermack and McKendrick [30, 31, 32], and variations of this model have been explored extensively in [1, 8, 17, 15, 23, 24, 22], to name only a few. In Kermack and McKendrick s model, an epidemic spreads through a uniformly mixing population. Individuals typically have one of three internal states: susceptible, infectious, or removed. The removed state includes any agent who no longer participates in infection propagation due to immunity conferred upon recovery, death, or quarantine from which the individual does not return during modeling. Models that do not consider or do not distinguish these later possibilities may refer to this final state as recovered. In this thesis, we will favor the term recovered to avoid potential confusion with algorithmic interpretations of removed and removal. The majority of individuals begin susceptible to infection, with only a small

19 2 proportion initially infected. When mixing with an infectious individual, there exists some possibility that these susceptible individuals acquire the infection, transitioning from susceptible to infectious. Similarly, an infectious individual has some (other) possibility for removal from simulation (e.g., recovery, death, quarantine). We represent the proportion of the population that have the status susceptible, infectious, and recovered with the functions S, I, and R. While each is a function of time, we suppress the parameterization (t) for simplicity and clarity. Allowing f to denote the derivative of f with respect to time, the standard compartmental differential equations modeling an susceptible-infectious-recovered infection under uniform mixing can be given by: Ṡ = βsi (1.1) I = βsi γi (1.2) Ṙ = γi (1.3) where a transmission rate β represents both inter-compartment mixing and probability of infection, and a recovery rate γ represents the speed with which individuals recover. Note that Kermack and McKendrick s model here makes an implicit mass action assumption, that is, that the behavior of the system is governed by the proportions of the three populations in the model rather than behavior of any individual agent. While the contribution of Kermack and McKendrick has been influential, the mass-action assumption is highly restrictive. In small or moderately sized populations, the behavior of the individual plays a larger role relative to the average, aggregate behavior. As such, these models break down at these smaller scales. This loss of accuracy can be mitigated somewhat by sub-compartmentalization, extending each population compartment to distinguish populations that behave differently [27]. Alternatively, a model may be extended to include additional information, such as an

20 3 individual s location [26]. Extending the concept of sub-compartmentalization to the extreme case, we turn to discrete event simulation. That is, we consider compartments consisting of a single individual and consider every individual s connections and behavior as distinct. The assumptions that drive continuous models deteriorate at this level. En masse, one can consider a fraction of a compartment as a proportion of a population, and as such, it is acceptable to use fractional values when representing the state of the population. However, it is not sensible to consider a population of one individual to have an infected population consisting of a fractional amount (ignoring any probabilistic interpretation). In a discrete event simulation, variables (infection state, time) and events (infection propagation, recovery) can be represented as discrete entities. Unlike the continuous time of the above differential equations, there exists some discrete treatment of time that is still granular enough to capture phenomena of interest. Each point in this descretized timeframe is called a timestep. Additionally, the population itself is discretized. Where differential equations may, under mass action assumptions, consider compartments of the population as a whole, discrete event simulations will typically represent each individual in a population as a vertex in a discrete graph, using weighted edges to denote the possibility of transmitting infection between one vertex and another. These graphs may be randomly generated [2], snapshots of known or inferred networks [27], or a combination networks generated to fit observed metrics [13]. Absent assumptions about uniform behavior, discrete event simulations may also develop in-depth agent behavior models [7]. This may include activity scheduling, response to the infection status in the individual or its neighbors, or the evolution of graph edges or associated edge-weights over time. These concepts and more are examined in depth in, e.g., Epstein s Agent Zero [15].

21 4 1.2 Statement of the Problem While discrete event simulations may model agent behavior to any desired degree, they are significantly more expensive than a differential equation model. Development time for discrete event simulations is typically much longer, as a model must be created incorporating many possible parameters and then implemented, often from scratch, to yield agents displaying all the characteristics and behavior of interest. Moreover, these characteristics and behavior are themselves difficult to validate against observed data, especially as agents become more complex, so it is often hard to determine if agent behavior is in line with the desired course of study. Not least of all, discrete event simulations are also significantly more computationally costly. Since every individual of a population is modeled as distinct, large-scale discrete event simulations require both processing power and memory orders of magnitude larger than most differential equation solvers. Additionally, differential equation models may adjust the flow of simulation time, such as with a Runge-Kutta-Fehlberg solver, or may be analyzed or solved outside of a temporal scale. Conversely, discrete event simulators are typically bound to simulate every timestep, with cost increasing proportional to simulation length. Large-scale discrete event simulation carries with it computational and memory costs that are too large to be processed by a single machine. In the most common parallelization paradigm of discrete event simulators, the application is broken into multiple threads. Threads are autonomous sequences of computation, with information shared between threads via message passing. A single physical machine may support multiple threads. This is commonly referred to as multithreading, in which case, message passing may be done using shared memory. Otherwise, computation that utilizes multiple threads existing across multiple machines is referred to as distributed computing. Message passing in distributed computing is achieved with network communication. Threads may be assigned to machines in any fashion that physical resources allow. Message passing via shared memory is significantly different than network communication; the computational cost of shared memory message

22 5 passing is trivial in comparison to that of network communication. In distributed computing, it is common to refer to a physical machine as a node. However, we will eschew this term to avoid confusion between a machine in distributed computing and a vertex in a discrete graph. We will instead use the term agent to refer to the vertex in the discrete graph and explicitly refer to thread or machine as the point of computation, noting that a single machine can support multiple threads. 1.3 Current State of the Art Large scale agent base simulation is an active field of study. Representative of this work within the last ten years are the simulators EpiSimdemics (2008) [3], EpiFast (2009) [6], GSAM (2011) [37], a simulator proposed by Frías et al (2011) [18], and Indemics (2014) [5] EpiSimdemics EpiSimdemics [3] is a distributed agent simulator meant to model epidemics and specifically the impact of epidemic intervention strategies. Agent behavior is moderately complex and may change based on the agent s infection status or environmental variables. Three insights drive EpiSimdemics. First, simulation is divided between agents and locations to produce a bipartite contact graph of agents and the locations they visit. Simulation is driven by each agent s schedule, a process by which an agent is assigned to a location for some number of timesteps. This allows locations to aggregate co-local agents and abstract agent-to-agent mixing, which consequently allows simpler computation of the transmission of infection; the location, knowing the level of infection present and the mixing rate between co-local agents, determines the expected number of new infections, selecting and notifying connected agents of their infection. Second, EpiSimdemics uses an expanded epidemic framework wherein agent state proceeds along one of three paths. In the first path, an infected agent enters

23 6 a latent period before symptoms emerge. Upon symptom emergence, they enter an infectious state, after which they transition to the removed state. In the second path, an infected agent never exhibits symptoms. The agent first enters an incubating state, transitioning then to the asymptomatic state and finally the removed state. In the last path, an agent might seek preventative treatment (i.e., vaccination) and proceed directly from the uninfected to the removed state. When EpiSimdemics is parallelized, agent behavior will need to be communicated between processing threads for those interactions that cross partitioning. In a simple approach, every thread would communicate to every other thread between each simulation timestep, proceeding in lockstep synchrony. However, due to this disease framework, there exists some amount of time in which an agent might leave the susceptible state but not be aware of having done so, regardless of which infection path an agent takes. Because the agent is not aware of its infection, its behavior in this period is no different than when the agent is susceptible. As a result, synchronization in EpiSimdemics may occur less frequently and simulation instead proceeds in near-lockstep synchrony. This reduction of synchronization frequency represents a significant savings in communication costs. The third significant insight of EpiSimdemics relates to communication optimization. Much of the computation that surrounds the packing, sending, routing, receipt, and unpacking of network communication is itself orthogonal to the computation related to agent simulation. Instead, a dedicated message broker acts like a blackboard on which other components (agents or locations) can write messages, trusting that the message will eventually be delivered to the relevant component in a timely fashion. Moreover, a message broker can perform some basic preprocessing of messages as it aggregates them, allowing simulating to see only those messages that are relevant to its simulation. For instance, because EpiSimdemics proceeds in nearlockstep synchrony, some messages may not be needed until the next synchronization cycle, potentially several timesteps in the future. This allows redundant messages (e.g., multiple infection messages targeting the same agent) to be simplified as they

24 7 are aggregated. As the size of the simulation grows and additional computational threads are added to the distributed simulation, communication between threads becomes a dominant cost, even when improved via message brokers. As a result, a more careful partitioning of the bipartite contact graph is required to reduce the number of messages that need to be sent. These improvements are examined in [43], showing viable scalability from hundreds of processors in EpiSimdemics to tens of thousands of processors belonging to the Blue Waters supercomputer cluster. However, this scalability comes at the significant preprocessing cost of partitioning. EpiSimdemics uses the METIS partitioning [29] to determine partitioning, which has theoretical complexity of (O(n + m) log(k)) on a graph of n agents with m edges to be partitioned to k partitions EpiFast EpiFast [6] is another high performance distributed agent simulator with an active focus on disease interventions. It follows a structure similar to a traditional distributed agent-based simulator, not unlike the simulator we present in Chapter 2 (Algorithm ). In this simulator, agents are represented by members of a contact network and, at each timestep, there exists some possibility that infection is spread between neighboring agents. When distributed across multiple processing threads, each thread computes the local outcome and synchronizes with other threads at the end of each timestep, progressing in lockstep synchrony. Two key insights drive EpiFast. First, EpiFast uses a Master/Slave as a basis for distributed computation. Slaves perform simulation on a partition of the contact network for one timestep and report the outcome to the Master. All cross-partition infection messages are sent through the Master. When all threads have finished the timestep s computation, the Master aggregates messages and forwards them to the appropriate Slave thread. Upon receiving the infection messages (if any) from the Master, Slave processes continue to the next timestep. Computation proceeds in this

25 8 way until an upper bound timestep parameter is reached and computation halts. The use of a Master in this way is similar to the message brokers used by EpiSimdemics, if every message broker was consolidated into the single Master. Consequently, this Slave/Master structure provides a significant reduction in communication costs, similar to EpiFast. Also similar to EpiSimdemics, communication costs are significantly dependent on the structure of the network being simulated and, more importantly, the partitioning of that network in its assignment to processors. As with EpiSimdemics, this problem is addressed by preprocessing via the METIS partitioning scheme [29]. Secondly, while not unique to EpiFast, its presentation formalizes the problem of epidemic simulation in such a way to explicitly reveal underlying structure of infection. That is, it recognizes that the transmission network produces a directed acyclic graph within the connection network. This transmission network consists of the set of initial infections at level 0, with level i consisting of each agent that became infected at time i, with a directed edge (u, v) representing transmission of disease from u to v. Much of the work presented in Chapter 2 is based on a similar recognition of the transmission network. Because of its relative simplicity in implementation, EpiFast lends itself very well to extension. In [28], two such extensions are offered. In the first implementation, to address the importance of efficient partitioning in large-scale networks, the problem space is reduced via MapReduce [14], a method by which components of a graph are abstracted and a smaller meta-graph is considered in its place. Additionally, a high availability, fault-tolerant distributed database is used to drive the core of simulation messaging. In the second implementation offered in [28], orthogonal computation is decoupled, similar to the decoupling of messaging from simulation in EpiSimdemics. However, this implementation decouples various aspects of an agent s behavior. For instance, an agent s schedule may be independent of infection state computations, and as such, these simulation components may be parallelized. By decoupling these

26 9 workloads, the aggregate of agent behavior may be computed more quickly, and these savings are compounded on many orders of magnitude for each agent being simulated GSAM GSAM [37], the Global-Scale Agent Model, was designed with the objective of simulating populations of several billion agents. When simulating populations at this scale, every possible memory- and computation-saving technique must be considered. GSAM saves on memory and computation by first dividing the population into geographically-contiguous regions called MemoryBlocks (MBs). An MB might represent a city block, a zip code, a census tract, or some other region as defined by the input data. Each MB is simulated by a separate thread. Even within the division of MBs, the population simulated on a single thread may be very large. To reduce computational complexity, a process ignores all agents that are not active. An active agent is any agent that is currently symptomatic and/or infectious. Every non-active agent does not influence the epidemic outcome of simulation, except in those moments when an agent acquires infection and becomes active. This spread of infection occurs in two parts: first, the active agent is determined to have contacted some inactive agent to spread infection; second, the inactive agent is identified and assigned to the active set. The first of these steps is determined by simulation parameters. Contact rates are determined by agent and environment parameters. Infectivity or susceptibility rates are determined by infection parameters. Once contact is to be made, the type and target of contact are determined. Contacts are divided into two categories: frequent contacts and random contacts. Frequent contacts represent contacts between agents that occur regularly, such as interaction between agents that share a family, workplace, etc. Random contacts can be considered the fleeting contact between strangers, for instance those agents sharing a bus or train. Frequent contacts are known to the agent and might be represented by a

27 10 social network underlying the simulation. When a frequent contact is the target for the spread of infection, the inactive agent is sampled from this social network. Conversely, if a random contact is made, an agent within the MB is selected at random according to some probability distribution function governing random agent mixing. The selected agent is then added to the active set. In this way, new infections can be introduced to the active set without requiring a full computation of every agent s behavior. Those agents that are not active to the propagation of infection are simply background. Some infections will naturally cross MB boundaries, and so synchronization is required. This synchronization is performed periodically in bulk to reduce the overhead inherent to message passing between processes. Because GSAM is implemented in Java, this bulk communication can be performed easily and efficiently via Java s Remote Method Invocations. By reducing events to an easily serialized array of Java primitive types, RMIs can transmit millions of inter-mb events in seconds. Lastly and of an apparent point of pride to the authors, it is notable that GSAM is Java-based. As Java is a platform-agnostic coding language, this allows GSAM to be run on nearly any computational architecture Frías et al We mention [18] primarily for their novel application of data. In this simulator, Frías et al create realistic contact networks and agent mobility patterns from the positioning data present in Call Detail Records (CDR). CDRs are created whenever a mobile phone makes or receives a phone call or uses a service (e.g., a text message). Given the ubiquity of mobile devices today and the frequency with which these devices are used, this dataset is a rich and high resolution source agent location and mobility. Moreover, this data allows the authors to infer a social network between the users of devices that communicate to each other. The agent model of Frías et al makes use of both of these sources of data, developing a mobility user model and a social user model. The mobility model

28 11 divides the landscape according to coverage of the Base Transceiver Stations (BTS) that connect cellular phones to the network. Each CDR is associated to a particular BTS, allowing the location of the agent to be known to be within the region of the BTS. While networks naturally have overlaps in BTS coverage, these regions are approximated to disjoint regions via Voronoi tessellation. The social user model is gathered naturally from the set of CDRs. Agents communicate with those agents in their social network. As such, any pair of agents that both send and receive some type of communication are considered share a social network. Frías et al distinguish weekday and weekend social networks. It is then assumed that any two agents that are colocated in some BTS region are more likely to be in contact with one another if they share a social network. This difference in probability will drive the social-contact aspect of disease transfer. The simulator used by Frías et al is a traditional discrete event simulator, modeling each agent in the population and proceeding in timesteps. At each timestep, the simulator (1) determines in which BTS region each agent is located, (2) identifies which BTS regions contain at least one infective agent, (3) for each infective agent, determine which agents are nearby by selecting agents in the same BTS region and in the infective agent s social network at some (higher) probability and selecting agents in the same BTS and not in the infective agent s social network at some (lower) probability, (4) For each agent selected, transmit infection according to disease parameters, and finally (5) update agent infection state and remove those agents in the Removed state from simulation Indemics Lastly, EpiFast has recently been extended to the simulator Indemics [5], the Interactive Epidemics Simulation. Indemics is a user-facing implementation that also reaches for improvements to performance via the decoupling of orthogonal computational components. In this case, the authors separate computations involving (a) epidemic intervention and behavior adaptation, (b) infection state assessment,

29 12 and (c) disease transmission. These various components are separated into a system architecture consisting of an Indemics Client (IC), an Intervention Simulation and Assessment Engine (ISAE), an Epidemic Propagation Simulation Engine (EPSE), and the Middleware Platform (MP) that connects the three components. The primary contribution of Indemics is its user interface. As the name emphasizes, Indemics is meant to be interactive. The IC allows a user to control simulation parameters, control the simulation itself (e.g., to pause, rollback, or resume), assess the current state of simulation, and to respond with epidemic intervention measures, all in real time. The EPSE provides the core of this actual simulation processing, implementing a form of EpiFast that has been modified to support external interventions. The ISSE stores simulation data to an Oracle relational database, which comes with built-in error handling, query optimization, synchronization, and fault tolerance. The MP connects each of these components, translating information between each component as required by each system architecture. The MP relays information requests between the CI and the ISSE, intervention specification between the CI and the EPSE, and simulation data of the EPSE to the long-term storage of the ISSE. While it may seem that Indemics is simply an application of EpiFast at scale, we recognize the difficulty of developing meaningful user interface that can interact with a complex simulation backend. In particular, Indemics is a simulation framework that does not require significant familiarity with computer systems to use. It is meant to be used by those people who may indeed enact epidemic intervention strategies, for instance in the work of Marathe et al investigating intervention strategies in schools and the impact this has on epidemic behavior in the population at large. 1.4 Significance of This Study The previous simulators cast a daunting shadow. They process enormous populations, some with dynamic contact networks, process location entities or are spatially aware, and employ every cost-savings technique available to improve their

30 13 scalability. In this thesis, we work on static graphs, with agents defined by their contacts and not any scheduling behavior. Agents are themselves very static, with contacts between agents persisting for the entirety of simulation and no new contacts developed during simulation. However, one feature shared by all these state-of-the-art simulators is the need for frequent synchronization across the distributed computational resources. In this research, we propose a simulation framework that explicitly defies this requirement, allowing computation of a subset of a population to execute to completion and updating simulation history to resolve conflicts that may arise between computational threads. We also note that many of the themes of the previously mentioned stateof-the-art methods are not mutually exclusive with our own proposed simulator and optimizations. In particular, the application of a high-availability database, the use of communication managers and proxies, and the novel selection of data could all be adopted into our own proposed simulator. At the same time, much of our own work shares themes with the state-of-theart. The redistribution protocol we propose in Chapter 4 hopes to achieve the runtime improvement that EpiSimdemics, EpiFast, and Indemics realize through METIS preprocessing. This redistribution focuses on the actual infection outcomes, not the network in its entirety, to determine which agents should be co-located to a processing thread. In this way, we adopt some of the active set ideals of GSAM. Both the work proposed in this thesis as well as the simulators EpiSimdemics, EpiFast, and Indemics through their reliance on METIS require that the underlying graph structure remain static to ensure that partitioning improvements are meaningful. This reliance on METIS represents a costly preprocessing step required to determine an advantageous assignment of agents to processors to reduce the speed at which communication costs dominate total simulation cost as additional processors are added. We eschew this preprocessing. Instead, we propose an adaptive workload rebalancing protocol that leverages regularities implicit in the underlying model s computation to improve, over the course of these many replicates, the distribution of

31 14 agents to threads in such a way that respects community structure and reduces overall inter-thread communication. Because discrete event simulators are typically based on some stochastic action, they must be run over many instances and replicates for meaningful patterns to emerge from random noise, from the model s own sensitivity to poorly chosen initial parameters (e.g., disease transmission probability), or from uncertainty inherent to the individual agent models. This allows our improvements in communication cost to be realized for every subsequent simulation replicate, for a benefit that increases with the workload expected. The remainder of this thesis is structured as follows: In Chapter 2, we develop a parallelized discrete event simulator that does not require lockstep or near-lockstep synchronization, instead relying on a more flexible method of synchronization. In Chapter 3, we develop a system of differential equations to model the expected communication costs of a parallelized discrete event simulator and examine the effect of partitioning. In Chapter 4, we discuss metrics under which edge strength and vertex community may be inferred from data in generated as part of parallelized discrete event simulation. We explore how this data can be leveraged to improve the assignment of network vertices to processing threads. In Chapter 5, we discuss empirical results of vertex redistribution on computational cost of discrete event simulation. 1.5 Prerequisite Discussion on Random Graph Generators Before we begin our investigation in earnest, we must address the graphs that we will use for testing and validation. For much of our validation, the graphs we use will require community structure to be known as a ground-truth. Due to this requirement and the paucity of real-world network data with significant community coverage, we rely predominantly on randomly-generated graphs. In order from least to most representative of real-world contact networks, we make use of Erdős-Rényi, Stochastic Block Model, Recursive Matrix, and Lancichinetti-Fortunato-Radicchi Benchmark graph generation algorithms. Each generation algorithm is examined here. In particular, we examine whenever possible the degree distributions, diameter, and clustering

32 15 of each graph generation algorithm Erdős-Rényi Graphs Erdős-Rényi graph generation is developed fully in [16]. Graphs are generated by two parameters, a number of agents n and a probability p of an edge existing between any two agents. For every possible edge, a random process determines if that edge exists, with probability p. An alternative formation uses parameters n and a number of edges m, where m of the possible n(n 1) edges are chosen to exist uniformly at random. The two formulations are similar, but it is important to note that while setting p = m/ (n(n 1)) yields the same result in expectation, the realized instances of graphs generated will differ. Agents in Erdős-Rényi graphs are homogeneous. Every edge exists independently randomly, and so no particular difference in local network structure can be expected before generation occurs. As such, there is no expectation of community structure, since every agent is equally likely to neighbor any other agent. Alternatively, we can consider Erdős-Rényi graphs to contain exactly one community, consisting of the entire graph. Erdős-Rényi graphs have a binomial expected degree distribution, with ( ) n 1 P (deg(v) = k) = p k (1 p) n 1 k k Because every agent is equally likely to be connected to any other, Erdős- Rényi graphs tend to have small diameter, with the diameter of the graph increasing logarithmically with the number of agents. Also due to the uniform randomness of connection, Erdős-Rényi graphs tend to have a very low clustering coefficient. Nothing drives neighboring agents to share their neighbors with each other beyond stochastic noise.

33 Stochastic Block Model The Stochastic Block Model is developed fully in [25]. It may be considered a generalization of Erdős-Rényi graphs. Graphs are generated by two parameters, a partitioning of agents P = {V 1,..., V k } and a probability matrix P. Similar to Erdős-Rényi graphs, edges between agents exist independently at random. Rather than a single probability p, however, an edge between u V i and v V j exists with probability P ij. We note that an Erdős-Rényi graph with edge probability p is equivalent to the stochastic block model special case P = {V }, P = [p]. Indeed, each partition V k is locally an Erdős-Rényi graph, connected by probability p kk. If p ii > p ij for i j, the model is weakly assortative, and if min(p ii ) > i max (p jk), the model is strongly assortative. j,k j We note that assortative stochastic block models exhibit community structure in partitioning, with agents being more likely to share an edge within their partition than with an agent belonging to another partition. Due to our interest in community structure, we focus our use of stochastic block model graphs to those that are strongly assortative. The degree distribution of stochastic block model graphs is the sum of binomial distributions, one for each partition. Let deg(v) = k denote the compartment-wise degree from v, where k j is the number of edges from v to any agent in compartment j. It follows that P (deg(v V i ) = k) = j ( ) n 1 p k j ij (1 p ij) n 1 k j k j Because stochastic block model graphs are fundamentally similar to Erdős- Rényi graphs, the diameter of stochastic block model graphs also tends to increase logarithmically with the size. Likewise, the clustering coefficient in stochastic block model graphs tends to be low, as only stochastic noise is responsible for any two neighboring agents sharing their respective neighbors.

34 Recursive Matrix (R-MAT) The Recursive Matrix, or R-MAT, model is developed fully in [9]. Its focus is to develop a network with community structure that also incorporates the hierarchical nature of communities and subcommunities. For instance, a student may have a community of their friends, all of whom belong to a community of their classmates, all of whom belong to the community of the school at large. R-MAT graph generation is based on six parameters: a number of agents n, a number of edges m, and relative probabilities a, b, c, d, with a + b + c + d = 1. For ease of bookkeeping, n is typically restricted to be a power of 2. Typically, a b, a c, a > d. The authors suggest a a 3. b c These parameters are used to create a network s connectivity matrix C, in which C uv > 0 when there exists an edge from u to v, and C uv = 0 when there is not. In some applications, C uv [0, ) to denote an edge-weight of the edge (u, v). Alternatively, it may be a simple indicator of edge existence, with C ij {0, 1}. R-MAT uses the probabilities a, b, c, d to generate the connectivity matrix C in the following way. The connectivity matrix C is visually partitioned into equal squares according to the matrix a b. For each of the m edges to be added, a c d random process determines in which quadrant of C the edge will belong, according to the relative probabilities a, b, c, d. That quadrant itself is then subdivided into four quadrants, and the process is repeated until an individual element C uv is chosen. An example progression of R-MAT edge placement is given in Figure 1.1 R-MAT generation does inherently impose symmetry, may produce the same edge multiple times, and may include self-loops (i.e., some edge (u, u)). However, it may be adapted to include or prohibit these qualities as desired, most simply by disregarding undesired outcomes and enforcing symmetry after an edge is chosen. We note that community structure is generated by these binary quadrants. In If n is not a power of 2, additional bookkeeping is required to ensure that every possible edge is reached by a unique and fairly distributed sequence of outcomes.

35 18 Initial partitioning of C : a b c d First random outcome: b a a b c d c d Second random outcome: c a a b a b c d d c d Third random outcome: a add edge (3, 5) Figure 1.1: An 8-agent example of the R-MAT generator s edge placement. Probabilities a, b, c, d are associated with each quadrant of the connectivity matrix. A random outcome is generated. The associated quadrant is divided into four subquadrants, and the process repeats. This process continues until a single entry is identified. (Here, we have used a 1-indexing to identify edge (3, 5).)

36 19 this way, we can interpret parameters a and d as the community-internal connection probabilities and parameters b and c as community external connection probabilities. The resulting community structure resembles a binary tree, where agents share a community when they share a parent root agent, with communities ranging in size from 2 to 2 j, when considering tree depth of j. C1 = {{1, 2}, {3, 4},..., {n 3, n 2}, {n 1, n}} C2 = {1, 2, 3, 4}, {5, 6, 7, 8},..., {n 7, n 6, n 5, n 4, n 3, n 2, n 1, n}}. Clog 2 (n) 2 = {{1,..., n/4}, {n/4 + 1,..., n/2}, {n/2 + 1,..., 3n/4}, {3n/4 + 1,..., n}} Clog 2 (n) 1 = {{1,..., n/2}, {n/2 + 1,..., n}} C = log 2 (n) 1 j=1 C j We note also that each agent appears exactly once in each Cj and, as a result, belongs to log 2 (n) 1 communities in total. Alternatively, the start and edge indices used to form C could be adjusted to restrict very small and/or very large communities. As we traverse the binary-tree-like structure of the connectivity matrix during R-MAT generation, we see that the R-MAT model can be considered a binomial cascade in two dimensions. Consequently, we can calculate the number of agents expected out-degree k by the following: E( {v deg(v) = k} ) = E k n n i i=0 [p n i (1 p) i ] k [1 p n i (1 p) i ] E k where 2 n is the number of agents, E is the number of edges, and p = a + b.

37 20 This may modified to calculate the distribution of agent in-degrees, using instead the value p = a + c. It should be noted that R-MAT graphs have a tendency to generate isolated agents, and that this tendency will vary significantly based on construction parameters. This and other R-MAT attributes are investigated in depth in [40], finding that the number of isolated agents can be approximated by: Assuming: n = 2 l (1.4) D = m/n (1.5) σ = (a + b) 1 2 (1.6) τ = 1 + 2σ 1 2σ (1.7) λ = D(1 4σ 2 ) l/2 (1.8) isolated l/2 ( ) l exp( 2λτ 2 ) l/2 + r (1.9) r= l/2 For our purposes, there is no compelling reason to maintain isolated agents in our simulation. We therefore remove all isolated agents from any graph generated via R-MAT generation. After this pruning, communities consist of very few agents, some of which may even be empty. Such communities are also removed from the set of communities. Because R-MAT graphs are based around a binary-tree of community structure, the diameter of the graph can be expected to be logarithmic with respect to the size. The clustering coefficient will be largely parameter dependent, increasing with the internal-connection parameters a and d LFR-Benchmarks We make use of some graphs generated with the Lancichinetti-Fortunato- Radicchi Benchmark graph generation suites [34, 33]. These suites are available for

38 21 download at Of the five generators available, we make use of the Binary generator. These generators create graphs with a power law degree distribution, with the exact distribution given as a parameter. Moreover, parameters are accepted to generated agents within communities of a size bounded above and below. Communities are generated so as to be disjoint. Among the possible input parameters (detailed in Appendix B), the LFR accepts a parameter om. Agents are assigned to belong to either one or om communities. Generation proceeds in five steps. First, the degree of each agent is determined, respecting the power-law distribution parameters as well as the minimum and maximum degree parameters. Agents are given edge stubs to later connect to other agents. Second, the mixing parameter determines what proportion of these stubs will attach to agents within the community and which connect with other agents in the network. Third, the sizes of communities are determined, again satisfying the minimum, maximum, and power-law distribution parameters given, while also satisfying that the sum of community sizes equals the number of agents. Next, all agents begin homeless but are iteratively assigned to a community at random. If the community size exceeds the previously determined number of required community-internal edges, the agent is added to the community; otherwise it remains homeless. This process repeats until all agents are assigned. If an agent cannot be assigned to a community (e.g., it would require more internal edges than the community has agents), a randomly selected agent is ejected from its community. Lastly, rewiring of edges is performed so that the mixing parameter is respected. We employ four LFR test sets. We discuss our motivations for these test sets below and provide explicit parameter usages in Appendix B. For each test set, we generate ten graphs using the same parameters to use as the test set. In each, we

39 22 generate ten thousand agents belonging to communities spanning in size from 20 to 200. By design of the LFR-Benchmarks, both agent degree and community size are power-law distributed. We distinguish each test set in the clarity of division between communities Test Set: LFR-A We begin with a very clean graph, where each community is fully disjoint from each other community. No agents belong to two communities. We expect this test set to provide similar behavior to the stochastic block models, where community structure is concise, but be more representative of real-world networks both in terms of degree distribution and clustering Test Set: LFR-B We proceed to a neat set of graphs, similar to the LFR-A set. However, rather than fully disjoint communities, we now have 1% of the population belong to two communities. The community structure is likely still very concise, but we expect to no longer have pockets of the population fully isolated from the rest of the population Test Set: LFR-C We continue to dirty community lines. In this test set, we again have some of the population belonging to two communities, but increase this proportion from 1% in LFR-B to 10%. Community lines will likely still be present, but less easily detected than in the previous test sets Test Set: LFR-D For our final test set, we fully muddy the water. As in LFR-C, we have 10% of the population belonging to multiple communities, while the rest belong to only one community. However, in this test set, that 10% belong to ten communities instead of only two. While community structure is still present, we expect it to be very difficult

40 23 Figure 1.2: A visualization of the different expected degree distributions of the Erdős-Rényi, stochastic block model, R-MAT, and LFR-Benchmark graph generators. The expected degree distributions of Erdős-Rényi and stochastic block model graphs are identical, with the only distinction between them being whether those edges exist within or across community boundaries. An LFR graph follows a power-law distribution by design, and we observe a near-power-law distribution in the R-MAT generator (though this is parameter dependent). to detect in such a graph SNAP Dataset Lastly, we make use of the Stanford Large Network Dataset Collection for empirical testing on various sizes of real world networks. In particular, we make use of the DBLP Collaboration Network, the YouTube network, and the Enron network. Refer to [35] for full details on these and other publicly available large contact networks.

41 24 In order of increasing size, the Enron network consists of 36,692 agents with 183,831 edges, across a graph diameter of 11. The DBLP network consists of 317,080 agents with 1,049,866 edges, across a graph diameter of 21. The YouTube network contains 1,134,890 agents with 2,987,624 edges, across a graph diameter of 20. While community ground truth is available for some of these graphs, it is often the case that only a small fraction of the network as a whole belongs to any community. Instead, we will use these graphs to examine actual runtime performance in experiments that assume some underlying community structure exists.

42 25 CHAPTER 2 DISCRETE EVENT SIMULATORS 2.1 Development A contact network is represented by the discrete, directed graph G = (V, E), where each v V represents a single agent and edge e = (i, j) E represents contact (and thus the possibility of transmission) of agent i to agent j. Let N(v) denote the neighborhood of v, with N(v) = {n (v, n) E}. Infection begins with some (typically small) subset of agents I 0 V seeding the infection. These agents begin in the infected state, while every other agent typically begins in the susceptible state (although exceptions may occur, such as when a part of the population is vaccinated but not pruned from simulation). Time is discretized into timesteps. At each timestep, each infected agent can, according to some stochastic process, infect any of its susceptible neighbors. Beginning at the next timestep, these agents may in turn infect their neighbors, and so on. We represent this possible propagation from an agent v to a neighbor n with the function Propagate(v, n, β), where β denotes this probability per timestep. Propagate can also performs clerical duties, such as verifying the v is infected state and that n is in the susceptible state. Agents may also recover from infection, gain immunity, or be otherwise removed from propagation (via death, quarantine, etc.). For conciseness, we refer to all these outcomes as an agent s recovery. This recovery process may occur according to a stochastic process, or may be based on the number of timesteps elapsed since the onset of their infection. We represent the possibility of removal of an agent v from the Infected state at each timestep with the function Recover(v, γ), where γ denotes this probability per timestep. The evolution of an agent s state from susceptible to infectious to recovered is broadly referred to as an SIR infection model. While this model is common, many While some models may explicitly incorporate death, quarantine, or other propagation-

43 26 other models exist. For illnesses that do not confer immunity, an infected individual may return to the susceptible state instead of the recovered state, yielding an SIS model. For illness with temporary immunity or which are seasonal, an individual may eventually transition from recovered to susceptible, yielding an SIRS model. We may wish to consider a period of time wherein an agent is infected but does not yet display symptoms (i.e., is exposed), yielding an SEIR model. Some may consider the decay of immunity and extend to multiple subsequent recovered states, as in [23]. For our application, however, we consider exclusively an SIR model of infection. We will adopt Object-Oriented notation, where edges and vertices can act as containers of multiple values, denoting each with object.value notation. A Traditional Discrete Event Simulator for an SIR model of infection is given in Algorithm removing effects, we will consider all of these effects to be encompassed by the Recovered compartment in an SIR model. Throughout this thesis, we will use the term recover as a catch-all that indicates transition from the infected to some removed state (and analogous states, as we develop in Chapter 3), regardless of what that transition may, in a more literal sense, represent.

44 27 Algorithm 2.1 Secondary Functions for Traditional Discrete Event Simulator 1: function Propagate(Source v, Target n, Transmission rate β) : Boolean 2: r random(), with r [0, 1] 3: return (r < β and v.state = I and n.state = S) 1: function Recover(Agent v, Recovery rate γ) : Boolean 2: r random(), with r [0, 1] 3: return (r < γ) Algorithm 2.1: Two functions, Propagate and Recover, handle stochastic state change and validation, returning T rue on a state change. Propagate(v, n, β) handles infection spread from an infected agent v to a susceptible agent n, and Recover(v, γ) determines if an agent v recovers from infection. Recover encapsulates all transition out of the infected state and does not distinguish healthy recovery, quarantine, death, or any other aspect that would remove an agent from the infected state. Some models may include additional information (i.e., temporal parameters, an transmission rate conditional on neighbor infection status, etc); we ignore these for simplicity.

45 28 Algorithm 2.2 Traditional Discrete Event Simulator 1: function Simulate(Graph G, Seeds I 0, Transmission rate β, Recovery rate γ) 2: t 0 3: v.state I for v I 0 4: v.state S for v / I 0 5: while {v v.state = I} do 6: t t + 1 7: current {v v.state = I} 8: for all v current do 9: for all n N(v) do 10: if Propagate(v, n, β) then 11: n.state I 12: if Recover(v, γ) then 13: v.state R Algorithm 2.2: A (typically small) set of agents I 0 begin with the infectious state. Infectious agents spread infection to their neighbors according to the Propagate function and recover from infection according to the Recover function. Given the SIR model of infection, we allow computation to terminate when no agent remains in the infectious state. Note that each timestep begins by determining which agents are currently infected. This set is static for a given timestep. Were it not, agents could become infected and immediately recover, resulting in an apparent transition directly from Susceptible to Recovered. Similarly problematic in such an implementation, an agent s neighbors might become infected in the same timestep that the infecting agent itself becomes infected. A fundamental limitation of traditional discrete event simulators is in scaling. While the above is computationally tractable for reasonably-sized networks, a network consisting of many millions of agents is limited both in time by the sequential nature of the algorithm s computation as well being limited in resources by the prohibitive demand for memory. For large-scale graphs, it becomes necessary to distribute both computation and memory across multiple resources. We consider the partitioning of the graph G = (V, E) by assigning each v V to one of h subsets, and then assigning each subset to a distinct computational thread. While the intent is to assign partitions to separate computational resources, it is not an inherent requirement that each machine process a single thread. Threads may be

46 29 local to the same processing machine (communicating via shared memory) or distinct machines (communicating via message passing). Equitable graph partitioning and load balancing are themselves complex problems, which we explore in greater depth in Chapter 4. For the duration of this chapter, we assume that a partitioning P = {V 1, V 2,..., V h } of agents to h computational threads is given, with V = h i=1 V i. We say that a vertex v is owned by k if it belongs to V k, and define the function owner(v) = k v V k. The thread to which V k is assigned will share its indexing; we refer to this thread as having rank k. Likewise, we may refer to the thread as owning partition V k, as well as all vertices therein. Additionally, an edge is owned by the thread that owns that edge s source. For e = (u, v), let owner(e) = owner(u) = k u V k. Note that, while we often consider (though do not restrict to) symmetric graphs, we do not consider the edges (u, v) and (v, u) to be the same edge. Indeed, network diffusion is itself a directed process, even when the underlying contact network is undirected. When memory and computation are distributed but are not autonomous and disjoint, communication between threads is required. For our current purposes, communication will encompass (1) sending and receiving infection transmission between threads, (2) announcing readiness to proceed to the next timestep, and (3) announcing readiness to terminate computation. Let Send(recipient, messaget ype, data) and Recv(sender, messaget ype) encapsulate all message passing. Messages sent via Send will be accessible in the recipient thread s incoming message buffer, B in, possibly delayed by communication latency. For simplicity, we aggregate all incoming messages to a single buffer, regardless of message source. However, every thread has its own separate incoming message buffer. Messages are recovered from B in with Recv. While message buffers are First In First Out (FIFO), receiving messages in the order in which they arrive, the buffer can be scanned for messages with a particular messaget ype tag, from a particular

47 30 sender, or both. The buffer remains FIFO with respect to tag and sender, but a newer message may be received even when an older message exists in the buffer if the older message does not match the message type provided to Recv. Lastly, a message buffer may probe for the existence of a message with a particular message type and/or sender, via B in.probe(messaget ype, sender). If the source of the message is irrelevant, we use a special case sender anysender which matches any thread s rank. In addition to communication, some additional coordination is required to allow computation to be performed in parallel. In the traditional discrete event simulator described previously, computation terminates when no agent remains in state I. In a distributed simulator, a thread does not know (without communication) the state of any agent beyond those it owns itself. Beyond this, there is no need for every thread to attempt to determine the algorithm s global state, but instead may coordinate with a master thread to determine if computation may end. A master thread may be an existing thread that is assigned an additional duty as coordinator, or it may exist separately from the simulating threads and dedicate all resources to coordination. We assume the former, acknowledging a dedicated master thread with rank master as the special case where the coordinating thread is assigned an empty partition V master = E master =. Discrete event simulators have a natural sense of progress, moving forward in timesteps. When distributing a discrete event simulator, it is typical that each thread synchronizes at the end of each timestep before continuing to the next. When each thread is guaranteed to be simulating the same timestep, we say that the threads work in lockstep synchronicity. In some applications, lockstep synchronicity is relaxed somewhat and threads are allowed to drift partially, perhaps performing synchronization after a specified number of timesteps. This is referred to as near lockstep synchronicity. In this first step of development, we would avoid complications that arise from allowing time to drift between threads. Each thread proceeds in lockstep synchronic-

48 31 ity. In the function Lockstep, each thread announces to the master thread its own readiness to proceed. Additionally, it communicates if any infection persists locally or if there exist incoming infection messages in its buffer. The master thread aggregates these messages, idling as necessary until all threads have sent a message. If all threads announce that local infection has resolved and contain no incoming messages, then the master thread determines that infection has fully resolved globally and sends a termination command. Pseudocode for this Lockstep Parallelized Discrete Event Simulator is given in Algorithm Algorithm 2.3 Secondary Functions for Lockstep Parallelized Discrete Event Simulator 1: function lockstep(local Infected I, Message Buffer B in, Local rank l, Master rank master) : Boolean 2: localstate (I nil or B in nil) 3: Send(master, lockstepready, localstate) 4: if l = master then 5: for all thread do 6: ready[thread] Recv(thread, lockstepready) 7: response (T rue in ready) 8: for all thread do 9: Send(thread, lockstepcontinue, response) 10: return Recv(master, lockstepcontinue) Algorithm 2.3: All threads identify locally if infection persists or if the incoming message buffer is non-empty. In either of these cases, additional computation is required. Threads then announce to the master thread the need for additional computation (T rue) or readiness to exit (F alse). The master thread aggregates these. If any thread has announced the need for additional computation, the master thread indicates to all threads to continue to the next timestep (replying with T rue). If no thread requires additional computation, the master thread indicates readiness to exit (replying with F alse). This value is returned to be used in the simulation s loop control.

49 32 Algorithm 2.4 Secondary Functions for Lockstep Parallelized Discrete Event Simulator, Cont. 1: function Propagate(Source v, Target n, Transmission rate β, Local rank l) : Boolean 2: r random(), with r [0, 1] 3: return ( r < β and (owner(v) l or v.state = I ) and (owner(n) l or n.state = S ) ) Algorithm 2.4: As in the single-threaded simulator, Propagate performs the stochastic element of propagation, as well as verifies agent states. However, if the transmission crosses partitions, the local thread will only know the state of one of the involved agents. As such, Propagate will be called twice. The first call, during the source s propagation phase, will perform the stochastic element and verify the source s state. The second call, during the target s ReceiveInfectionMessages phase, will verify the target s state, but not meaningfully repeat the stochastic element. Algorithm 2.5 Secondary Functions for Lockstep Parallelized Discrete Event Simulator, Cont. 1: function SendInfectionMessages(Message buffer B out ) 2: for all msgt ype, msgt argetrank, msgcontent B out do 3: Send(msgT ype, msgt argetrank, msgcontent) 4: B out nil Algorithm 2.5: The message buffer B out contains all the off-thread messages enqueued in the primary computation loop (Algorithm 2.7, line 19). Message content will consist of the infection s source and target agents, which will be processed in the target thread s ReceiveInfectionMessages.

50 33 Algorithm 2.6 Secondary Functions for Lockstep Parallelized Discrete Event Simulator, Cont. 1: function ReceiveInfectionMessages(Message buffer B in ) 2: while B in.probe(infectionmessage, anysender) do 3: source, target Recv(anySender, inf ectionm essage) 4: if Propagate(source, target, 1) then 5: target.state I Algorithm 2.6: If an infection message waits in the buffer (detected by B in.probe), it is received from the buffer and its contents are passed to Propagate to verify the target agent s state and, if applicable, update its state to infected. Since the stochastic element was already performed by the infection s source to determine if the message should be sent, we pass 1 as the infection rate to guarantee infection of a susceptible target.

51 34 Algorithm 2.7 Lockstep Parallelized Discrete Event Simulator 1: function Simulate(Graph G, Partitioning P = {V 1, V 2,..., V h }, Seeds I 0, Transmission rate β, Recovery rate γ, Local rank l, Master rank master, Message Buffers B out, B in ) 2: t 0 3: v.state I for v I 0 V l 4: v.state S for v / I 0 V l 5: current I 0 V l 6: while Lockstep(current, B in, l, master) do 7: ReceiveInfectionMessages(B in ) 8: t t + 1 9: current {v v.state = I} 10: for all v current do 11: for all n N(v) do 12: if Propagate(v, n, t) then 13: if owner(n) = l then 14: n.state I 15: else 16: msgt ype inf ectionm essage 17: msgt argetrank owner(n) 18: msgcontent (v, n) 19: B out.enqueue(msgt ype, msgt argetrank, msgcontent) 20: if Recover(v, γ) then 21: v.state R 22: SendInfectionMessages(B out ) Algorithm 2.7: Unlike the single-threaded traditional simulator, the main simulation loop is now controlled by Lockstep. This coordinates synchronization of the computed timestep as well as termination detection. The core loop now begin with ReceiveInfectionMessages, which receives and processes the last round s off-thread infection messages. Propagation logic is largely unchanged, excepting that transmission targeting off-thread agents are queued to be sent at the end of the timestep, in SendInfectionMessages. The distribution of memory and computation allows us to handle very large graphs. However, distribution introduces potentially large communication costs. Additionally, while computational burden per thread is likely reduced, an inequitable load balancing may result in few threads with high relative burden, while other threads

52 35 are largely idle. The combination of communication overhead and idle times may yield significantly diminishing returns with respect to real-world time savings as additional threads are added. We now turn our investigation to a reduction of idling time, by reframing the core discrete event simulator s propagation algorithm. First, we recognize that while propagation is often considered an action of the agent, it is most accurately an action belonging to the edge that would carry infection between agents. Second, in an SIR simulation, infection traverses any single edge at most once. Given these facts, we may perform all stochastic elements involving propagation during initialization, allowing for deterministic behavior during the rest of the algorithm. During initialization, edges are assigned a random number, which we call the edge s seed and store on the edge e at e.seed. This seed will be used to determine (1) if infection along that edge will occur, given that the source agent is infected, and if so, (2) the time at which infection traverses the edge, relative to the source agent s infection time. Similarly, agents are assigned a random number during initialization, which we call the agent s seed and store on the agent v at v.seed. This seed will be used to determine the duration of the agent s infection, as recovery is an action belonging to the agent itself. Using these seeds, we simplify the stochastic elements of Propagate and Recover from multiple Bernoulli random variables to geometric random variables. For recovery, this is a natural process; at each timestep after infection, there is some probability of recovery. For propagation, however, infection duration imposes a natural upper bound to the delay between the onset of infection and the propagation to a neighbor, truncating the geometric distribution. This additionally allows us a cost-effective way to encapsulate an absence of propagation. If our geometric ran- Note that in any model wherein an agent can return to susceptibility (e.g., SIS or SIRS), a single edge may be an avenue of infection propagation multiple times in a single simulation.

53 36 dom variable returns a value greater than the duration of infection, then no infection spreads. Conversely, if it returns a value less than or equal to the infection duration, the result is an infection attempt that many timesteps after the agent s own infection. We formalize this below. We know that a geometric random variable has a probability mass function of P (i) = p(1 p) i 1. Using P below, we define Geom(seed, p) with a look-up table: Geom(seed, p) = 1, seed [0, P (1)) 2, seed [P (1), P (1) + P (2)) 3, seed [P (1) + P (2), P (1) + P (2) + P (3)). d, seed [ d 1 P (i), i=1 ) d P (i) i=1 Combining the above concepts and definitions, we provide pseudocode for a Quasi-deterministic Lockstep Parallelized Discrete Event Simulator in Algorithm

54 37 Algorithm 2.8 Secondary Functions for Quasi-deterministic Lockstep Parallelized D.E.S. 1: function Precompute(Agent v, Infection timestep t, Transmission rate β, Recovery rate γ) 2: if v.emerge nil then This agent has already been precomputed 3: return 4: v.emerge t 5: v.duration Geom(v.seed, γ) 6: v.recover v.emerge + v.duration 7: for all n N(v) do 8: e (v, n) 9: e.offset Geom(e.seed, β) 10: if v.emerge + e.offset v.recover then 11: e.transmit v.emerge + e.offset 12: else 13: e.transmit nil Algorithm 2.8: Here, we precompute an agent s infection profile. Infection duration is computed from a geometric distribution via Geom, and this value, the timestep of emergence, and the timestep of recovery are stored to the agent. Likewise, for each of the agent s outgoing edges, Geom is used to determine how many timesteps will pass until the neighboring agent would become infected. If this value is greater than the duration of the agent s infection, no infection occurs. Both this offset length and transmission time (if appropriate) are stored to the edge, the former to retain stochastic outcome in a future iteration and the latter for ease of reference during propagation simulation. As in Propagate, we verify that the targeted agent has not yet been infected. Note, however, that since propagation now belongs to the edge, only the state of the targeted agent is of consequence.

55 38 Algorithm 2.9 Secondary Functions for Quasi-deterministic Lockstep Parallelized D.E.S., Cont. 1: function ReceiveInfectionMessages(Message buffer B in, Current timestep t, Transmission rate β, Recovery rate γ) 2: while B in.probe(infectionmessage, anysender) do 3: source, target Recv(anySender, inf ectionm essage) 4: Propagate(source, target, 1) 5: Precompute(target, t, β, γ) Algorithm 2.9: Minor adjustments are made to ReceiveInfectionMessages to use Precompute in place of Propagate, setting the outcome of various propagation variables in a computationally inexpensive manner. As in the previous models, Precompute assumes additional clerical duty to verify that the targeted agent has not yet been infected.

56 39 Algorithm 2.10 Quasi-deterministic Lockstep Parallelized D.E.S. 1: function Simulate(Graph G, Partitioning P = {V 1, V 2,..., V h }, Seeds I 0, Transmission rate β, Recovery rate γ, Local rank l, Master rank master, Message Buffers B out, B in ) 2: t 0 3: v.state I for v I 0 V l 4: v.state S for v / I 0 V l 5: e.offset, e.transmit nil, nil for all e E with owner(e) = l 6: e.seed random() for all e E with owner(e) = l 7: v.emerge, v.duration, v.recover nil, nil, nil for all v V l 8: v.seed random() for all v V l 9: for all v I 0 V l do 10: Precompute(v, 0, β, γ) 11: current I 0 V l 12: while Lockstep({v v.recover > t}, B in, l, master) do 13: ReceiveInfectionMessages(B in, t, β, γ) 14: t t : current {v v.emerge t < v.recover} 16: for all v current do 17: for all n N(v) do 18: e (v, n) 19: if e.transmit = t then 20: if owner(n) = l then 21: Precompute(n, t, β, γ) 22: else 23: msgt ype inf ectionm essage 24: msgt argetrank owner(n) 25: msgcontent (v, n) 26: B out.enqueue(msgt ype, msgt argetrank, msgcontent) 27: if Recover(v, γ) then 28: v.state R 29: SendInfectionMessages(B out ) Algorithm 2.10: Focus has now shifted almost entirely from agents to edges. We no longer explicitly track agent state, but rather the time at which infection emerges and that at which recovery occurs. In the same timestep that infection spreads to an agent, we precompute the whole of an infection cycle and all propagation along incident edges. Note that the emergence time of the targeted agent is not known if the targeted agent is not owned by the local thread; as in the previous lockstep synchronous simulator, infection messages are queued for targets that may not be susceptible to infection.

57 40 We must take a moment to acknowledge the current formulation of our simulator. Notably, by precomputing the delay between the timestep of infection onset and the timestep(s) at which infection transmission occurs, we have seemingly reduced the problem of epidemic simulation to that of the well studied Single Source Shortest Paths (SSSP) problem. In this interpretation, the values of each edge s e.offset could be interpreted as an edge-weight (with e.offset = nil interpreted as an arbitrarily large cost or a pruned edge). Because the Susceptible-Infected-Recovered framework favors the earliest infection to arrive, an agent s infection time could be reframed as the shortest path to that agent from the infection seed(s). The most well known solution to the SSSP problem is Dijkstra s algorithm [41], which has seen parallelization in a number of extensions [36, 12]. However, we do not proceed with an application of SSSP due to one major consideration. An application of SSSP would require that every edge-weight (or this case, infection delay ) be known a priori. In our own framing, an edge s e.offset is only evaluated when the source agent computes its own infection. For a smooth transition to an SSSP application, we would require that every agent s infection be precomputed, as opposed to only those agents that actually yield infection. While our own infection computation functions are intentionally simple, our proposed simulator is meant to be a framework from which other simulators could extend. As such, the full a priori computation required by SSSP could likely be prohibitively expensive to perform for every simulation replicate. We would now like to escape the requirement for lockstep progression. By allowing individual threads to run ahead, we hope to reduce the amount of time a thread spends idle. However, allowing threads to proceed without restriction, some computations may become inconsistent and require correction. That is, an infection message may arrive intending to infect an agent at a timestep prior to the current local timestep. In particular, the targeted agent may or may not have been infected by another agent, either in a past or even a future timestep, relative to the incoming message. In the event that the arriving message arrives at an earlier timestep than

58 41 any already computed infection, or if the message targets an agent that is still in the susceptible state, the receiving thread must roll back to that timestep and insert the incoming infection into its own history. In this second reframing, lockstep synchronicity is replaced with a system consisting of phases. Threads perform a local and independent Propagation Phase, computing infection to quiescence and caching outbound interpartition infection messages as necessary. Pending inbound infection messages from other threads are then received in a Resolution Phase. If necessary, corrections to previous computation are made in a Resync Phase. Cached outbound infection messages are updated to reflect the Resync phase if necessary, and are then sent to the appropriate threads in a Communication Phase. Lastly, a thread reaches an Idle Phase, announcing local completion to the master thread and awaiting either communication from the master thread to terminate or new incoming infection messages. We identify the three possible scenarios involving an incoming infection message when local timesteps are allowed to proceed asynchronously. In the following, let v.emerge denote the time at which the targeted agent would become infected, and e.transmit represent the intended arrival time along some interpartition edge targeting v. Then one of the three following scenarios will occur: v.emerge = nil: Infection targets an agent who does not yet have a recorded infection. Because the agent was susceptible at timestep e.transmit, infection will resolve normally. However, if the targeted agent would become infected earlier than the current local timestep, this infection may create additional edges to consider. These edges, too, will resolve in one of these three scenarios. We call this type of message a transmission message. v.emerge e.transmit: Infection targets an agent whose already-computed infection profile would occur no later than the arriving infection. As such, the incoming infection would not be targeting a susceptible agent, and this infection message may be safely ignored. We call this type of message an overhead message.

59 42 v.emerge > e.transmit: Infection targets an agent whose infection has already been computed, but was computed at a timestep later than the incoming infection. While the agent s infection profile (duration and offset along outgoing edges) will remain unchanged, v.emerge must be updated to e.transmit. This in turn will require updates to the agent s recovery timestep and each non-nil outgoing edge transmit time. If any of these edge updates yields a transmit time earlier than that of the agent they target, the update will cascade to the neighboring agent, which may or may not be local to the operating thread. This is discussed in greater detail below. All local updates occur during the Resync Phase. We call this type of message an updating message. While the core propagation and communication will be largely unchanged, asynchrony introduces the need for a Resync Phase if and when updates to infection emergence are required. A simple example of communication and Resync are given in Figures

60 43 t= A A B Partition 0 Partition 1 time A0 A1 A2 A3 A4 A5 Incoming Buffer C3 C2 C1 Outgoing Buffer B A B A Figure 2.1: An example of propagation requiring Resync. Agent infection status is given by color: black when susceptible, green when infectious, and red when recovered. Propagation between agents is indicated by a directed arrow. (A) marks successful transmissions. (B) marks transmissions targeting a nonsusceptible agent, which are duly ignored. At (C1), agent A3 would infect agent A2 at timestep t = 5. This message is queued to the outgoing buffer. At (C2), all local infection has resolved in Partition 1. If Partition 1 had messages in its incoming buffer, they would be processed here. After this, messages in the outgoing buffer are sent, received to Partition 0 s incoming buffer. At (C3), all local infection has resolved in Partition 0. It now receives the message in its buffer for processing. Because the infection message would resolve earlier than the infection already computed, a Resync phase is required. This process is demonstrated in Figure 2.2.

61 44 t= D C Partition 0 Partition 1 time A0 A1 A2 A3 A4 A5 B B Effect after resync A Figure 2.2: A message, marked (A), is received requiring a Resync event. The target s infected behavior (infection duration and offset of outgoing edges) has already been computed. The target s infection emergence time is updated to the new message s infection time, and the target s recovery time is updated to keep the infection duration consistent. These adjustments are marked with (B). Likewise, any outgoing infection is updated to keep the corresponding edge s offset value consistent. The early propagation is marked (C). While the agent A1 was itself unchanged, we note that the propagation marked (D) has changed from a successful transmission to one that fails due to agent status.

62 45 t= A A B Partition 0 Partition 1 time A0 A1 A2 A3 A4 A5 Incoming Buffer C3 C2 C1 Outgoing Buffer A B A B Figure 2.3: In some instances, the action of Resync requires additional adjustment. In this example, infection durations and edge offset values match those of Figure 2.1; the only change is that agent A3 seeds infection in Partition 1 instead of agent A5. As in Figure 2.1, (A) denotes successful propagation, and (B) denotes those propagation attempts that fail due to target infection status. (C1) marks a message being sent to the outgoing buffer. In (C2), that message is transmitted to the target s incoming buffer. In (C3) the message is received for processing, which is demonstrated in Figure 2.4.

63 46 t= F D D E C Partition 0 Partition 1 time A0 A1 A2 A3 A4 A5 B B A Effect after resync Figure 2.4: As in Figure 2.2, (A) denotes the message that requires a Resync phase. (B) marks the adjustment of the target s infection emergence time to the received message and the adjustment of its recovery time to keep infection duration consistent. (C) denotes the early propagation time along one of its edges to keep the associate offset value consistent. In this case, however, this adjusted propagation time would now successfully resolve on the targeted agent A1. Another Resync phase begins. (D) marks the adjustment of emergence and recovery times, akin to the action marked with (B). A1 s propagation is also adjusted, marked by (E). Note that this propagation has changed both with respect to timestep as well as changing from successful transmission to failed. Likewise, while the agent A0 was itself unchanged, we note that the propagation marked (F) has changed from successful transmission to one that fails due to agent status. It is important to note that the Resync Phase can be performed efficiently only through the precomputation of the infection profile. If infection behavior were to rely on temporal aspects or either the local or global state of the network, the Resync Phase would need to fully recompute the agent s behavior, as these variables are unlikely to be the same at different timesteps. We make the assumption that

64 47 agent behavior is not so affected, allowing the Resync Phase to consist only of minor arithmetic and value reassignment. While such an assumption may seem overly restrictive, there are many aspects of network diffusion where infection status of an agent or its neighbors is not likely to change interactive behavior. For instance, interaction is largely static when examining information diffusion through social media [38, 27]. Indeed, the core framework of an SIR compartmentalization assumes a single, undifferentiated infected state. Pseudocode for a Quasi-deterministic Asynchronously Parallelized Discrete Event Simulator is given in Algorithm Algorithm 2.11 Secondary Functions for Quasi-deterministic Asynchronous Parallelized D.E.S. 1: function ReceiveInfectionMessages(Message buffer B in, Current timestep t, Transmission rate β, Recovery rate γ) 2: while B in.probe(infectionmessage, anysender) do 3: source, target, arrival Recv(anySender, inf ectionm essage) 4: Precompute(target, arrival, β, γ) Algorithm 2.11: Because local timesteps may now differ, infection arrival time is included in infection messages and passed appropriately to Precompute and is not required as an input parameter.

65 48 Algorithm 2.12 Secondary Functions for Quasi-deterministic Asynchronous Parallelized D.E.S., Cont. 1: function Precompute(Agent v, Infection timestep t, Transmission rate β, Recovery rate γ) 2: if v.emerge nil then This agent has already been precomputed 3: if v.emerge t then 4: return... and needs no update. 5: else 6: v.updatet o min({v.updatet o, t}) 7: return... and requires update during the Recovery Phase. 8: v.emerge t 9: v.duration Geom(v.seed, γ) 10: v.recover v.emerge + v.duration 11: for all n N(v) do 12: e (v, n) 13: e.offset Geom(e.seed, β) 14: if v.emerge + e.offset v.recover then 15: e.transmit v.emerge + e.offset 16: else 17: e.transmit nil Algorithm 2.12: Precompute is updated to flag appropriate agents an needing attention during the Resync Phase. In line 6, we resolve the possibility that the same agent is targeted by multiple incoming infections, giving attention only to the earliest-timestep infection. (Here, we let nil > t for all t.)

66 49 Algorithm 2.13 Secondary Functions for Quasi-deterministic Asynchronous Parallelized D.E.S., Cont. 1: function Coordinate(Local rank l, Master rank master, Message Buffer B in ) : Boolean 2: static entrycounter 0 3: entrycounter entrycounter + 1 4: Send(stateAnnouncement, master, entrycounter) 5: while T rue do 6: if l = master then 7: CoordinateAsMaster(B in ) 8: if B in.probe(confirmationrequest, master) then 9: Recv(master, conf irmationrequest) 10: Send(stateAnnouncement, master, entrycounter) 11: else if B in.probe(infectionmessage, anysender) then 12: Send(stateAnnouncement, master, 0) 13: return T rue 14: else if B in.probe(terminationcommand, master) then 15: Recv(master, terminationcommand) 16: return F alse 17: else 18: sleep() Algorithm 2.13: Lockstep is replaced by Coordinate, where threads perform their Idle Phase. Threads announce their entrance and exit to the master thread for coordination of termination. Threads periodically check the incoming message buffers, returning to computation if infection messages are present. When termination is possible, the master thread will poll for confirmation, to which threads respond without leaving the Idle Phase. This message, as well as the Idle Phase announcement message, use a static counter to verify that a thread does not exit to process infection before responding to a confirmation message. When the master thread issues the termination command, threads exit. If no message is present, threads remain and idle.

67 50 Algorithm 2.14 Secondary Functions for Quasi-deterministic Asynchronous Parallelized D.E.S., Cont. 1: function CoordinateAsMaster(Message Buffer B in ) 2: static states [0, 0,..., 0] 3: while B in.probe(stateannouncement, anysender) do 4: for all thread do 5: if B in.probe(stateannouncement, thread) then 6: states[thread] Recv(thread, stateannouncement) 7: ready [0, 0,..., 0] 8: if threadstate[thread] > 0 for all thread then 9: for all thread do 10: Send(thread, conf irmationrequest, nil) 11: for all thread do 12: response Recv(thread, stateannouncement) 13: if states[thread] = response then 14: ready[thread] 1 15: else 16: states[thread] response 17: if ready[thread] = 1 for all thread then 18: for all thread do 19: Send(thread, conf irmationrequest, nil) 20: for all thread do 21: response Recv(thread, stateannouncement) 22: if states[thread] = response then 23: ready[thread] 2 24: else 25: states[thread] response 26: if B in.probe(infectionmessage, anysource) then 27: return The master thread verifies its own one last time 28: if ready[thread] = 2 for all thread then 29: for all thread do 30: Send(terminationCommand, thread, nil) Algorithm 2.14: All state announcements are received, possibly including those from previous confirmation requests. If no thread has indicated a return to computation, then the master thread requests confirmation from each thread twice. Because communication is FIFO, this ensures that no additional message has been sent or received since idling begun. Should a thread return to computation in this time, the master thread is prepared to receive a state 0 in place of a confirmation. Before announcing readiness to terminate, the master thread must verify its own state again. If all threads, including the master thread, are now idle and empty of messages, termination commands are sent to all threads.

68 51 Algorithm 2.15 Secondary Functions for Quasi-deterministic Asynchronous Parallelized D.E.S., Cont. 1: function Resync(Graph G = (V, E), Message Buffer B out, Local rank l) 2: t min({v.updatet o v V }) 3: while {v v.updatet o t} nil do 4: for all v {v v.updatet o = t} do 5: v.emerge v.updatet o 6: v.updatet o nil 7: for all n N(v) do 8: e (v, n) 9: if e.transmit nil then 10: e.transmit v.emerge + e.offset 11: if owner(n) = l then 12: if e.transmit < n.emerge then 13: n.updatet o min({e.transmit, n.updatet o}) 14: else 15: msgt ype inf ectionm essage 16: msgt argetrank owner(n) 17: msgcontent (v, n, e.transmit) 18: B out.enqueue(msgt ype, msgt argetrank, msgcontent) 19: t t + 1 Algorithm 2.15: During Recover, each agent s updatet o indicates which agents need to be updated. Because a nontrivial outgoing edge s offset is always at least 1, we process agents in order of increasing updatet o. In this way, we know that any cascading changes an agent s updatet o value will not require us to undo updates already performed. Note also that, because a thread does not know if a cascading update is necessary when the edge in question crosses partitioning, all such updates must be forwarded to the appropriate thread during the Communication Phase.

69 52 Algorithm 2.16 Quasi-deterministic Asynchronous Parallelized D.E.S. 1: function Simulate(Graph G, Partitioning P = {V 1, V 2,..., V h }, Seeds I 0, Transmission rate β, Recovery rate γ, Local rank l, Master rank master, Message Buffers B out, B in ) 2: t 0 3: for all e E do 4: e.offset, e.transmit nil, nil 5: e.seed random() 6: for all v V do 7: v.emerge, v.duration, v.recover, v.updatet o nil, nil, nil, nil 8: v.seed random() 9: for all v I 0 V l do 10: Precompute(v, 0, β, γ) 11: current I 0 V l 12: while Coordinate(B in, l, master) do 13: t 0 14: ReceiveInfectionMessages(B in, t, β, γ) 15: ReceiveInfectionMessages(B in, β, γ) 16: while {e t < e.transmit} do 17: t t : current {e e.transmit = t} 19: for all e current do 20: v, n e.source, e.target 21: if owner(n) = l then 22: Precompute(n, t, β, γ) 23: else 24: msgt ype inf ectionm essage 25: msgt argetrank owner(n) 26: msgcontent (v, n, t) 27: B out.enqueue(msgt ype, msgt argetrank, msgcontent) 28: Resync(G, B out, l) 29: SendInfectionMessages(B out ) Algorithm 2.16: The focus of the main propagation loops are now focused entirely on graph edges. Messages are received, edge transmission values updated and processed in timestep order, updating new edge transmission values from nil to a future timestep during Precompute. Resync is performed for those edges that require updating. After Resync, any interpartition messages are sent, now including the infection s arrival time. Threads then idle in Coordinate until detecting the need for additional computation or receiving the termination command from the master thread.

70 Implementation Much of the above ignores practical concerns for presentation purposes. In this section, we discuss improvements in implementation and provide a final version of the pseudocode combining these concepts. As the networks that we aim to simulate consist of many agents and edges, any improvement to the way these elements and their associated data are stored will yield marked savings in memory. To this end, stochastically-determined values such as v.duration and e.offset can be removed from memory and efficiently reproduced at each use. Likewise, v.recover may be removed from caching and instead be recomputed as the sum v.emerge + v.duration at each use. While infection should be processed in order according to timestep, a thread is likely to have many timestep s infection to process at any given moment. While it is equivalent in the abstract to iterate over {e e.transmit = t}, performing this slice at every pass would become expensive. Instead, we add the targeted agents to a priority queue (ordered by increasing time) when edge transmission is determined. The same priority queue receives new infections originating from incoming propagation messages. The Resync Phase likewise uses a priority queue to resolve earliest-occurring Resync steps first, which will eliminate the possibility that a single agent would require multiple adjustments in a single Resync phase. The communication of infection between machines is typically much more expensive than the local computation of infection, especially considering the steps we have taken with precomputing an agent s infection profile. As such, any significant reduction of communication can yield noticeable practical speed-up. As we are using an SIR infection model, only the infection arriving at the earliest timestep is used to determine the outcome. As such, the outbound message buffer may be screened to remove any message targeting an agent for which an earlier infection time is known. If two or more messages would target the same agent, only the message that would arrive first is actually sent. If multiple messages would arrive at the same timestep, only one message is sent, chosen at random, by index or according to some other

71 54 tiebreaker. These important messages may themselves be cached and compared against in future communication rounds to avoid redundant messaging. Also motivated against the great potential cost of communication, infection messages are not sent until total local quiescence, lest the messages sent become immediately out of date. Conversely, because incoming messages may necessitate additional propagation or Resync, messages are received as soon as possible. As such, we place each of the five simulation phases in order of decreasing priority: ReceiveCommunication, P ropagation, Resync, SendCommunication, and Idle. Our implementation uses OpenMPI [20] for message passing and parallelization in conjunction with the Boost Parallel Graph Library (BPGL) [21] for distributed memory management of the agent network. Using the BPGL, we implement a distributed graph where each vertex and edge possesses a number of attributes. These attributes are stored and only accessible (without message passing) by the owner of the corresponding vertex or edge. Such values are also transferred with the edge or vertex if ownership changes during redistribution. Explicit details of agent and edge variables are included in Appendix A. Our code is available at Our finalized pseudocode is given below.

72 55 Algorithm 2.17 Secondary Functions for Finalized Quasi-deterministic Asynchronous Parallelized D.E.S. 1: function Simulate(Graph G, Partitioning P, Seeds I 0, Infection rate β, Recovery rate γ, Local rank l, Master rank master, Message Buffers B out, B in ) 2: infqueue, recqueue, commqueue [ ], [ ], [ ] 3: Initialize(G, inf Queue) 4: while Coordinate(l, master, B in, infqueue) do 5: ReceiveInfectionMessages(G, B in, infqueue, recqueue) 6: Propagation(inf Queue, recqueue, commqueue, l) 7: if B in.probe(infectionmessage, anysource) then 8: continue 9: Recover(G, β, recqueue, commqueue) 10: if B in.probe(infectionmessage, anysource) then 11: continue 12: SendInfectionMessages(commQueue) Algorithm 2.17: Primary loop for our Quasi-deterministic Asynchronous Parallelized Discrete Event Simulator. Priority is given to receiving incoming infection messages as soon as possible to minimize their effects on other phases, and to minimize the number of updates that would require another message in SendInfectionMessages. The Idle Phase occurs during Coordinate, if and when all queues and message buffers are empty.

73 56 Algorithm 2.18 Secondary Functions for Finalized Quasi-deterministic Asynchronous Parallelized D.E.S. 1: function Coordinate(Local rank l, Master rank master, Message Buffer B in, Infection Queue inf Queue) : Boolean 2: static entrycounter 0 3: if infqueue [ ] or B in.probe(infectionmessage, anysource) then 4: return T rue 5: entrycounter entrycounter + 1 6: Send(stateAnnouncement, master, entrycounter) 7: while T rue do 8: if l = master then 9: CoordinateAsMaster(B in ) 10: if B in.probe(confirmationrequest, master) then 11: Recv(master, conf irmationrequest) 12: Send(stateAnnouncement, master, entrycounter) 13: else if B in.probe(infectionmessage, anysource) then 14: Send(stateAnnouncement, master, 0) 15: return T rue 16: else if B in.probe(terminationcommand, master) then 17: Recv(master, terminationcommand) 18: return F alse 19: else 20: sleep() Algorithm 2.18: Coordinate is relatively unchanged from its previous form, except for the addition of an immediate exit if the infection queue or message buffer is not empty.

74 57 Algorithm 2.19 Secondary Functions for Finalized Quasi-deterministic Asynchronous Parallelized D.E.S. 1: function CoordinateAsMaster(Message Buffer B in ) 2: static states [0, 0,..., 0] 3: while B in.probe(stateannouncement, anysource) do 4: for all thread do 5: if B in.probe(stateannouncement, thread) then 6: states[thread] Recv(thread, stateannouncement) 7: ready [0, 0,..., 0] 8: if threadstate[thread] > 0 for all thread then 9: for all thread do 10: Send(thread, conf irmationrequest, nil) 11: for all thread do 12: response Recv(thread, stateannouncement) 13: if states[thread] = response then 14: ready[thread] 1 15: else 16: states[thread] response 17: if ready[thread] = 1 for all thread then 18: for all thread do 19: Send(thread, conf irmationrequest, nil) 20: for all thread do 21: response Recv(thread, stateannouncement) 22: if states[thread] = response then 23: ready[thread] 2 24: else 25: states[thread] response 26: if B in.probe(infectionmessage, anysource) then 27: return The master thread verifies its own buffers first 28: if ready[thread] = 2 for all thread then 29: for all thread do 30: Send(terminationCommand, thread, nil) Algorithm 2.19: CoordinateAsMaster is unchanged from its previous form but is reproduced here for completeness.

75 58 Algorithm 2.20 Secondary Functions for Finalized Quasi-deterministic Asynchronous Parallelized D.E.S. 1: function Initialize(Graph G = (V, E), Infection queue inf Queue) 2: for all e E do 3: e.transmit nil 4: e.seed random() 5: for all v V do 6: v.emerge nil 7: v.seed random() 8: for all v I 0 do 9: inf Queue.enqueue(v, 0) Algorithm 2.20: Initialize prepares those edge and vertex values that remain and seeds the infection queue with those agents that will begin infection. Algorithm 2.21 Secondary Functions for Finalized Quasi-deterministic Asynchronous Parallelized D.E.S. 1: function ReceiveInfectionMessages(Graph G, Message Buffer B in, Infection queue inf Queue, Resync queue recqueue ) 2: while B in.probe(infectionmessage, anysender) do 3: target, arrival Recv(anySender, inf ectionm essage) 4: if target.emerge = nil then 5: inf Queue.enqueue(target, arrival) 6: else if target.emerge > arrival then 7: recqueue.enqueue(target, arrival) Algorithm 2.21: Messages are received and are sent to the infection queue if the target is still susceptible, sent to the Resync queue if an update is required, or ignored if arriving after an infection that has already been initiated.

76 59 Algorithm 2.22 Secondary Functions for Finalized Quasi-deterministic Asynchronous Parallelized D.E.S. 1: function Propagation(Infection queue inf Queue, Resync queue recqueue, Communication queue commqueue, Local rank l) 2: while infqueue [ ] do 3: target, time inf Queue.pop() 4: if owner(target) = l then 5: if target.emerge = nil then 6: Precompute(target, time, β, γ, inf Queue, recqueue, commqueue) 7: else if target.emerge > arrival then 8: recqueue.enqueue(target, arrival) 9: else 10: commqueue.enqueue(target, time) Algorithm 2.22: Propagation pops each item in the infection queue, precomputing those agents that are local and not yet resolved. Precompute may add agents to the infection queue, as well as the Resync and communication queues.

77 60 Algorithm 2.23 Secondary Functions for Finalized Quasi-deterministic Asynchronous Parallelized D.E.S., Cont. 1: function Precompute(Agent target, timestep time, Infection rate β, Recovery rate γ, Infection queue inf Queue, Resync queue recqueue, Communication queue commqueue) 2: if target.emerge nil then This agent has already been precomputed 3: if target.emerge time then 4: return... and needs no update. 5: else 6: recqueue.enqueue(target, time) 7: return... and requires update during the Resync Phase. 8: target.emerge time 9: duration Geom(target.seed, γ) 10: for all neighbor N(target) do 11: e (target, neighbor) 12: offset Geom(e.seed, β) 13: if offset duration then 14: e.transmit target.emerge + of f set 15: if owner(neighbor) = l then 16: inf Queue.enqueue(neighbor, time + of f set) 17: else 18: commqueue.enqueue(neighbor, time + of f set) Algorithm 2.23: Precompute sends infections to the appropriate queues, and computes offset and duration values from Geom as needed instead of being saved to their respective edges and agents. It should be noted that the same agent may be entered in the same queue multiple times; care must be taken in both Propagate processing the infection queue and SendInfectionMessages processing the communication queue that redundant computation is avoided.

78 61 Algorithm 2.24 Secondary Functions for Finalized Quasi-deterministic Asynchronous Parallelized D.E.S. 1: function Recovery(Graph G, Infection rate β, Resync queue recqueue, Communication queue commqueue) 2: while recqueue [ ] do 3: target, time recqueue.pop() 4: if time target.emerge then 5: continue 6: target.emerge time 7: for all neighbor N(target) do 8: e (target, neighbor) 9: if e.transmit nil then 10: if owner(neighbor) = l then 11: recqueue.enqueue(neighbor, time + Geom(e.seed, β)) 12: else 13: commqueue.enqueue(neighbor, time + Geom(e.seed, β)) Algorithm 2.24: It is probable that the Resync queue contains multiple potential timesteps of recovery for each agent. As such, we first verify the intended updated time of infection time as against that currently noted on the agent; if it is not earlier, this entry of the queue is ignored (lines 4-5). The agent s infection time is updated (line 6), and possible cascading is queued either locally or as communication, as appropriate (lines 7-13).

79 62 Algorithm 2.25 Secondary Functions for Finalized Quasi-deterministic Asynchronous Parallelized D.E.S. 1: function SendInfectionMessages(Communication queue commqueue) 2: static commhash [ ] 3: msgbuffers [[ ], [ ],..., [ ]] 4: while commqueue [ ] do 5: target, time commqueue.pop() 6: if target / commhash or time < commhash[target] then 7: commhash[target] time 8: msgbuf f ers[owner(target)].enqueue(target, time) 9: for all thread do 10: if msgbuffers[thread] [ ] then 11: Send(inf ectionm essage, thread, msgbuf f ers[thread]) Algorithm 2.25: The communication queue may have a significant number of messages which may be locally known as redundant. To reduce communication costs, we create a hash table cache, mapping targeted agents to known infection attempt times. This time provides an upper bound to the agent s actual infection the target s actual infection is no later than the cached attempt. As such, no message that would arrive after the cached time can possibly require action and may be dropped from the message queue without consequence. The communication queue is processed, selecting those items in the queue that occur before the associated cache value (when it exists), which are then added to an array according to the owning thread (lines 4-8). Once each queue is empty, each buffered group of messages is sent to the corresponding thread, where the messages will be sorted to the infection queue, Resync queue, or ignored as appropriate (lines 9-11). Infection messages now are sent as arrays (as opposed to individual infections as in previous models) to reduce message overhead, which can be significant.

80 63 CHAPTER 3 A COMMUNICATION MODEL FOR COMPARTMENTAL ERDŐS-RÉNYI GRAPHS 3.1 Model Definition While we are excited at the prospect of the potential reduction of idle time by asynchronous computation in the discrete event simulator proposed in the previous chapter, we must acknowledge some introduced costs that would not be borne by an algorithm that proceeds in synchrony. A synchronous algorithm has the number of infection messages bounded above by E, since each edge can be used to transmit infection meaningfully at most once. In our simulator, the number of cross-partition infection messages has no such clean bound, since an agent may have to reiterate its (updated) infection messages after each Resync event. This potential increase in communication cost is not the only possible increase in computation. We have introduced the cost of our Resync phase as well as some additional bookkeeping. However, network communication is often orders of magnitude more expensive than local computation. In this chapter, we hope to gain insight into how the communication of the system from the previous chapter truly behaves, a nontrivial task given that the outcomes of such complex systems do not always clearly reveal why they happen the way that they do. To this end, we develop a compartmental model consisting of differential equations to estimate these communication costs. We base epidemic behavior on Kermack and McKendrick s model, iteratively expand that system of differential equations to account for communication latency, partitioning, and bookkeeping for cross-partition mixing. We then evaluate our approximation of realized communication costs against empirically measured communication costs to validate our communication model Traditional Compartmental Model We first present Kermack and McKendrick s system of differential equations for modeling an SIR infection under the mass action assumptions. Compartments are

81 64 functions of time, and compartments Ŝ, Î, ˆR represent the proportion of a population in the susceptible, infected, and recovered states respectively. Parameters β, γ, and δ represent a normalized probability of transmission, recovery, and inclusion in the initial set of infected, respectively. Allowing f to denote the derivative of f(t) with respect to time, the Kermack and McKendrick equations are as follows. Ŝ = βŝî (3.1) Î = βŝî γî (3.2) ˆR = γî (3.3) with initial conditions Ŝ(0) = 1 δ (3.4) Î(0) = δ (3.5) ˆR(0) = 0 (3.6) All compartments Ŝ, Î, and ˆR have been normalized with respect to the total population; the value of each at a given moment in time represents the proportion of the population currently in the associated to that state Preparation of Compartment Subdivision In the discrete event simulator proposed in the previous chapter, a critical distinction exists between those infection messages that arrive to a thread timestamped before the targeted individual has become infected and those messages that would arrive with a timestamp after. We would like to incorporate into our compartmental model this concept of latency. We use hat notation for these compartments to avoid confusion with the correlating but significantly different compartments S, I, and R that we develop below.

82 65 However, a homogeneous compartmental model is inherently time-synchronous. All equations are deterministic functions of a temporal variable t which does not vary between compartments. As such, there can be no Resync phase here to adjust previous values as there is in a discrete event simulator if we wish to maintain homogeneity. Our ultimate motivation in this chapter remains obtaining a reasonable mathematical estimate of communication costs, which includes distinguishing messages requiring Resync from ignored overhead messages. Without any Resync phase in our current context of differential equations, we instead consider how events unfold from the perspective of an omniscient observer. We consider the local and global awareness of each partition with respect to an agent s infection during the course of simulation. When an agent becomes infected, this information is only readily available to the thread owning that agent. At some point in the future, this information is communicated as necessary to neighboring partitions. We must also be sure to recognize in this model the real-time latency between the transition of infection awareness from being exclusive to the owning thread to a global awareness across all threads. We model this awareness at the agent level. An agent becomes infected, known only to the local partition. We call this state of temporary isolation of awareness early. At some subsequent time, this infection is announced to the rest of the simulation, transitioning the agent from early to late according to some communication rate. This communication rate will (inversely) encapsulate to our intuition of communication latency, both in terms of actual communication transmission time as well as the delay to finish computation before transmission occurs. This notion of communicated infection is independent to the agent s actual infection state. An agent might transition to recovered either before or after the agent has transitioned from early to late states. These states are orthogonal. The communication of infection does not change an infection s outcome, and agent behavior is identical in early and late stages. Indeed, the agent knows its own state and acts accordingly, not basing its actions on the awareness of other threads.

83 66 This initial division of compartments is rather arbitrary. As there is only a single population, there can be no meaningful concept of communication latency between partitions. However, this division of compartments provides the framework for subsequent expansion of our model for threading and our ultimate goal, a viable communication model Subdivision for Latency We divide the Infected compartment Î into early and late infections, denoted by the compartments I and L respectively. Likewise, we divide the Recovered compartment ˆR into early and late compartments, denoted by R and K respectively, and which receive individuals recovering from compartments I and L respectively. To reiterate, this distinction of early and late eventually will come to encapsulate communication latency between processing threads, but this system does not as yet model any partitioning. Individuals transition from early to late compartments (I to R and L to K) according to a communication rate. This transition is wholly independent from any epidemic behavior, relying only on the internals of communication during simulation. Because behavior is unchanged, we recognize that new infections in this model will be the result of mixing between a susceptible individual and either a early or late infected individual. This emergence of new infection is governed under the same parameter β as was used in Equations Likewise, individuals continue to exit the infected state as before, according to recovery rate γ, excepting that individuals now transition from early or late infected compartment to the corresponding early or late recovered state. Incorporating these compartment changes yields the following equations.

84 67 Ṡ = βs (I + L) (3.7) I = βs (I + L) γi I (3.8) L = γl + I (3.9) Ṙ = γi R (3.10) K = γl + R (3.11) with initial conditions S(0) = 1 δ (3.12) I(0) = δ (3.13) L(0) = 0 (3.14) R(0) = 0 (3.15) K(0) = 0 (3.16) A compartmental flow diagram of Equations as it compares to Equations is given in Figure 3.1. We note that epidemic behavior of Equations is unchanged compared to that of the SIR model presented in Equations Indeed, when = 0, the solution curves of Ŝ, Î, and ˆR are identical to S, I, and R. General equivalency in epidemic behavior is demonstrated in Figure 3.2 and shown explicitly below. Lemma 1. The epidemic outcome modeled by Equations is equivalent to that modeled by Equations

85 68 Susceptible Susceptible ^ ^ βsi βs(i+l) Infected Early Infected I Late Infected ^ γi γi γl Removed Early Recovered R Late Recovered (a) SIR (b) SILRK Figure 3.1: ŜÎ ˆR and SILRK population movement diagrams. The Infected compartment is divided into two compartments, but overall individual behavior is unchanged. Also note that transitioning from compartment Ŝ to Î involves the mixing of an individual in Ŝ with an individual in Î or, in the case of the SILRK model, an individual in I or L. All other transitions are independent of individual behavior are functions only of the respective parameter and of time.

86 69 Proof. Given Equations , suppose Ŝ = S (3.17) Î = I + L (3.18) ˆR = R + K (3.19) It follows that Ŝ = Ṡ (3.20) = βs(i + L) (3.21) = βŝî (3.22) Î = I + L (3.23) = (βs(i + L) γi I) + ( γl + I) (3.24) = βs(i + L) γ(i + L) (3.25) = βŝî γî (3.26) ˆR = Ṙ + K (3.27) = (γi R) + (γl + R) (3.28) = γ(i + L) (3.29) = γî (3.30) yielding identity to Equations , as desired. It should also be noted that memorylessness is inherent to the design of compartmental models. Consequently, recovery from I to R and from L to K as parameterized by γ results in an individual s infection to be exponentially distributed with expected duration 1. Likewise, we acknowledge that our design of transitioning γ early to late compartments according to a similarly structured will in turn

87 70 Figure 3.2: Epidemic curves for the traditional SIR and our SILRK compartmental models reveal that division of infected and recovered compartments do not impact epidemic behavior. We observe that compartment S is identical for both models, and that the SIR model s compartments I and R are reproduced in the SILRK model s combined compartments I +L and R+K respectively. These curves were solved computationally, using parameters β = 0.2, γ = 0.1, = 0.05, I(0) = 0.01 result in an exponentially-distributed duration in the early state before an individual transitions to the late state. As a result, we assume that infection messages in this model will have an exponentially distributed latency, with expected latency of 1. We note that latency in this context does not refer only to time to communi- cate between threads, but also encapsulates the time it takes for a thread to reach quiescence, much as β encapsulates both the rate of mixing between individuals as well as epidemic parameters (e.g., infectivity, etc).

88 Subdivision for Threading We now extend the previous set of equations to model a partitioning of agents across computational threads. Given h threads, we divide each compartment into h subcompartments, assigning one subcompartment to each thread and indexing according to the thread s rank. This yields the following equations. Ṡ i = β ij S i (I j + L j ) (3.31) j ( ) I i = β ij S i (I j + L j ) γ i I i i I i (3.32) j L i = γ i L i + i I i (3.33) Ṙ i = γ i I i i R i (3.34) K i = γ i L i + i R i (3.35) These equations are visualized in Figure 3.3. We prove that this compartmental model, too, models an epidemic equivalent to the previous systems. Lemma 2. Equations can be considered a special case of Equations Proof. Given Equations , suppose β ij = βν j, where ν j denotes the proportion of total population belonging to partition j.

89 72 Suppose also S = i I = i L = i R = i K = i ν i S i (3.36) ν i I i (3.37) ν i L i (3.38) ν i R i (3.39) ν i K i (3.40) It follows that Ṡ = ν i Ṡ i (3.41) i = [ ] ν i βν j S i (I j + L j ) (3.42) i j ( ) (( ) ( )) = β ν i S i ν j I j + ν j L j (3.43) i j j = βs (I + L) (3.44) I = i ν i Ii (3.45) ) γi i I i ] = [( ν i βν j S i (I j + L j ) i j ( ) (( ) ( )) = β ν i S i ν j I j + ν j L j i j j ( ) ( ) γ ν i I i ν i I i i i (3.46) (3.47) = βs (I + L) γi I (3.48)

90 73 L = i ν i Li (3.49) = ν i [ γl i + I i ] (3.50) i ( ) ( ) = γ ν i L i + ν i I i (3.51) i i = γl + I (3.52) Ṙ = i ν i Ṙ i (3.53) = ν i [γi i R i ] (3.54) i ( ) ( ) = γ ν i I i ν i R i (3.55) i i = γi R (3.56) K = i ν i K i (3.57) = ν i [γl i + R i ] (3.58) i ( ) ( ) = γ ν i L i + ν i R i (3.59) i i = γl + R (3.60) yielding identity to Equations , as desired. Corollary 1. As a special case of Equations , the epidemic behavior modeled by Equations can be made to be equivalent. For consistency with the previous chapter, we refer to the subcompartments S i, etc as belonging to partition i. Note that each subcompartment is normalized with respect to its local population size, not the global population size, as given by the parameter ν i. By expanding our equations to include partitioning, we have additionally re-

91 74 Susceptible Σ β ij S i (I j + L j ) j Early Infected γ i I i i I i Late Infected γ i L i Early Removed Recovered i R i Late Removed Recovered Figure 3.3: The threaded SILRK model is not unlike the unthreaded version. Individuals are wholly partitioned and do not deviate from their threads, as represented by each layer above. laxed from a single infection parameter β to an infection parameter β ij which may differ between agent source and target partitions. Additionally, the recovery rate of compartment i (given by γ i ), and the communication rate (given by i ) need not be identical across partitions. While by no means necessary for a meaningful investigation, the freedom to vary parameters may be useful for the modeling of communities with significantly different behavior, especially with respect to internal or cross-community mixing rates or differing health profiles of different demographics. Furthermore, latency of interpartition communication is very likely to differ significantly between threads, depending heavily on the physical layout of the computation network.

92 Further Subdivision to Capture Thread Communication Our ultimate goal is to model the communication behavior of our distributed discrete event simulator. Therefore, in addition to a model of the spread of infection, we must additionally estimate the amount of inter- and cross-partition diffusion taking place during simulation. We divide I i into subcompartments I ij to denote those infected individuals in partition i who became infected via an individual in partition j. Likewise, these individuals transition into disjoint subcompartments, dividing the compartment L i, R i, and K i to subcompartments L ij, R ij, and K ij. This yields the following equations. Ṡ i = β ij S i (I jk + L jk ) (3.61) j,k ( ) I ij = β ij S i (I jk + L jk ) γ i I ij i I ij (3.62) k L ij = γ i L ij + i I ij (3.63) Ṙ ij = γ i I ij i R ij (3.64) K ij = γ i L ij + i R ij (3.65) These equations are visualized in Figure 3.4. We prove that this compartmental model, too, models an epidemic equivalent to the previous systems. Lemma 3. Equations can be considered a special case of Equations

93 76 Susceptible Σ β ij S i (I j + L j ) j I i1 I i2... I i3 I ij i I ij Li1 L i2 Li3... L ij γ i I ij γ ilij Ri1 R i2 R i3... R ij i R ij K i1 K i2 K i3... K ij (a) The flow of individuals for a single thread i. Susceptible Σ β ij S i (I j + L j ) j I i1 I i2... I i3 I ij i I ij Li1 L i2 Li3... L ij γ i I ij γ ilij Ri1 R i2 R i3... R ij i R ij K i1 K i2 K i3... K ij (b) The flow of of all individuals across all threads. Figure 3.4: Figure 3.4a shows the flow of individuals in a single thread of our expanded SILRK model, as given by Equations An individual owned by thread i infected by an individual owned by thread j flows to I ij, proceeding to either L ij or R ij before arriving to K ij. Figure 3.4b represents that, as before, individuals within a single thread s compartments will remain within that thread s compartments throughout the model, giving us this stacked visualization.

94 77 Proof. Given Equations , suppose S i = S i (3.66) I i = j L i = j R i = j K i = j I ij (3.67) L ij (3.68) R ij (3.69) K ij (3.70) It follows that Ṡ i = j,k = j = j β ij S i (I jk + L jk ) (3.71) (( ) β ij S i I jk + k ( )) L jk k (3.72) β ij S i (I j + L j ) (3.73) I i = j I ij (3.74) = [( ) ] β ij S i (I jk + L jk ) γ i I ij i I ij (3.75) j k = [ (( ) ( )) ] β ij S i I jk + L jk γ i I ij i I ij (3.76) j k k ( (( ) ( ))) = β ij S i I jk + L jk j k k ( ) ( ) γ i I ij i I ij (3.77) k j ( ) = β ij S i (I j + L j ) γ i I i i I i (3.78) j

95 78 L i = j L ij (3.79) = j ( γ i L ij + i I ij ] (3.80) = ( ) ( ) γ i L ij + i I ij j j (3.81) = γ i L i + i I i (3.82) Ṙ i = j Ṙ ij (3.83) = j [γ i I ij i R ij ] (3.84) = ( ) ( ) γ i I ij i R ij j j (3.85) = γ i I i i R i (3.86) K i = j K ij (3.87) = j [γ i L ij + i R ij ] (3.88) = ( ) ( ) γ i L ij + i R ij j j (3.89) = γ i L i + i R i (3.90) yielding identity to Equations , as desired. Corollary 2. As a special case of Equations , the epidemic behavior modeled by Equations can be made to be equivalent. Note that this subcompartmentalization does not introduce a new normalization term for populations; while we may think of partitions are distinct entities warranting normalization, all compartments I ij belong to partition i. In this way, we

96 79 consider each compartment I ij et al. to be the proportion of partition i currently in the given state. By expanding our equations to include both partition ownership and infection source information, we may now make our estimates of communication costs Estimation of Communication Costs As previously mentioned, there are three possible outcomes of interpartition diffusion. First, interpartition diffusion may target a susceptible agent, resulting in new infection. We have termed this as transmission. Second, interpartition diffusion may target an individual whose infection at some earlier timestep has already been computed. We have termed this as overhead, as these messages are immediately ignored. Lastly, interpartition diffusion may target an individual whose infection had already been noted, but reveals that the infection occurred at a timestep earlier than that previous computed. We have termed this as updating. The modeling of transmission diffusion is conveniently already present in the above differential equations. The population of susceptible individuals infected in partition i by some agent owned by partition k is ultimately aggregated to K ik. We reproduce it below, however, in the interest of completion and for future development, and we note that T ij will differ from K ij only by the initial condition of K ii (0). To determine which of the remaining communications between partitions are updating and which are overhead, we return to our concept of awareness discussed when introducing early and late compartments. We consider the following: an omniscient observer witnesses two agents interacting from separate partitions. Infection passes from one agent to another, but the target s partition is unaware even that the other agent is infected. When the infecting agent s partition subsequently notifies the other partition, it will require the targeted partition to reincorporate this infection into its own history, requiring a Resync event. Conversely, mixing with agents whose infection announcements have already occurred will not require additional correction. As such, infection messages that result

97 80 from mixing with agents in the late states result in overhead messages. We then define compartments for the initial estimates of transmission, updating, and overhead to i from j by T ij, Ûij, and Ôij respectively. These equations are visualized in Figure 3.5. T ij = β ij S i (I jl + L jl ) (3.91) Û ij = k,l l β ij (I ik + R ik ) (I jl + L jl ) (3.92) Ô ij = k,l β ij (L ik + K ik ) (I jl + L jl ) (3.93) below. Again, we use hat notation to distinguish these compartments from the refinement

98 81... T i1 T i2 T i3 T ij Susceptible Σ β ij S i (I j + L j ) j I i1 I i2... I i3 I ij Li1 L i2 L i3... L ij Ri1 R i2 R i3... R ij K i1 K i2 K i3... K ij (a) Duplication flow for transmission messages. Susceptible Σ β ij (I il + R ) il k,l (I + L ) jk jk I i1 I i2... I i3 I ij Li1 L i2 L i3... L ij Σ β ij (L +K il ) k,l il (I + L ) jk jk Ri1 R i2 R i3... R ij K i1 K i2 K i3... K ij U i1 U... i2 U i3 U... ij O Oi2 O O i1 i3 ij (b) Duplication flows for updating and overhead messages. Figure 3.5: Messages are captured by duplicating individuals as they mix, as given by Equations Green arrows indicate this duplication, as opposed to flow diagrams underlying and used previously. Figure 3.5a shows the duplication of individuals who become infected during mixing being noted as they transition from the susceptible to the infected state. Figure 3.5b shows similar copying occurring to capture updating messages from the mixing of individuals in either infected state with individuals in either early state, as well as the capturing of overhead messages resulting from the mixing of individuals in either infected state with individuals in either late state. While these equations are refined in Equations , the increased complexity prohibits as clean a visualization.

99 82 These equations capture the majority of communication over the course of a single modeling of infection. However, when an agent is updated, all outgoing infection information must be repeated with its updated values. Therefore we modify Ûij to mimic recent infections, transitioning to a new compartment A ij that aggregates these update messages. Additionally, we capture feedback F ij, that is, updating messages that result from other updating messages. These changes yield the following communication estimators: T ij = β ij S i (I jl + L jl ) (3.94) l ( U ij = β ij (I ik + R ik ) (I jl + L jl + U jl ) γ i U ij (3.95) k,l Ȯ ij = β ij (L ik + K ik ) (I jl + L jl + U jl ) (3.96) k,l F ij = β ij (I ik + R ik ) (U jl ) (3.97) k,l A ij = γ i U ij (3.98) ) One final consideration enters into our estimate of communication costs. The cost of communication between partitions i and j is not constant for every pair of partitions. If partitions i and j represent two threads on a single machine, or when i = j and communication is within a partition, the cost of communication is trivial. We introduce a cost of communication µ ij as the amortized cost per message to partition i from partition j, asserting that µ ii µ ij for i j. Recall that each compartment represents the proportion of associated individuals or messages, normalized with respect to the total population. If considered in aggregate, each must in turn be reduced according to the associated partition s relative population size. We then define aggregate costs for transmitting, updating, and overhead messages with the following weighted sums, acknowledging that F represents a subset updating messages captured in A.

100 83 T = i,j A = i,j O = i,j F = i,j ν i µ ij T ij (3.99) ν i µ ij A ij (3.100) ν i µ ij O ij (3.101) ν i µ ij F ij (3.102) We present a finalized form of the SILRK model with communication cost estimations below. Ṡ i = β ij S i (I jk + L jk ) (3.103) j,k ( ) I ij = β ij S i (I jk + L jk ) γ i I ij i I ij (3.104) k L ij = γ i L ij + i I ij (3.105) Ṙ ij = γ i I ij i R ij (3.106) K ij = γ i L ij + i R ij (3.107) T ij = β ij S i (I jl + L jl ) (3.108) l ( U ij = β ij (I ik + R ik ) (I jl + L jl + U jl ) γ i U ij (3.109) k,l Ȯ ij = β ij (L ik + K ik ) (I jl + L jl + U jl ) (3.110) k,l F ij = β ij (I ik + R ik ) (U jl ) (3.111) k,l A ij = γ i U ij (3.112) )

101 84 with initial conditions S i (0) = 1 δ i (3.113) I ii (0) = δ i (3.114) i j, I ij (0) = 0 (3.115) L ij (0) = 0 (3.116) R ij (0) = 0 (3.117) K ij (0) = 0 (3.118) T ij (0) = 0 (3.119) U ij (0) = 0 (3.120) O ij (0) = 0 (3.121) F ij (0) = 0 (3.122) A ij (0) = 0 (3.123) 3.2 Comparison and Conversion Between Models Before we begin our investigation into this communication model s estimate quality, we must discuss how the models differ at fundamental levels. At the highest level, we have one continuous model approximating a discrete model. There are scenarios possible in a discrete model that are unlikely or impossible in the continuous model. For instance, multiple infection messages might arrive at an agent at the same timestep in a discrete model, but this possibility is meaningless in the continuous one. In a discrete model, infection duration might follow a geometric distribution, whereas a continuous model of infection durations will follow an exponential distribution. While these may be considered analogues of each other, each parameterization of these functions will differ greatly. Most notably, the β used in the discrete event simulator represents only the probability of transmission between an infected and susceptible agent given mixing has occurred, but the β in the compartmental model also represents the level of agent mixing and a partition size normalization factor.

102 85 We present now the mechanisms by which the parameters of the discrete event simulator may be related to the compartmental model presented above. Consider a discrete event simulation s agent, infected with a probability of recovery at each timestep γ. For the duration of the agent s infection, transmission occurs with probability β along each of its k outgoing edges. We wish to estimate this infection with a continuous model with parameters β, γ, and. We know that both the discrete geometric distribution and the continuous exponential distribution of infection duration have an expectation of 1. While the γ distributions themselves are of course very different, we can conveniently set γ = γ. Determining an appropriate value for β is slightly more complicated. In the continuous interpretation, β answers this question: Of an infected individual s susceptible neighbors, what proportion do we expect will become infected in one timestep? The expectation of this value can be explicitly computed from the discrete model s parameters. E(duration of infection) = d = 1 γ (3.124) P(transmission per timestep along one edge) = β (3.125) P(any transmission along one edge) = 1 (1 β) d (3.126) ( E(number of transmissions along all edges) = 1 (1 β) d) k (3.127) Effective infection rate = β = Number infected duration ( 1 (1 β) d) k d (3.128) (3.129) Lastly, due to the large potential variance of due to the communication structure of the threads involved in computation, our estimation of is largely empirical, accepting in our experiments = 5 3 γ.

103 Investigation We will now investigate the model given by the compartmental model above. We will use both the compartmental model incorporating feedback (i.e., Equations ) and the compartmental model without feedback (i.e., Equations ) to emphasize the importance of Resync phase feedback. We will compare both general epidemic behavior (in the form of the proportion of population infected) and the expected number of each type of infection message against those same metrics as observed from a discrete event simulator of the type discussed in Chapter 2. We will use randomly generated graphs as the underlying structure for the observed discrete event simulator. Refer to Section 1.5 for details about these generators Erdős-Rényi Graph As mentioned, compartmental models inherently possess the mass-action assumption, where both structure and behavior within a compartment is homogeneous. Because of this, we begin our investigation with agent-based networks of similar structure. We first examine performance against an Erdős-Rényi graph, that is, one in which two agents are connected by an edge with fixed and uniform probability. Agents are assigned evenly across threads. A full development of Erdős-Rényi graph generation can be found in Section For our investigation, we use a discrete event simulator operating on an Erdős- Rényi graph consisting of 1,000 agents and a probability of connection selected such that the expected degree of an agent is 11. Agents are assigned to one of two partitions uniformly at random. Probability of transmission between neighbors is β = 0.02, with an expected infection duration of 10 timesteps. The discrete event simulator is run over 10 instances of generated graphs, each simulated over 100 replicates. Four agents are selected uniformly at random to seed the infection. The number of per capita infected and the number of transmission, updating, and overhead messages as measured by both models is presented in Table 3.1. In both models, the population is distributed fairly across two threads. Because of the struc-

104 87 tural homogeneity of Erdős-Rényi graphs, the correlation between our compartmental and discrete models is very high. The compartmental models estimated the proportion of the population infected at 0.6% above the observed mean. Of these infections, the compartmental models estimated that roughly half originated in the other thread, as we would expect in with a compartmental model with a uniform β. The discrete model observed closer to 40% of infections originated off thread. However, we observe that this discrepancy is almost entirely due to the discrete nature of the simulator. In Table 3.1a, we also observe Colliding Transmission Messages, incoming infections that target the same agent at the same timestep. While, in the discrete model, these messages might be pruned (if they were originating from the same partition). However, the compartmental model s infinite granularity does not distinguish these collisions. If we consider these colliding transmission messages both as transmission, as opposed to one transmission and some overhead, then the compartmental model s estimate of transmission messages is 3.8% above the observed mean. While feedback has no impact on the compartmental model s estimate of infection spread or transmission message, it is critically important for the accurate estimation of overhead and updating messages, as feedback can generate a significant amount of additional communication in poorly partitioned graphs. Estimates of both overhead and updating messages rely heavily on the communication rate. A sufficiently large communication rate would represent a latency-free transfer of information, which would greatly reduce both overhead and Resync events. We use a = 5 γ, which we determined empirically. 3 Unsurprisingly in such a uniform graph, the compartmental model estimates that roughly half of both overhead and updating messages are the result of feedback. Both message types are estimated with fair accuracy. Overhead messages are estimated at 4.6% above the observed mean (not counting the colliding transmission messages mentioned above as overhead messages). The number of updating messages are estimated at 5.7% above the observed mean.

105 88 Observed and estimated metrics in an Erdős-Rényi graph. Observed Min 25-quartile Median 75-quartile Max Mean Proportion of Agents Infected Effective Total Transmission Messages Transmission Messages Colliding Transmission Messages Overhead Messages Resync Events Resync Messages Resync Events Cascades (a) Message and infection rates observed in the discrete event simulator. Estimated Without Feedback With Feedback Proportion of Individuals Infected Estimated per capita Transmission Messages Estimated per capita Overhead Messages Estimated per capita Resync Messages Estimated per capita Internal Resync Cascades (b) Message and infection rates estimated by the compartmental model. Table 3.1: The compartmental model estimates Erdős-Rényi graph communication behavior with a high degree of accuracy, though consistently overestimates each of our observed data. Epidemic outcome as measured by total population infected is accurate within 1% of the observed mean. The estimate of transmission and overhead messages is accurate within 5%, and the estimate of updating messages is accurate within 6%.

106 Stochastic Block Model While structurally similar to the mass-action assumptions inherent to compartmental models, Erdős-Rényi graphs offer very little in the way of scalable graphs similar to observed real-world networks. To address our desire for community structure, we turn to the stochastic block model. The stochastic block model is a generalization of Erdős-Rényi graphs, where agents grouped into subsets B = {B 1,..., B h }, and the edge between u B i and v B j exists with probability p ij. A full development of the stochastic block model can be found in Section To avoid overloading the term partition, we will refer to the partitioning of agents as used by the stochastic block model as stochastic blocks, and continue to use partitioning to refer to the assignment of agents to computational threads. We highlight some of the more important attributes of the stochastic block model. First, the stochastic block model is a generalization of Erdős-Rényi graphs. An Erdős-Rényi graph with edge probability p is equivalent to the stochastic block model special case with one stochastic block, that is, V = B 1 and p 11 = p. Indeed, in an arbitrary stochastic block model, each block B k is locally an Erdős-Rényi graph, connected by probability p kk. We focus our attention on stochastic block graphs that are strongly assortative, satisfying min i p ii > max i j p ij, as these graphs will have stronger community ties within each stochastic block that are of interest to us. Now that we are examining a graph that is not globally uniform, partitioning of agents becomes important. Fortunately, with the natural delineation of community available to assortative stochastic block models, we have an immediate sense of partition quality. That is, the assignment of agents to computational threads is optimal when the discrete event simulator s partitioning P = {V 1,..., V h } is along the lines of the model s stochastic blocks B = {B 1,..., B h }, such that either V i B j = B j or V i B j =. We call this a stochastic block model s best case partitioning. Conversely, an assignment of agents to computational threads can be thought to be maximally suboptimal if each set in the discrete event simulator s partitioning included equal portions of each stochastic block. That is, a maximally suboptimal

107 90 assignment might satisfy V i B j = 1 hh V for each V i P, B j B. We call this a stochastic block model s worst case partitioning. We note that the simplest say to achieve a near worst case partitioning is to assign agents to computational threads uniformly at random, though stochastic noise will see the resulting partitioning slightly better than a true worst-case partitioning Stochastic Block Model - Worst Case We consider the worst case partitioning of a simple two-block stochastic block model, partitioned across two computational threads. We use parameters similar to the Erdős-Rényi graph in the previous section, except in those parameters that determine network connectivity. Where in the previous section, probability of connection was chosen such that each agent had expected degree 11, we now select probabilities p 11 and p 22 such that each agent has expected degree 10 to agents within their stochastic block, and select p 12 and p 21 such that each agent has expected degree 1 to agents outside their stochastic block. As such, the expected degree of any agent remains consistent with the previous experiment while simultaneously providing definition of community. All other parameters are held consistent: β = 0.02, infection has expected duration of 10 timesteps, and data is aggregated over 10 generated graphs simulating 100 replicates. To parameterize this for the compartmental model, a worst-case partitioning is indistinguishable from the Erdős-Rényi graph. Because we expect for an individual to have the same number (in this case 5.5) of neighbors within and outside the partitioning, the compartmental model s mixing and transmission parameter, β ij, is constant for all i, j. As a result, estimates for this worst-case partitioning are identical to the previously estimated values. Nevertheless, the two models do indeed perform very similarly, as shown in Table 3.2. All metrics are observed within a few percent of the value observed in the Erdős-Rényi graph simulations. This is not wholly unexpected, as this worst case partitioning would lead us to believe that there is a great deal of homogeneity

108 91 Observed and Estimated Metrics in a Worst Case Partitioning of a Stochastic Block Model Graph. Observed Min 25-quartile Median 75-quartile Max Mean Proportion of Agents Infected Effective Total Transmission Messages Transmission Messages Colliding Transmission Messages Overhead Messages Resync Events Resync Messages Resync Events Cascades (a) Message and infection rates observed in the discrete event simulator. Estimated Without Feedback With Feedback Proportion of Individuals Infected Estimated per capita Transmission Messages Estimated per capita Overhead Messages Estimated per capita Resync Messages Estimated per capita Internal Resync Cascades (b) Message and infection rates estimated by the compartmental model. Table 3.2: While there is a significant difference in the uniformity of the graph between Erdős-Rényi and stochastic block model, very little of that difference is noted in our metrics when agents are assigned according to a near worst case partitioning, both as estimated by the compartmental model and as observed by the discrete event simulator. Indeed, the compartmental model uses the same parameters as it did in the Erdős-Rényi experiment, as aggregate internal and external mixing is constant between these experiments. Still, overall accuracy is relatively high. throughout our graph. As we see in the next experiment, however, this is not truly the case Stochastic Block Model - Best Case The change in our metrics is pronounced when changing between the best and worst case partitionings. While the worst case partitioning saw no meaningful change in communication costs, we see here the potential benefit of assigning communities to single computational threads. Note that, at the same time, all other parame-

109 92 ters are maintained and, as a result, epidemic outcome is the same as the previous experiment. On the other hand, between the best and worst case partitioning, all communication was decreased near an order of magnitude. Our compartmental model performs somewhat more poorly in this well partitioned experiment. We estimate a per capita message rate, but observe only (approximately 128% above the observed mean). At the same time, however, the actual infected rate is only overestimated by 0.6%. We underestimate per capita overhead message rates by 2.4%, but similarly underestimate updating messages by over a factor of three. Raw observed values are reported in Table 3.3. While our individual message estimates perform poorly, aggregate communication estimates are not as far from the observed. We observe a total per capita communication cost of messages, and estimate a total per capita communication cost of 0.446, yielding an underestimate of approximately 3.5%. While the compartmental model may misdiagnose some message types, meaningful communication rates are still captured. within stochastic noise

110 93 Observed and Estimated Metrics in a Best Case Partitioning of a Stochastic Block Model Graph. Observed Min 25-quartile Median 75-quartile Max Mean Proportion of Agents Infected Effective Total Transmission Messages Transmission Messages Colliding Transmission Messages Overhead Messages Resync Events Resync Messages Resync Events Cascades (a) Message and infection rates observed in the discrete event simulator. Estimated Without Feedback With Feedback Proportion of Individuals Infected Estimated per capita Transmission Messages Estimated per capita Overhead Messages Estimated per capita Resync Messages Estimated per capita Internal Resync Cascades (b) Message and infection rates estimated by the compartmental model. Table 3.3: Here we see the great impact of quality partitioning. While epidemic behavior is consistent with the previous experiment, communication costs have been reduced by an order of magnitude across all metrics. Our compartmental model overestimates communication costs, transmissions and underestimates overhead and updating messages.

111 94 CHAPTER 4 LOAD REBALANCING PROTOCOLS We have seen in the previous chapter that, unsurprisingly, partitioning along community structure drastically reduces the communication load for all three types of communication. This is especially important for the asynchronous model we presented in Chapter 2, as additional updating communication can in turn generate additional feedback. This chapter focuses on ways to perform on-the-fly redistribution of agents to threads over the course of many replicates such that community structure is increasingly respected by the partitioning of agents to computational threads. Formally, a community is a set of agents that are more densely connected to agents within that set than to agents in the rest of the network. Communities do not necessarily form a partitioning; an agent may belong to multiple communities or may belong to no community. Many real-world networks exhibit community structure, particularly in those networks that mirror personal interaction. Physical communities, social networks (both online and in-person), and communication networks all exhibit community structure. Community structure is also found in impersonal contexts. It is evident in collaboration networks (where individuals share an edge with collaborators and field of study delineates community) [19], as well as in the bipartite graph of Amazon products and purchasers (where users share an edge with products purchased and communities are found in co-purchasing) [42]. We are interested in a form of community detection insofar as it may be used as a proxy for high-quality assignments of agents to computational threads, as a means to lower communication costs during large-scale asynchronous simulation. Community detection is itself a rich field of research. However, due to the complexity of the problem, an a priori community detection is likely to be prohibitively expensive on an arbitrary graph. The gains in runtime that result from a well-partitioned network

112 95 may well be lost to the cost of detecting these high-performing partitions. There do exist linear and sublinear-time algorithms for some special cases, particularly in applications of label propagation [39] and averaging dynamics [4], but we focus our work on a general approach that should yield reduction in communication costs for an arbitrary network. To this end, we instead look towards incremental improvement that yields reduction in communication costs while remaining computationally inexpensive. We also note that a full detection of community structure far exceeds our goal. We are assigning agents to computational threads, and will likely wish to allow multiple communities to be assigned to the same thread. It is not of consequence here that two randomly selected agents on that thread might not share a community. Our primary concern is that agents in separate threads do not frequently share communities. In effect, we are identifying supersets of agents that are likely to contain the majority of a given community, thereby reducing edge crossings, ultimately leading to lower communication costs during large-scale asynchronous simulation. During simulation, an infected agent may transmit infection to its neighbors. This action is inherently local to the agent s neighborhood. Even if one neighboring agent does not become infected, it may soon become infected by a shared neighbor. The more neighbors shared, the more opportunities an agent has to acquire infection, even when infection is not transferred directly from the originally infected agent. Indeed, communities are the natural delineation of infection, often blooming within a community s abundance of internal pathways while struggling to cross the few avenues that enter neighboring communities. Infection pathways and density can therefore expose communities in a natural way. Simulations are typically performed over many replicates so that meaningful patterns may emerge. Over these many replicates, we aim to gather community information as inferred from aggregated epidemic behavior. This community information is then used periodically between replicates to refine the assignment of agents to computational threads in a way that reduces the overall communication cost in the

113 96 remaining replicates. The remainder of this chapter is structured as followed: we begin by developing the concept of attraction between agents, where stronger attraction suggests a higher likelihood that agents share community. We then use attraction to collect agents into transfer sets, sets of agents to be moved between partitions. We conclude with a comparative performance of several possible definitions of the attraction function. 4.1 Defining Attraction Through the natural course of epidemic simulation, a great deal of data is gathered regarding network structure. For instance, we might expect those individuals who are more frequently infected to have higher network centrality measures [10]. We wish to leverage this sort of information to create an efficient heuristic for identifying community structure between agents. Key among our concerns is that the information used requires little or no additional computational cost beyond that which is inherent to simulation in a discrete event simulator. For instance, we may readily use simple counters, such as the number of times an agent is infected or requires a Resync event over the course of multiple simulations. Alternatively, network features such as agent degree may be quickly calculated or retrieved from the underlying network s structure. Similar counters may be attached to edges rather than agents, e.g., tracking the total number of attempted or successful transmissions between two agents, or the number of updating or overhead messages between agents. We use these data to quantify the attraction of one agent u to another agent v as an edge-weight on the edge (u, v), given by some function Attr(u, v), intending larger values of Attr(u, v) to indicate a greater probability that u belongs to some community containing v. Wishing to avoid compounding communication costs, we design Attr(u, v) such that computation can be local to the thread owning u. Thus, because agents u and v may be assigned to separate threads, and as such the attributes of one agent may not be available to the thread that owns the other, we do not assume that attraction is symmetric. We summarize possible parameters of Attr and assign

114 97 Possible Attraction Metric Information Belonging to Shorthand Attribute Vertex v v n Number of times infected Vertex v v u Number of times updated Vertex v v d Agent degree Edge (u, v) uv a Number of attempted transmissions Edge (u, v) uv t Number of successful transmissions Edge (u, v) uv u Number of Resync events along Edge (u, v) uv o Number of unsuccessful transmissions Table 4.1: While we frame the edge attributes around our previous consideration of message types, keep in mind that we wish Attr to be meaningful within a single partition as well. Also note that, of the various attributes belonging to an edge uv, only uv a can be known by a thread without additional communication. While a computational thread certainly knows the number of messages it sends a neighboring thread, the type of message is determined at the target. This communication can be performed in bulk after computation completes, but its cost is non-trivial when many cross-partition edges exist. We also note that any metric driven by uv u will have attraction biased to favor cross-partition edges, as internal edges only require a Resync in the event of a cascading update. Additionally, an internal overhead message typically requires no action, but will require bookkeeping when associated metrics are used. notation for these parameters in Table 4.1. While much of our framing of potentially useful edge attributes surrounds our previous consideration of message types, we keep in mind that we wish Attr to be meaningful within a partition as well as across partition boundaries. We also note that, of the various attributes belonging to an edge uv, only the total number of transmission attempts along an edge (i.e., uv a ) can be known by a thread without additional communication when u and v are not owned by the same thread. That is, while a computational thread certainly knows the number of messages it sends a neighboring thread along any given edge, whether that message is transmitting, updating, or overhead must be determined by the target. If any of the other proposed edge attributes would be used by Attr, the outcome of each message must be

115 98 Proposed Attraction Functions Attr any (u, v) = uv a Attr succ (u, v) = uv t Attr prop succ (u, v) = { 0, uv a = 0 uv t /uv a, else Attr prop recv (u, v) = uv t /v n Attr inv fail (u, v) = 0, uv a = 0 1, uv t > 0; uv u + uv o = 0 1/(uv u + uv o ), else Table 4.2: Possible attraction metrics from aggregated simulation data. Data shorthand defined in Table 4.1. Attr any weighs edges by their message counts. Attr succ counts only messages that successfully transfer infection. Attr prop recv weighs edges by the proportion of target agent infections received through that edge. Attr prop succ weighs by the proportion of messages along the edge that yield new infection. Attr inf fail opts for an inverse-linear penalty of updating and overhead messages. Comparative performance of these metrics is discussed in the next chapter. reported. This communication can be performed in bulk after the replicate s computation completes, but its cost could be non-trivial when many cross-partition edges exist. We also note that any attraction function driven significantly by the number of Resync events along an edge (i.e., uv u ) will have Attr biased to favor cross-partition edges, as edges internal to a thread required a Resync event only when that event has cascaded from an externally caused Resync event. These potentially useful edge attributes may produce many possible Attr functions. While far from an exhaustive list, we propose a number of Attr heuristics in Table 4.2.

116 99 Attraction between agents is implicit in network structure, and our heuristics are gathered naturally during the course of simulation. However, our ultimate goal remains to determine an assignment of agents to threads. We expand our concept of attraction from an edge-weight between two agents to an aggregated force between sets. Formally, we let N(u) denote the neighborhood around an agent u, with N(u) = {v (u, v) E} Because of the importance of partition assignment, we define the partition-neighborhood N t (u) to denote the neighbors of u within partition t, that is, N t (u) = {v N(u) owner(v) = t} Likewise, for a set U, we define ( ) N(U) = N(u) \ U u U and ( ) N t (U) = N t (u) \ U u U We adopt a natural extension of Attr for set-wise definitions, allowing Attr(u, V ) and Attr(U, V ) as the sum of all edge attractions, with Attr(u, V ) = v V Attr(u, v) and Attr(U, V ) = Attr(u, v) u U v V Lastly, we define the draw of an agent to a thread as the sum of the agent s attraction to any neighbor on that thread. That is, Draw(u, t) = Attr(u, N t (u)) and Draw(U, t) = Attr(U, N t (U))

117 Transfer Set Generation and Redistribution With some Attr function chosen and Draw thus defined, we may now determine which agents to redistribute between threads. Redistribution consists of three phases. First, each thread generates possible transfer sets of agents. Second, a selection of transfer sets are approved for redistribution. Lastly, redistribution is performed and ownership of agents in transfer sets are transitioned between threads. These steps occur periodically between simulation replicates. Because we aim for iterative improvement, we assume that some initial partitioning is given to begin simulation, even if that partitioning may only be a worst-case assignment of agents to threads, i.e., uniformly at random. After several replicates have executed, we may hope that Attr contains more meaningful data than stochastic noise and that our redistribution is meaningful. Generation of transfer sets, like all our considerations in redistribution, seeks to be as computationally inexpensive as possible. We propose a greedy algorithm for transfer set generation. First, we identify the agent and corresponding neighboring thread t with the greatest Draw to seed the transfer set T S. Iteratively, we add to the transfer set neighboring agents based on three values: the attraction to the transfer set, the draw to thread t, and the attraction to local agents not in the transfer set. If the neighboring agent is less attracted to local agents than the sum of the others, it is included in the transfer set. This algorithm is formalized in Algorithm

118 101 Algorithm 4.1 Transfer Set Generation Scoring Functions 1: function AgentTransferScore(Transfer set TS, Agent v, Graph G, Local rank l, Targetted rank t) 2: agentset TS.agents 3: return Attr(v, agentsset) + Draw(v, t) Attr(v, N l (v) \ agentsset) 1: function TransferSetScore(Transfer set TS, Graph G, Local rank l, Targetted rank t) 2: agentset TS.agents 3: return Draw(agentsSet, t) Attr(agentsSet, N l (agentsset)) Algorithm 4.1: We adopt some utility functions that compare the attraction and draw of the local thread, targeted thread, and those agents that might be relocated to the targeted thread. AgentTransferScore returns positive when an agent not in a transfer set is more attracted to members of the transfer set or any neighbors it may have in the targeted thread than it is to its own thread-local neighbors. Such an agent is a viable candidate to add to the growing transfer set. Likewise, TransferSetScore returns positive when it is, as a whole, more attracted to the targeted thread than it is to those agents that it would leave behind.

119 102 Algorithm 4.2 Transfer Set Generation Protocol 1: function AggregateTransferSet(Agent v, Graph G, Local rank l, Targetted rank t) 2: T S.agentSet {v} 3: T S.target t 4: while max n N(TS) AgentTransferScore(TS, n, G, l, t) > 0 and TransferSetScore(TS, G, l, t) > 0 do 5: nextagent arg max n N(TS) AgentTransferScore(TS, n, G, l, t) 6: TS.agentSet TS.agentSet {nextagent} 7: if TransferSetScore(TS, G, l, t) 0 then 8: TS.agentSet TS.agentSet \ {nextagent} 9: break 10: return TS 1: function GenerateTransferSets(Graph G, Partitioning P = {V 1, V 2,..., V h }, Local rank l) 2: ProposedTransferSets [ ] 3: Candidates V l 4: while Candidates > 0 and max t {1,...,h},v Candidates Draw(v, t) > 0 do 5: seedingagent, targetthread arg max t {1,...,h},v Candidates Draw(v, t) 6: TS AggregateTransferSet(seedingAgent, G, l, targetthread) 7: ProposedTransferSets.append(TS) 8: Candidates Candidates \ TS.agentSet 9: return ProposedTransferSets Algorithm 4.2: While not presented here, additional considerations may be included in the generation of transfer sets. For instance, a maximum number of proposed sets could be added to the condition for generation in Generate- TransferSets at line 4, or a maximum size in transfer set could be imposed in AggregateTransferSet in line 4. To avoid possible assignment conflict and to increase computation speed, we remove an agent from further consideration once it has been included in any transfer set, reducing the size of Candidate in GenerateTransferSets at line 8. As with our discrete event simulator itself, the above pseudocode can be optimized in a number of ways. Iterating through a priority queue is an ideal way to handle the many max and arg max evaluations. Likewise, caching and updating AgentTransferScore values significantly reduces the algorithmic complexity of

120 103 transfer set generation. Once transfer sets are generated by a thread, they must be approved or rejected for redistribution. Again, many mechanisms for this process may exist. We propose a lightweight, coordinated redistribution in which each thread proposes their transfer sets to a master thread and the master thread approves those transfer sets that have the highest TransferSetScore. Some coordination is made to keep a workload balance, without which the algorithm could converge to a scenario with every agent assigned to one thread. While this would indeed minimize communication cost to zero, it trivializes the problem itself to absurdity. When communicating transfer set requests to the coordinating master thread, it is of course prohibitively expensive to include the full membership or edge information with the request. To keep communication costs reasonable, requests are sent with a summary of each transfer set. This summary consists only of an identifying index, the value of the locally computed TransferSetScore, the size of the transfer set, and the intended target. Once the master thread has collected the summary of all proposed transfer sets, it selects greedily selects those with the highest TransferSetScore for approval. A limit is imposed on approval such that one of the h partitions cannot deviate more than 30% of a fair load, that is, h V i V [0.7, 1.3] for all i {1,..., h}. Threads are awarded transfer set approval cyclically, so that any transfer set that has reached its limit may be revisited after it has, possibly, been scheduled to receive another thread s transfer set. Pseudocode for this protocol is given in Algorithm

121 104 Algorithm 4.3 Transfer Set Redistribution Protocol 1: function PerformRedistribution( Graph G, Partitioning P = {V 1, V 2,..., V h }, Local rank l, Master rank master) 2: myproposed GenerateTransferSets(G, P, l) 3: CommunicateSummaries(G, myproposed, l, master) 4: if l = master then 5: ApproveTransferSets(G, P) 6: myapprovedt ransf ersets Recv(master, approvallist) 7: Redistribute() 1: function CommunicateSummaries(Graph G, List of transfer sets myproposed, Local rank l, Master rank master) 2: Send(transferSetCount, master, myproposed.size()) 3: for all i, TS enumerate(myproposed) do 4: summary.i i 5: summary.score TransferSetScore(TS, G, l, TS.target) 6: summary.size TS.agents.size() 7: summary.target TS.target 8: Send(transferSetSummary, master, summary) Algorithm 4.3: To perform redistribution, threads generate transfer sets and report their summaries to the master thread. The master thread determines which transfer sets to approve while aiming to maintain a reasonably balanced workload across all threads (detailed in Algorithm 4.4). The master thread reports which transfer sets are approved, and redistribution may take place. We use here a utility function enumerate in CommunicateSummaries at line 3 allows both index and element reference of an array at the same time. The actual reassignment of agents to threads, as represented by the function call Redistribute() at line 7 of PerformRedistribution, is not presented here as it is in its near entirety handled by the Boost Parallel Graph Library.

122 105 Algorithm 4.4 Transfer Set Approval Protocol, Master Thread 1: function ApproveTransferSets(Graph G, Partitioning P = {V 1, V 2,..., V h }) 2: TSPriorityQueues [[ ], [ ],..., [ ]] 3: sizes [ V 1, V 2,..., V h ] 4: ApprovedTSLists [[ ], [ ],..., [ ]] 5: for all thread {1,..., h} do 6: numberproposed Recv(thread, transfersetcount) 7: for all i {1,..., numberproposed} do 8: summary Recv(thread, transfersetsummary) 9: TSPriorityQueues[thread].push(summary) 10: threadslefttotry h 11: while threadslefttotry > 0 do 12: for all sourcethread {1,..., h} do 13: proposedts TSPriorityQueues[sourceThread].top() 14: transfersize proposedts.size 15: targetthread proposedts.target 16: if h (sizes[sourcethread] transfersize) [0.7, 1.3] and h V V (sizes[targetthread] + transfersize) [0.7, 1.3] then 17: TSPriorityQueues[sourceThread].pop() 18: ApprovedTSLists[sourceThread].append(proposedTS.i) 19: sizes[sourcethread] sizes[sourcethread] transfersize 20: sizes[sourcethread] sizes[targetthread] + transfersize 21: threadslefttotry h 22: else 23: threadslefttotry threadslefttotry 1 24: for all thread {1,..., h} do 25: Send(approvalList, thread, ApprovedTSLists[thread]) Algorithm 4.4: The master thread allocates and prepares priority queues, partition size, and an array of indices indicating approved transfer sets in lines 2-4. The master thread receives summaries from all threads, which are pushed into a priority queue, one per thread, according to the transfer set s score (lines 5-9). Though not presented here, this score could be modified to favor transfer sets originating from larger partitions or targeting smaller partitions. The master thread then passed through each thread s priority queue, approving the top of the transfer set if that approval would not overly unbalance the distribution of agents to threads. If a full pass through each thread s priority queue yields no additional approved transfer sets, threadslefttotry will reach zero and the approval process ends. The master thread communicates approval, at which point reassignment of agents to threads may occur.

123 Performance of Attraction Functions Partition Scoring Identification of agent communities is essentially a labeling problem. As a result, many applications of agent partitioning commonly use those performanceevaluation functions common in labeling. Most widely used are the average F1 score, Omega Index [11], or Normalized Mutual Information. None of these scoring functions fit our needs, however. Again, our aim is not ultimately to identify communities, but rather to identify beneficial assignments of agents to computational threads. An individual thread will likely have many communities that our method does not explicitly, individually expose. As a result, such scoring metrics do not rightfully apply to our method. The communities we aim to detect are not the true communities of the network, but rather a superset likely to contain multiple communities. Our hope is that edges between to partitioned threads are few, but density within the thread is relevant only with respect to the change in cross-partition edges between redistributions. We must then develop our own scoring functions to evaluate a partitioning of agents to computational threads against a ground-truth set of communities. We hope to assign as much of any particular community to a single computational thread, and as such, we build our scoring function around this value. For simplicity, we will focus only on the maximal capture of a community across all partitions, choosing not to distinguish, for instance, a 0.8/0.2/0.0 division of a community from a 0.8/0.1/0.1 division. Once these maximal captured proportions are found, they can be averaged to form an overall score for a partitioning against a ground-truth community set, which we denote by score. Note that, by design, such a score will be bound above by 1 but bound below, not by 0, but by 1, given h partitions. h While such a scoring function is suitable for many graphs, it gives equal weight to all communities, regardless of size. We propose a second scoring metric that weighs the score of a community by its size. We denote this scoring function with score.

124 107 Formally, let C be a collection of h ground-truth communities, with C = {C1,..., Ch }. No assumption is made regarding agent inclusion in C ; agents may belong to zero, one, or many communities. We will continue to use P = {V 1,..., V h } to denote the partitioning of agents as assigned to computational threads. We then define the following functions: proportion(v i, C j ) = V i C j C j (4.1) captured(p, C j ) = max V i P proportion(v i, C j ) (4.2) score(p, C ) = score(v, C ) = 1 C h j=1 captured(p, C j ) (4.3) j C j captured(p, C j ) i C i (4.4) Possible Attraction Functions Several possible definitions for the function Attr were proposed earlier in this chapter. We will examine the following functions, using the attribute notation definitions given in Table 4.1.

125 108 Attr inv fail (u, v) = Attr prop succ (u, v) = 0, uv a = 0 1, uv t > 0; uv u + uv o = 0 1/(uv u + uv o ), else 0, uv a = 0 uv t /uv a, else (4.5) (4.6) Attr any (u, v) = uv a (4.7) Attr succ (u, v) = uv t (4.8) Attr 1 (u, v) = 1 (4.9) We refer to the attraction functions given by Equations by the names inverse-failure, proportion-successful, any-transmission, successful-transmission, and constant-edgeweight, respectively. We include the constant-edgeweight attraction function as a comparative baseline. However, because of the uniformity and saturation of nontrivial attraction values when using the constant-edgeweight function, transfer set generation often takes much longer using this attraction function than with the others. We acknowledge that this is far from an exhaustive list of possible attraction functions. While the choice of attraction function may be itself worthy of significant study, we do not mean to investigate this aspect deeply. We examine our proposed attraction functions across some few experiments to select which will be used for the remainder of this thesis.

126 Evaluation We test our attraction functions across three different scenarios. First, we examine their performance on a stochastic block model graph under a worst case initial partitioning, measuring the iterative improvement in partition score with each redistribution. Second, we examine the performance, or rather performance degradation, of each attraction function on the same stochastic block model graphs, this time under the best-case initial partitioning, asserting that an ideal attraction function should not significantly deviate from an ideal partitioning. Last, we examine the performance of each attraction function on a graph with more complex community structure, employing a recursive matrix graph with hierarchical communities Stochastic Block Model with Near Worst Case Initial Partitioning To begin, we examine a stochastic block model under a worst case initial partitioning. Graphs are generated consisting of 10,000 agents forming 750 communities. Agents have an expected in-community degree of 10 and an expected out-community degree of 1. Simulation parameters are similar to previous experiments, with β = 0.02 and an expected infection duration of 10 timesteps. Simulation runs for 200 replicates, with redistribution occurring every 10 replicates. Results are aggregated over 10 instances, each with their own iterative redistribution. Computation occurs across two computational threads. The score of each repartitioning over time can be seen in Figure 4.1. Note that, because we score only on the largest proportion of each community captured, the initial worst-case partitioning has a score greater than the 0.5 one might expect. As mentioned in the previous chapter, stochastic noise will see that only very rarely is a community divided exactly in half. As such, most communities are distributed slightly more favorably when considered by any metric that scores only the largest population, compared to the performance of a manually constructed worst-case partitioning, which can be enginerred to yield exactly 0.5 given evenly-sized communities. None of our proposed attraction functions significantly outperforms the others

127 Figure 4.1: Each attraction function performs similarly overall, here operating on graphs generated using the stochastic block model with 750 communities with an initial partitioning of agents assigned uniformly at random to one of two computational threads. We note that the proportion-successful and any-transmission attraction functions appear to regress in the middle, whereas the inverse-failure and successful-transmission functions perform slightly higher in earlier redistribution phases. 110

128 111 in this experiment. In the long-term, all attraction functions approach, on average, a final score within 1% of the other attraction functions. We note, however, that the inverse-failure and successful-transmission attraction functions have greater performance in the earlier redistributions. Both the any-transmission and proportionsuccessful attraction functions appear to have a higher likelihood of a redistribution decreasing the partitioning score, notably in Figure 4.1 in the range of replicates Stochastic Block Model with Best Case Initial Partitioning While it is important to be able to escape local optima in favor of long-term greater improvement, we do not wish our attraction function to favor score decreases unduly. In this experiment, we construct graphs and simulate epidemics using the same parameters as in the previous experiment. However, we begin each graph s instance with a best case initial partitioning. We would desire that our attraction function not favor a significant or long-term deviation from this best-case partitioning. Performance is visualized in Figure 4.2. First, we must note the scale of the graphs in Figure 4.2. While both the proportion-successful and successful-transmission attraction functions see a comparatively significant deviation from the best case initial partitioning, the score degradation is under 3% at its lowest. More concerning, however, is that these attraction functions never recovery from the degradation of partition quality. Conversely, the inverse-failure, any-transmission, and constant-edgeweight attraction functions all maintain partition quality when given the best case initial partitioning Hierarchical Community Structure in Recursive Matrix Graphs We would lastly like to examine the performance of redistribution on a graph with more complex community structure. For this, we turn to the Recursive Matrix, or R-MAT, method of generation. A full development of recursive matrix graph generation is given in Section The most notable quality of graphs generated by the recursive matrix algorithm is that communities are formed in a hierarchical

129 Figure 4.2: Here we see the degradation of partition quality under each attraction function when the initial partitioning is the best case partitioning. While no attraction function yields significant reduction in quality, both proportion-successful and successful-transmission attraction functions do decrease partition quality and never recover from this deviation. The other attraction functions maintain the high quality partitioning throughout. 112

130 113 way. Much like a binary tree, the entire graph is considered one community, composed of two large communities. Each of these communities is itself composed of two communities, as are each of those communities, and so on. As with the previous experiments, simulation is executed over the course of 200 replicates on 10 instances of the graph, using β = 0.02 and an expected infection duration of 10 timesteps. Simulation is performed by two computational threads. We generate a graph of 4,096 agents and edges. After pruning isolated agents, we are left with 2,931 agents, yielding an expected degree of 13.97, slightly higher than our expected degree of 11 in the previous experiments. Because of the increased complexity in community structure, community in-degree and out-degree cannot be meaningfully reported. Because the recursive matrix generates a graph wherein agents belong to multiple, nested communities, we present performance over multiple repartitionings using both score and score, in Figures 4.3 and 4.4 respectively. We observe a lower overall, long-term partition quality than we observed in the stochastic block model experiment. Again, all attraction functions perform similarly overall, though the constant-edgeweight attraction function sees an overall benefit of 1-2% over the others. This is mitigated, however, by notably longer redistribution times. Two artifacts are most worth of note. First, in the recursive matrix graph, every attraction function saw rapid convergence to a relatively stable partition score, though again the inverse-failure and any-transmission attraction functions realized this somewhat more quickly. Second, we observe no significant difference between evaluation using score and score. This suggests that the redistribution protocol used may not favor small communities over larger communities with comparable attraction; small and large communities both remain improve but sub-optimally distributed between threads.

131 Figure 4.3: Evaluation of each attraction metric using score on a recursive matrix graph. We see that the nested, hierarchical structure of the R-MAT graph causes our transfer set protocol to struggle to find improvement after an the first few redistributions. Performance across each metric is similar, with a slight regression in the proportion-successful and successful-transmission metrics. 114

132 Figure 4.4: Evaluation of each attraction metric using score on a recursive matrix graph. While we might have expected the difference in community sizes to exhibit a difference in scoring, the nested nature of the smaller communities is likely responsible for similar behavior of both score and score metrics. 115

133 116 CHAPTER 5 EMPIRICAL RESULTS In this chapter, we evaluate the performance of the transfer-set protocol proposed in the previous chapter as we vary parameters of the approval protocol as outlined in Algorithm 4.4. Throughout, we will use the inverse-failure attraction function, defined in Equation 4.5 and reproduced here. 0, uv a = 0 Attr = Attr inv fail = 1, uv t > 0; uv u + uv o = 0 1/(uv u + uv o ), else This choice is driven by three major motivations. First, the inverse-failure attraction function appears quick to realize improvements in partition quality. Second, it does not appear to suffer quality degradation to which some other attraction functions are susceptible. Lastly, while not explicitly investigated in the previous chapter, the inverse-failure attraction function exhibits faster transfer set generation relative to other the attraction functions that satisfy the previous two motivations. Our investigation is driven by three major questions. First, does parallelization actually help a simulator of the type developed in Chapter 2? Second, if parallelization improves performance, to what extent can this improvement be increased by quality workload distribution via the transfer set protocol discussed in Chapter 4? Lastly, to what extent can these improvements be realized on real-world graphs? Throughout this chapter, we perform simulation on a private cluster consisting of four machines, three of which have 2 CPUs and 8 Gb of memory and one of which has 8 CPUs and 48 Gb of memory. Threads are assigned as fairly as possible to CPUs. Additionally, we present our results as scored using the score function defined in the previous chapter. While not presented here, performance was not qualatatively

134 117 different using score. 5.1 Benefiting from Parallelization For our initial investigation of parallelization performance, we use the Lancichinetti- Fortunato-Radicchi Benchmark graphs discussed in Section and explicitly detailed in Appendix B. Each graph is of moderate size. Community structure in the LFR-A graph is explicit and disjoint, with community boundaries becoming less well defined at each of the subsequent LFR-B, LFR-C, and LFR-D graphs. Each simulation is executed using the same parameters and without any redistribution of workload. We use an infectivity of β = 0.02 and executed infection duration of 10 timesteps, seeding infection among 1% of the population. On all graphs, the addition of more threads exhibits speed-up per processor that is sublinear but very close to this optimal speedup in processor usage. The relative performance of each is shown in Figures Each figure includes a Perfect Speedup plot that is normalized against the 2-thread case. Speed-up on each graph is similar, although we note a significantly larger distribution of high clock-time simulations on each of the 16 thread experiments. We believe this to be an artifact of reaching the capacity of our computational resources. 5.2 Improvement of Performance Via Transfer Protocol We do indeed see improvement when computation is distributed across multiple processors, but to what extent can this improvement be maximized by our transfer set protocol? In this section, we examine the impact of several parameters and graph structures on the improvements offered by the protocol discussed in the previous chapter Crossing Transfers The transfer set approval protocol proposed in Chapter 4 relies on a central coordinator to approve transfers between computational threads. One of our earliest concerns over our transfer set protocol was that, knowing only the summary and

135 Figure 5.1: The required clock seconds of each thread during simulation on the LFR-A graph. The line represents the perfect speed-up case, which is attainable only in the abscense of communication costs and with perfect partitioning. The significantly higher frequency of large clock-times in the 16-process experiments are likely due to use reaching the capacity of our computational resources. 118

136 Figure 5.2: The required clock seconds of each thread during simulation on the LFR-B is not significantly different from the LFR-A graph, but is included here in the interest of completeness. 119

137 Figure 5.3: The required clock seconds of each thread during simulation on the LFR-C is not significantly different from the LFR-A or LFR-B graphs, but is included here in the interest of completeness. 120

138 Figure 5.4: The required clock seconds of each thread during simulation on the LFR-D is not significantly different from the previous LFR graphs, but is included here in the interest of completeness. 121

139 122 not the contents of the transfer sets, one thread may risk transferring away the very agents to which another thread found attraction. In this experiment, we impose the additional requirement that no thread may both send and receive transfer sets. Threads submit their transfer set summaries to the master thread as normal, and the master thread greedily approves the transfer set with the highest score. The thread targeted by this transfer set is then blacklisted from receiving any transfer set approval. Beyond this, approval proceeds normally. Simulation is consistent with previous experiments: infection spreads with β = 0.02 and expected duration 10 timesteps and results are aggregated over 200 replicates for each of 10 graph instances, with redistribution occurring every 10 simulation replicates. We examine both the stochastic block model and the recursive matrix graphs. The stochastic block model graphs are generated with 10,000 agents, 10 of which will seed infection, forming 750 communities. The recursive matrix graph was generated over 4,096 agents and 20,480 edges, using parameters a = 0.6, b = c = 0.15, and d = 0.1. After pruning isolated agents, the graph contains 2,931 agents, yielding an expected degree of Partition scoring can be seen for the stochastic block model in Figure 5.5 and for the recursive matrix graph in Figure 5.6. Counter to our intuition, we see a significant reduction in partition quality over the course of redistribution when crossing transfers are prohibited in the stochastic block model, from an average partitioning score of to However, in the recursive matrix graph, we see an improvement in long term partition quality when prohibiting crossing transfers, with average partitioning score increasing from when crossing transfers are allowed to when they are prohibited. This may suggest that in heavily connected networks, where agents form communities more freely and belong to multiple communities, crossing transfers are more likely to occur. Conversely, in graphs such as the stochastic block model s, agents form very tight communities that do not interact (significantly) with other communities. In such a case, the community will attract more heavily to one thread and

140 Figure 5.5: Performance over multiple redistributions in stochastic block model graphs, with and without crossing transfers. Counter to our intuition, prohibiting crossing transfers unduly slows partition improvements. 123

141 Figure 5.6: Performance over multiple redistributions an R-Mat graph, with and without crossing transfers. In either case, it appears that the nested nature of the communities inhibits partition quality more heavily than the the potential trading attractive agents between two threads. 124

142 125 aggregation of communities is more likely to occur Thread Count In this experiment, we consider the effect of increasing the number of computational threads on partitioning quality, examining partition score when using 2, 4, 8, and 16 computational threads. We consider two styles of stochastic block models. The first has 16 communities, that we may examine partition quality as the number of partitioning threads approaches the number of (partitioning) stochastic blocks. The second style is consistent with previous experiments, possessing 750 communities. As with previous experiments, we set β = 0.02 and expected infection duration at 10 timesteps. For all 16-community and 750-community stochastic block graphs, we generate with agents having expected in-community degree 10 and expected outcommunity degree of 1. Results are aggregated over 200 replicates running for each of 10 graph instances, with redistribution occurring every 10 simulation replicates. Partition score over each redistribution can be seen in Figures 5.7 and 5.9 for the 16-community and 750-community stochastic block model graphs respectively, with Figures 5.8 and 5.10 visualizing the same data scale to a single plot. First, we must draw attention to the drastic difference in partition score scaling between each experiment. Because each experiment begins with a near-worst case partitioning, agents having been assigned to partitions uniformly at random, we consequently begin each experiment of h computational threads with a partition score near 1/h. Again, because our scoring function evaluates only the largest captured proportion of a community, and because stochastic noise makes it unlikely that agents distributed uniformly at random will be distributed evenly, we observe an initial score slightly higher than the 1/h of a truly worst case partitioning. The evolution of partition quality is drastically different between our two experiments. In the 16-community experiment, the 2- and 4-thread simulations saw significant improvement over redistribution, the former seeing a 56% increase in partition score (from to 0.814) and the latter realizing a 78% increase (from 0.273

143 Figure 5.7: Partition quality over redistribution in the 16-community stochastic block model graph. We see that partition quality seems to suffer significantly in the 16-community stochastic block model as we increase the number of computational threads towards the number of communities. Indeed, no improvement is observed in any simulation series when the number of threads matches the number of communities. 126

144 Figure 5.8: An alternate visualization of Figure 5.7. Here, the score of each plot is scaled by the number of processors, so that the scores in a k-thread simulation will range between 1 and k. In this way, we more clearly see the significant (relative) improvement the 4-thread case displays. 127

145 Figure 5.9: Partition quality over redistribution in the 750-community stochastic block model graph. Unlike the 16-community experiment, the addition of computational threads to a 750-community stochastic block model average partition score improvements consistent across most of our simulation sets. We note, however unsurprisingly, that the increased number of computational threads sees a wider variance in an individual simulation s partition score at a given replicate. 128

146 129 Figure 5.10: An alternate visualization of Figure 5.9. Again, the score of each plot is scaled by the number of processors, so that the scores in a k-thread simulation will range between 1 and k. We see here that the minor improvement in the 16-thread case is less significant than the increases in the 8- and 4-thread cases.

147 130 to 0.352). Conversely, the 8- and 16-community simulations little or no increase in partition quality. Indeed, as shown in the background of Figure 5.7, each individual simulation saw essentially constant partition quality despite redistributions. However, the 750-community stochastic block model s graphs saw significantly higher quality of partitioning over its redistributions in all four cases. While, again, the differing number of threads brings with it a significant difference of scale in scoring, we observe a 29% increase in partition score in the 2-thread simulations, a 21% increase in partition score in the 4-thread simulations, a 30% increase in partition score in the 8-thread simulations, and an 8% increase in partition score in the 16- thread simulations. These two experiments in conjunction suggest that either our scoring function or, more likely, our transfer protocol performs much better when there is an abundance of communities relative to the number of computational threads. However, when the number of computational threads approaches the number of communities in a stochastic block model graph, it is possible that the noise of our attraction function makes it increasingly difficult to isolate and aggregate communities to one or few threads. It is likely also be the case that the 16-community graph does not exhibit the strong community ties within those few communities that the 750-community graph does, given that both graphs have the same expected degree within and between communities. That is, an expected degree of 10 is much more diffuse in the 16- community graph (where communities are much larger) than in the 750-community graph Scaling In this experiment, we examine partition score over redistribution as a function a function of the number of communities belonging to the stochastic block model graphs. While we maintain, as in previous experiments, an expected in-community degree of 10 and out-community degree of 1, we vary the number of communities, examining stochastic block model graphs containing 2, 8, 64, and 512 blocks. As

148 131 a result, as we increase communities, each community becomes much more densely connected and cliquish. Each simulation set is computed across 2 threads. As with previous experiments, we set β = 0.02 and expected infection duration at 10 timesteps. Results are aggregated over 200 replicates running for each of 10 graph instances, with redistribution occurring every 10 simulation replicates. Partition score over each redistribution can be seen in Figure Slightly at odds with what we observed in the previous two experiments, we note that our initial near-worst case partition score increases slightly as we increase the number of communities, as stochastic noise yields a more pronounced deviation from a uniform distribution of a community s agents across each thread. More interestingly, we note a sharp increase in the speed at which partition score initially increases as we increase the number of communities. While this may be an artifact of our scoring function s ignorance to all but the largest proportion of a community captured, we believe it is more likely that, as community count increases and communities grow more densely connected, attraction within a community becomes more pronounced. This leads to more meaningful transfer set generation and a sharper initial increase in partition quality. Indeed, we observe in the 2-community simulation an almost logistic curve, with initial redistributions doing very little to improve partition score. Overall quality improvement is not significantly affected, with a slight overall decrease in long term partition score as the number of communities increases Community Complexity We return now to the LFR Benchmark graphs used in the previous section. To reiterate, each graph is generated to be extremely similar, with only the distinction between communities deteriorating as we progress from the LFR-A test set to LFR- B, LFR-C, and LFR-D test sets. Refer again to Section for more information on each test set. Each simulation is run under the same conditions as in the previous section, i.e., with infectivity beta = 0.02, expected infection duration of 10 timesteps, and 1%

149 Figure 5.11: Partition quality over redistribution in stochastic block model graph with a varying number of stochastic blocks. As the number of communities increases, we observe much more rapid initial improvements in partition score. Overall long term partition improvement is comparable across each simulation set. 132

150 133 of the population seeding infection. Unlike the previous section, we redistribute every ten replicates. The improvement in community scoring is visualized in Figures As we might expect, exceedingly well-defined communities allow the transfer set protocol to perform very well. While the transfer set protocol finds improvements at each redistribution in each LFR test set, we see a marked decline in overall improvement as the definition between communities deteriorates. However, even in the most diffuse case of the LFR-D dataset, our protocol performs encouragingly well. 5.3 Performance on Real-World Graphs In the previous section, we observed that high performance in our transfer set protocol relies heavily on strongly defined community structure. This is shown most notably in the shocking disparity in performance between the SBM and R-Mat graphs in Section 5.2.1, and explicitly examined in Section The real world is hardly so rigidly defined. In this section, we explore performance on real world, observed social networks, as measured by reduced computation costs over the course of multiple redistributions. As mentioned in Section 1.5.5, we make use of the Stanford Large Network Dataset Collection[35], a part of the Stanford Network Analysis Project (SNAP). We examine real world social networks of varying sizes in the Enron network, DBLP co-authorship network, and the YouTube friend network. As with previous iterations, redistribution occurs every ten replicates. Observed clocktimes are displayed in Figures In these graphs, we highlight the average simulation time for the given replicate as well as draw the least-squares fitted line. Both the Enron network and the DBLP collaboration network show slight but steady improvement with redistribution, while the YouTube graph shows no improvement and perhaps even a slight degradation in quality over multiple redistributions. Given our previous experiments, we suspect this is due to both the Enron

151 Figure 5.12: Here, we have scaled the score of each partition score by the number of threads, so that each may be viewed in a single scale. Overall, the LFR-A test set of disjoint communities performs very well under our transfer set protocol. The improvement with each redistribution appears comprable across each experiment, relative to the number of threads involved. 134

152 Figure 5.13: The LFR-B test set with very loosely connected communities continues to perform well under our transfer set protocol, but performs less well than the LFR-A test set. 135

153 Figure 5.14: The LFR-C test set does not perform significantly worse than the LFR-B test set. In this set, we have many more multiple-community agents, but each of these still belong only to two communities. While more diffuse as a whole, communities only have a small amount of intermixing. 136

154 Figure 5.15: The LFR-D test set sees a significant decline in transfer set protocol performance. In this test set, each multi-community agent belongs to ten communities, blurring lines between communities a great deal. 137

155 138 Figure 5.16: Simulation clocks over redistribution in the Enron network. While subject to stochastic noise, we see clocks are slowly but steadily reduced over redistribution, with the least squares line possessing a slope of m network and the DBLP networks being based on physical, personal networks, whereas the YouTube network is based on a virtual social network. It may be the case that these physical networks tend to exhibit more sharply defined communities than the online communities. Indeed, examination of these graphs reveals a clustering coefficient of in the Enron network and in the DBLP network, whereas there is only a clustering coefficient of in the YouTube network. However, there are also a significant number of degree-one agents in the YouTube network, so this may be a misleading indicator of community delineation.

156 Figure 5.17: Simulation clocks over redistribution in the DBLP network. We see again clocks slowly but consistently reduced, with the least-squares line possessing a slope of m

157 140 Figure 5.18: Simulation clocks over redistribution in the YouTube network. With poorly-defined community structure, simulation clocks in the YouTube network do not exhibit improvement. The least squares line possesses a slope of m , an increase in clocks, but five orders of magnitude smaller than the average runtime.

Epidemic spreading on networks

Epidemic spreading on networks Due date: Sunday October 25th, 2015, at 23:59. Always show all the steps which you made to arrive at your solution. Make sure you answer all parts of each question. Always