Sampling Strategies for Epidemic-Style Information Dissemination

Size: px

Start display at page:

Download "Sampling Strategies for Epidemic-Style Information Dissemination"

Jordan Boone
6 years ago
Views:

This full text paper was peer reviewe at the irection of IEEE Communications Society subject matter experts for publication in the IEEE INFOCOM 2008 proceeings.

1 This full text paper was peer reviewe at the irection of IEEE Communications Society subject matter experts for publication in the IEEE INFOCOM 2008 proceeings. Sampling Strategies for Epiemic-Style Information Dissemination Milan Vojnović, Varun Gupta, Thomas Karagiannis, an Christos Gkantsiis Microsoft Research Carnegie Mellon University Abstract We consier epiemic-style information issemination strategies that leverage the nonuniformity of host istribution over subnets (e.g., IP subnets) to optimize the information sprea. Such epiemic-style strategies are base on ranom sampling of target hosts accoring to a sampling rule. In this paper, the objective is to optimize the information sprea with respect to minimizing the total number of samplings to reach a target fraction of the host population. This is of general interest for the esign of efficient information issemination systems an more specifically, to ientify requirements for the containment of worms that use subnet preference scanning strategies. We first ientify the optimum number of samplings to reach a target fraction of hosts, given global information about the host istribution over subnets. We show that the optimum can be achieve by either a ynamic strategy for which the per host sampling rate over subnets is allowe to vary over time or by a static strategy for which the sampling over subnets is fixe. These results appear to be novel an are informative about (a) what best possible performance is achievable an (b) what factors etermine the performance gain over oblivious strategies such as uniform ranom scanning. We then consier several simple, online sampling strategies that require only local knowlege, where each host biases sampling base on its observe sampling outcomes an keeps only O(1) state at any point in time. Using real atasets from several large-scale Internet measurements, we evaluate the significance of the factors reveale by our analytical results on the sampling efficiency. I. INTRODUCTION We consier the problem of reaching a target fraction of an initially unknown population of hosts by using epiemicstyle information issemination. We make the assumption that the noes of the population take ientities from a large space (such as the space of IPv4 aresses), an that two noes communicate only when one of them iscovers the other by ranom probing. Initially, the information to be isseminate exists in few noes which use ranom probing to iscover more noes. Discovere noes also initiate a ranom probing process to further propagate the information. In this work, we escribe optimum static an ynamic strategies, as well as local sub-optimal strategies to efficiently isseminate the information to a given fraction of the population. If this information correspons to worm-like malicious software, the problem transforms to characterizing the performance of worm propagation. In this case, we are intereste in characterizing the minimum number of scans that a worm nees to infect a target fraction of susceptible machines; such knowlege is of value to esigners of worm countermeasures. In general, however, epiemic-style information issemination is of interest in a plethora of areas, incluing web-service Fig. 1. The setting uner consieration: strategies for efficient sampling of hosts that are istribute over ifferent groups. membership management [1], atabase maintenance [2], an streaming broacasting [3]. In such systems, information issemination is assiste by the unerlying structure of the network (e.g., p2p overlays). However, it is of theoretical interest an practical value to unerstan whether the information in such systems can be efficiently isseminate without the support of an unerlying network, when the population of the system an its istribution across the issemination network are a-priori unknown. This is consistent for example with trying to reach hosts, a large fraction of which connect to the Internet through ynamic IPs [4], ergo resulting in high population variability over time an IP space. Specifically, to esign efficient information issemination strategies we take into avantage the non-uniformity of host istribution over subnets. In particular, we assume that hosts are partitione into groups or subnets such as IP aress blocks or subnets efine by Autonomous Systems (ASes) (see for example Fig. 1). Our goal is then to minimize the samplings require to reach a preetermine fraction of hosts without any prior knowlege of their group partitioning, an only requiring the minimum amount of state possible. We first examine the case where perfect global knowlege of the host istribution over subnets exists. We fin that the optimum in this context may be achieve by either a static or a ynamic strategy, where surprisingly both strategies can be escribe by the same formula. Our analysis provies insights about the best possible performance that can be achieve. We show that while the selection of the right target set of subnets is crucial to achieve the optimum, further sampling optimization within this target set only bears minimal benefits. While it may be argue that the knowlege require by the optimum strategy (i.e., the istribution of hosts over subnets) can be estimate through various sources such as, for example, intrusion etection systems, BGP tables or even reports of infecte hosts from earlier worm attacks, we present evience that such an inference might not be as straightforwar. Using a set of iverse real Internet measurements, we highlight that ifferent atasets provie istinct viewpoints of the Internet /08/$ IEEE 2351

2 This full text paper was peer reviewe at the irection of IEEE Communications Society subject matter experts for publication in the IEEE INFOCOM 2008 proceeings. topology, stressing the nee for sampling strategies that require no prior knowlege of subnet partitioning, but instea they are able to euce this information online. Motivate by this fining, we propose simple strategies that bias their sampling base on previous sampling attempts using just O(1) state. Our contributions can be summarize as follows: We escribe the optimum static an ynamic strategies to minimize the number of samplings require to reach a target fraction of the host population (Sec. III-A an IV-A). Note that these appear to be novel results. We propose an evaluate simple local strategies (Sec. V) that require no prior knowlege of the host istribution over subnets, an yet significantly outperform uniform ranom scanning as well as local-subnet preference strategy. Using Internet measurement atasets, we evaluate an provie insights regaring the factors that etermine the performance of the optimal an the sub-optimal strategies. We further highlight that ifferent atasets provie istinct perspectives of the Internet topology both across the measurement traces, but also across time within the same trace (Sec. VI). To the best of our knowlege, our work is the first to fully escribe the optimum with respect to minimizing the number of samplings require to reach a fraction of hosts. While subnet preference sampling has been analyze earlier [5] an the objective of minimizing samplings has been ientifie as important also in previous work [6], the authors there restricte themselves to stuying sub-optimal strategies. II. ASSUMPTIONS AND NOTATION We first introuce the notation that we use in the remainer of the paper. We enote the size of the total aress space with Ω (e.g., for IPv4, Ω =2 32 ). The aress space is ivie into J subnets with subnet j having an aress space of size Ω j. We will refer to the hosts that receive the information item of our issemination system as infecte hosts an those that are intereste in receiving the information items but have not yet receive it as susceptible hosts. This nomenclature is commonplace in the literature. The number of hosts in a subnet j that are intereste in the information item is enote with N j with N efine as the total number of hosts (i.e., N = P J j=1 N j). We enote with I j (t) the number of infecte hosts in subnet j at time t. Similarly, we let S j (t) =N j I j (t) be the number of susceptible hosts in subnet j at time t. The quantities n j, i j (t) an s j (t) are the normalize versions efine as follows: n j = N j /N, i j (t) =I j (t)/n,ans j (t) =S j (t)/n.weenotewithi(t) an s(t) the total fraction of infecte an susceptible hosts, respectively, i.e. i(t) = P J j=1 i j(t) an s(t) = P J j=1 s j(t). Also, let = Ω j /Ω be the fraction of the total aress space occupie by the jth subnet. Without of loss of generality, we assume that subnets are enumerate such that S 1(0) Ω 1 S2(0) SJ (0) 0 (1) Ω 2 Ω J an S 1 (0)/Ω 1 > 0. We let η be the rate at which an infecte host samples the aress space for a susceptible host. Without loss of generality, we assume that η =1. In the most general setting, at time t,an infecte host in a subnet i ecies to sample a noe in a subnet j with probability p ij (t). Once it chooses subnet j, it samples an aress lying in subnet j s aress space uniformly at ranom an then initiates a contact to this aress. Let β be the ensity of hosts, i.e., β = N Ω.Define β ij(t) =β 1 p ij (t). We consier the many host limit where the total number of hosts N tens to be large while the following parameters are hel fixe N/Ω, Ω j /Ω, ani j (0), j =1,...,J. Uner assumption that each host initiates samplings at instances of a Poisson process with rate 1, the infecte host population is escribe by a Markov process inexe with N. The host population frequencies converge with N uniformly on any compact time interval to the solution of the following system of orinary ifferential equations (e.g., see Kurtz [7]) t ij(t) = β ij(t)i i(t) s j(t) (2) i=1 with s j (t) =n j i j (t), forj =1,...,J. Given the ensity of hosts over subnets n j an initial placement of infecte hosts i j (0), seeking for an optimal sampling strategy is equivalent to fining the functions β ij (t), t 0. The optimal strategy β woul epen on the impose constraints, such as, for example, for static subnet preference that β is time invariant. β ij will epen only on j if one oes not allow the noes to scan their subnet faster or have privilege information about their subnet. In that case, all the subnets have the same subnet sampling bias. In the sequel, we will enote with u(t) the total number of samplings per host by time t, i.e. u(t) = Z t i(x)x. (3) 0 III. STATIC SUBNET PREFERENTIAL SAMPLING We consier a class of sampling strategies for which the subnet preference probability istribution p is fixeintime. For this subset of sampling strategies, Eq. (2) boils own to: t i j(t) =β j i(t)s j (t) (4) where β j := βp j /, j =1,...,J. From Eq. (4), it follows i j (u) =n j s j (0)e βju, j =1,...,J (5) where u(t), t 0, isgivenbyu(0) = 0 an t u(t) =1 s j (0)e βju(t), t 0. (6) j=1 Note that for any static sampling strategy, the time ynamics of infecte host population over subnets is entirely escribe by Eq. (5) an (6). We briefly revisit the uniform ranom sampling, the well known S-I epiemics, which we will use recurrently as a reference as it is a commonplace sampling rule in practice. Note that with uniform ranom sampling p j =, i.e. a subnet j is sample proportional to the aress space size of the subnet j. Uniform ranom sampling is easy to analyze, for example, note that s(u) =s(0)e βu, u 0. (7) 2352

3 This full text paper was peer reviewe at the irection of IEEE Communications Society subject matter experts for publication in the IEEE INFOCOM 2008 proceeings. We will see later that for real-life istributions of hosts over subnets, the uniform ranom sampling is grossly less efficient than an optimal strategy that minimises the samplings for a given target fraction of infecte hosts. The following preliminary result provies a comparison of static sampling strategies an uniform ranom sampling; proof in Appenix [8]. Proposition 1 The set of static sampling strategies is characterise as follows: (a) For any static sampling strategy specifie by the subnet preference istribution p such that P J j=1 p Sj (0) j Ω j < 1 S(0) Ω the total fraction of infecte hosts is smaller than uner uniform ranom sampling, for any given fraction of samplings u 0, i.e.i p (u) <i ω (u), for all u>0, wherei p (u) an i ω (u) are the total fractions of infecte hosts uner static sampling p an uniform ranom sampling ω, respectively. (b) For any subnet preference istribution p such that p j < for some subnet j with s j (0) > 0, i p (u) < i ω (u), for some u>0. Item (a) ientifies a sufficient conition uner which a static sampling strategy p is less efficient than uniform ranom sampling. These are sampling strategies that in average sense bias to sampling of rare subnets. Item (b) entails that any static sampling strategy other than uniform ranom sampling over a set of subnets with each containing susceptible hosts is worse than uniform ranom sampling over this set of subnets, for some total fraction of samplings u. In the next section, we ientify optimal static sampling strategy that in the case of nonuniformly ense subnets requires smaller number of samplings than uniform ranom sampling, for a given target fraction of infecte hosts. A. Optimal Static Strategy In this section, we ientify the optimal static sampling strategy that minimises the total number of samplings to reach a given fraction of infecte hosts i 0. We show that the static strategy OPT-STATIC, specifie by the following subnet preference probabilities is optimal: OPT-STATIC Each infecte host scans a subnet j with probability: s A j (0) ω αω p j = j log j A j A 1 P i0 i(0) (8) k A s k (0) 0 j/ A where α is the normalization constant an A is the set of subnets {1, 2,...,J 0 } with ( J 0 =max j : sj(0) P j > s ) k(0) (i 0 i(0)) P. j ωk Here we use the following notation: s A s j(0) j (0) := Pk A sk(0) an ωa j =, j A. Pk A ωk The strategy specifies to sample a set A of initially ensest subnets. The necessary conition for a set A to be optimal is that the initial ensity of susceptibles in every subnet in A must be larger than the final ensity of susceptibles in A. Inee, if we target the set A then after target infection is reache, the final ensity of susceptibles in A is given by 1 P j A S j(t 0 ) Pj A Ωj = P j A S j(0) (I 0 I(0)) P j A Ωj where t 0 is the time when the fraction of infecte hosts i 0 is reache. Furthermore, note that if a subnet j is in the target set A, all subnets whose initial ensity of susceptibles is larger than the initial ensity of susceptibles in j are also in A. Lastly, note that the strategy oes not necessarily target the smallest ensest set of subnets. One may nee to target a larger set because even though that it may slow the initial phase, the ensity of susceptibles will still be sufficient as infection reaches target infection. Finally, note that in Eq. (8) the subnet preference probability for a subnet that is in the target set A is an expression containing a term logarithmic in the initial ensity of this subnet. The next result establishes that OPT-STATIC is optimal over all static sampling strategies. Theorem 2 For any given target fraction of infecte hosts, the strategy OPT-STATIC is optimal in minimising the total number of samplings over all static sampling strategies. The total number of require samplings for a target fraction of infecte hosts i 0 is given by ³ P " u STA (i 0,A)= 1 β j A ω 1 j log 1 P i0 i(0) k A s k (0) D(ω A s A (0)) where D( ) enotes Kullback-Liebler (KL) ivergence. 2 Proof: See Appenix [8]. What oes this tell us? We compare the require total number of samplings of the optimal static strategy to uniform ranom sampling. We first note the following easy result: Proposition 3 Uner uniform ranom sampling of a subset of subnets A, we have that the per host total number of require samplings for a target fraction of infecte hosts i 0 is given by u UNI (i 0,A)= 1 X log 1. β j A 1 P i0 i(0) j A sj (0) We note that the require total number of samplings for OPT- STATIC iffers from that of the uniform ranom sampling only in the KL term D( ), presuming that both strategies target the set A specifie by OPT-STATIC. The KL term measures the eviation between the initial susceptible host population sizes s j (0) an the subnet aress space sizes over subnets in the target set. In particular, if the aress sizes of subnets 1 In aition, at the en of the infection, all subnets in the target set will have equal ensity of susceptible noes. 2 Kullback-Liebler ivergence between two probability measures p an q is efine by D(p q) := P i p(i)log(p(i)/q(i)). 2353

4 This full text paper was peer reviewe at the irection of IEEE Communications Society subject matter experts for publication in the IEEE INFOCOM 2008 proceeings. are equal, then the KL measures the eviation of s j (0) from the uniform istribution on the target set A. Note that µ D(ω A s A Aω A(ρ(0)) (0)) = log G ω A(ρ(0)) where ρ j (0) := s j (0)/ is the ensity of a subnet j, A ω A(ρ(0)) an G ω A(ρ(0)) are the arithmetic an geometric means, respectively. A ω A(ρ(0)) = X ω P j j A k A ω ρ j(0) k G ω A(ρ(0)) = Y j A ρ j (0) ω P j k A ω k. Inee, we have that D(ω A s A (0)) > 0 unless all subnets are of same ensity, i.e., s i (0)/ω i = s j (0)/, for all i, j A (elementary property of means, Hary, Littlewoo, an Pólya [9, Section 2.5]). The question of interest is how significant is the KL term in Theorem 2 relative to the logarithmic term therein. If the KL term is insignificant relative to the logarithmic term, then this suggests that provie that one ientifies the target set A, then simple uniform ranom sampling over the set A woul yiel nearly-optimal performance. Furthermore, it is of interest to evaluate how critical is the selection of the target set on the resulting total number of samplings. We will aress these questions in Section VI. IV. DYNAMIC SAMPLING STRATEGIES In this section we consier strategies for which subnet preference probabilities are allowe to vary over time. A. Optimal ynamic strategy We now consier what optimal performance can be achieve over the entire set of feasible sampling strategies. A priori, it may not be clear whether enlarging the set of strategies from static to ynamic will yiel better performance. We show here that the answer is no. The following scheme is optimal over the entire set of ynamic sampling strategies: OPT At any point in time t with the subset of ensest subnets S(t), each infecte host samples uniformly at ranom an aress over the aress space of subnets S(t). The optimality of this strategy woul appear to be very intuitive. The strategy was claime to be optimal in [6], however, we are unaware of a proof in the literature that shows that this is inee an optimal strategy. We also characterize the host evolution for this strategy, which we later use to compare with sub-optimal strategies. Theorem 4 OPT has the following properties: 1) The strategy is optimal in that at any point of its execution the fraction of infecte hosts is maximise. 2) For any given target fraction of infecte hosts i 0,the total number of samplings is given by the relation in Theorem 2. 3) For the total number of samplings u 0, the fraction of susceptible hosts is given by, for u n 1 u<u n n X s(u) = s j(0) + G ne P β nj=1 u w j (9) where j=n+1 an u 0 =0with G n = u n = j=1 ny µ sj(0) j=1 P n j=1 ωj β log P n ω k Gn s j+1 (0) +1 for n =1, 2,...,J 1, an u J =+. (10) The result entails the following corollary: Corollary 5 For any given target fraction of infecte hosts i 0, there exists a static sampling strategy that achieves the smallest possible total number of samplings to infect the fraction of hosts i 0 over all ynamic sampling strategies. This static sampling strategy is OPT-STATIC. We en this sub-section with a comparison with uniform ranom sampling. Corollary 6 For uniform ranom sampling, we have 1) With the target fraction of infecte hosts going to 1, the total number of samplings is asymptotically optimal. 2) In the prevailing limit, the fraction of susceptible hosts uner uniform ranom sampling an optimal satisfy s UNI (u) lim u + s OPT (u) = ed(ω s(0)) (11) where D(ω s(0)) is the Kullback-Lieber ivergence between (ω 1,...,ω J ) an (s 1 (0),...,s J (0)). Item 1 is rather intuitive. Let i 0 =1 1/N be the fraction where all but one host are infecte. Then, from Theorem 4, we have that the log term is logarithmic in N while the KL term is a constant, thus asymptotically negligible. Item 2 follows irectly from Theorem 4, Proposition 3, an the fact Eq. (7). B. Proportional Sampling: A Sub-Optimal Dynamic Strategy The optimal strategies iscusse so far woul in many cases be ifficult to implement as they require global knowlege about ensities of susceptible hosts over subnets. In particular, the optimal static strategy requires knowing initial ensities of susceptible hosts over subnets while the ynamic optimum strategy requires knowing the subset of ensest subnets at any point in time uring its execution. In this section, we consier a sampling strategy that is base on sampling a subnet proportional to the number of susceptible hosts in this subnet. It follows from the analysis below that, in general, proportional sampling is a sub-optimal strategy. It is not clear, though, how far is the proportional sampling from optimal for istributions of hosts over subnets in practice; we investigate this in Section VI. We next characterise a generalize version 2354

5 This full text paper was peer reviewe at the irection of IEEE Communications Society subject matter experts for publication in the IEEE INFOCOM 2008 proceeings. of the proportional sampling (we call PROP) that is a mix of uniform ranom sampling an sampling proportional to the ensity of susceptibles per subnet. The strategy is specifie by the parameter 0 q 1 enoting the probability that a host samples a subnet proportional to the number of susceptibles in this subnet. (q =1correspons to pure proportional sampling.) PROP An infecte host with probability q samples a subnet proportional to the current number of susceptible hosts in this subnet, or else samples a subnet by uniform ranom sampling of the entire aress space. It turns out that uner strategy PROP the fraction of susceptible hosts for any given total number of samplings u 0 is given in a simple analytical form: Theorem 7 The fraction of susceptible hosts in a subnet j is given by s j (u) =s j (0) 1+ e β(1 q)u sj (0) ψ(u) (12) where ψ is the implicit function µ ω k log 1+ s k(0) ψ(u) = βqu. (13) ω k Proof (Appenix [8]) shows that in fact the ynamics uner PROP is entirely escribe by a scalar ifferential equation for ψ an that the evolution of the susceptible hosts in a subnet j is given by the function of ψ given in the theorem. V. SAMPLING STRATEGIES THAT USE ONLY LOCAL KNOWLEDGE We consier sampling strategies that are local in that each host biases its sampling over subnets base solely on the observe successes an failures of its own samplings. Moreover, we confine our attention to strategies that at any time keep state of only a fixe number of subnets with respect to the total number of subnets in the system. We consier several sampling strategies an escribe their ynamics by ifferential systems. This enables our numerical evaluations in Section VI. A. Local Subnet Preference We first consier the well-known local subnet preference strategy (e.g., use by CoeRe-II worm), efine as follows: LOC-PREF A infecte host in a subnet j, with probability q j, samples an aress uniformly over the aress space of the subnet j, or else it samples an aress uniformly over the entire aress space. This strategy can be seen as a ynamic sampling strategy specifie by the following subnet preference probabilities p j (t) = 1 q k v k (t) + q j v j (t) (14) where v j (t) :=i j (t)/ P J i k(t). B. K-FAIL Strategy We next consier another strategy that biases to subnets from which a host observes successful samplings. The strategy is escribe as follows (K 1 is a configuration parameter): K-FAIL 1) Initially, infecte hosts use uniform ranom sampling. 2) When a host that performs uniform ranom sampling successfully samples a host in a subnet j, it continues to sample uniformly at ranom on the subnet j until K consecutive scan failures. 3) Upon K consecutive failures, the host switches to uniform ranom sampling, until it successfully samples a susceptible host when it goes to item 2. We are unaware of a issemination system in practice that uses this strategy. The strategy is perhaps closest relate to the Zotob family of worms that use similar but ifferent strategy. With some Zotob worms, an infecte host starts with scanning its local subnet until K consecutive failures, an then switches to an remains inefinitely a uniform ranom scanner. Instea, K-FAIL strategy sticks to a subnet from which it successfully samples a susceptible host an may switch over between uniform ranom sampling moe an sticking to a subnet moe. We next escribe ynamics of K-FAIL strategy. Each infecte host is in one of K states: 0 enoting the state in which host performs uniform ranom sampling, or state k where K k enotes the number of successive failures that the host alreay incurre, k =1,...,K.Weenotewithr 0 the fraction of infecte hosts that are in state 0 an with r j,k the fraction of infecte hosts in a subnet j that are in state k. The ynamics of the system is entirely escribe by the following ifferential equations: sj t = β r 0 + µ t r0 = r i,1 i=1 KX r j,k 1 β si t r j,k = r k,j + r k+1,j rj,k = rj,k +2 t ω i s j r 0β µ 1 β sj r 0βs j + β sj s i i=1, 1 k<k KX r j,k. The equations capture the transitions of host states. We use this system in our numerical evaluations in Section VI. C. Subnet Preference Strategy We consier a subnet preference sampling strategy where each host maintains a caniate set of a fixe number K 1 of subnets. Each host splits its effort to sampling subnets from its caniate set an uniform ranom sampling. Again, we are unaware of a system that uses this strategy. It appears natural to consier strategies for which each host keeps a (small) list of subnets an uses sorting strategies of 2355

6 This full text paper was peer reviewe at the irection of IEEE Communications Society subject matter experts for publication in the IEEE INFOCOM 2008 proceeings. the caniate subnets in the list, such as for example, move-tofront sort heuristic, in orer to bias its sampling to subnets at the hea of the list. We instea consier the below introuce K-CANDSET strategy that keeps no orer of the items in the list for ease of analysis. Inee, for the caniate set of size at most 1, i.e.k =1, the two schemes are equivalent. In this case, it can be consiere as a generalization of local subnet preference sampling by letting the preference subnet not to be fixe to local subnet but to change to subnets from which the host observes successful samplings. The strategy is introuce in the following, where 0 q<1 is a configuration parameter. K-CANDSET 1) Init: infecte hosts set their caniate sets accoring to a policy (efault: empty set). 2) With probability q, a host samples a subnet by picking uniformly at ranom from its caniate set. 3) Otherwise, the host samples by uniform ranom sampling of the entire aress space. If the sampling to a subnet k is successful an subnet k is not in the caniate set of this host, the following happens: a) If the caniate set of this host is smaller than K, then subnet k is ae to the caniate set, b) else, the subnet k replaces a subnet from the caniate set, evicte uniformly at ranom. 4) A host infecte by an instigator host, inherits the caniate set of the instigator host (the caniate set of the instigator upate after successful sampling of this host). The ynamics of the above escribe sampling strategy is entirely specifie by the following system of orinary ifferential equations. In orer to keep the notation simple, we isplay equations only for the special case K =1,but note that similar equations are easily erive for the general case. Let r k enote the fraction of infecte hosts of type k, i.e., those with the caniate set {k} an let r 0 enote the fraction of hosts of type 0, i.e., with empty caniate set. t s s j j = β (1 q) r k + q ( r 0 + r j ) k=0 r0 = r0 βs k t rj = 2(1 q) r k βs j + t k=0 +r j qβ sj (1 q) βs k. Again, we will use these equations to erive our numerical results in Section VI. VI. EXPERIMENTAL RESULTS In this section we perform an extensive evaluation of the strategies escribe throughout sections III- V. We first outline Fig. 2. Subnet ensity for all atasets. the set of analyse atasets which cover iverse Internet measurements reflecting the istributions of IP aresses over the IP space (Section VI-A). We then examine the factors that etermine the ynamics of the optimum strategy (Section VI-B ) an finally we evaluate the performance of the propose strategies that require no sie knowlege about the istribution of hosts over subnets (Section VI-C). A. Datasets Our atasets consist of measurement traces of IP aresses. We aggregate IPs into groups or subnets of various sizes such as for example /8 an /16 subnets, or into groups base on Autonomous Systems (AS). Without loss of generality we will use /16 subnet groups for the remainer of the paper unless otherwise specifie. Throughout our evaluation, we make use of the following atasets: 3 WU : The ataset refers to IIS logs collecte at the Winows Upate system [10]. In our experiments we will use the 117 million IP aresses observe uring the first ay of the measurement. (Populate /16s: ) Hotmail : The ataset consists of approximately 103 million IP aresses which were observe over a perio of three months from logs of user-logins at the Hotmail service [4]. (Populate /16s: ) DShiel : The ataset consists of roughly 7.6 million IP aresses that were collecte by a set of firewall an intrusion etection systems provie by DShiel [11] an were use in [12] an [13]. This ataset may contain spoofe source IP aresses. (Populate /16s: ) Witty : A list of IPs (roughly 55 thousans) corresponing to hosts spreaing the Witty worm provie by CAIDA [14]. (Populate /16s: 5271.) Fig. 2 presents the ensity of hosts in /16 subnets as seen in each of our atasets. Density here refers to the fraction N j /N, while the x-axis presents to the rank of each subnet with respect to its ensity (i.e., x =1refers to a ensest subnet). Note that while overall the shapes of the curves appear similar, ensities of istinct subnets appear quite ifferent across the atasets. We will extensively examine these ifferences in the following section, as they significantly impact the performance of the sampling strategies, e.g. the selection of the target set A escribe in Section III-A. 3 Due to space limitations results will be presente for only some atasets interchangeably. Our finings however apply to all atasets. 2356

7 This full text paper was peer reviewe at the irection of IEEE Communications Society subject matter experts for publication in the IEEE INFOCOM 2008 proceeings. Fig. 3. Logterm vs. KL term in the four atasets. The KL term oes not appear to be significant. Fig. 5. LEFT: Augmenting the target set size an its effect on the number of samplings. RIGHT: The optimal target set size an the minimum number of sets to cover a fraction of the population. Fig. 4. Optimal strategy (opt) vs. uniform ranom scanning of the target set (rn) an the optimal set of ensest subnets (rnopt). All curves fall on top of one another. B. Optimal strategy evaluation We first examine the factors that affect the total number of samplings for the optimal sampling strategy. Note that in Theorem 2 we foun that the total number of samplings per susceptible host epens on the logarithmic an the KL term therein. We now examine the significance of these two iniviual factors. This is of interest as if it turns out that the KL term can be neglecte relative to the logarithmic term, then the implication is that simple uniform ranom sampling of the target set A woul alreay be near-optimal an fine-tuning of the subnet preference probabilities may not be neee. Fig. 3 specifically examines these two factors in the four atasets as the fraction of infecte hosts i 0 grows. Inee, in all cases we observe that the KL term is several orers of magnitue smaller than the logarithmic term, especially in the larger atasets for smaller i 0. Furthermore, while for the DShiel an Witty atasets the KL term appears larger for a range of values for small i 0, this is only a sie-effect of the smaller number of IP aresses in these atasets an oes not reflect true operational regions (e.g., for i 0 =10 6 in the witty ataset there are no hosts to be infecte since the total number of IP aresses is roughly 55 thousan). The above observation suggests that optimization of scanning rates over subnets of the optimal target set yiels insignificant or moerate gain compare to simply using uniform ranom scanning over the optimal target set. Thisisfurther Fig. 6. Examples of employing a non-optimal target set with a prior istribution. TOP: Prior istribution: DShiel. True: Hotmail. BOTTOM: Prior istribution: WU. True: Hotmail. valiate by Fig. 4 which compares the optimal strategy vs. uniform ranom sampling within the target set an uniform ranom sampling on the smallest set of ensest subnets that covers the target fraction of infecte hosts (rn an rnopt curves in the figure). In Fig. 4, all curves fall on top of one another, showing that these strategies perform similarly if the optimal target set has been well ientifie. However, the choice of the target set is critical with respect to the total number of samplings, since a poor choice of the target set A may have a ramatic impact on the require number of samplings to reach a target fraction of the host population. To evaluate the iscrepancy from the optimal strategy with a poor choice of A, we perform the following experiment. For a given target fraction of infecte hosts i 0, we augment the optimal target set A by aing subnets in ecreasing orer of their ensities, an we then examine the incurre penalty. For example, if the optimum target set is A = {1, 2,...,j}, weefine sets A as A {j +1}, A {j +1,j +2}, etc. In our evaluations, we enlarge the target set by a factor k, i.e.k = A A. This effect is shown 2357

8 This full text paper was peer reviewe at the irection of IEEE Communications Society subject matter experts for publication in the IEEE INFOCOM 2008 proceeings. Fig. 7. in Fig. 5(left) where for various i 0 of the WU ataset, we show the ratio of the number of samplings by uniform ranom sampling of the set A to that of uniform ranom sampling of the optimum target set A vs. the target fraction of infecte hosts. For reference, Fig. 5(right) presents the size of the optimal target set A an the minimum possible set of subnets to cover a given fraction of the target host population i 0.We observe that as i 0 increases, the penalty factor becomes quite significant especially when i is larger than 1% (e.g., to reach roughly 14% of the population, a tenfol increase of the target set, will prouce 5 times more samplings). When i 0 is small augmenting the target set size oes not incur a high penalty, since i 0 is alreay reachable by the ensest sets alreay in A. Fig. 5 also shows that the penalty increases roughly linearly with k for a given i 0. (In Fig. 5-left, i i 0.) The criticality of the target set is also highlighte when the optimal static sampling strategy is configure using a prior istribution of hosts over subnets, an then applie over a host population following a ifferent istribution. We compare the final fraction of sample hosts when the above strategy is applie to a) the population istribute accoring to the prior istribution, an b) a population istribute accoring to another true reference istribution. This coul be seen as having imperfect knowlege about the true istribution, which can result in using a non-optimal target set A. Fig.6 presents the results of two such experiments: when using DShiel as a prior an Hotmail as the true istribution, an using WU as a prior an Hotmail as the true istribution For reference, we also plot the cumulative fraction of hosts for the true istribution resiing in the subnets corresponing to the ensest subnets in the prior istribution with a given total fraction of population. Fig. 6 highlights that the iscrepancy with respect to the optimal is significant. How sub-optimal might the target set A be having partial knowlege? There are two important factors which etermine how useful a prior istribution might be: 1. The fraction of the population in the true istribution for which the corresponing subnets in the prior are empty. The importance of this fact is highlighte when comparing Hotmail vs. Dshiel. Approximately 30% of Hotmail hosts resie in subnets which are empty in the DShiel ata set (x =1, y =0.7 top right plot in Fig. 6). This implies that when using DShiel as a prior for Hotmail, we can never reach more than 70% of WU s population. 2. The ifference in istribution of hosts among the Comparing the ensest subnets across the atasets an across time. Fig. 8. The impact of the group size on the KL term an the samplings uner the optimal strategy. populate subnets. The importance of this fact is highlighte when using WU vs. Hotmail. In both cases, approximately 10% of hosts in one ata set resie in subnets which are empty in the other ata set. However, the curves comparing the istributions look rastically ifferent. The plot of Hotmail s istribution in the ensest WU subnets as a function of cumulative fraction of hosts in WU is almost a straight line an very close to y = x (Fig. 6, bottom, right). Thus, using WU as a prior for Hotmail performs better than using DShiel; yet, the other way roun (i.e., Hotmail as a prior for WU) oes not (plot omitte ue to lack of space; similar to Fig. 6, top). To further examine the ifference between the istributions accoring the various atasets, we compare the analyse traces by examining the fraction of common ensest subnets within the set of top-k ensest subnets for each istribution. Fig. 7 presents the extent to which the ensest subnets vary from one istribution to another by pairwise comparing the atasets. Similarly, Fig. 7(right) presents the same result but for ifferent time viewpoints for the WU ataset where timing information is available. Surprisingly, the iscrepancy appears to be substantial. For example, for k =10the sets of the 10 ensest subnets of the WU an the Hotmail istributions have only one subnet in common, while no common subnets exists when consiering the DShiel istribution While the ensest subnets across time for the same istribution appear to be more stable the iscrepancy may still be nontrivial. These observations clearly raise the problem of accurately estimating the true subnet istribution in the Internet. Finally we consier the impact of host partitioning in subnets on performance of the optimal strategy. To this extent, Fig. 8 presents the KL term an the number of samplings uner the optimal strategy for various subnet efinitions of the WU trace. Specifically, IP aresses are groupe to /n subnets, with n ranging from 8 to 20. We fin that while grouping 2358

9 This full text paper was peer reviewe at the irection of IEEE Communications Society subject matter experts for publication in the IEEE INFOCOM 2008 proceeings. Fig. 9. The fraction of sample hosts i(u) vs. the number of samplings u for sub-optimal strategies. LEFT: Proportional sampling. MIDDLE: Local-knowlege strategies. RIGHT: Zooming in mile figure for small i(u). oes not appear to have a significant effect on the KL term, the number of sampling increase as the groups become larger (i.e., n smaller). (This is expecte from our analytical results as splitting subnets into sub-subnets inee enlarges the set over which the optimisation is one.) This effect highlights that smaller groupings appear more attractive since effort will be concentrate in small populate subnets, in contrast to performing reunant samplings in large subnets which woul be a lot sparser in general, but this may a to the complexity of ientifying the optimum target set of subnets. C. Sub-optimal strategies Here, we examine the performance of our propose strategies with respect to the optimal an the uniform ranom sampling strategies. We start by stuying proportional sampling (PROP Section IV-B) in Fig 9(left), where we plot PROP vs. the optimal (OPT) an uniform ranom sampling (RND) for all atasets. In all cases, PROP follows closely the optimal. This is an interesting property as it suggest that alreay proportional sampling brings us very close to the optimal an may inform esign of online sampling strategies. Similarly, Fig 9(mile) an Fig 9(right) show the performance of sampling strategies that only take avantage of local knowlege (Section V) for the WU ataset. For reference, we also present the optimal, uniform ranom an PROP strategies. Fig 9(right) zooms in a particular range of Fig 9(mile) for small i. Fromthesefigures we can make the following observations for this specific ataset: Local subnet preference strategies perform close to (RND) for small q (local sampling probability). For larger q this strategy appears to suffer from persistently sampling exhauste subnets especially for larger i, thus performing close to (RND) or worse, while showing better performance for small i(u). K-FAIL appears to consistently outperform all other local strategies an asymptotically follows the optimal strategy. CANDSET appears to suffer from similar issues with the local preference strategy, by persisting to scan subnets that have been exhauste. For smaller i (Fig 9, right), we see that both CANDSET an K-FAIL perform very close to the optimal (up to roughly i = 0.1). Note however, that i = 0.1 for the WU ataset correspons to a substantial number of hosts (over 10 million). Overall, all our propose strategies perform significantly better than uniform ranom sampling an local subnet preference strategy in the majority of the cases. VII. CONCLUDING REMARKS This paper stuies the problem of epiemic-style information issemination using ranom sampling. We ientify optimal static an ynamic strategies to reach a target fraction of the host population in minimum number of samplings. We also propose an evaluate simple strategies that use no prior information an constant state, an provie significant gain. Future work may further investigate the space of simple strategies that perform near optimal. REFERENCES [1] W. Vogels an C. Re, WS-Membership Failure Management in a Web- Services Worl, in Proc of the Twelfth International Worl Wie Web Conference, [2] A. Demers, D. Greene, C. Hauser, W. Irish, J. Larson, S. Shenker, H. Sturgis, D. Swinehart, an D. Terry, Epiemic algorithms for replicate atabase maintenance, in Proceeings of the sixth annual ACM Symposium on Principles of istribute computing, [3] L.Massoulie,A.Twigg,C.Gkantsiis,anP.Roriguez, Decentralize broacasting algorithms, in Proc. of the IEEE INFOCOM, [4] Y. Xie, F. Yu, K. Achan, E. Gillum, M. Golszmit, an T. Wobber, How Dynamic are IP Aresses? in Proc. of the ACM SIGCOMM, [5] C. C. Zou, D. Towsley, an W. Gong, On the performance of internet worm scanning strategies, Performance Evaluation, vol. 63, [6] Z. Chen an C. Ji, A Self-Learning Worm Using Importance Scanning, in ACM WORM 05, Fairfax, Virginia, USA, [7] T. Kurtz, Approximation of Population Processes. CBMS-NSF Regional Conference Series in Applie Mathematics, 1981, vol. 36. [8] M. Vojnovic, V. Gupta, T. Karagiannis, an C. Gkantsiis, Sampling Strategies for Epiemic-Style Information Dissemination, Microsoft Research, MSR-TR , Tech. Rep., [9] G. Hary an J. E. Littlewoo an G. Pólya, Inequalities, 2n e. Cambrige Mathematical Library, [10] C. Gkantsiis, T. Karagiannis, P. Roriguez, an M. Vojnovic, Planet Scale Software Upates, in Proc. of the ACM SIGCOMM, [11] DShiel: Cooperative Network Security Community, [12] P.Barfor,R.Nowak,R.Willett,anV.Yegneswaran, TowaraMoel for Source Aresses of Internet Backgroun Raiation, in Proc. of the Passive an Active Measurement Conference, [13] Z. Chen an C. Ji, Measuring network-aware worm spreaing ability, in Proc. of the IEEE INFOCOM, [14] Cooperative Association for Internet Data Analysis,

Online Appendix to: Generalizing Database Forensics

Online Appendix to: Generalizing Database Forensics Online Appenix to: Generalizing Database Forensics KYRIACOS E. PAVLOU an RICHARD T. SNODGRASS, University of Arizona This appenix presents a step-by-step iscussion of the forensic analysis protocol that