Summarizing Data using Bottom-k Sketches

Summarzng Data usng Bottom-k Sketches Edth Cohen AT&T Labs Research 8 Park Avenue Florham Park, NJ 7932, USA edth@research.att.com Ham Kaplan School of Computer Scence Tel Avv Unversty Tel Avv, Israel hamk@cs.tau.ac.l ABSTRACT A Bottom-k sketch s a summary of a set of tems wth nonnegatve weghts that supports approxmate query processng. A sketch s obtaned by assocatng wth each tem n a ground set an ndependent random rank drawn from a probablty dstrbuton that depends on the weght of the tem and ncludng the k tems wth smallest rank value. Bottom-k sketches are an alternatve to k-mns sketches [9], whch consst of the k mnmum ranked tems n k ndependent rank assgnments, and of mn-hash [5] sketches, where hash functons replace random rank assgnments. Sketches support approxmate aggregatons, ncludng weght and selectvty of a subpopulaton. Coordnated sketches of multple subsets over the same ground set support subset-relaton queres such as Jaccard smlarty or the weght of the unon. All-dstances sketches are applcable for datasets where tems le n some metrc space such as data streams (tme) or networks. These sketches compactly encode the respectve plan sketches of all neghborhoods of a locaton. These sketches support queres posed over tme wndows or neghborhoods and tme/spatally decayng aggregates. An mportant advantage of bottom-k sketches, establshed n a lne of recent work, s much tghter estmators for several basc aggregates. To materalze ths beneft, we must adapt tradtonal k-mns applcatons to use bottom-k sketches. We propose all-dstances bottom-k sketches and develop and analyze data structures that ncrementally construct bottom-k sketches and alldstances bottom-k sketches. Another advantage of bottom-k sketches s that when the data s represented explctly, they can be obtaned much more effcently than k-mns sketches. We show that k-mns sketches can be derved from respectve bottom-k sketches, whch enables the use of bottom-k sketches wth off-the-shelf k-mns estmators. (In fact, we obtan tghter estmators snce each bottom-k sketch s a dstrbuton over k-mns sketches). Categores and Subject Descrptors: E.2 Data Storage Representatons; G.3: probablstc algorthms; E. Data Structures General Terms: Algorthms, Measurement, Performance, Theory Keywords: all-dstances sketches, data streams, bottom-k sketches Permsson to make dgtal or hard copes of all or part of ths work for personal or classroom use s granted wthout fee provded that copes are not made or dstrbuted for proft or commercal advantage and that copes bear ths notce and the full ctaton on the frst page. To copy otherwse, to republsh, to post on servers or to redstrbute to lsts, requres pror specfc permsson and/or a fee. PODC 7, August 2 5, 27, Portland, Oregon, USA. Copyrght 27 ACM 978--59593-66-5/7/8...$5... INTRODUCTION Sketchng or samplng s an extremely useful tool for storage and queres on massve data sets. Sketches allow us to process approxmate queres on the orgnal data sets whle occupyng a fracton of the storage space requred for the full data set and usng a fracton of the computaton resources requred for the exact answer. The value of a sketchng method depends on the effcency of ts mplementaton, ts versatlty n terms of the operatons supported, and the qualty of the estmates obtaned. Bottom-k and k-mns sketches are summares of a set of tems wth postve weghts. k-mns sketches (The mn-rank method [9]) are obtaned by assgnng ndependent random ranks to tems where the dstrbuton used for each tem depends on the weght of the tem. We retan the mnmum rank of an tem n the set. Ths s repeated wth k ndependent rank assgnments for some nteger k and we obtan a k-vector of ndependent mnmum ranks and k ndependent weghted samples. Bottom-k sketches are an emergng alternatve to k-mns sketches. Bottom-k sketches are constructed usng a sngle rank assgnment. The bottom-k sketch of a subset contans the k tems wth smallest ranks n the subset. Bottom-k sketches were mentoned, wthout analyss, n [9, 22]. The sketch supports approxmate query processng over the orgnal data set and subpopulatons of ths dataset. Basc aggregatons nclude the weght of the set or the selectvty of a subpopulaton (subset) of the set and derved aggregatons nclude approxmate quantles, average weght, and varance and hgher moments []. The sketch of a set s a weghted random sample. When used wth exponentally dstrbuted ranks, bottom-k sketches are a weghted sample wthout replacement (WS-sketches) whereas k-mns sketches are a weghted sample wth replacement (WSRsketches). In applcatons where there are multple subsets that are defned over the same ground set of tems, a sketch s produced for each subset. The sketches of dfferent subsets are coordnated, sharng the same rank assgnments to the tems of the ground set, and support queres over subset relatons, such as the weght of the unon or ntersecton, ther weght rato, and resemblance or Jaccard smlarty coeffcent. A useful property of coordnated sketches s that the sketch of a unon can be computed from the sketches of the subsets. Therefore, gven sketches of subsets, we can perform aggregatons on unons of subsets. Example of an applcaton wth multple subsets s when tems are assocated wth nodes of a drected graph and we compute k- mns sketches for the reachablty set of each node. These sketches can be computed n Õ(km) tme (and storage) whereas an explct representaton of the subsets requres O(mn) tme [9]. Applcatons nclude mantanng a sketch of nfluencng events for each process n a computer system [5], when a process A affects pro- 225

cess B, the new sketch of B becomes the sketch of the unon; and usng the property that the sketches reduce the approxmate sum problem to that of fndng a mnmum, k-mns sketches were used for aggregatons on gossp networks [2]. Other applcatons wth multple subsets where sketches support fast computaton of subset relatons are near-duplcate detecton for Web pages [5] (a sketch s produced for each Web page), study of smlar Web stes [2], mnng of assocaton rules [22] from market basket data, and elmnatng redundant network traffc [23]. In these applcatons, a varant termed mn-hash sketches substtutes random rank assgnments wth random hash functons (famles of mn-wse ndependent hash functons or ɛ-mn-wse functons [5, 6]). Wth random hash functons, the rank assgnment of an tem depends on the tem dentfer, and t has the property that all copes of the same tem across dfferent subsets obtan the same rank, wthout addtonal book keepng or coordnaton between all occurrences of each tem. Ths allows for effcent aggregatons over dstnct occurrences (see [9]) and supports subset-relaton queres. Bottom-k sketches encode more nformaton than k-mns sketches. (Intutvely, samplng wthout replacement s more nformatve than samplng wth replacement.) A lne of recent work showed that bottom-k sketches are superor to k-mns sketches n terms of estmate qualty. Estmators for subpopulaton weght usng prorty ranks (PRI-sketches) were provded n [, 24] and estmators for general famles of rank functons were provded n [, 2]. The mprovement n estmate qualty s sgnfcant on weght dstrbutons and values of k, such that tems are lkely to be sampled multple tmes n a k-sample drawn wth replacement, such as skewed Zpf-lke dstrbutons that often arse n practce. For subset relatons such as the weght of the ntersecton or unon, bottom-k sketches mprove over k-mns sketches even when weghts are unform [, 2]: Carefully desgned estmators are appled to the combned bottom-k sketches, whch reveal more members of the unon and ntersecton than two correspondng k-mns sketches. Our contrbutons We facltate the use of bottom-k sketches by developng and analyzng data structures that construct these sketches. Our results allow applcatons that use k-mns sketches to use the superor bottomk sketches. An nherent dfference we had to tackle s that k-mns sketches are obtaned usng k ndependent rank functons, whch allows for k ndependent copes of the same smple data structure to be used whereas bottom-k entres are dependent. Sketches are constructed ncrementally as tems are processed. The sketch s manpulated through two basc operatons: A test operaton whch tests f the sketch has to be updated, and an update operaton whch nserts the new tem f the sketch ndeed has to be updated. We make ths dstncton snce test operatons can be performed much more effcently than update operatons. The number of update operatons depends on the order n whch tems are processed and on the weght dstrbuton of the data. The number of test operatons s typcally larger than the number of updates. The extent n whch t s larger, however, hghly depends on the applcaton. We dstngush between applcatons wth explct representaton [3, 2, 22, 23] or mplct representaton [9, 3, 5] of the data. In applcatons wth an explct representaton, tem-subset pars are provded explctly. The dataset could be dstrbuted, presented as a data stream, or n external memory, but the pars are explctly provded and are all processed to produce the sketches. In applcatons wth mplct representaton, the subsets are specfed as neghborhoods n a graph or some metrc space. Wth explct representaton, the number of test operatons s much larger than the number of update operatons. In Secton 3 we analyze the number of test and update operatons and how t depends on the way the data s presented and on the dstrbuton of the tem weghts. All-dstances sketches are a generalzaton of plan sketches that are used when the underlyng dataset has tems assocated wth locatons n some metrc space, and subsets are specfed by neghborhoods of a locaton. All-dstances k-mns sketches were used for data streams (where aggregaton s over wndows of elapsed tme to the present tme) [4], the Eucldean plane (where we are presented wth a query pont and dstance) [3, 2], a graph (the query s a node and dstance) [9], or dstrbuted spatal aggregaton over a network [9, 3]. An all-dstances sketch s a compact encodng of the plan sketches of all neghborhoods of a certan locaton q. For a gven dstance d, the sketch for the d-neghborhood of the locaton can be constructed from the all-dstances sketch. All-dstances sketches also support tme-decayng and spatally-decayng aggregates usng arbtrary decay functons [4, 3]. In Secton 4 we defne bottom-k all-dstances sketches and present effcent data structures for mantanng both all-dstances k-mns sketches and all-dstances bottom-k sketches. We analyze the number of operatons requred to construct all-dstances sketches under dfferent arrval orders of the tems. In Secton 6 we provde a method to derve WSR-sketches (kmns wth exponental ranks) from WS-sketches (bottom-k wth exponental ranks). Ths mmckng process provdes a general method of applyng estmators desgned for WSR-sketches to WS-sketches. Ths process enables us to use bottom-k sketches n applcatons (such as those wth explct representaton of the data) where they can be obtaned much more effcently than k-mns sketches and use readly avalable WSR-sketches estmators. In fact, snce each WS-sketch corresponds to a dstrbuton over WSR-sketches, we obtan estmators wth smaller varance than the underlyng WSRsketches estmators. Ths reducton also shows that WS-sketches are strctly superor to WSR-sketches. We provde examples of applcatons of the mmckng process. 2. PRELIMINARIES Let I be a ground set of tems, where tem I has weght w(). A rank assgnment maps each tem to a random rank r(). The ranks of tems are drawn ndependently usng a famly of dstrbutons f w (w ), where the rank of an tem wth weght w() s drawn accordng to f w(). We use random rank assgnments to obtan sketches of subsets as follows. For a subset J of tems and a rank assgnment r we defne B (r, J) = arg mn j J r(j), to be the tem n J wth smallest rank accordng to r. For {,..., J }, we defne B (r, J) to be the tem n J wth th smallest rank accordng to r and r (J) r(b (r, J)) to be the th smallest rank value n J accordng to r. Defnton 2.. k-mns sketches are produced from k ndependent rank assgnments, r (),..., r (k). The k-mns sketch of a subset J s the k-vector (r () (J), r(2) (J),..., r(k) (J)). To support some queres, we may need to nclude wth each entry an dentfer or some other attrbutes such as the weght of the tems B (r (j), J) (j =,..., k). Defnton 2.2. Bottom-k sketches are produced from a sngle rank assgnment r. The bottom-k sketch s(r, J) of the subset J s a lst of entres (r (J), w(b (r, J))) for =,..., k. The lst s ordered by rank, from smallest to largest. The bottom-k sketch of a subset s therefore a lst wth up to k entres. The sze of the lst s the mnmum of k and the number 226

of tems n the subset. For a sngle tem (a subset of sze ), the bottom-k sketch s a lst wth a sngle entry (r (J), w(b (r, J))). To support queres, n addton to the weght, entres n the sketch may nclude an dentfer and attrbute values of tems B (r, J) ( =,..., k). Bottom-k and k-mns sketches have the followng useful property: The sketch of a unon of two sets can be generated from the sketches of the two sets. Let J and H be two subsets. For any rank assgnment r, r (J H) = mn{r (J), r (H)}. Therefore, for k-mns sketches we have (r () (J H),..., r(k) (J H)) = (mn{r () (J), r() (H)},..., mn{r(k) (J), r(k) (H)}). For bottom-k sketches, the k smallest ranks n the unon J H are contaned n the unon of the sets of the k-smallest ranks n each of J and H. That s, s(r, J H) s(r, J) s(r, H). Therefore, the bottom-k sketch of J H can be computed by takng the entres wth k smallest ranks n the combned sketches of J and H. To support sketch-based set operatons and queres, we need to store the rank values of tems. To perform sketch-based queres on a sngle subset, however, we do not need all rank values. Wth bottom-k sketches, t s suffcent to store the (k + )st smallest rank value, r k+ : We (re)draw random rank values for each tem n the sketch usng f w() condtoned on the rank beng smaller than r k+. Ths s just lke (re)drawng a random bottom-k sketch from the probablty subspace where the mnmum rank of tems not n the sketch s equal to r k+ and all tems n the sketch have ranks smaller than r k+. Beyond reduced storage, ths observaton often enables us to obtan tghter estmators. The unbased rank condtonng estmator for subpopulaton weght [, 2] s appled to the value r k+ and the weghts of the tems n the (unordered) sketch. In some cases, however, t s easer to derve estmator that s appled to the ordered sketch wth rank values (the mmckng process n Secton 6 s appled to an ordered bottom-k sketch). In ths case, nstead of applyng an estmator to the orgnal sketch and rank values, we take ts expectaton over re-drawn sketches or ts average over multple draws (f the expectaton s hard to compute). Ths results n an estmator wth at most the same varance and often smaller varance. Correctness follows from a basc property of varances: Lemma 2.3. Let a and a 2 be two random varables over Ω. Suppose there s a partton of Ω such that the value of a 2 on each part s equal to the expectaton of a on that part. Then VAR(a 2) VAR(a ). The choce of whch famly of random rank functons to use matters only when tems are weghted. Otherwse, sketches produced usng one rank functon can be transformed to any other rank functon. WS-sketches and WSR-sketches. A convenent choce for the rank functon f w s an exponental dstrbuton wth parameter w [9]. The densty functon of ths dstrbuton s f w(x) = we wx, and ts cumulatve dstrbuton functon s F w(x) = e wx. We refer to k-mns sketches wth these ranks as WSR-sketches and to bottom-k sketches wth these ranks as WS-sketches. The mnmum rank r (J) of an tem n a subset J I s exponentally dstrbuted wth parameter w(j) = P J w(). Ths follows from the fact that the mnmum of random varables each drawn from an exponental dstrbuton s also an exponentally dstrbuted random varable wth parameter equal to the sum of the parameters of these dstrbutons. The tem wth the mnmum rank We assume to smplfy the analyss that all random values are dstnct. B (r, J) s a weghted random sample from J: The probablty that an tem J s the mnmum rank tem s w()/w(j). Therefore we can conclude that a WSR-sketch of sze k of a subset J s a weghted random sample of sze k, drawn wth replacement from J (hence the term WSR-sketches). The ranks of these tems s a set of k ndependent samples from an exponental dstrbuton wth parameter w(j). Hence, f the weght w(j) s provded and we do not use subset-relaton queres rank values are redundant. If w(j) s not provded, the rank values can be used n unbased estmators for both w(j) and the nverse weght /w(j) [9]. 2 On the other hand, the tems n a WS-sketch are samples drawn wthout replacement from J: Lemma 2.4. A WS-sketch of sze k of a subset J s a sample of sze k drawn wthout replacement from J. PROOF. The probablty that tem J s B (r, J) s w()/w(j). Condtoned on the bottom-j ranked tems n J beng,..., j, B j+(r, J) s J \ {,..., j} wth probablty w()/(w(j) P j h= w( h)). If the weght w(j) s provded and we do not use the sketches for subset-relaton queres t suffces to store the unordered set of tems n s(r, J). Ths nformaton allows us to draw at random a bottom-k sketch from the probablty subspace that contans all sketches where the set of the bottom-k ranked tems s s(r, J). PRI-sketches. Wth prorty ranks [8, ] the rank value of an tem wth weght w s selected unformly at random from [, /w]. Ths s the equvalent to choosng rank value r/w, where r U[, ] s selected from the unform dstrbuton on the nterval [, ]. It s well known that f r U[, ] then ln(r)/w s an exponental random varable wth parameter w. Therefore exponental ranks correspond to usng rank values ln r/w where r U[, ]. Choce of a rank functon. The appeal of PRI-sketches s estmators that (nearly) mnmzes P I VAR( w()) [24]. More precsely, Szegedy showed that the sum of per-tem varances usng PRI-sketches of sze k s no larger than the smallest sum of varances attanable by an estmator that uses sketches wth average sze k. 3 WS-sketches offer several other dstnct advantages. Frst, they support unbased estmators for selectvty (subpopulaton fracton); Second, the estmators for selectvty and for subpopulaton weght when the weght of the set s known (as n data streams), feature negatve covarances between dfferent tems. Therefore, selectvty and weght estmators for larger subpopulatons are much tghter than wth the known estmator for PRI-sketches [2]. Unbased subpopulaton weght estmators exst for bottom-k sketches obtaned usng arbtrary rank functons [2]. These estmators are useful when we want to obtan good estmators wth respect to multple weght functons (eg, for IP flows datasets we are nterested n count of dstnct flows and total bandwdth). 3. MAINTAINING SKETCHES Sketches are produced for each subset of nterest n a collecton of subsets over a ground set of tems. The algorthms for constructng sketches are applcaton-dependent, but on a hgh level, 2 Estmators for the nverse-weght are useful for obtanng unbased estmates for quanttes where the weght appears n the denomnator. These nclude weght rato of two dfferent subsets, set resemblance of two subsets, and average weght of a subset. 3 Szegedy s proof apples only to estmators based on adjusted weght assgnments. It also does not apply to estmators on the weght of subpopulatons. 227

sketches are constructed usng an ncremental process, where a current sketch s mantaned for each subset of nterest, and the sketch s updated when a new nformaton (tem, or tem and rank value) s presented. We dentfy two operatons on the current sketch, a test operaton that checks whether ncorporatng the new nformaton causes a modfcaton of the current sketch and an update operaton, whch s a modfcaton of the current sketch. We make the dstncton between test and update because as a general rule, applcatons requre more tests than updates, and n some applcatons, updates are costler than tests. We consder the tme bounds of constructng k-mns and bottomk sketches for two representatve classes of applcatons. We show that when subsets are represented explctly (each occurrence of an tem n a subset s specfed), t s much more effcent to construct bottom-k sketches. Ths pont for unform weghts, was already noted n [4, 22]. We revew t and extend the analyss for weghted tems. For mplct representaton of the subsets, va a graph, we show that the tme bounds for generatng the two types of sketches are comparable. 3. Explct representaton of subsets Examples of applcatons wth explct specfcaton are [3, 2, 22, 23]. Among these are market-basket data, Web duplcate analyss and more. To construct a k-mns sketch for a subset, we mantan a current sketch (m,..., m k ) of the smallest rank value observed so far for each of the k rank functons (along wth attrbutes of the tems wth smallest rank). Intally, m j = + for (j =,..., k). When an tem s processed we compute r () (), r (2) (),..., r (k) (). We then update the sketch so that m j mn{m j, r (j) ()}. Therefore, the processng tme for each occurrence of an tem n a subset s Θ(k) (t s Θ(k) tme for both the test and update operatons). To construct a bottom-k sketch, we use a current sketch that contans the k smallest rank values observed so far m < m 2 < m k as a sorted lst. When an tem s processed, we compute r(), whch s compared to m k (test operaton). If r() < m k, the rank value m k (and correspondng tem) s deleted from the lst and r() s nserted (update operaton). A test operaton takes O() tme and an update takes O(log k) tme. Therefore, the tme bound for generatng a sketch for a subset of sze s s O(sk) for a k-mns sketch and O(s log k) for a bottom-k sketch. We next show that for unform weghts the expected number of update operatons whle constructng a bottom-k sketch of a set of sze s s O(k log s). Ths mples a better bound of O(s + k log s log k) on the expected runnng tme to generate a bottom-k sketch. Lemma 3.. If tems have unform weghts then the expected number of updates to a bottom-k sketch of a set of sze s s k ln s. PROOF. A presented tem trggers an update of the current sketch f and only f t has one of the bottom-k ranks among tems presented so far. If j tems were presented so far, the probablty of that happenng s mn{, k/j}. Summng over all postons n the presentaton order we obtan that the expected number of updates s at most P s j= k/j k ln s. For weghted tems we consder two cases. Frst s the case where tems are presented n an order determned by a random permutaton. Lemma 3.2. If tems are presented n random order then the expected number of updates to a bottom-k sketch of a set of sze s s k ln s. PROOF. Fx the rank assgnment. The probablty that the jth tem n the presentaton order has one of the k th smallest ranks of the frst j tems s mn{, k/j}. Contnue as n the proof of Lemma 3.. From Lemma 3.2 t follows that f tems are weghted and are presented n random order, the bottom-k sketch s constructed n O(s + k log k log s) expected tme. To bound the number of updates when tems are presented n an arbtrary order we need the rank assgnment to defne a close to random permutaton of the tems f weghts are, say, wthn a factor of two from each other. Ths wll hold f the rank functons satsfy the followng property. Defnton 3.3. A famly of rank functons s c-moderate f for any w >, and < w 2w, there s probablty at least such c that an tem drawn accordng to f w has a larger rank than an tem drawn accordng to f w. If the famly of rank functons s c-moderate for some constant c and the weghts of all tems are wthn a factor of two from each other then the probablty that a rank of a partcular tem, say, s among the k-smallest ranks s at most c k, where j s the number of j tems. 4 One can check that exponental ranks are 3-moderate and prorty ranks are 4-moderate. Lemma 3.4. If tems are weghted and presented n arbtrary (worstcase) order, and the famly of rank functons s c-moderate for some constant c, then the expected number of updates of the bottom-k sketch of a set of sze s s O(k log(max w()/ mn w()) log s). PROOF. Consder a partton of the tems nto log(max w()/ mn w()) groups accordng to the weght, so that tems of weght [2 mn w(), 2 + mn w()] are n the same group. We bound the number of updates wthn one group. From the fact that the rank assgnment s c-moderate t follows that the probablty of the jth presented tem n a group to be wthn the bottom-k tems presented so far from ts group s at most ck/j, and hence, the expected number of updates wthn a group s at most ck ln s. The statement of the lemma follows by summng over all groups. From Lemma 3.4 t follows that f weghted tems are presented n arbtrary order, and the set of rank functons s c-moderate for some constant c, then we buld the bottom-k sketch n O(s + k log(max w()/ mn w()) log s log k) expected tme. 3.2 Graph representaton of subsets In some applcatons, tems and locatons are embedded n a graph or a metrc space and subsets correspond to all tems n a certan neghborhood or the reachablty set of a node [9, 3, 5]. The computaton of the sketches s performed concurrently for all subsets, wth tems and ranks beng propagated n a controlled way such that an tem s tested for a subset only f t s farly lkely to occur n the sketch of the subset and the number of test operatons s much smaller than wth an explct representaton. 4 To see that, replace tem by c duplcates, consder a random permutaton of the new set of tems and the probablty that one of the duplcates s among the bottom-k. Ths probablty s smaller than c k and larger than the probablty that tem s among the j bottom k. 228

We revew the computaton of sketches for reachablty sets of nodes n a graph [9]. In ths applcaton each node s an tem. Each node computes the sketch of ts reachablty set. Rank values (and assocated nformaton) are propagatng usng a graph traversal method such as breadth-frst or depth-frst search. When a rank value does not result n an update at a node, the propagaton of the rank value s halted at that node. Therefore, the number of test operatons s at most (m/n) tmes the number of update operatons, where m s the number of edges and n the number of nodes. For k-mns sketches, each tem and a rank value assocated wth t are propagated separately (therefore, k truncated traversals are performed for each tem). If, wthn each rank assgnment, tems are propagated n ncreasng rank order, then the combned number of updates for all subsets s n. Therefore, the total number of updates, for all k rank assgnments and subsets s O(kn) and the number of tests (and total tme) s O(km) [9]. Bottom-k sketches are computed by propagatng each tem and ts assocated rank usng a truncated graph traversal (note that n contrast to k-mns sketches, one traversal s performed for each tem). The current sketch at a node s updated when an tem arrves and ts rank value s smaller than the kth smallest current rank at the node. The traversal s halted at nodes where the tem dd not result n an update of the current sketch. When tems are presented n ncreasng rank order, then tems can only be appended to bottom-k sketches and t s never necessary to remove an tem. Therefore, the total number of updates s O(kn) and the total number of tests (and total tme) s O(km). These bounds are the same as the bounds obtaned for k-mns sketches. Arbtrary order. When tems are not presented ordered by ther ranks [3], the number of update operatons ncreases. Smlarly to Lemma 3. and Lemma 3.4 we prove that Lemma 3.5. Suppose we mantan the mnmum rank n a subset of sze s. Then f tems have unform weghts and presented n a fxed but arbtrary order or f tems are weghted and presented n a random order, the expected number of updates to the mnmum rank s ln s. f tems are weghted and presented n a fxed but arbtrary order and the famly of rank functons s c-moderate, the expected number of updates s O(log(max w()/ mn w()) log s). It follows that the total number of updates when computng k- mns sketches of all reachablty sets s O(kn log n) for unform weghts and weghted tems presented n random order and O(kn log(max w()/ mn w()) log n) for weghted tems presented n arbtrary order. We perform a test or update n O() tme and the number of tests s at most m/n tmes the number of updates. Therefore, the total tme s m/n tmes the number of updates. The number of updates for bottom-k sketches s gven n Lemmas 3.,3.2, and 3.4. Each update takes O(log k) tme, and a test takes O() tme. The number of tests s m/n tmes the number of updates. Therefore, the total tme s O(log k + m/n) tmes the number of updates gven n each of these lemmas. 4. ALL-DISTANCES SKETCHES An all-dstances sketch s an encodng of plan sketches of all neghborhoods of a certan locaton q. For a gven dstance d, the sketch for the d-neghborhood of the locaton can be retreved from the all-dstances sketch. We revew k-mns all-dstances sketches and ntroduce bottomk all-dstances sketches. We consder the sze of the all-dstances sketches, ts constructon tme, and the tme t takes to retreve the sketch of a partcular dstance. We consder ncremental constructon, where current all-dstances sketches are mantaned and updated upon the arrval of new nformaton (tem, dstance, rank). The operatons we consder are test that determnes f the current sketch needs to be modfed when new nformaton arrves, update of the current sketch, and a dstance query ssued to the fnal sketch. The dstance query retreves from the all-dstances sketch the plan sketch for the neghborhood of the locaton q specfed by the query dstance. We show that the expected sze of the representaton of the alldstances bottom-k sketch matches that of the k-mns sketch. When subsets are represented explctly, the computaton tme of the alldstances bottom-k sketches s about factor of k faster than that of the all-dstances k-mns sketches. When subsets are represented va a graph, the constructon tmes are comparable. All-dstances k-mns sketches: We revew all-dstances k-mns sketches. Consder a sngle rank assgnment. An MV/D lst of a locaton q (Mnmum Value/Dstance Lst) encodes the mnmum rank n any neghborhood (query dstance) of q n a compact way. It s a lst of trples where each trple contans an tem e, ts rank, and ts dstance from q. An tem e s n the MV/D lst of q f there s no tem wth smaller rank closer to q. The MV/D lst s sorted n ncreasng dstance and decreasng rank order. For a query dstance d, the smallest rank of an tem n the MV/D lst of q of dstance at most d from q s the tem of smallest rank n the subset of tems n the d-neghborhood of q. The expected sze of the lst depends on the rank functon and on the weght dstrbuton of the tems. Lemma 4.. The sze of an MV/D lst of n weghted tems from a locaton q s bounded as follows:. When weghts are unform, the expected sze s O(log n) [9]. 2. If weghts are arbtrary but tems are assgned to locatons at random then the expected sze over assgnments of tems to locatons, and over rank assgnments s O(log n). 3. If tems have arbtrary weghts and placed n arbtrary locatons and ranks are assgned usng a c-moderate famly of rank functons for some constant c, then the expected sze s O(log(max w()/ mn w()) log n). PROOF. Fx the rank assgnment. Order the locatons n ncreasng dstance from q. The assgnment of tems to locaton defnes a random permutaton of the ranks. Therefore, the probablty that the rank value n locaton j s smaller than the rank values n all closer locatons (and therefore the tem occurs on the MV/D lst) s /j. By summng over all postons, we obtan that the expected sze of the MV/D lst s P n j= /j ln n. If the relaton of the weghts and the locatons of tems s arbtrary, the expected sze of the MV/D lsts depends on the locaton of tems: If tem weghts are decreasng wth dstance then the expected sze of the MV/D lst s smaller and f tem weghts are ncreasng wth dstances, then the expected sze s larger (can be lnear n the worst case). The worst-case sze of the MV/D lst, however, can be bounded by the weght dstrbuton of the tems. The proof of the followng lemma s smlar to that of Lemma 3.4. Lemma 4.2. If tems have arbtrary weghts and placed n arbtrary locatons and ranks are assgned usng a c-moderate famly of rank functons for some constant c, the expected sze of the MV/D lst s O(log(max w()/ mn w()) log n). 229

PROOF. Let w = mn w(). Consder a partton of the tems so that all tems wth weght n [w 2, w 2 + ) are n group, for =, log 2 (max w()/ mn w()). By the property of c- moderate rank functons, the expected number of tems from each group that appear on the MV/D lst s logarthmc n ts sze. Therefore, the total expected number of tems on the MV/D lst s bounded by 2 ln n( + ln(max w()/ mn w())). The MV/D lst can be constructed ncrementally: When presented wth a new tem, ts rank, and dstance, the lst s updated only f the new tem has smaller rank than all tems on the lst that have the same or smaller dstance. If tems are presented n order of ncreasng rank, (or ncreasng (dstance,rank) n lexcographc order), then tems are never removed from the lst durng updates [9]. Other orders of presentng tems were analyzed n [3]. We summarze and extend these results n the followng lemma. Lemma 4.3. Assume that we construct an MV/D lst of a locaton q, and there are n weghted tems. Then,. When tems are presented n random order and there are unform weghts, the expected number of updates s O(log 2 n) [3]. 2. If tems are assgned to locatons at random, the expected number of updates to the MV/D lst, over assgnments of tems to locatons, rank assgnments, and presentaton order of tems s O(log 2 n). 3. If ranks are assgned usng a c-moderate famly of rank functons for some constant c, then the expected number of updates to the MV/D lst, over rank assgnments, and presentaton order of tems s O(log(max w()/ mn w()) log 2 n). All-dstances bottom-k sketches: An all-dstances bottom-k sketch encodes the bottom-k tems n a neghborhood defned by any query dstance from a locaton q. The all-dstances bottom-k sketch s a data structure that generalzes a sngle MV/D lst. An tem, ts rank value r(), and dstance d() are represented n the sketch f and only f the tem has one of the bottom-k ranks n the d()-neghborhood of the locaton. It s convenent to thnk of the all-dstances bottom-k sketch as a lst of lsts arranged by ncreasng dstance. For each dstance d where the set of bottom-k tems wthn dstance d changes, we record the lst of bottom-k tems wthn ths dstance. Ths lst s vald untl the next dstance for whch there s a change. The lst of lsts representaton, however, s not storage effcent, snce all but one tem are repeated n two consecutve lsts. Ths sketch can be more compactly represented f we only record the changes to the lst. In Secton 5 we dscuss compact representatons for an all-dstances bottom-k sketch that requre storage proportonal to the number of dstances where the bottom-k set changes. We bound the number of dstances for whch the bottom-k lst changes. These bounds mply that the storage for an all-dstances bottom-k sketch s comparable to the storage for k MV/D lsts n an all-dstances k-mns sketch. Lemma 4.4. Consder an all-dstances bottom-k sketch for n tems of a locaton q. We bound the expected number of dstances from q where the set of bottom-k tems changes.. For unform weghts, the expected number of dstances s O(k log n). 2. For a set of tems wth arbtrary weghts that are randomly assgned to locatons the expected number of dstances (over assgnments of tems to locatons, and over rank assgnments) s O(k log n). 3. If tems have arbtrary weghts and placed n arbtrary locatons and ranks are assgned usng a c-moderate famly of rank functons for some constant c, the expected number of dstances s O(k log(max w()/ mn w()) log n). PROOF. Order the tems by ncreasng dstance from q. Let d(j) be the dstance of the jth tem n ths order from q. The jth tem s n the bottom-k set of tems wthn dstance d(j) from q f t s one of the k-smallest tems among the j closest tems to q. Snce weghts are unform, the ranks defne a random permutaton of the tems whch s ndependent of the ther dstances to q. So the jth tem s among the smallest k wth probablty mn{k/j, }. Summng over all tems we obtan that the expected number of tems whch are among the kth smallest tems wthn ther dstance from q s at most X k j k ln n j As n Lemma 3., and 3.4 for weghted tems we can show the followng. Lemma 4.5.. For a set of tems wth arbtrary weghts and a set of locatons, the expected number of dstances from a locaton q where the set of bottom-k tems changes, over assgnments of tems to locatons, and over rank assgnments s O(k log n). 2. If tems have arbtrary weghts and placed n arbtrary locatons and ranks are assgned usng a c-moderate famly of rank functons for some constant c, the expected number of dstances from a locaton q where the set of bottom-k tems changes s O(k log(max w()/ mn w()) log n). If tems are presented n order of ncreasng dstances from q we can obtan a bottom-k lst for the current dstance, from the bottom-k lst of the prevous dstance by dong an nserton and a deleton. Smlarly, f tems arrve sorted by rank value, then the number of updates to the bottom-k sketch s proportonal to the sze (number of breakpont dstances) of the sketch. We can also bound the number of updates performed f tems arrve n a random order. Lemma 4.6. Consder the expected number of updates that s performed n an ncremental constructon of an all-dstances bottom-k sketch of a locaton q when tems are presented n a random order (the order s a random permutaton). When tem weghts are unform, the expected number of updates s O(k log 2 n). 2. When tems have arbtrary weghts, the expected number of updates over assgnments of weghts to locatons, over rank assgnments, and arrval order, s O(k log 2 n). 3. When tems have arbtrary weghts, and the famly of rank functons s c-moderate, the expectaton over rank assgnments and arrval orders of the number of updates s O(k log(max w()/ mn w()) log 2 n). PROOF. Consder unform weghts (Part ). An tem would result n an update f at the tme t s presented, t has one of the k smallest ranks amongst tems already presented that are at least 23

as close to q. Consder the jth closest tem to q. It has probablty /j of havng the th rank among all tems that are at least as close to the locaton. We now calculate the probablty that the tem results n an update gven that t has the th rank. Consder the tems that have smaller ranks and are at least as close. The probablty that at most k of them are presented before our tem s that of beng n one of the frst k postons n a random permutaton of tems, whch s mn{k/, }. We obtan P that the expected number of updates for the jth closest tem s j = mn{k/, }/j (/j) P j = k/ (k/j) ln j. summng over all n tems, we obtan that the expected number of updates s (k/j) ln j k ln 2 n. j= The proof of Part 2 and Part 3 follows by an argument as for Lemma 3., and Lemma 3.4. As n the case of a sngle sketch n Secton 3 the number of test operatons depends on the representaton of the subsets. If ths representaton s explct then snce k-mns sketch conssts of k ndependent MV/D lsts the number of tests requred for a k-mns sketch s by a factor of k larger than for a bottom-k sketch. In a graph representaton, the number of tests s at most (m/n) tmes the number of updates for both knds of sketches. In Secton 5 we dscuss representatons of sketches that allow effcent mplementatons of test and update operatons. 5. REPRESENTATIONS OF SKETCHES We consder possble representatons for k-mns sketches and bottom-k sketches. We are nterested n boundng the sze of the data structure that encodes the sketch, and the tme requred to ncrementally construct the sketch when tems are presented n sorted or other orders. For all-dstances sketches we also consder the tme t takes to fnd the sketch for a partcular query dstance. Representaton of an MV/D lst: An effcent data structure for an MV/D lst constructon and queryng was not explctly dscussed n earler works. If tems arrve sorted, by ncreasng rank value or ncreasng dstance, we represent an MV/D lst sorted by ncreasng dstances (and decreasng ranks), as a bnary search tree. Wth ths representaton we can support dstance queres n expected O(log M) tme, where M s the expected sze of the lst. If tems do not arrve n a sorted order, we represent the current MV/D lst as a dynamc bnary search tree. Test operatons then requre expected O(log M) tme. An update s performed n O(log M) expected amortzed tme: Each tem requres an nserton to the tree f t has the smallest rank wthn ts dstance from the query locaton, and possbly a seres of deletons of tems whch are further away from the query locaton and of larger rank. Snce each tem can be deleted at most once, we can charge each deleton to the respectve nserton. The all-dstances k-mns sketches conssts of k ndependent MV/D lsts, one for each rank assgnment. Therefore, for any query dstance, we can obtan the mn-rank sketch over the tems that le wthn that dstance n O(k log M) tme, by searchng ndependently n each of the k lsts. The query tme can be mproved to O(k + log M) usng fractonal cascadng [7]. Usng fractonal cascadng, we perform a bnary search only on one lst and use lnks between tems to fnd the poston n the next lst s O() tme. Another approach to obtan a O(k + log M) bound per query s to use an nterval tree or a segment tree (See e.g. [6]) to represent the km ntervals defned by consecutve ponts on the same lst. We can then do stabbng queres to fnd the k ntervals of a query dstance, whch correspond to the mn-rank n that neghborhood n each of the k rank functons. Constructng and queryng the bottom-k sketch: A natural representaton for a sngle bottom-k sketch s a lst of the tems sorted by ncreasng ranks represented as a search tree, as mentoned n Secton 3. However for all-dstances bottom-k sketch one needs to be more careful so that the sze of the representaton would be proportonal to the number of dstances where the lst changes as mentoned n Secton 4. We suggest possble effcent representatons for an all-dstances bottom-k sketch. Ordered nserton of tems: When tems are presented n an order related to ther dstances or ranks, we can use the followng data structures. If tems are presented n order of ncreasng dstances from q we can obtan a bottom-k lst for the current dstance, from the bottom-k lst of the prevous dstance by dong an nserton and a deleton. If we use a persstent lst [7] to represented each bottomk lst, then we can update a bottom-k lst to obtan the next one n O(k) tme whle consumng only O() space. We can reduce the update tme to O(log k) by usng persstent search trees nstead of persstent lsts, the space requred per operaton s stll O(). We can also construct the bottom-k all-dstances sketch f tems are presented n order of ncreasng ranks so that t takes space proportonal to the number of updates. We construct the frst lst after the k tems wth smallest ranks are presented. Ths lst s assocated wth the dstance of the tem among these k whch s furthest from the query locaton q. When the next tem arrves, say tem j, f tem j s closer to q than any of the already seen tems, we construct a new bottom k lst L. Assume that the prevous lst L whch we constructed was assocated wth dstance d > d(j). We construct L from L by deletng from L the tem at dstance d from q and addng tem j nstead. The dstance assocated wth L s the dstance of the furthest tem n L from q. Usng persstent lsts or persstent search trees to represent the bottom-k lsts we construct all lsts n space whch s proportonal to the number of updates. The update tme s O(k) wth persstent lsts and O(log k) wth persstent trees (we keep the tems n each lst sorted by ncreasng dstances from q). Inserton of tems n arbtrary order: To support arbtrary nserton order, we can thnk of the all-dstances bottom-k sketch as a set of ntervals on a lne. Each tem corresponds to an nterval over the range of dstances n whch t s a bottom-k tem. Let D be the current set of ntervals. A query s a pont stabbng query, the bottom-k lst conssts of the set of ntervals n D ntersectng the query pont. When a new tem z arrves at dstance d we should fgure out f the sketch should be updated. Let I = [d, d 2) be the nterval spannng dstance d wth the largest rank. We should update the sketch f the rank of z s smaller than the rank of the tem correspondng to I. We update the sketch as follows. We replace I wth I = [d, d). Then we fnd the nterval I 2 = [d 2, d 3) wth largest rank at dstance d 2. If the rank of I 2 s larger than the rank of z we delete I 2, and we contnue n the same way fndng for > 2 the nterval I of largest rank at dstance d, and deletng I f the rank of the correspondng tem s larger than the rank of z. Let d j be the rght endpont of the last nterval whch we deleted. We nsert the nterval [d, d j) correspondng to tem z. Snce each nterval s nserted and deleted once the total number of nsertons and deletons of ntervals s proportonal to the number of ntervals. An nterval I may splt many tme. However, each splt of I s assocated wth a newly nserted nterval mmedately followng I. Snce each nserted nterval may cause at most one splt the total number of splts s also proportonal to the total number of ntervals. 23

To support these nterval operatons, we can mantan the ntervals ether n a dynamc nterval tree or n a dynamc segment tree [8]. Let M denote the number of ntervals n the tree. A dynamc nterval tree takes O(M) space, and usng t we can report the k ntervals stabbed at a partcular dstance n O(log(M) log(k) + k) tme. We can update an nterval tree n O(log(M) log(k)) amortzed tme. A dynamc segment tree requres O(M log M) space and supports queres n O(log(M) + k) tme and updates n O(log(M) log(k)) amortzed tme. By a standard modfcaton to an nterval tree n whch we store at every secondary node the tem of maxmum rank n ts subtree we can fnd the nterval of maxmum rank stabbed by a query dstance n O(log(M) log(k)) tme. Smlarly, by mantanng at each node of a segment tree the maxmum rank nterval that t contans we can fnd the maxmum rank nterval stabbed by a query dstance n O(log(M)) tme. Ths allows us to test f the bottom-k sketch changes when a new tem arrves n polylogarthmc tme. (Ths s n contrast wth O(k log(n)) tme for k ndependent MV/D lsts that form a k-mns all dstances sketch.) 6. MIMICKED SAMPLING WITH REPLACE- MENT We present a randomzed procedure that uses a WS-sketch (weghted samplng wthout replacement untl k tems are obtaned) to emulate weghted samplng wth replacement. Usng ths process, we can derve a sze-k WSR-sketch from a sze-k WS-sketch. By mmckng we mean that the probablty to obtan a partcular sketch by frst obtanng a WS-sketch and then applyng the procedure s the same as when drectly obtanng a WSR-sketch. The process s descrbed as generatng a sequence of tems (and rank values). The process s randomzed and therefore every WSsketch b corresponds to a dstrbuton M(b) over such sequences. If we stop the process after k samples, we obtan a WSR-sketch. We can use a dfferent stoppng rule and contnue untl the (k + ) dstnct tem s sampled. We refer to a weghted sample wth replacement wth ths stoppng rule as a WSRD-sketch. The WSRDsketch contans the same set of tems as the WS-sketch but also has a count for each tem that corresponds to the number of tmes the tem s sampled untl the process s stopped. Mmckng allows us to apply an estmator ν desgned for WSRsketches or WSRD-sketches to WS-sketches. A WS-sketch estmator can be obtaned by drawng a mmcked sketch s M(b) usng ths process and returnng ν(s). Ths estmator s equvalent to usng the estmator ν on WSR or WSRD-sketches. The estmator ν (b) = E(ν(s) s M(b)) has lower varance (a consequence of Lemma 2.3). It can be approxmated 5 by takng average of ν(s) over multple draws of s M(b). Lower varance estmator (another consequence of Lemma 2.3) s obtaned by consderng the subspace L(b) of WS-sketches wth the same subset of tems as b and f w(j) s not provded and the same rank value r k+. L(b) s an equvalence relaton that defnes a partton of the sample space. The estmator ν (b) = E(ν (b ) b L(b)) can be approxmated by averagng ν(b ) over multple draws of b L(b). We frst provde a mmckng process when the total weght w(i) of the ground set s known. Let,..., k be the tems n the WSsketch b, ordered by ncreasng ranks. The frst tem n the mmcked sample s. We then select wth probablty w( )/w(i) and 2 otherwse, and repeat ths untl we have k samples or untl 2 s selected. In phase j, after outputtng at least one sample of each of,..., j, we select l wth probablty w( l )/w(i) (for 5 Ths approxmaton preserves unbasedness. l j) and j+ otherwse. Each phase can be smulated effcently usng the geometrc dstrbuton to determne the number of samples untl the next tem from b s sampled and the multnomal dstrbuton to determne the number of tmes each tem s sampled. We now provde a mmckng procedure when w(i) s not known. The procedure s appled to an ordered sketch where all tems have rank values. We use propertes of the exponental dstrbuton and the ranks of the tems n the WS-sketch. We frst establsh few lemmas about the dstrbuton of the dfferences between the ranks of the tems n a WS-sketch. The frst lemma follows from the memoryless nature of the exponental dstrbuton. Lemma 6.. Consder a subspace of rank assgnments where the order of the tems accordng to rank values s fxed, say,..., n, and the rank values of the frst j tems are fxed. Let r( j+) be the random varable that s the (j + )st smallest rank. The condtonal dstrbuton of r( j+) r( j) s exponental wth parameter P n w( h). PROOF. Snce rank values of dfferent tems are ndependent, the probablty densty for the event: tems,..., j have the bottomj ranks wth the values r( ) < < r( j) and tems j+,..., n havng the next n j smallest ranks n that order s the product p p 2 where p = w( ) exp( r( )w( ))w( 2) exp( r( 2)w( 2)) w( j) exp( r( j)w( j)) (probablty densty that the tems,..., j have the rank values r( ),..., r( j)) and p 2 = Z w( j+) exp( x j+w( j+)) r( j ) Z w( j+2) exp( x j+2w( j+2)) x j+ Z w( n) exp( x nw( n))dx n dx j+2dx j+. x n s the probablty densty that tems j+,..., n have rank values n that order and all larger than r( j). Performng the ntegraton, we obtan that where p 2 = p 3 exp( r( j) w( j+) w( h )), w( p 3 = P n w( P j+2) n h) h=j+2 w( h) w( n ) w( n ) + w(. n) (p 3 s the probablty that the rank values of tems j+,..., n are n that order and exp( r( P n j)( w( h))) s the probablty that the mnmum rank among j+,..., n s at least r( j).) Therefore, the probablty densty s p p 2 = p p 3 exp @ r( j) w( h ) A. () We next calculate the probablty densty for the followng event: tems,..., n have ncreasng ranks, the bottom-j ranks are equal to r( ) <... < r( j), and the (j + )st rank has value r( j) + d. It follows from ndependence of the rank values that the probablty densty s 232