Distinct Sampling on Streaming Data with Near-Duplicates*

Size: px

Start display at page:

Download "Distinct Sampling on Streaming Data with Near-Duplicates*"

Neal Henderson
5 years ago
Views:

1 Distinct Samping on Streaming Data with Near-Dupicates* ABSTRACT Jiecao Chen Indiana University Boomington Boomington, IN, USA In this paper we study how to perform distinct samping in the streaming mode where data contain near-dupicates. The goa of distinct samping is to return a distinct eement uniformy at random from the universe of eements, given that a the near-dupicates are treated as the same eement. We aso extend the resut to the siding window cases in which we are ony interested in the most recent items. We present agorithms with provabe theoretica guarantees for datasets in the Eucidean space, and aso verify their effectiveness via an extensive set of experiments. 1 INTRODUCTION Rea word datasets are aways noisy; imprecise references to same rea-word entities are ubiquitous in the business and scientific databases. For exampe, YouTube contains many videos of amost the same content; they appear to be sighty different due to cuts, compression and change of resoutions. A arge number of webpages on the Internet are near-dupicates of each other. Numerous tweets and WhatsApp/WeChat messages are re-sent with sma edits. This phenomenon makes data anaytics more difficut. It is cear that direct statistica anaysis on such noisy datasets wi be erroneous. For instance, if we perform standard distinct samping, then the samping wi be biased towards those eements that have a arge number of near-dupicates. On the other hand, due to the sheer size of the data it becomes infeasibe to perform a comprehensive data ceaning step before the actua anaytic phase. In this paper we study how to process datasets containing near-dupicates in the data stream mode [4, 23], where we can ony make a sequentia scan of data items using a sma memory space before the query-answering phase. When answering queries we need to treat a the near-dupicates as the same universe eement. This genera probem has been recenty proposed in [9], where the authors studied the estimation of the number of distinct eements of the data stream (aso caed F 0 ). In this paper we extend this ine of research by studying another fundamenta probem in the data stream iterature: the distinct samping (a.k.a. 0 -samping), where at *Both authors are supported by NSF CCF and IIS Permission to make digita or hard copies of a or part of this work for persona or cassroom use is granted without fee provided that copies are not made or distributed for profit or commercia advantage and that copies bear this notice and the fu citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or repubish, to post on servers or to redistribute to ists, requires prior specific permission and/or a fee. Request permissions from permissions@acm.org. PODS 18, June 10 15, 2018, Houston, TX, USA 2018 Copyright hed by the owner/author(s). Pubication rights icensed to the Association for Computing Machinery. ACM ISBN /18/06... $ Qin Zhang Indiana University Boomington Boomington, IN, USA qzhangcs@indiana.edu the time of query we need to output a random sampe among a the distinct eements of the dataset. 0 -samping has many appications that we sha mention shorty. We remark, as aso pointed out in [9], that we cannot pace our hope on a magic hash function that can map a the near-dupicates into the same eement and otherwise into different eements, simpy because such a magic hash function, if exists, needs a ot of bits to describe. The Noisy Data Mode and Probems. Let us formay define the noisy data mode and the probems we sha study. In this paper we wi focus on points in the Eucidean space. More compicated data objects such as documents and images can be mapped to points in their feature spaces. We first introduce a few concepts (first introduced in [9]) to faciitate our discussion. Let d(, ) be the distance function of the Eucidean space, and et α be a parameter (distance threshod) representing the maximum distance between any two points in the same group. Definition 1.1 (data sparsity). We say a dataset S (α, β)-sparse in the Eucidean space for some β α if for any u, v S we have either d(u,v) α or d(u,v) > β. We ca max β β/α the separation ratio. Definition 1.2 (we-separated dataset). We say a dataset S weseparated if the separation ratio of S is arger than 2. Definition 1.3 (natura partition; F 0 of we-separated dataset). We can naturay partition a we-separated dataset S to a set of groups such that the intra-group distance is at most α, and the intergroup distance is more than 2α. We ca this the unique natura partition of S. Define the number of distinct eements of a weseparated dataset w.r.t. α, denoted as F 0 (S, α), to be the number of groups in the natura partition. We wi assume that α is given as a user-chosen input to our agorithms. In practice, α can be obtained for exampe by samping a sma number of items of the dataset and then comparing their abes. For a genera dataset, we need to define the number of distinct eements as an optimization probem as foows. Definition 1.4 (F 0 of genera dataset). The number of distinct eements of S given a distance threshod α, denoted by F 0 (S, α), is defined to be the size of the minimum cardinaity partition G = {G 1,G 2,...,G n } of S such that for any i = 1,..., n, and for any pair of points u,v G i, we have d(u,v) α. Note that the definition for genera datasets is consistent with the one for we-separated datasets. We next define 0 -samping for noisy datasets. To differentiate with the standard 0 -samping we wi ca it robust 0 -samping; but

2 we may omit the word robust in the rest of the paper when it is cear from the context. We start with we-separated datasets. Definition 1.5 (robust 0 -samping on we-separated dataset). Let S be a we-separated dataset with natura partition G = {G 1,G 2,...,G n }. The robust 0 -samping on S outputs a point u S such that i [n], Pr[u G i ] = 1/n. (1) That is, we output a point from each group with equa probabiity; we ca the outputted point the robust 0 -sampe. It is a itte more subte to define robust 0 -samping on genera datasets, since there coud be mutipe minimum cardinaity partitions, and without fixing a particuar partition we cannot define 0 -samping. We wi circumvent this issue by targeting a sighty weaker samping goa. Definition 1.6 (robust 0 -samping on genera dataset). Let S be a dataset and et n = F 0 (S, α). The robust 0 -samping on S outputs a point q such that, p S, Pr[q Ba(p, α) S] = Θ(1/n), (2) where Ba(p, α) is the ba centered at p with radius α. Let us compare Equation (1) and (2). It is easy to see that when S is we-separated, etting G(p) denote the group that p beongs to in the natura partition of S, we have and thus we can rewrite (1) as G(p) = Ba(p, α) S, p S, Pr[q Ba(p, α) S] = 1/n. (3) Comparing (2) and (3), one can see that the definition of robust 0 - samping on genera dataset is consistent with that on we-separated dataset, except that we have reaxed the sampe probabiity by a constant factor. Computationa Modes. We study robust 0 -samping in the standard streaming mode, where the points p 1,...,p m S comes one by one in order, and we need maintain a sketch of S t = {p 1,...,p t } (denoted by sk(s t )) such that at any time t we can output an 0 - sampe of S t using sk(s t ). The goa is to minimize the size of sketch sk(s t ) (or, the memory space usage) and the processing time per point under certain accuracy/approximation guarantees. We aso study the siding window modes. Let w be the window size. In the sequence-based siding window mode, at any time step t we shoud be abe to output an 0 -sampe of {p w+1,...,p } where p is the atest point that we receive by the time t. In the time-based siding window mode, we shoud be abe to output an 0 -sampe of {p,...,p } where p,...,p are points received in the ast w time steps t w +1,..., t. The siding window modes are generaizations of the standard streaming mode (which we ca the infinite window mode), and are very usefu in the case that we are ony interested in the most recent items. Our agorithms for siding windows wi work for both sequence-based and time-based cases. The ony difference is that the definitions of the expiration of a point are different in the two cases. Our Contributions. This paper makes the foowing theoretica contributions. (1) We propose a robust 0 -samping agorithm for we-separated datasets in the streaming mode in constant dimensiona Eucidean spaces; the agorithm uses O(ogm) words of space (m is the ength of the stream) and O(ogm) processing time per point, and successes with probabiity (1 1/m) during the whoe streaming process. This resut matches the one in the corresponding noiseess data setting. See Section 2.1 (2) We next design an agorithm for siding windows under the same setting. The agorithm works for both sequence-based and time-based siding windows, using O(og n ogw) words of space and O(og n ogw) processing time per point with success probabiity (1 1/m) during the whoe streaming process. We comment that the siding window agorithm is much more compicated than the one for the infinite window, and is our main technica contribution. See Section 2.2. (3) For genera datasets, we manage to show that the proposed 0 - samping agorithms for we-separated datasets sti produce amost uniform sampes on genera datasets. More precisey, it achieves the guarantee (2). See Section 3. (4) We further show that our agorithms can aso hande datasets in high dimensiona Eucidean spaces given sufficienty arge separation ratios. See Section 4. (5) Finay, we show that our 0 -samping agorithms can be used to efficienty estimate F 0 in both the standard streaming mode and the siding window modes. See Section 5. We have aso impemented and tested our 0 -samping agorithm for the infinite window case, and verified its effectiveness on various datasets. See Section A. Reated Work. We now briefy survey reated works on distinct samping, and previous work deaing with datasets with near-dupicates. The probem of 0 -samping is among the most we studied probems in the data stream iterature. It was first investigated in [14, 24, 26], and the current best resut is due to Jowhari et a. [28]. We refer readers to [13] for an overview of a number of 0 -sampers under a unified framework. Besides being used in various statistica estimations [14], 0 -samping finds appications in dynamic geometric probems (e.g., ϵ-approximation, minimum spanning tree [24]), and dynamic graph streaming agorithms (e.g., connectivity [1], graph sparsifiers [2, 3], vertex cover [10, 11] maximum matching [1, 5, 10, 30], etc; see [32] for a survey). However, a the agorithms for 0 -samping proposed in the iterature ony work for noiseess streaming datasets. 0 -samping in the siding windows on noiseess datasets can be done by running the agorithm in [6] with the rank of each item being generated by a random hash function. As before, this approach cannot work for datasets with near-dupicates simpy because the hash vaues assigned to near-dupicates wi be different. 0 -samping has aso been studied in the distributed streaming setting [12] where there are mutipe streams and we want to maintain a distinct sampe over the union of the streams. The samping agorithm in [12] is essentiay an extension of the random samping agorithms in [15, 34] by using a hash function to generate random ranks for items, and is thus again unsuitabe for datasets with near-dupicates. The ist of works for F 0 estimation is even onger (e.g., [7, 19, 22, 23, 25, 29]; just mention a few). Estimating F 0 in the siding window

3 mode was studied in [37]. Again, a these works target noiseess data. The genera probem of processing noisy data streams without a comprehensive data ceaning step was ony studied fairy recenty [9] for the F 0 probem. A number of statistica probems (F 0, 0 - samping, heavy hitters, etc.) were studied in the distributed mode under the same noisy data mode [36]. Unfortunatey the muti-round agorithms designed in the distributed mode cannot be used in the data stream mode because on data streams we can ony scan the whoe dataset once without ooking back. This ine of research is cosey reated to entity resoution (aso caed data dedupication, record inkage, etc.); see, e.g., [17, 20, 27, 31]. However, a these works target finding and merging a the near-dupicates, and thus cannot be appied to the data stream mode where we ony have a sma memory space and cannot store a the items. Techniques Overview. The high eve idea of our agorithm for the infinite window is very simpe. Suppose we can modify the stream by ony keeping one representative point (e.g., the first point according to the order of the data stream) of each group, then we can just perform a uniform random samping on the representative points, which can be done for exampe by the foowing fokore agorithm: We assign each point with a random rank in (0, 1), and maintain the point with the minimum rank as the sampe during the streaming process. Now the question becomes: Can we identify (not necessariy store) the first point of each group space-efficienty? Unfortunatey, we wi need to use Ω(n) space (n is the number of groups) to identify the first point of each group for a noisy streaming dataset, since we have to store at east 1 bit to record the first point of each group to avoid seecting other ater-coming points of the same group. One way to dea with this chaenge is to subsampe a set of groups in advance, and then ony focus on the first points of this set of groups. Two issues remain to be deat with: (1) How to sampe a set of groups in advance? (2) How to determine the sampe rate? Note that before we have seen a points in the group, the group itsef is not we-defined, and thus it is difficut to assign an ID to a group at the beginning and perform the subsamping. Moreover, the number of groups wi keep increasing as we see more points, we therefore have to decrease the sampe rate aong the way to keep the sma space usage. For the first question, the idea is to post a random grid of side ength Θ(α) (α is the group distance threshod) upon the point set, and then sampe ces of the grid instead of groups using a hash function. We then say a group (1) G is samped if and ony if G s first point fas into a samped ce, (2) G is rejected if G has a point in a samped ce, however the G s first point is not in a samped ce. (3) G is ignored if G has no point in a samped ce. We note that the second item is critica since we want to judge a group ony by its first point; even there is another point in the group that is samped, if it is not the first point of the group, then we wi sti consider the group as rejected. On the other hand, we do not need to worry about those ignored groups since they are not considered at the very beginning. To guarantee that our decision is consistent on each group we have to keep some neighborhood information on each rejected group as we to avoid doube-counting, which seems to be space-expensive at the first gance. Fortunatey, for constant dimensiona Eucidean space, we can show that if grid ces are randomy samped, then the number of non-samped groups is within a constant factor of that of samped groups. We thus can contro the space usage of the agorithm by dynamicay decreasing the sampe rate for grid ces. More precisey, we try to maintain a sampe rate as ow as possibe whie guarantee that there is at east one group that is samped. This answers the second question. The situation in the siding window case becomes compicated because points wi expire, and consequenty we cannot keep decreasing the grid ce sampe rate. In fact, we have to increase the ce sampe rate when there are not enough groups being samped. However, if we increase the ce sampe rate in the midde of the process, then the neighborhood information of those previousy ignored groups has aready got ost. To hande this diemma we choose to maintain a hierarchica samping structure. We choose to describe the high eve ideas as we as the actua agorithm in Section after the some basic agorithms and concepts have been introduced. For genera datasets, we show that our agorithms for we-separated datasets can sti return an amost uniform random distinct sampe. We first reate our robust 0 -samping agorithm to a greedy partition process, and show that our agorithm wi return a random group among the groups generated by that greedy partition. We then compare that particuar greedy partition with the minimum cardinaity partition, and show that the number of groups produced by the two partitions are within a constant factor of each other. Comparison with [9]. We note that athough this work foows the noisy data mode of that in [9] and the roadmap of this paper is simiar to that of [9] (which we think is the natura way for the presentation), the contents of this paper, namey, the ideas, proposed agorithms, and anaysis techniques, are a very different from that in [9]. After a, the 0 -samping probem studied in this paper is different from the F 0 estimation studied in [9]. We note, however, that there are natura connections between distinct eements and distinct samping, and thus woud ike to mention a few points. (1) In the infinite window case, we can easiy use our robust 0 - samping agorithm to get an agorithm for (1+ϵ)-approximating robust F 0 using the same amount of space as that in [9] (see Section 5). We note that in the noiseess data setting, the probem of 0 -samping and F 0 estimation can be reduced to each other by easy reductions. However, it is not cear how to straightforwardy use F 0 estimation to perform 0 -samping in the noisy data setting using the same amount of space as we have achieved. We beieve that since there is no magic hash function, simiar procedure ike finding the representative point of each group is necessary in any 0 -samping agorithm in the noisy data setting. (2) Our siding window 0 -samping agorithm can aso be used to obtain a siding window agorithm for (1+ϵ)-approximating F 0 (aso see Section 5). However, it is not cear how to extend

4 Notation Definition S stream of points m ength of the stream w ength of the siding window n = F 0 (S) number of groups G/G set of groups / a group G(p) group containing point p α threshod of group diameter G/C grid / a grid ce CELL(p) ce containing point p ADJ(p) set of ces adjacent to CELL(p) Ba(p, α) {q d(p,q) α} ϵ approximation ratio for F 0 Tabe 1: Notations the F 0 agorithm in [9] to the siding window case, which was not studied in [9]. (3) In order to dea with genera datasets, in [9] the authors introduced a concept caed F 0 -ambiguity and used it as a parameter in the anaysis. Intuitivey, F 0 -ambiguity measures the east fraction of points that we need to remove in order to make the dataset to be we-separated. This definition works for probems whose answer is a singe number, which does not depend on the actua group partition. However, different group partitions do affect the resut of 0 -samping, even that a those partitions have the minimum cardinaity. In Section 3 we show that by introducing a reaxed version of random samping we can bypass the issue of data ambiguity. Preiminaries. In Tabe 1 we summarize the main notations used in this paper. We use [n] to denote {1, 2,..., n}. We say x is (1 + ϵ)-approximation of y if x [(1 ϵ)y, (1 + ϵ)y]. We need the foowing versions of the Chernoff bound. LEMMA 1.7 (STANDARD CHERNOFF BOUND). Let X 1,..., X n be independent Bernoui random variabes such that Pr[X i = 1] = p i. Let X = i [n] X i. Let µ = E[X ]. It hods that Pr[X (1+δ)µ] e δ 2 µ/3 and Pr[X (1 δ)µ] e δ 2 µ/2 for any δ (0, 1). LEMMA 1.8 (VARIANT OF CHERNOFF BOUND). Let Y 1,...,Y n be n independent random variabes such that Y i [0,T ] for some T > 0. Let µ = E[ i Y i ]. Then for any a > 0, we have Pr Y i > a e (a 2µ)/T. i [n] 2 WELL-SEPARATED DATASETS IN CONSTANT DIMENSIONS We start with the discussion of 0 -samping on we-separated datasets in constant dimensiona Eucidean space. 2.1 Infinite Window We first consider the infinite window case. We present our agorithm for 2-dimensiona Eucidean space, but it can be triviay extended to O(1)-dimensions by appropriatey changing the constant parameters. Let G = {G 1,...,G n } be the natura group partition of the weseparated stream of points S. We post a random grid G with side ength α 2 on R2, and ca each grid square a ce. For a point p, define CELL(p) to be the ce C G containing p. Let ADJ(p) = {C G d(p,c) α}, where d(p,c) is defined to be the minimum distance between p and a point in C. We say a group G intersects a ce C if G C. Assuming that a points have x and y coordinates in the range [0, M] for a arge enough vaue M. Let = 2M α + 1. We assign the ce on the i-th row and the j-th coumn of the grid G [0, M] [0, M] a numerica identification (ID) ((i 1) + j). For convenience we wi use ce and its ID interchangeaby throughout the paper when there is no confusion. For ease of presentation, we wi assume that we can use fuy random hash functions for free. In fact, by Chernoff-Hoeffding bounds for imited independence [18, 33], a our anaysis sti hods when we adopt Θ(ogm)-wise independent hash functions, using which wi not affect the asymptotic space and time costs of our agorithms. Let h : [ 2 ] {0, 1,..., 2 2 og 1} be a fuy random hash function, and define h R for a given parameter R = 2 k (k N) to be h R (x) = h(x) mod R. We wi use h R to perform samping. In particuar, given a set of IDs Y = {y 1,...,y t }, we ca {y Y h R (y) = 0} the set of samped IDs of Y by h R. We aso ca 1/R the sampe rate of h R. As discussed in the techniques overview in the introduction, our main idea is to sampe ces instead of groups in advance using a hash function. Definition 2.1 (samped ce). A ce C is samped by h R if and ony if h R (C) = 0. By our choices of the grid ce size and the hash function we have: FACT 1. (a) Each ce wi intersect at most one group, and each group wi intersect at most O(1) ces. (b) For any set of points P = {p 1,...,p t }, {p P h 2R (ce(p)) = 0} {p P h R (ce(p)) = 0}. In the infinite window case (this section) we choose the representative point of each group to be the first point of the group. We note that the representative points are fuy determined by the input stream, and are independent of the hash function. We wi define the representative point sighty differenty in the siding window case (next section). We define a few sets which we wi maintain in our agorithms. Definition 2.2. Let S rep be the set of representative points of a groups. Define the accept set to be and the reject set to be S acc = {p S rep h R (CELL(p)) = 0}, S rej = {p S rep \S acc C ADJ(p) s.t. h R (C) = 0}. For convenience we aso introduce the foowing concepts. Definition 2.3 (samped, rejected, candidate group). We say a group G a samped group if G S acc, a rejected group if G S rej, and a candidate group if G (S acc S rej ).

5 y axis p 1 p 2 p x axis Figure 1: Each square is a ce; each ight bue square is a samped ce. Each gray dash circe stands for a group. Red points (p 1,p 2 and p 3 ) are representative points; p 1 is in the accept set and p 2 is in the reject set. Gray ces form ADJ(p 3 ). α = 2 in this iustration. Figure 1 iustrates some of the concepts we have introduced so far. Obviousy, the set of samped groups and the set of rejected groups are disjoint, and their union is the set of candidate groups. Aso note that S acc is the set of representative points of the samped groups, and S rej is the set of representative points of rejected groups. We comment that it is important to keep the set S rej, even that at the end we wi ony sampe a point from S acc. This is because otherwise we wi have no information regarding the first points of those groups that may have points other than the first ones faing into a samped ce, and consequenty points in S\S rep may aso be incuded into S acc, which wi make the fina samping to be nonuniform among the groups. One may wonder whether this additiona storage wi cost too much space. Fortunatey, since each group has diameter at most α, we ony need to monitor groups that are at a distance of at most α away from samped ces, whose cardinaity can be shown to be sma. More precisey, for a group G, etting p be its representative point, we monitor G if and ony if there exists a samped ce C such that C ADJ(p). The set of representative points of such groups is precisey S acc S rej. Our agorithm for 0 -samping in the infinite window case is presented in Agorithm 1. We contro the sampe rate by doubing the range R of the hash function when the number of points of S acc exceeds a threshod Θ(ogm) (Line 10 of Agorithm 1). We wi aso update S acc and S rej accordingy to maintain Definition 2.2. When a new point p comes, if CELL(p) is samped and p is the first point in G(p) (Line 6), we add p to S acc ; that is, we make p as the representative point of the samped group G(p). Otherwise if CELL(p) is not samped but there is at east one samped ce in ADJ(p), and p is the first point in G(p) (Line 8), then we add p to S rej ; that is, we make p as the representative point of the rejected group G(p). Agorithm 1: ROBUST 0 -SAMPLING-IW 1 R 1; S acc ; S rej 2 κ 0 is chosen to be a arge enough constant /* dataset is fed as a stream of points */ 3 for each arriving point p do /* if p is not the first point of a candidate group, skip it */ 4 if u S acc S rej s.t. d(u,p) α then 5 continue /* if p is the first point of a candidate group */ 6 if h R (ce(p)) = 0 then 7 S acc S acc {p} 8 ese if C adj(p) s.t. h R (ce(c)) = 0 then 9 S rej S rej {p} 10 if S acc > κ 0 ogm then 11 R 2R 12 update S acc and S rej according to the updated hash function h R /* at the time of query: */ 13 return a random point in S acc On the other hand, if there is at east one samped ce in ADJ(p) (i.e., G(p) is a candidate group) and p is not the first point (Line 4), then we simpy discard p. Note that we can test this since we have aready stored the representation points of a candidate groups. In the remaining case in which G(p) is not a candidate group, we aso discard p. At the time of query, we return a random point in S acc. Correctness and Compexity. We show the foowing theorem. THEOREM 2.4. In constant dimensiona Eucidean space for a we-separated dataset, there exists a streaming agorithm (Agorithm 1) that with probabiity 1 1/m, at any time step, it outputs a robust 0 -sampe. The agorithm uses O(ogm) words of space and O(ogm) processing time per point. The correctness of the agorithm is straightforward. First, S acc is a random subset of S rep since each point p S rep is incuded in S acc if and ony if h R (CELL(p)) = 0. Second, the outputted point is a random point in S acc. The ony thing eft to be shown is that we have S acc > 0 at any time step. LEMMA 2.5. With probabiity 1 1/m, we have S acc > 0 throughout the execution of the agorithm. PROOF. At the first time step of the streaming process, p 1 is added into S acc with certainty since R is initiaized to be 1. Then S acc keeps growing. At the moment when S acc > κ 0 ogm, R is doubed so that each point in S acc is resamped with probabiity 1 2. After the resamping, Pr[ S acc = 0] ( 1 2 ) κ0 og m 1 m 2. (4)

6 By a union bound over at most m resampe steps, we concude that with probabiity 1 1/m, S acc > 0 throughout the execution of the agorithm. We next anayze the space and time compexities of Agorithm 1. LEMMA 2.6. With probabiity (1 1/m) we have S rej = O(ogm) throughout the execution of the agorithm. PROOF. Consider a fixed time step. Let S = S acc S rej. For a fixed p S rep, since ADJ(p) 25 (we are in the 2-dimensiona Eucidean space), and each ce is samped randomy, we have Pr[p S rej ] 24 Pr[p S]. (5) 25 We ony need to prove the emma for the case Pr[p S rej ] = Pr[p S]; the case Pr[p S rej ] < Pr[p S] foows directy since p is ess ikey to be incuded in S rej. For each p S, define X p to be a random variabe such that X p = 1 if p S rej, and X p = 0 otherwise. Let X = p S rej X p. We have X = S rej and E[X] = S. By a Chernoff bound (Lemma 1.7), we have Pr [X E[X ] > 0.01E[X ]] e E[X ] 3. (6) If S ogm then we immediatey have S rej S = O(ogm). Otherwise by (6) we have We thus have Pr[X > 1.01E[X ]] < 1/m 2. 1/m 2 > Pr[X > 1.01E[X ]] = Pr[X > S ] = Pr[X > (X + S acc )] = Pr[0.0304X > S acc ]. According to Agorithm 1 it aways hods that S acc = O(ogm). Therefore S rej = X = O(ogm) with probabiity at east 1 1/m 2. The emma foows by a union bound over m time steps of the streaming process. By Lemma 2.6 the space used by the agorithm can be bounded by O( S acc + S rej ) = O(ogm) words. The processing time per point is aso bounded by O( S acc + S rej ). 2.2 Siding Windows We now consider the siding window case. Let w be the window size. We first present an agorithm that maintains a set of samped points in S acc with a fixed sampe rate 1/R; it wi be used as a subroutine in our fina siding window agorithm (Section 2.2.2) A Siding Window Agorithm with Fixed Sampe Rate. We describe the agorithm in Agorithm 2, and expain it in words beow. Besides maintaining the accept set and the reject set as that in the infinite window case, Agorithm 2 aso maintains a set A consisting of key-vaue pairs (u,p), where u is the representative point of a candidate group (u can be a point outside the siding window as ong as the group has at east one point inside the siding window), and p Agorithm 2: SW WITH FIXED SAMPLE RATE 1/R 1 for each expired point p do 2 if (u,p) A then 3 deete (u,p) from A, deete u from S acc S rej 4 for each arriving point p do /* if there aready exists a point of the same group in S acc S rej */ 5 if u S acc S rej s.t. d(u,p) α then 6 A (u,p) A\(u, ) /* otherwise we set p as a representative of its group */ 7 ese if h R (ce(p)) = 0 then 8 S acc S acc {p}, A A (p,p) 9 ese if C adj(p) s.t. h R (C) = 0 then 10 S rej S rej {p}, A A (p,p) Figure 2: Representative points in siding windows. There are two different groups, and the red window is the current siding window (of size w = 5). Point c is not the representative point of Group 1 because the window right before it (incusive) contains point b which is aso in Group 1. Point b is the representative point because it is the atest point such that there is no other point of Group 1 in the window right before b. is the atest point of the same group (thus p must be in the siding window). Define A(S acc ) = {p u S acc s.t. (u,p) A}. For each siding window, we guarantee that each candidate group G has exacty one representative point. This is achieved by the foowing process: for each candidate group G, if there is no maintained representative point, then we take the first point u as the representative point (Line 8 and Line 10). When the ast point p of the group expires, we deete the maintained representative point u from S acc S rej, and deete (u,p) from A (Line 3). For a new arriving point p, if there aready exists a point u S acc S rej in the same group G, then we simpy update the ast point in the pair (u, ) we maintained for G (Line 6). Otherwise p is the first point of G(p) in the siding window. If G(p) is a samped group, then we add p to S acc and (p,p) to A (Line 8); ese if G(p) is a rejected group, then we add p to S rej and (p,p) to A (Line 10). The foowing observation is a direct consequence of Agorithm 2. It foows from the discussion above and the testing at Line 7 of Agorithm 2. OBSERVATION 1. In Agorithm 2, at any time for the current siding window, we have (1) Each group has exacty one representative point, which is fuy determined by the stream and is independent of the hash function. More precisey, a point p becomes the representative

7 point of group G in the current window if p is the atest point in G such that the window right before p (incusive) has no point in G. See Figure 2 for an iustration. (2) The representative point of each group in the current window is incuded in S acc with probabiity 1/R A Space-Efficient Agorithm for Siding Windows. We now present our space-efficient siding window agorithm. Note that the agorithm presented in Section 2.2.1, though being abe to produce a random sampe in the siding window setting, does not have a good space usage guarantee; it may use space up to w/r where w is the window size. The siding window agorithm presented in this section works simutaneousy for both sequence-based siding window and timebased siding window. High Leve Ideas. As mentioned, the main chaenge of the siding window agorithm design is that points wi expire, and thus we cannot keep decreasing the sampe rate. On the contrary, if at some point there are too few non-expired samped points, then we have to increase the sampe rate to guarantee that there is at east one point in the siding window beonging to S acc. However, if we increase the sampe rate in the midde of the streaming process, then the neighborhood information of a newy samped group may aready get ost. In other words, we cannot propery maintain S rej when the sampe rate increases. To resove this issue we have the prepare such a decrease of S acc in advance. To this end, we maintain a hierarchica set of instances of Agorithm 2, with sampe rates 1/R being 1, 1/2, 1/4,... respectivey. We thus can guarantee that in the owest eve (the one with sampe rate 1) we must have at east one samped point. Of course, to achieve a good space usage we cannot endessy insert points to a the Agorithm 2 instances. We instead make sure that each eve stores at most S acc S rej = O(ogm) points, where S acc and S rej are the accept set and reject set respectivey in the run of an Agorithm 2 instance at eve. We achieve this by maintaining a dynamic partition of the siding window. In the -th subwindow we run an instance of Agorithm 2 with sampe rate 1/2. For each incoming point, we accept it at the highest eve in which the point fas into S acc, and then deete a points in the accept and reject sets in a the ower eves. Whenever the number of points in S acc at some eve exceeds the threshod c ogm for some constant c, we promote most of its points to eve + 1. The process may cascade to the top eve. At the time of query we propery resampe the points maintained at each S acc ( = 0, 1,...) to unify their sampe probabiities, and then merge them to S acc. In order to guarantee that if the siding window is not empty then we aways have at east one samped point in S acc, during the agorithm (in particuar, the promotion procedure) we make sure that the ast point of each eve is aways in the accept set S acc. REMARK 1. The hierarchica set of windows reminisces the exponentia histogram technique by Datar et a. [16] for basic counting in the siding window mode. However, by a carefu ook one wi notice that our agorithm is very different from exponentia histogram, and is (naturay) more compicated since we need to dea with both distinct eements and near-dupicates. For exampe, the exponentia Agorithm 3: Robust 0 -SAMPLING-SW 1 R 2 for a = 0, 1,..., L. 2 for 0 to L do /* create an agorithm instance according to Agorithm 2 with fixed sampe rate 1/R * / 3 ALG (,,, R ) 4 for each arriving point p do 5 for L downto 0 do 6 ALG (p) /* feed p to the instance ALG */ 7 if (u,p) A then /* prune a subsequent eves */ 8 for j 1 downto 0 do 9 ALG j (,,, R j ) 10 if S acc > κ 0 ogm then 11 j 12 create a temporary instance ALG 13 whie ( Sj acc > κ 0 ogm) do 14 (ALG, ALG j ) SPLIT(ALG j ) 15 ALG j+1 MERGE(ALG, ALG j+1 ) 16 j j if j > L then return error 18 break /* at the time of query: */ 19 S 20 Let c be the maximum index such that S acc 21 for 1 to c do 22 incude each point in the set {p (,p) A } to S with probabiity R /R c 23 return a random point from S histogram agorithm in [16] partitions the siding window deterministicay to subwindows of size 1, 2, 4,.... Suppose we are ony interesting in the representative point of each group, we basicay need to deete a the other points in each group in the siding window, which wi change the sizes of the subwindows. Handing near dupicates adds another ayer of difficuty to the agorithm design; we hande this by empoying Agorithm 2 (which is a variant of the agorithm for the infinite window in Section 2.1) at each of the subwindows with different sampe rates. The interpay between these components make the overa agorithm invoved. The Agorithm. We present our siding window agorithm in Agorithm 3 using Agorithm 4 and Agorithm 5 as subroutines. We use ALG to represent an instance of Agorithm 2. For convenience, we aso use ALG to represent a the data structures that are maintained during the run of Agorithm 2, and write ALG = (S acc, S rej, A, R), where S acc, S rej are the accept and reject sets respectivey, A is the key-vaue pair store, and R is the reciproca of the sampe rate.

8 Agorithm 4: SPLIT(ALG ) 1 create instances ALG a = (Sa acc, Sa rej, A a, R a ) and ALG b = (S b acc, Srej b, A b, R b ) 2 t = max{t (p t S acc ) (h R+1 (CELL(p t )) = 0)} 3 Sa acc {p k S acc (k t) (h R+1 (CELL(p k )) = 0)}; Sa rej {p k S rej (k t) (h R+1 (CELL(p k )) = 0)}; A a {(u, ) A u S acc }; R a R +1 4 S b acc {p k S acc k > t}; S rej b {p k S rej k > t}; A b {(u, ) A u S acc }; R b R 5 return (ALG a, ALG b ) Agorithm 5: MERGE(ALG a, ALG b ) 1 create a temporary instance ALG = (S acc, S rej, A, R) 2 S acc Sa acc S b acc; Srej Sa rej S rej b ; A A a A b ; R R a 3 return ALG Set R = 2 for = 0, 1,..., L = ogw. In Agorithm 3 we create L instances of Agorithm 2 with empty S acc, S rej, A (denoted by ), and sampe rates 1/R respectivey. We ca the instance with R = 2 the eve instance. When a new point p comes, we first find the highest eve such that p is samped by ALG (i.e., p S acc ), and then deete a the structures of ALG j (j < ), except keep their sampe rates 1/R j (Line 5 to Line 9). If after incuding p, the size of S acc becomes more than κ 0 ogm, we have to do a series of updates to restore the invariant that the accept set of each eve contains at most κ 0 ogm points at any time step (Line 10 to Line 16). To do this, we first spit the instance of ALG into two instances (Agorithm 4). Let point p be the ast point in S acc which is samped by hash function h R+1. We promote a the points in S acc S rej arriving before (and incude) p to eve + 1 by resamping them using h R+1, which gives a new eve + 1 instance ALG. We next try to merge ALG with ALG +1 who have the same sampe rate by merging their accept/reject sets and the sets of key-vaue pairs (Agorithm 5). The merge may resut S +1 acc > κ 0 ogm, in which case we have to perform the spit and merge again. These operations may propagate to the upper eves unti we research a eve in which we have S acc κ 0 ogm after the merge. At the time of query, we have to integrate the maintained sampes in a L eves. Since at each eve we sampe points use different sampe rates 1/R, we have to resampe each point in S acc with probabiity R /R c where c is the argest eve that has a non-empty accept set (Line 20 to Line 22). Correctness and Compexity. In this section we prove the foowing theorem. THEOREM 2.7. In constant dimensiona Eucidean space for a we-separated dataset, there exists a siding window agorithm (Agorithm 3) that with probabiity 1 1/m, at any time step, it eve 4 eve 3 eve 2 eve 1 eve 0 siding window Figure 3: An iustration of subwindows of a siding window; the subwindow at eve 0 is an empty set. outputs a robust 0 -sampe. The agorithm uses O(ogw ogm) words of space and O(ogw ogm) amortized processing time per point. First, it is easy to show the probabiity that Agorithm 3 outputs error is negigibe. LEMMA 2.8. With probabiity 1 1/m 2, Agorithm 3 wi not output error at Line 17 during the whoe streaming process. PROOF. Fix a siding window W. Let G 1,...,G k (k w) be the groups in W. The sampe rate at eve L is 1/R L = 1/2 L = 1/w. Let X be a random variabe such that X = 1 if the -th group is samped, and X = 0 otherwise. Let X = k =1 X. Thus E[X ] = k 1/R L w 1/w = 1. By a Chernoff bound (Lemma 1.8) we have that with probabiity 1 1/m 3, we have X κ 0 ogm for a arge enough constant κ 0. The emma then foows by a union bound over at most m samping steps. The foowing definition is usefu for the anaysis. Definition 2.9 (subwindows). For a fixed siding window W, we define a subwindow W for each instance ALG ( = 0, 1,..., L) as foows. W L starts from the first point in the siding window to the ast point (denoted by p tl ) in A(SL acc). For = L 1,..., 1, W starts from p t+1 +1 to the ast point (denoted by p t ) in A(S acc ). W 0 starts from p t1 +1 to the ast point in the window W. See Figure 3 for an iustration of subwindows. We note that a subwindow can be empty. We aso note the foowing immediate facts by the definitions of subwindows. FACT 2. W 0 W 1... W L covers the whoe window W. FACT 3. Each subwindow W ( = 1,..., L) ends up with a point in A(S acc ). For = 0, 1,..., L, et G be the set of groups whose ast points ie in W, and et S rep be the set of their representative points. From Agorithm 3, 4 and 5 it is easy to see that the foowing is maintained during the whoe streaming process. FACT 4. During the run of Agorithm 3, at any time step, S acc is formed by samping each point in S rep with probabiity 1/R. The foowing emma guarantees that at the time of query we can aways output a sampe. S acc S rej

9 LEMMA During the run of Agorithm 3, at any time step, if the siding window contains at east one point, then when querying we can aways return a sampe, i.e., S 1. PROOF. The emma foows from Fact 3, and the fact that ALG 0 incudes every point in S rep 0 (R 0 = 1). Now we are ready to prove the theorem. (FOR THEOREM 2.7). We have the foowing facts: (1) S rep 0, Srep 1,..., Srep L are set of representatives of disjoint sets of groups G 0, G 1,..., G L. And L =0 G is the set of a groups who have the ast points inside the siding window. (2) By Fact 4 each S acc is formed by samping each point in S rep with probabiity 1/R. (3) Each point in S rep is incuded in S with probabiity R /R c (Line 22 of Agorithm 3). (4) By Lemma 2.10, S 1 if the siding window is not empty. (5) The fina sampe returned is a random sampe of S. (6) By Lemma 2.8, with probabiity (1 1/m) the agorithm wi not output error. By the first three items we know that S is a random sampe of the ast points of a groups within the siding window, which, combined with Item 4, 5 and 6, give the correctness of the theorem. We now anayze the space and time compexities. The space usage at each eve can be bounded by O(ogm) words. This is due to the fact that Sj acc is aways no more κ 0 ogm, and consequenty A j has O(ogm) key-vaue pairs. Using Lemma 2.6 we have that with probabiity 1 1/m 2, S rej j = O(ogm). 1 Thus by a union bound, with probabiity (1 O(ogw/m 2 )), the tota space is bounded by O(ogw ogm) words since we have O(ogw) eves. For the time cost, simpy note that the time spent on each point at each eve during the whoe streaming process can be bounded by O(ogm), and thus the amortized processing time per item can be bounded by O(ogw ogm). 2.3 Discussions We concude the section with some discussions and easy extensions. Samping k Points with/without Repacement. Samping k groups with repacement can be triviay achieved by running k instances of the agorithm for samping one group (Agorithm 1 or Agorithm 3) in parae. For samping k groups without repacement, we can increase the threshod at Line 10 of Agorithm 1 to κ 0 k ogm, by which we can show using exacty the same anaysis in Section 2.1 that with probabiity 1 1/m we have S acc k. Simiary, for the siding window case we can increase the threshod at Line 10 of Agorithm 3 to κ 0 k ogm. Random Point As Group Representative. We can easiy augment our agorithms such that instead of aways returning the (fixed) representative point of a randomy samped group, we can return a random point of the group. In other words, we want to return each point p G with equa probabiity 1 n G. 1 We can reduce the faiure probabiity 1/m to 1/m 2 by appropriatey changing the constants in the proof. For the infinite window case we can simpy pug-in the cassica Reservoir samping [35] in Agorithm 1. We can impement this as foows: For each group G that has a point stored in S acc S rej, we maintain an e G = (v,ct) pair where ct is a counter counting the number of points of this group, and v is the random representative point. At the beginning (when the first point u of group G comes) we set e G = (u, 1). When a new point p is inserted, if there exists u S acc such that d(u,p) α (i.e., u and p are in the same group), we increment the counter ct for group G(u), and reset e G = (u,p) with probabiity ct 1. For the siding window case, we can just repace Reservoir samping with a random samping agorithm for siding windows (e.g., the one in [8]). 3 GENERAL DATASETS In this section we consider genera datasets which may not be weseparated, and consequenty there is no natura partition of groups. However, we show that Agorithm 1 sti gives the foowing guarantee. THEOREM 3.1. For a genera dataset S in constant dimensiona Eucidean space, there exists a streaming agorithm (Agorithm 1) that with probabiity 1 1/m, at any time step, it outputs a point q satisfying Equaity (2), that is, p S, Pr[q Ba(p, α) S] = Θ( 1 F 0 (S,α ) ), where Ba(p, α) is the ba centered at p with radius α. Before proving the theorem, we first study group partitions generated by a greedy process. Definition 3.2 (Greedy Partition). Given a dataset S, a greedy partition is generated by the foowing process: pick an arbitrary point p S, create a new group G(p) Ba(p, α) S and update S S\G(p); repeat this process unti S =. LEMMA 3.3. Given a dataset S, et n opt be the number of groups in the minimum cardinaity partition of S, and n gdy be the number of groups in an arbitrary greedy partition. We aways have n opt = Θ(n gdy ). PROOF. We first show n gdy n opt. Let G(p 1 ),...,G(p ngdy ) be the groups in the greedy partition according to the orders they were created, and et H 1,..., H nopt be the minimum partition. We prove by induction. First it is easy to see that G(p 1 ) must cover the group containing p 1 in the minimum partition (w..o.g. denote that group by H 1 ). Suppose that i j=1 G(p j ) covers i groups H 1,..., H i in the minimum partition, that is, i j=1 H j i j=1 G(p j ), we can show that there must be a new group H i+1 in the minimum partition such that i+1 j=1 H j i+1 j=1 G(p j ), which gives n gdy n opt. The induction step foows from the foowing facts. (1) p i+1 i j=1 G(p j ). (2) Ba(p i+1, α) i+1 j=1 G(p j ). (3) The diameter of each group in the minimum partition is at most α. Indeed, by (1) and the induction hypothesis we have p i+1 i j=1 H j. Let H i+1 be the group containing p i+1 in the minimum partition. Then by (2) and (3) we must have H i+1 Ba(p i+1, α) i+1 j=1 G(p j ).

10 We next show n opt O(n gdy ). This is not obvious since the diameter of a group in the greedy partition may be arger than α (but is at most 2α), whie groups in the minimum partition have diameter at most α. However, in constant dimensiona Eucidean space, each group in a greedy group partition can intersect at most O(1) groups in the minimum cardinaity partition. We thus sti have n opt O(n gdy ). Now we are ready to prove the theorem. (FOR THEOREM 3.1). We can think the group partition in Agorithm 1 as a greedy process. Let (q 1,...,q z ) be the sequence of points that are incuded in S acc, according to their arriving orders in the stream. We can generate a greedy group partition on zi=1 Ba(q i, α) as foows: for i = 1,..., z, create a new group G(q i ) Ba(q i, α) S and update S S\G(q i ). Let G sub = {G(q 1 ),...,G(q z )}. We then appy the greedy partition process on the remaining points in S, again according to their arriving orders in the stream. Let q z+1,...,q ngdy be the representative points of the remaining groups. Let G = {G(q 1 ),...,G(q ngdy )} be the fina group partition of S. We have the foowing facts. (1) Each group in G intersects Θ(1) grid ce in G. (2) Each grid ce in G is samped by the hash function h R with equa probabiity. (3) q 1,...,q z are the representative points of their groups in G sub. (4) Agorithm 1 returns a sampe randomy from q 1,...,q z. By items 1 and 2, we know that each group in G is incuded in G sub with probabiity Θ( G sub / G ). By items 3 and 4, we know that Agorithm 1 returns a random group from G sub. Therefore each group G G is samped by Agorithm 1 with probabiity Θ(1/n gdy ) = Θ(1/n opt ), where the ast equation is due to Lemma 3.3. Now for any p S, according to the greedy process and Agorithm 1, there must be some q S such that G(p) Ba(q, α), and if G(p) is samped then q is the samped point. So the probabiity that q is samped is at east the probabiity that G(p) is samped. Finay, note that if p Ba(q, α) then we aso have q Ba(p, α). We thus have Pr[ q Ba(p, α) s.t. q is samped] = Ω(1/n opt ). (7) On the other hand, in constant dimensiona Eucidean space Ba(p, α) can ony intersect O(1) groups in the greedy partition. We thus aso have Pr[ q Ba(p, α) s.t. q is samped] = O(1/n opt ). (8) The theorem foows from (7) and (8). It is easy to see that the above arguments can aso be appied to the siding window case with respect to Agorithm 3. COROLLARY 3.4. For a genera dataset in constant dimensiona Eucidean space, there exists a siding window agorithm (Agorithm 3) that with probabiity 1 1/m, at any time step, it outputs a point q such that p S, Pr[q Ba(p, α)] = Θ(1/n opt ), where S is the set of a the points in the siding window, and n opt is the size of the minimum cardinaity partition of S with group radius α. 4 HIGH DIMENSIONS In this section we consider datasets in d-dimensiona Eucidean space for genera d. We show that Agorithm 1, with some sma modifications, can hande (α, β)-sparse dataset in d-dimensiona Eucidean space with β > d 1.5 α as we. THEOREM 4.1. In the d-dimensiona Eucidean space, for an (α, β)-sparse dataset with β > d 1.5 α, there is a streaming agorithm such that with probabiity 1 1/m, at any time step, it outputs a robust 0 -sampe. The agorithm uses O(d ogm) words of space and O(d ogm) processing time per item. REMARK 2. We can use Johnson-Lindenstrauss dimension reduction to weaken the sparsity assumption to β c α og 1.5 m α for some arge enough constant c α. We pace a random grid G with side ength dα. Since the dataset is (α, β)-sparse with β > d 1.5 α, each grid ce can intersect at most one group. However, in the d-dimensiona space a group can intersect 2 d grid ces in the worst case, which may cause difficuty to maintain S rej in sma space in the worst case we woud have S rej = Ω(2 d ) whie S acc is sti sma. Fortunatey, in the foowing emma we show that for any p S rep, the probabiity that p S rej wi not be too arge compared with the probabiity that p S acc. LEMMA 4.2. For any fixed p S rep, we have Pr[p S rej ] κ 1 Pr[p S acc S rej ], where κ 1 (0, 1) is a constant. PROOF. For a group G, et Ba(G, α) = {p d(p,g) α} where d(p,g) = min q G d(p,q). It is easy to see that Ba(G, α) has a diameter of at most 3α because the diameter of G is at most α. Since the random grid has side ength dα, the probabiity that Ba(G, α) is cut by the boundaries of ces in each dimension is at most µ = d 3. If Ba(G, α) is cut by i dimensions, the number of ces it intersects is at most 2 i, and consequenty ADJ(p) 2 i for each p G. Reca that each ce is samped with probabiity R 1, we thus have = Pr[p S rej S acc ] Pr[p S rej S rej ADJ(p) = i] Pr[ ADJ(p) = i] i 1 d ( d )µ i d i 2i (1 µ) i R i=0 (2µ + 1 µ) d R (1 + d 3 )d ( ) 1 = O. R R Since S acc S rej =, we have Pr[p S rej ] = Pr[p S rej S acc ] Pr[p S acc ] for some constant κ 1 (0, 1). κ 1 Pr[p S acc S rej ] By Lemma 4.2, and basicay the same anaysis as that in Lemma 2.6, we can bound the space usage of Agorithm 1 by O(d ogm) (O(ogm)

Solutions to the Final Exam

Solutions to the Final Exam CS/Math 24: Intro to Discrete Math 5//2 Instructor: Dieter van Mekebeek Soutions to the Fina Exam Probem Let D be the set of a peope. From the definition of R we see that (x, y) R if and ony if x is a