An improved data stream summary: the count-min sketch and its applications

Size: px
Start display at page:

Download "An improved data stream summary: the count-min sketch and its applications"

Transcription

1 Journal of Algorithms 55 (2005) An improved data stream summary: the count-min sketch and its applications Graham Cormode a,,1, S. Muthukrishnan b,2 a Center for Discrete Mathematics and Computer Science (DIMACS), Rutgers University, Piscataway, NJ, USA b Division of Computer and Information Systems, Rutgers University and AT&T Research, USA Received 6 June 2003 Available online 4 February 2004 Abstract We introduce a new sublinear space data structure the count-min sketch for summarizing data streams. Our sketch allows fundamental queries in data stream summarization such as point, range, and inner product queries to be approximately answered very quickly; in addition, it can be applied to solve several important problems in data streams such as finding quantiles, frequent items, etc. The time and space bounds we show for using the CM sketch to solve these problems significantly improve those previously known typically from 1/ε 2 to 1/ε in factor Elsevier Inc. All rights reserved. 1. Introduction We consider a vector a, which is presented in an implicit, incremental fashion. This vector has dimension n, and its current state at time t is a(t) =[a 1 (t),...,a i (t),..., a n (t)]. Initially, a is the zero vector, a i (0) = 0 for all i. Updates to individual entries of the vector are presented as a stream of pairs. The tth update is (i t,c t ), meaning that a it (t) = a it (t 1) + c t, and a i (t) = a i (t 1) for all i i t. At any time t, aquery calls for computing certain functions of interest on a(t). This setup is the data stream scenario that has emerged recently. Algorithms for computing functions within the data stream context need to satisfy the following desiderata. * Corresponding author. addresses: graham@dimacs.rutgers.edu (G. Cormode), muthu@cs.rutgers.edu (S. Muthukrishnan). 1 Supported by NSF ITR and NSF EIA Supported by NSF CCR , NSF ITR and NSF EIA /$ see front matter 2003 Elsevier Inc. All rights reserved. doi: /j.jalgor

2 G. Cormode, S. Muthukrishnan / Journal of Algorithms 55 (2005) First, the space used by the algorithm should be small, at most polylogarithmic in n, the space required to represent a explicitly. Since the space is sublinear in data and input size, the data structure used by the algorithms to represent the input data stream is merely a summary aka a sketch or synopsis [17] of it; because of this compression, almost no function that one needs to compute on a can be done precisely, so some approximation is provably needed. Second, processing an update should be fast and simple; likewise, answering queries of a given type should be fast and have usable accuracy guarantees. Typically, accuracy guarantees will be made in terms of a pair of user specified parameters, ε and δ, meaning that the error in answering a query is within a factor of ε with probability 1 δ. The space and update time will consequently depend on ε and δ; our goal will be limit this dependence as much as is possible. Many applications that deal with massive data, such as Internet traffic analysis and monitoring contents of massive databases, motivate this one-pass data stream setup. There has been a frenzy of activity recently in the Algorithm, Database and Networking communities on such data stream problems, with multiple surveys, tutorials, workshops and research papers. See [3,12,28] for detailed description of the motivations driving this area. In recent years, several different sketches have been proposed in the data stream context that allow a number of simple aggregation functions to be approximated. Quantities for which efficient sketches have been designed include the L 1 and L 2 norms of vectors [2,14,23], the number of distinct items in a sequence (i.e., number of non-zero entries in a(t)) [6,15,18], join and self-join sizes of relations (representable as inner-products of vectors a(t), b(t)) [1,2], item and range sum queries [5,20]. These sketches are of interest not simply because they can be used to directly approximate quantities of interest, but also because they have been used considerably as black box devices in order to compute more sophisticated aggregates and complex quantities: quantiles [21], wavelets [20], histograms [19,29], database aggregates and multi-way join sizes [10], etc. Sketches thus far designed are typically linear functions of their input, and can be represented as projections of an underlying vector representing the data with certain randomly chosen projection matrices. This means that it is easy to compute certain functions on data that is distributed over sites, by casting them as computations on their sketches. So, they are suited for distributed applications too. While sketches have proved powerful, they have the following drawbacks. Although sketches use small space, the space used typically has a Ω(1/ε 2 ) multiplicative factor. This is discouraging because ε = 0.1 or0.01 is quite reasonable and already, this factor proves expensive in space, and consequently, often, in per-update processing and function computation times as well. Many sketch constructions require time linear in the size of the sketch to process each update to the underlying data [2,21]. Sketches are typically a few kilobytes up to a megabyte or so, and processing this much data for every update severely limits the update speed. Sketches are typically constructed using hash functions with strong independence guarantees, such as p-wise independence [2], which can be complicated to evaluate, particularly for a hardware implementation. One of the fundamental questions is to what extent such sophisticated independence properties are needed.

3 60 G. Cormode, S. Muthukrishnan / Journal of Algorithms 55 (2005) Many sketches described in the literature are good for one single, pre-specified aggregate computation. Given that in data stream applications one typically monitors multiple aggregates on the same stream, this calls for using many different types of sketches, which is a prohibitive overhead. Known analyses of sketches hide large multiplicative constants inside big-oh notation. Given that the area of data streams is being motivated by extremely high performance monitoring applications e.g., see [12] for response time requirements for data stream algorithms that monitor IP packet streams these drawbacks ultimately limit the use of many known data stream algorithms within suitable applications. We will address all these issues by proposing a new sketch construction, which we call the count-min, or CM, sketch. This sketch has the advantages that: (1) space used is proportional to 1/ε; (2) the update time is significantly sublinear in the size of the sketch; (3) it requires only pairwise independent hash functions that are simple to construct; (4) this sketch can be used for several different queries and multiple applications; and (5) all the constants are made explicit and are small. Thus, for the applications we discuss, our constructions strictly improve the space bounds of previous results from 1/ε 2 to 1/ε and the time bounds from 1/ε 2 to 1, which is significant. Recently, a Ω(1/ε 2 ) space lower bound was shown for a number of data stream problems: approximating frequency moments F k (t) = k (a i(t)) k, estimating the number of distinct items, and computing the Hamming distance between two strings [30]. 3 It is an interesting contrast that for a number of similar seeming problems (finding heavy hitters and quantiles in the most general data stream model) we are able to give an O(1/ε) upper bound. Conceptually, CM sketch also represents progress since it shows that pairwise independent hash functions suffice for many of the fundamental data stream applications. From a technical point of view, CM sketch and its analyses are quite simple. We believe that this approach moves some of the fundamental data stream algorithms from the theoretical realm to the practical. Our results have some technical nuances: The accuracy estimates for individual queries depend on the L 1 norm of a(t) in contrast to the previous works that depend on the L 2 norm. This is a consequence of working with simple counts. The resulting estimates are often not as tight on individual queries since L 2 norm is never greater than the L 1 norm. But nevertheless, our estimates for individual queries suffice to give improved bounds for the applications here where it is desired to state results in terms of L 1. 3 This bound has virtually been met for distinct items by results in [4], where clever use of hashing improves previous bounds of O((log n/ε 2 ) log(1/δ)) to Õ((1/ε 2 + log n) log(1/δ)).

4 G. Cormode, S. Muthukrishnan / Journal of Algorithms 55 (2005) Most prior sketch constructions relied on embedding into small dimensions to estimate norms. For example, [2] relies on embedding inspired by the Johnson Lindenstrauss lemma [24] for estimating L 2 norms. But accurate estimation of the L 2 norm of a stream requires Ω(1/ε 2 ) space [30]. Currently, all data stream algorithms that rely on such methods that estimate L p norms use Ω(1/ε 2 ) space. One of the observations that underlie our work is while embedding into small space is needed for small space algorithms, it is not necessary that the methods accurately estimate L 2 or in fact any L p norm, for most queries and applications in data streams. Our CM sketch does not help estimate L 2 norm of the input, however, it accurately estimates the queries that are needed, which suffices for our data stream applications. Most data stream algorithm analyses thus far have followed the outline from [2] where one uses Chebyshev and Chernoff bounds in succession to boost probability of success as well as the accuracy. This process contributes to the complexity bounds. Our analysis is simpler, relying only on the Markov inequality. Perhaps surprisingly, in this way we get tighter, cleaner bounds. The remainder of this paper is as follows: in Section 2 we discuss the queries of our interest. We describe our count-min sketch construction and how it answers queries of interest in Sections 3 and 4 respectively, and apply it to a number of problems to improve the best known complexity in Section 5. In each case, we state our bounds and directly compare it with the best known previous results. All previously known sketches have many similarities. Our CM sketch lies in the same framework, and finds inspiration from these previous sketches. Section 6 compares our results to past work, and shows how all relevant sketches can be compared in terms of a small number of parameters. This should prove useful to readers in contrasting the vast number of results that have emerged recently in this area. Conclusions are in Section Preliminaries We consider a vector a, which is presented in an implicit, incremental fashion. This vector has dimension n, and its current state at time t is a(t) =[a 1 (t),...,a i (t),..., a n (t)]. For convenience, we shall usually drop t and refer only to the current state of the vector. Initially, a is the zero vector, 0, soa i (0) is 0 for all i. Updates to individual entries of the vector are presented as a stream of pairs. The tth update is (i t,c t ), meaning that a it (t) = a it (t 1) + c t, a i (t) = a i (t 1), i i t. In some cases, c t s will be strictly positive, meaning that entries only increase; in other cases, c t s are allowed to be negative also. The former is known as the cash register case and the latter the turnstile case [28]. There are two important variations of the turnstile case to consider: whether a i s may become negative, or whether the application generating the updates guarantees that this will never be the case. We refer to the first of these as the general case, and the second as the non-negative case. Many applications that use

5 62 G. Cormode, S. Muthukrishnan / Journal of Algorithms 55 (2005) sketches to compute queries of interest such as monitoring database contents, analyzing IP traffic seen in a network link guarantee that counts will never be negative. However, the general case occurs in important scenarios too, for example in distributed settings where one considers the subtraction of one vector from another, say. At any time t,aquery calls for computing certain functions of interest on a(t). We focus on approximating answers to three types of query based on vectors a and b: A point query, denoted Q(i), is to return an approximation of a i. A range query Q(l, r) is to return an approximation of r i=l a i. An inner product query, denoted Q(a, b) is to approximate a b = n i=1 a i b i. These queries are related: a range query is a sum of point queries; both point and range queries are specific inner product queries. However, in terms of approximations to these queries, results will vary. These are the queries that are fundamental to many applications in data stream algorithms, and have been extensively studied. In addition, they are of interest in a non-data stream context. For example, in databases, the point and range queries are of interest in summarizing the data distribution approximately; and inner-product queries allow approximation of join size of relations. Fuller discussion of these aspects can be found in [16,28]. We will also study use of these queries to compute more complex functions on data streams. As examples, we will focus on the two following problems. Recall that a 1 = ni=1 a i (t) ; more generally, a p = ( n i=1 a i (t) p ) 1/p. (φ-quantiles)theφ-quantiles of the cardinality a 1 multiset of (integer) values each in the range 1...n consist of those items with rank kφ a 1 for k = 0...1/φ after sorting the values. Approximation comes by accepting any integer that is between the item with rank (kφ ε) a 1 and the one with rank (kφ + ε) a 1 for some specified ε<φ. (heavy hitters) Theφ-heavy hitters of a multiset of a 1 (integer) values each in the range 1...n, consist of those items whose multiplicity exceeds the fraction φ of the total cardinality, i.e., a i φ a 1. There can be between 0 and 1/φ heavy hitters in any given sequence of items. Approximation comes by accepting any i such that a i (φ ε) a 1 for some specified ε<φ. We will assume the RAM model, where each machine word can store integers up to max{ a 1,n}. Standard word operations take constant time and so we count space in terms of number of words and we count time in terms of the number of word operations. So, if one estimates our space and time bounds in terms of number of bits instead, a multiplicative factor log max{ a 1,n} is needed. Our goal is to solve the queries and the problems above using a sketch data structure, that is using space and time significantly sublinear polylogarithmic in input size n and a 1. All our algorithms will be approximate and probabilistic; they need two parameters, ε and δ, meaning that the error in answering a query is within a factor of ε with probability δ. Both these parameters will affect the space and time needed by our solutions. Each of these queries and problems has a rich history of

6 G. Cormode, S. Muthukrishnan / Journal of Algorithms 55 (2005) work in the data stream area. We refer the readers to surveys [3,28], tutorials [16], as well as the general literature. 3. Count-min sketches We now introduce our data structure, the count-min, or CM, sketch. It is named after the two basic operations used to answer point queries, counting first and computing the minimum next. We use e to denote the base of the natural logarithm function, ln Data structure A count-min (CM) sketch with parameters (ε, δ) is represented by a two-dimensional array counts with width w and depth d: count[1, 1]...count[d,w]. Given parameters (ε, δ), set w = e/ε and d = ln(1/δ). Each entry of the array is initially zero. Additionally, d hash functions h 1...h d : {1...n} {1...w} are chosen uniformly at random from a pairwise-independent family Update procedure When an update (i t,c t ) arrives, meaning that item a it is updated by a quantity of c t, then c t is added to one count in each row; the counter is determined by h j. Formally, set 1 j d, count [ j,h j (i t ) ] count [ j,h j (i t ) ] + c t. This procedure is illustrated in Fig. 1. The space used by count-min sketches is the array of wd counts, which takes wd words, and d hash functions, each of which can be stored using 2 words when using the pairwise functions described in [27]. Fig. 1. Each item i is mapped to one cell in each row of the array of counts: when an update of c t to item i t arrives, c t is added to each of these cells.

7 64 G. Cormode, S. Muthukrishnan / Journal of Algorithms 55 (2005) Approximate query answering using CM sketches For each of the three queries introduced in Section 2: point, range, and inner product queries, we show how they can be answered using count-min sketches Point query We first show the analysis for point queries for the non-negative case. Estimation procedure The answer to Q(i) is given by â i = min j count[j,h j (i)]. Theorem 1. The estimate â i has the following guarantees: a i â i ; and, with probability at least 1 δ, â i a i + ε a 1. Proof. We introduce indicator variables I i,j,k, which are 1 if (i k) (h j (i) = h j (k)), and 0 otherwise. By pairwise independence of the hash functions, then E(I i,j,k ) = [ Pr h j (i) = h j (k) ] 1 range(h j ) = ε e. Define the variable X i,j (random over the choices of h i )tobex i,j = n k=1 I i,j,k a k. Since all a i are non-negative in this case, X i,j is a non-negative variable. By construction, count[j,h j (i)]=a i + X i,j. So, clearly, min count[j,h j (i)] a i. For the other direction, observe that ( n ) n E(X i,j ) = E I i,j,k a k a k E(I i,j,k ) ε e a 1 k=1 k=1 by pairwise independence of h j, and linearity of expectation. By the Markov inequality, [ ] [ Pr â i >a i + ε a 1 = Pr j. count [ j,h j (i) ] ] >a i + ε a 1 = [ ] Pr j.a i + X i,j >a i + ε a 1 = [ Pr j.x i,j >ee(x i,j ) ] <e d δ. The time to produce the estimate is O(ln(1/δ)) since finding the minimum count can be done in linear time; the same time bound holds for updates. The constant e is used here to minimize the space used: more generally, we can set w = ε/b and d = log b (1/δ) for any b>1 to get the same accuracy guarantee. Choosing b = e minimizes the space used, since this solves d(wd)/db = 0, giving a cost of (2 + e/ε)ln(1/δ) words. For implementations, it may be preferable to use other (integer) values of b for simpler computations or faster updates. Note that for values of a i that are large relative to a 1, the bound in terms of ε a 1 can be translated into a relative error in terms of a i. This has implications for

8 G. Cormode, S. Muthukrishnan / Journal of Algorithms 55 (2005) certain applications which rely on retrieving large values, such as large wavelet or Fourier coefficients. The best known previous result using sketches was in [5]: there sketches were used to approximate point queries. Results were stated in terms of the frequencies of individual items. For arbitrary distributions, the space used is O((1/ε 2 ) log(1/δ)), and the dependency on ε is 1/ε 2 in every case considered. A significant difference between CM sketches and previous work comes in the analysis: All prior analyses of sketch structures compute the variance of their estimators in order to apply the Chebyshev inequality, which brings the dependency on ε 2. Directly applying the Markov inequality yields a more direct analysis which depends only on ε. Practitioners may have discovered that less than O(1/ε 2 ) space is needed in practice: here, we give proof of why this is so and the tighter bound. Because only positive quantities are added to the counters then it is possible to take the minimum instead of the median for the estimate. This allows a simple calculation of the failure probability, without recourse to Chernoff bounds. This significantly improves the constants involved: in [5], for example, the constant factors within the big-oh notation is at least 256; here, the constant factor is less than 3. The error bound here is one-sided, as opposed to all previous constructions which gave two-sided errors. This brings benefits for many applications which use sketches. In Section 6 we show how all existing sketch constructions can be viewed as variations of a common procedure. This emphasizes the importance of our attempt to find the simplest sketch construction which has the best guarantees and smallest constants. A similar result holds when entries of the implicit vector a may be negative, which is the general case. Estimation procedure for general case This time Q(i) is answered with â i = median j count[j,h j (i)]. Theorem 2. With probability 1 δ 1/4, a i 3ε a 1 â i a i + 3ε a 1. Proof. Observe that E( count[j,h j (i)] a i ) ε a 1 /e, and so the probability that any count is off by more than 3ε a 1 is less than 1/8. Applying Chernoff bounds tells us that the probability of the median of ln(1/δ) copies of this procedure being wrong is less than δ 1/4. The time to produce the estimate is O(ln(1/δ)) and the space used is (2 + e/ε)ln(1/δ) words. The best prior result for this problem was the method of [5]. Again, the dependence on ε here is improved from 1/ε 2 to 1/ε. By avoiding analyzing the variance of the estimator, again the analysis is simplified, and the constants are significantly smaller than in previous works.

9 66 G. Cormode, S. Muthukrishnan / Journal of Algorithms 55 (2005) Inner product query Estimation procedure Set (â b) j = w k=1 count a [j,k] count b [j,k]. Our estimation of Q(a, b) for nonnegative vectors a and b is â b = min j (â b) j. Theorem 3. a b â b and, with probability 1 δ, â b a b + ε a 1 b 1. Proof. (â n b )j = a i b i + i=1 p q h j (p)=h j (q) a p b q. Clearly, a b â b j for non-negative vectors. By pairwise independence of h, E ( â b j a b ) = p q Pr [ h j (p) = h j (q) ] a p b q p q εa p b q e ε a 1 b 1. e So, by the Markov inequality, Pr[â b a b >ε a 1 b 1 ] δ, as required. The space and time to produce the estimate is O((1/ε) log(1/δ)). Updates are performed in time O(log(1/δ)). Note that in the special case where b i = 1, and b is zero at all other locations, then this procedure is identical to the above procedure for point estimation, and gives the same error guarantee (since b 1 = 1). A similar result holds in the general case, where vectors may have negative entries. Taking the median of (â b) j would give a guaranteed good quality approximation; however, we do not know of any application which makes use of inner products of such vectors, so we do not give the full details here. We next consider the application of inner-product computation to Join size estimation, where the vectors generated have non-negative entries. Join size estimation is important in database query planners in order to determine the best order in which to evaluate queries. The join size of two database relations on a particular attribute is the number of items in the cartesian product of the two relations which agree the value of that attribute. We assume without loss of generality that attribute values in the relation are integers in the range 1...n. We represent the relations being joined as vectors a and b so that the values a i represents the number of tuples which have value i in the first relation, and b i similarly for the second relation. Then clearly a b is the join size of the two relations. Using sketches allows estimates to be made in the presence of items being inserted to and deleted from relations. The following corollary follows from the above theorem. Corollary 1. The join size of two relations on a particular attribute can be approximated up to ε a 1 b 1 with probability 1 δ, by keeping space O((1/ε) log(1/δ)).

10 G. Cormode, S. Muthukrishnan / Journal of Algorithms 55 (2005) Previous results have used the tug-of-war sketches [1]. However, here some care is needed in the comparison of the two methods: the prior work gives guarantees in terms of the L 2 norm of the underlying vectors, with additive error of ε a 2 b 2 ; here, the result is in terms of the L 1 norm. In some cases, the L 2 norm can be quadratically smaller than the L 1 norm. However, when the distribution of items is non-uniform, for example when certain items contribute a large amount to the join size, then the two norms are closer, and the guarantees of the CM sketch method is closer to the existing method. As before, the space cost of previous methods was Ω(1/ε 2 ), so there is a significant space saving to be had with CM sketches Range query Define χ(l, r) to be the vector of dimension n such that χ(l, r) i = 1 l i r, and 0 otherwise. Then Q(l, r) can straightforwardly be re-posed as Q(a, χ(l, r)). However, this method has two drawbacks: first, the error guarantee is in terms of a 1 χ(l, r) 1 and therefore large range sums have an error guarantee which increases linearly with the length of the range; and second, the time cost to directly compute the sketch for χ(l, r) depends linearly in the length of the range, r l + 1. In fact, it is clear that computing range sums in this way using our sketches is not much different to simply computing point queries for each item in the range, and summing the estimates. One way to avoid the time complexity is to use range-sum random variables from [20] to quickly determine a sketch of χ(l, r), but that is expensive and still does not overcome the first drawback. Instead, we adopt the use of dyadic ranges from [21]: a dyadic range is a range of the form [x2 y +1...(x+1)2 y ] for parameters x and y. Estimation procedure Keep log 2 n CM sketches, in order to answer range queries Q(l, r) approximately. Any range query can be reduced to at most 2 log 2 n dyadic range queries, which in turn can each be reduced to a single point query. Each point in the range [1...n] is a member of log 2 n dyadic ranges, one for each y in the range 0...log 2 n 1. A sketch is kept for each set of dyadic ranges of length 2 y, and update each of these for every update that arrives. Then, given a range query Q(l, r), compute the at most 2 log 2 n dyadic ranges which canonically cover the range, and pose that many point queries to the sketches, returning the sum of the queries as the estimate. Example 1. For n = 256, the range [48, 107] is canonically covered by the nonoverlapping dyadic ranges [ ], [ ], [ ], [ ], [ ], [ ]. Let a[l,r]= r i=l a i be the answer to the query Q(l, r) and let â[l,r] be the estimate using the procedure above. Theorem 4. a[l,r] â[l,r] and with probability at least 1 δ, â[l,r] a[l,r]+2ε log n a 1.

11 68 G. Cormode, S. Muthukrishnan / Journal of Algorithms 55 (2005) Proof. Applying the inequality of Theorem 1, then a[l,r] â[l,r]. Consider each estimator used to form â[l,r]; the expectation of the additive error for any of these is 2logn(ε/e) a 1, by linearity of expectation of the errors of each point estimate. Applying the same Markov inequality argument as before, the probability that this additive error is more than 2ε log n a 1 for any estimator is less than 1/e; hence, for all of them the probability is at most δ. The time to compute the estimate or to make an update is O(log n log(1/δ)). The space used is O((log n/ε) log(1/δ)). The above theorem states the bound for the standard CM sketch size. The guarantee will be more useful when stated without terms of log n in the approximation bound. This can be changed by increasing the size of the sketch, which is equivalent to rescaling ε. In particular, if we want to estimate a range sum correct up to ε a 1 with probability 1 δ then set ε = ε /(2logn). The space used is O((log 2 n/ε ) log(1/δ)). An obvious improvement of this technique in practice is to keep exact counts for the first few levels of the hierarchy, where there are only a small number of dyadic ranges. This improves the space, time and accuracy of the algorithm in practice, although the asymptotic bounds are unaffected. For smaller ranges, ranges that are powers of 2, or more generally, any range whose endpoints can be expressed in binary using a small number of 1s, then improved bounds are possible; we have given the worst case bounds above. One way to compute approximate range sums is via approximate quantiles: use an algorithm such as [22,25] to find the ε quantiles of the stream, and then count how many quantiles fall within the range of interest to give an O(ε) approximation of the range query. Such an approach has several disadvantages: (1) Existing approximate quantile methods work in the cash register model, rather than the more general turnstile model that our solutions work in. (2) The time cost to update the data structure can be high, sometimes linear in the size of the structure. (3) Existing algorithms assume single items arriving one by one, so they do not handle fractional values or large values being added, which can be easily handled by sketchbased approaches. (4) The worst case space bound depends on O((1/ε) log( a 1 /ε)), which can grow indefinitely. The sketch based solution works in fixed space that is independent of a 1. The best previous bounds for this problem in the turnstile model are given in [21], where range queries are answered by keeping O(log n) sketches, each of size O((1/ε 2 ) log n log(log n/δ)) to give approximations with additive error ε a 1 with probability 1 δ. Thus the space used there is O((log 2 n/ε 2 ) log(log n/δ)) and the time for updates is linear in the space used. The CM sketch improves the space and time bounds; it improves the constant factors as well as the asymptotic behavior. The time to process an update is significantly improved, since only a few entries in the sketch are modified, rather than a linear number.

12 G. Cormode, S. Muthukrishnan / Journal of Algorithms 55 (2005) Applications of count-min sketches By using CM sketches, we show how to improve best known time and space bounds for the two problems from Section Quantiles in the turnstile model In [21] the authors showed that finding the approximate φ-quantiles of the data subject to insertions and deletions can be reduced to the problem of computing range sums. Put simply, the algorithm is to do binary searches for ranges 1...r whose range sum a[1,r] is kφ a 1 for 0 k 1/φ. The method of [21] uses random subset sums to compute range sums. By replacing this structure with count-min sketches, the improved results follow immediately. By keeping log n sketches, one for each dyadic range and setting the accuracy parameter for each to be ε/log n and the probability guarantee to δφ/log n, the overall probability guarantee for all 1/φ quantiles is achieved. Theorem 5. ε-approximate φ-quantiles can be found with probability at least 1 δ by keeping a data structure with space O((1/ε) log 2 n log(log n/(φδ))). The time for each insert or delete operation is O(log n log(log n/(φδ))), and the time to find each quantile on demand is O(log n log(log n/(φδ))). Choosing CM sketches over random subset sums improves both the query time and the update time from O((1/ε 2 ) log 2 n log(log n/(εδ))), by a factor of more than (34/ε 2 ) log n. The space requirements are also improved by a factor of at least 34/ε. It is illustrative to contrast our bounds with those for the problem in the weaker cash register model where items are only inserted (recall that in our stronger turnstile model, items are deleted as well). The previously best known space bounds for finding approximate quantiles is O((1/ε)(log 2 (1/ε) + log 2 log(1/δ))) space for a randomized sampling solution [25] and O((1/ε) log(ε a 1 )) space for a deterministic solution [22]. These bounds are not completely comparable, but our result is the first on the more powerful turnstile model to be comparable to the cash register model bounds in the leading 1/ε term Heavy hitters We consider this problem in both the cash register model (meaning that all updates are positive) and the more challenging turnstile model (where updates are both positive and negative, with the restriction that count of any item is never less than zero, i.e., a i (t) 0). Cash register case It is possible to maintain the current value of a 1 throughout, since a(t) 1 = t i=1 c i. On receiving item (i t,c t ), update the sketch as before and pose point query Q(i t ):if estimate â it is above the threshold of φ a(t) 1, i t is added to a heap. The heap is kept small by checking that the current estimated count for the item with lowest count is above

13 70 G. Cormode, S. Muthukrishnan / Journal of Algorithms 55 (2005) threshold; if not, it is deleted from the heap as in [5]. At the end of the input, the heap is scanned, and all items in the heap whose estimated count is still above φ a 1 are output. Theorem 6. The heavy hitters can be found from an inserts only sequence of length a 1, by using CM sketches with space O((1/ε) log( a 1 /δ)), and time O(log( a 1 /δ)) per item. Every item which occurs with count more than φ a 1 time is output, and with probability 1 δ, no item whose count is less than (φ ε) a 1 is output. Proof. This procedure relies on the fact that the threshold value increases monotonically: therefore, if an item did not pass the threshold in the past, it cannot do so in the future without its count increasing. By checking the estimated value every time an items value increases, no heavy hitters will be omitted. By the one-sided error guarantee of sketches, every heavy hitter is included in the output, but there is some possibility of including nonheavy hitters in the output. To do this, the parameter δ is scaled to ensure that over all a 1 queries posed to the sketch, the probability of mistakenly outputting an infrequent item is bounded by 1 δ, using the union bound. We can compare our results to the best known previous work. The algorithm in [5] solves this problem using Count sketches in worst case space O((1/ε 2 ) log( a 1 /δ)), which we strictly improve here. A randomized algorithm given in [26] has expected space cost O((1/ε) log(1/(φδ))), slightly better than our worst case space usage. Meanwhile, a deterministic algorithm in the same paper solves the problem in worst case space O((1/ε) log( a 1 /ε)). However, for both algorithms in [26] the time cost of processing each insertion can be high (Ω(1/ε)): periodically, there are operations with cost linear in the space used. For high speed applications, our worst case time bound may be preferable. Turnstile case We adopt the solution given in [8], which describes a divide and conquer procedure to find the heavy hitters. This keeps sketches for computing range sums: log n different sketches, one for each different dyadic range. When an update (i t,c t ) arrives, then each of these is updated as before. In order to find all the heavy hitters, a parallel binary search is performed, descending one level of the hierarchy at each step. Nodes in the hierarchy (corresponding to dyadic ranges) whose estimated weight exceeds the threshold of (φ + ε) a 1 are split into two ranges, and investigated recursively. All single items found in this way whose approximated count exceeds the threshold are output. We instead must limit the number of items output whose true frequency is less than the fraction φ. This is achieved by setting the probability of failure for each sketch to be δφ/(2logn). This is because, at each level there are at most 1/φ items with frequency more than φ. At most twice this number of queries are made at each level, for all of the log n levels. By scaling δ like this and applying the union bound ensures that, over all the queries, the total probability that any one (or more) of them overestimated by more than a fraction ε is bounded by δ, and so the probability that every query succeeds is 1 δ. It follows that

14 G. Cormode, S. Muthukrishnan / Journal of Algorithms 55 (2005) Theorem 7. The algorithm uses space O((1/ε) log n log(2logn/(δφ))) and time O(log n log(2logn/(δφ))) per update. Every item with frequency at least (φ + ε) a 1 is output, and with probability 1 δ no item whose frequency is less than φ a 1 is output. The previous best known bound appears in [8], where a non-adaptive group testing approach was described. Here, the space bounds agree asymptotically but have been improved in constant factors; a further improvement is in the nature of the guarantee: previous methods gave probabilistic guarantees about outputting the heavy hitters. Here, there is absolute certainty that this procedure will find and output every heavy hitter, because the CM sketches never underestimate counts, and strong guarantees are given that no non-heavy hitters will be output. This is often desirable. In some situations in practice, it is vital that updates are as fast as possible, and here update time can be played off against search time: ranges based on powers of two can be replaced with an arbitrary branching factor k, which reduces the number of levels to log k n, at the expense of costlier queries and weaker guarantees on outputting non-heavy hitters. Hierarchical heavy hitters. A generalization of this problem is finding hierarchical heavy hitters [7], which assumes that the items are leaves in a hierarchy of depth h. Here the goal is to find all nodes in the hierarchy which are heavy hitters, after discounting the contribution of any descendant heavy hitter nodes. Using our CM sketch, the cost of the solution given in [7] for the turnstile model can be improved from O((h/ε 2 ) log(1/δ)) space and time per update to O((h/ε) log(1/δ)) space and O(hlog(1/δ)) time per update. 6. Comparison of sketch techniques We give a common framework to summarize known sketch constructions, and compare the time and space requirements for each of the fundamental queries point, range and inner products using them. Here is a brief summary of known sketch constructions. The first sketch construction was that of Alon, Matias and Szegedy [2], whose tug-of-war sketches are computed using 4-wise random hash functions g j mapping items to {+1, 1}.Thejth entry of the sketch, which is a vector of length O((1/ε 2 ) log(1/δ)), is defined to be a i g j (i), which is easy to maintain under updates. This structure was applied to finding inner products in [1,20] where, in our notation, it was shown that it is possible to compute inner products with additive error ±ε a 2 b 2. In [20], the authors use these sketches to compute large wavelet coefficients. In particular, they show how the structure allows point queries to be computed up to additive error of ±ε a 2. They also compute range sums Q(l, r): here range-summable random variables are used to compute the sums efficiently, but this incurs a factor of O(log n) additional time and space to compute these. Also note that here the error guarantees are much worse for large ranges: ε(r l + 1) a 1. For point queries only, then pairwise independence suffices for tug-of-war sketches, as observed by [5]. These authors additionally used a second hash set of hash functions, h j, to spread out the effect of high frequency items in their count sketch. For point queries, this gives the same space bounds, but an improved time bound to O(log(1/δ)) for updating the

15 72 G. Cormode, S. Muthukrishnan / Journal of Algorithms 55 (2005) Table 1 Comparison of different space and time requirements of sketching methods Method Ref. Query Space Update Query Randomness time time needed Tug-of-war [1] Inner-product 1/ε 2 1/ε 2 1/ε 2 4-wise Tug-of-war [20] Point log n/ε 2 log n/ε 2 log n/ε 2 4-wise Range log n/ε 2 log n/ε 2 log n/ε 2 4-wise Random subset-sums [21] Range log 2 n/ε 2 log 2 n/ε 2 log 2 n/ε 2 Pairwise Count sketches [5] Point 1/ε Pairwise Count-min sketches (this paper) Point 1/ε 1 1 Pairwise Inner-product 1/ε 1 1/ε Pairwise Range log n/ε log n log n/ε Pairwise The dependency on ε and n is shown, a factor of O(log(1/δ)) applies to each entry and is omitted. sketch. 4 Random subset sums were introduced in [21] in order to compute point queries and range sums. Here, 2-universal hash functions 5 h j map items to {0, 1}, and the jth entry of the sketch is maintained as n i=1 a i h j (i). The asymptotic space and time bounds for different techniques in terms of their dependence on epsilon are summarized in Table 1. All of the above sketching techniques, and the one which we propose in this paper, can all be described in a common way. Firstly, they can all be viewed as linear projections of the vector a with appropriately chosen random vectors, but more than this, the computation of these linear projections is very similar between all methods. Define a sketch to be a two dimensional array of dimension w by d. Leth 1...h d be pairwise independent hash functions mapping from {1...n} to {1...w}, and let g 1...g d be another hash function whose range and randomness varies from construction to construction. The (j, k)th entry of the sketch is defined to be a i g k (i). i: h k (i)=j The contents of the sketch for each of the above techniques is specified by the parameters w, d, and g; methods of answering queries using the sketch vary from technique to technique. The update time for each sketch is O(d) computations of g and h, and the space requirement is dominated by the O(wd) counters, provided that the hash functions can be stored efficiently. Tug of war sketches have w = 1, d = O((1/ε 2 ) log(1/δ)), g(i) is {+1, 1} with 4-wise independence. Count sketches have w = O(1/ε 2 ), d = O(log(1/δ)), g(i) is {+1, 1} with 2-wise independence. 4 We observe that the same idea, of replacing the averaging of O(1/ε 2 ) copies of an estimator with a 2-universal hash function distributing to O(1/ε 2 ) buckets can also reduce the update time for tug-of-war sketches and similar constructions to O(log(1/δ)). 5 In [21] the authors use a 3-wise independent hash function onto {0, 1}, chosen because it is easy to compute, but pairwise suffices.

16 G. Cormode, S. Muthukrishnan / Journal of Algorithms 55 (2005) Random subset sums have w = 2, d = (24/ε 2 ) log(1/δ), g(i) is 1. Count-min sketches have w = e/ε, d = ln(1/δ), g(i) = 1. We briefly mention some other well-known sketch constructions, and explain why they are not appropriate for the queries we study here. [14] gave a sketch construction for computing the L 1 difference between vectors. It extends the tug-of-war construction, the technical contribution being a demonstration of how to compute range sums of 4-wise random variables efficiently. Indyk [23] pioneered the use of stable distributions in sketch computations, again in order to compute L p norms of vectors presented as a sequence of updates. However, all these norm computations do not directly answer the three query types we consider here. Lastly, we note that our count-min sketches appear similar in outline to both the uniscan algorithm due to Fang et al. [13] and the parallel multistage filters of Estan and Varghese [11]. Our construction differs for the following reasons: (1) the structures from prior work are not sketches: they do not approximate any value, but instead return only a binary answer about whether an item has exceeded a certain numeric threshold; (2) the algorithms used require updates to be only positive; and (3) the analysis does not consider any limited independence needed for the hash functions, in contrast to CM sketches, which require only pairwise independence. The methods in [11,13] as such seem to use fully independent hash functions which is prohibitive in principle. 7. Conclusions We have introduced the count-min sketch, and shown how to estimate fundamental queries such as point, range or inner product queries as well as solve more sophisticated problems such as quantiles and heavy hitters. The space and/or time bounds of our solutions improve previously best known bounds for these problems. Typically the improvement is from 1/ε 2 factor to 1/ε which is significant in real applications. Our CM sketch is quite simple, and is likely to find many applications, including in hardware solutions for these problems. We have recently applied these ideas to the problem of change detection on data streams [9], and we also believe that it can be applied to improve the time and space bounds for constructing approximate wavelet and histogram representations of data streams [19]. Also, the CM sketch can also be naturally extended to solve problems on streams that describe multidimensional arrays rather than the unidimensional array problems we have discussed so far. Our CM sketch is not effective when one wants to compute the norms of data stream inputs. These have applications to computing correlations between data streams and tracking the number of distinct elements in streams, both of which are of great interest. It is an open problem to design extremely simple, practical sketches such as our CM sketch for estimating such correlations and more complex data stream applications.

17 74 G. Cormode, S. Muthukrishnan / Journal of Algorithms 55 (2005) References [1] N. Alon, P. Gibbons, Y. Matias, M. Szegedy, Tracking join and self-join sizes in limited storage, in: Proceedings of the Eighteenth ACM Symposium on Principles of Database Systems (PODS 99), 1999, pp [2] N. Alon, Y. Matias, M. Szegedy, The space complexity of approximating the frequency moments, in: Proceedings of the Twenty-Eighth Annual ACM Symposium on the Theory of Computing, 1996, pp ; Journal version in J. Comput. System Sci. 58 (1999) [3] B. Babcock, S. Babu, M. Datar, R. Motwani, J. Widom, Models and issues in data stream systems, in: Proceedings of Symposium on Principles of Database Systems (PODS), 2002, pp [4] Z. Bar-Yossef, T.S. Jayram, R. Kumar, D. Sivakumar, L. Trevisan, Counting distinct elements in a data stream, in: Proceedings of RANDOM 2002, 2002, pp [5] M. Charikar, K. Chen, M. Farach-Colton, Finding frequent items in data streams, in: Proceedings of the International Colloquium on Automata, Languages and Programming (ICALP), 2002, pp [6] G. Cormode, M. Datar, P. Indyk, S. Muthukrishnan, Comparing data streams using Hamming norms, in: Proceedings of 28th International Conference on Very Large Data Bases, 2002, pp ; Journal version in IEEE Trans. Knowledge Data Engrg. 15 (3) (2003) [7] G. Cormode, F. Korn, S. Muthukrishnan, D. Srivastava, Finding hierarchical heavy hitters in data streams, in: International Conference on Very Large Databases, 2003, pp [8] G. Cormode, S. Muthukrishnan, What s hot and what s not: tracking most frequent items dynamically, in: Proceedings of ACM Principles of Database Systems, 2003, pp [9] G. Cormode, S. Muthukrishnan, What s new: finding significant differences in network data streams, in: Proceedings of IEEE INFOCOM, [10] A. Dobra, M. Garofalakis, J.E. Gehrke, R. Rastogi, Processing complex aggregate queries over data streams, in: Proceedings of the 2002 ACM SIGMOD International Conference on Management of Data, 2002, pp [11] C. Estan, G. Varghese, New directions in traffic measurement and accounting, in: Proceedings of ACM SIGCOMM, Computer Communication Review 32 (4) (2002) [12] C. Estan, G. Varghese, Data streaming in computer networks, in: Proceedings of Workshop on Management and Processing of Data Streams, 2003, [13] M. Fang, N. Shivakumar, H. Garcia-Molina, R. Motwani, J.D. Ullman, Computing iceberg queries efficiently, in: Proceedings of the Twenty-Fourth International Conference on Very Large Databases, 1998, pp [14] J. Feigenbaum, S. Kannan, M. Strauss, M. Viswanathan, An approximate L 1 -difference algorithm for massive data streams, in: Proceedings of the 40th Annual Symposium on Foundations of Computer Science, 1999, pp [15] P. Flajolet, G.N. Martin, Probabilistic counting, in: 24th Annual Symposium on Foundations of Computer Science, 1983, pp ; Journal version in J. Comput. System Sci. 31 (1985) [16] M. Garofalakis, J. Gehrke, R. Rastogi, Querying and mining data streams: you only get one look, in: Proceedings of the ACM SIGMOD International Conference on Management of Data, [17] P. Gibbons, Y. Matias, Synopsis structures for massive data sets, in: DIMACS Ser. Discrete Math. Theoret. Comput. Sci. A, [18] P. Gibbons, S. Tirthapura, Estimating simple functions on the union of data streams, in: Proceedings of the 13th ACM Symposium on Parallel Algorithms and Architectures, 2001, pp [19] A. Gilbert, S. Guha, P. Indyk, Y. Kotidis, S. Muthukrishnan, M. Strauss, Fast, small-space algorithms for approximate histogram maintenance, in: Proceedings of the 34th ACM Symposium on Theory of Computing, 2002, pp [20] A. Gilbert, Y. Kotidis, S. Muthukrishnan, M. Strauss, Surfing wavelets on streams: one-pass summaries for approximate aggregate queries, in: Proceedings of 27th International Conference on Very Large Data Bases, 2001, pp ; Journal version in IEEE Trans. Knowledge Data Engrg. 15 (3) (2003) [21] A.C. Gilbert, Y. Kotidis, S. Muthukrishnan, M. Strauss, How to summarize the universe: dynamic maintenance of quantiles, in: Proceedings of 28th International Conference on Very Large Data Bases, 2002, pp

18 G. Cormode, S. Muthukrishnan / Journal of Algorithms 55 (2005) [22] M. Greenwald, S. Khanna, Space-efficient online computation of quantile summaries, SIGMOD Record (ACM Special Interest Group on Management of Data) 30 (2) (2001) [23] P. Indyk, Stable distributions, pseudorandom generators, embeddings and data stream computation, in: Proceedings of the 40th Symposium on Foundations of Computer Science, 2000, pp [24] W.B. Johnson, J. Lindenstrauss, Extensions of Lipshitz mapping into Hilbert space, Contemp. Math. 26 (1984) [25] G.S. Manku, S. Rajagopalan, B.G. Lindsay, Approximate medians and other quantiles in one pass and with limited memory, in: Proceedings of the ACM SIGMOD International Conference on Management of Data, 1998, pp [26] G.S. Manku, R. Motwani, Approximate frequency counts over data streams, in: Proceedings of 28th International Conference on Very Large Data Bases, 2002, pp [27] R. Motwani, P. Raghavan, Randomized Algorithms, Cambridge University Press, [28] S. Muthukrishnan, Data streams: algorithms and applications, in: ACM SIAM Symposium on Discrete Algorithms, 2003, [29] N. Thaper, P. Indyk, S. Guha, N. Koudas, Dynamic multidimensional histograms, in: Proceedings of the ACM SIGMOD International Conference on Management of Data, 2002, pp [30] D. Woodruff, Optimal space lower bounds for all frequency moments, in: ACM SIAM Symposium on Discrete Algorithms, 2004.

Lecture 5: Data Streaming Algorithms

Lecture 5: Data Streaming Algorithms Great Ideas in Theoretical Computer Science Summer 2013 Lecture 5: Data Streaming Algorithms Lecturer: Kurt Mehlhorn & He Sun In the data stream scenario, the input arrive rapidly in an arbitrary order,

More information

Course : Data mining

Course : Data mining Course : Data mining Lecture : Mining data streams Aristides Gionis Department of Computer Science Aalto University visiting in Sapienza University of Rome fall 2016 reading assignment LRU book: chapter

More information

Tracking Frequent Items Dynamically: What s Hot and What s Not To appear in PODS 2003

Tracking Frequent Items Dynamically: What s Hot and What s Not To appear in PODS 2003 Tracking Frequent Items Dynamically: What s Hot and What s Not To appear in PODS 2003 Graham Cormode graham@dimacs.rutgers.edu dimacs.rutgers.edu/~graham S. Muthukrishnan muthu@cs.rutgers.edu Everyday

More information

Join Sizes, Frequency Moments, and Applications

Join Sizes, Frequency Moments, and Applications Join Sizes, Frequency Moments, and Applications Graham Cormode 1 and Minos Garofalakis 2 1 University of Warwick, Department of Computer Science, Coventry CV4 7AL, UK. Email: G.Cormode@warwick.ac.uk 2

More information

Mining for Patterns and Anomalies in Data Streams. Sampath Kannan University of Pennsylvania

Mining for Patterns and Anomalies in Data Streams. Sampath Kannan University of Pennsylvania Mining for Patterns and Anomalies in Data Streams Sampath Kannan University of Pennsylvania The Problem Data sizes too large to fit in primary memory Devices with small memory Access times to secondary

More information

Mining Data Streams. Outline [Garofalakis, Gehrke & Rastogi 2002] Introduction. Summarization Methods. Clustering Data Streams

Mining Data Streams. Outline [Garofalakis, Gehrke & Rastogi 2002] Introduction. Summarization Methods. Clustering Data Streams Mining Data Streams Outline [Garofalakis, Gehrke & Rastogi 2002] Introduction Summarization Methods Clustering Data Streams Data Stream Classification Temporal Models CMPT 843, SFU, Martin Ester, 1-06

More information

Summarizing and mining inverse distributions on data streams via dynamic inverse sampling

Summarizing and mining inverse distributions on data streams via dynamic inverse sampling Summarizing and mining inverse distributions on data streams via dynamic inverse sampling Presented by Graham Cormode cormode@bell-labs.com S. Muthukrishnan muthu@cs.rutgers.edu Irina Rozenbaum rozenbau@paul.rutgers.edu

More information

Processing Data-Stream Join Aggregates Using Skimmed Sketches

Processing Data-Stream Join Aggregates Using Skimmed Sketches Processing Data-Stream Join Aggregates Using Skimmed Sketches Sumit Ganguly, Minos Garofalakis, and Rajeev Rastogi Bell Laboratories, Lucent Technologies, Murray Hill NJ, USA. sganguly,minos,rastogi @research.bell-labs.com

More information

Improving Sketch Reconstruction Accuracy Using Linear Least Squares Method

Improving Sketch Reconstruction Accuracy Using Linear Least Squares Method Improving Sketch Reconstruction Accuracy Using Linear Least Squares Method Gene Moo Lee, Huiya Liu, Young Yoon and Yin Zhang Department of Computer Sciences University of Texas at Austin Austin, TX 7872,

More information

Efficient Approximation of Correlated Sums on Data Streams

Efficient Approximation of Correlated Sums on Data Streams Efficient Approximation of Correlated Sums on Data Streams Rohit Ananthakrishna Cornell University rohit@cs.cornell.edu Flip Korn AT&T Labs Research flip@research.att.com Abhinandan Das Cornell University

More information

Data Streaming Algorithms for Geometric Problems

Data Streaming Algorithms for Geometric Problems Data Streaming Algorithms for Geometric roblems R.Sharathkumar Duke University 1 Introduction A data stream is an ordered sequence of points that can be read only once or a small number of times. Formally,

More information

One-Pass Streaming Algorithms

One-Pass Streaming Algorithms One-Pass Streaming Algorithms Theory and Practice Complaints and Grievances about theory in practice Disclaimer Experiences with Gigascope. A practitioner s perspective. Will be using my own implementations,

More information

Tracking Most Frequent Items Dynamically

Tracking Most Frequent Items Dynamically What s Hot and What s Not: Tracking Most Frequent Items Dynamically GRAHAM CORMODE DIMACS Center, Rutgers University and S. MUTHUKRISHNAN Department of Computer Science, Rutgers University Most database

More information

arxiv: v1 [cs.ds] 1 Dec 2008

arxiv: v1 [cs.ds] 1 Dec 2008 Optimal Tracking of Distributed Heavy Hitters and Quantiles Ke Yi Qin Zhang arxiv:0812.0209v1 [cs.ds] 1 Dec 2008 Department of Computer Science and Engineering Hong Kong University of Science and Technology

More information

New Constructions of Non-Adaptive and Error-Tolerance Pooling Designs

New Constructions of Non-Adaptive and Error-Tolerance Pooling Designs New Constructions of Non-Adaptive and Error-Tolerance Pooling Designs Hung Q Ngo Ding-Zhu Du Abstract We propose two new classes of non-adaptive pooling designs The first one is guaranteed to be -error-detecting

More information

Space Efficient Mining of Multigraph Streams

Space Efficient Mining of Multigraph Streams Space Efficient Mining of Multigraph Streams Graham Cormode Bell Laboratories cormode@bell-labs.com S. Muthukrishnan Rutgers University muthu@cs.rutgers.edu ABSTRACT The challenge of monitoring massive

More information

Finding Hierarchical Heavy Hitters in Streaming Data

Finding Hierarchical Heavy Hitters in Streaming Data Finding Hierarchical Heavy Hitters in Streaming Data GRAHAM CORMODE AT&T Labs Research and FLIP KORN AT&T Labs Research and S. MUTHUKRISHNAN Rutgers University and DIVESH SRIVASTAVA AT&T Labs Research

More information

The Encoding Complexity of Network Coding

The Encoding Complexity of Network Coding The Encoding Complexity of Network Coding Michael Langberg Alexander Sprintson Jehoshua Bruck California Institute of Technology Email: mikel,spalex,bruck @caltech.edu Abstract In the multicast network

More information

Misra-Gries Summaries

Misra-Gries Summaries Title: Misra-Gries Summaries Name: Graham Cormode 1 Affil./Addr. Department of Computer Science, University of Warwick, Coventry, UK Keywords: streaming algorithms; frequent items; approximate counting

More information

An Efficient Transformation for Klee s Measure Problem in the Streaming Model Abstract Given a stream of rectangles over a discrete space, we consider the problem of computing the total number of distinct

More information

Buffered Count-Min Sketch on SSD: Theory and Experiments

Buffered Count-Min Sketch on SSD: Theory and Experiments Buffered Count-Min Sketch on SSD: Theory and Experiments Mayank Goswami, Dzejla Medjedovic, Emina Mekic, and Prashant Pandey ESA 2018, Helsinki, Finland The heavy hitters problem (HH(k)) Given stream of

More information

Processing Set Expressions over Continuous Update Streams

Processing Set Expressions over Continuous Update Streams Processing Set Expressions over Continuous Update Streams Sumit Ganguly Bell Laboratories Lucent Technologies Murray Hill, NJ 07974 sganguly@bell-labs.com Minos Garofalakis Bell Laboratories Lucent Technologies

More information

A Distribution-Sensitive Dictionary with Low Space Overhead

A Distribution-Sensitive Dictionary with Low Space Overhead A Distribution-Sensitive Dictionary with Low Space Overhead Prosenjit Bose, John Howat, and Pat Morin School of Computer Science, Carleton University 1125 Colonel By Dr., Ottawa, Ontario, CANADA, K1S 5B6

More information

Optimal Parallel Randomized Renaming

Optimal Parallel Randomized Renaming Optimal Parallel Randomized Renaming Martin Farach S. Muthukrishnan September 11, 1995 Abstract We consider the Renaming Problem, a basic processing step in string algorithms, for which we give a simultaneously

More information

Summarizing and Mining Inverse Distributions on Data Streams via Dynamic Inverse Sampling

Summarizing and Mining Inverse Distributions on Data Streams via Dynamic Inverse Sampling Summarizing and Mining Inverse Distributions on Data Streams via Dynamic Inverse Sampling Graham Cormode Bell Laboratories cormode@lucent.com S. Muthukrishnan Rutgers University muthu@cs.rutgers.edu Irina

More information

Summarizing and Mining Inverse Distributions on Data Streams via Dynamic Inverse Sampling

Summarizing and Mining Inverse Distributions on Data Streams via Dynamic Inverse Sampling Summarizing and Mining Inverse Distributions on Data Streams via Dynamic Inverse Sampling Graham Cormode Bell Laboratories, Murray Hill, NJ cormode@lucent.com S. Muthukrishnan Rutgers University, Piscataway,

More information

Rectangle-Efficient Aggregation in Spatial Data Streams

Rectangle-Efficient Aggregation in Spatial Data Streams Rectangle-Efficient Aggregation in Spatial Data Streams Srikanta Tirthapura Iowa State David Woodruff IBM Almaden The Data Stream Model Stream S of additive updates (i, Δ) to an underlying vector v: v

More information

Data Streams Algorithms

Data Streams Algorithms Data Streams Algorithms Phillip B. Gibbons Intel Research Pittsburgh Guest Lecture in 15-853 Algorithms in the Real World Phil Gibbons, 15-853, 12/4/06 # 1 Outline Data Streams in the Real World Formal

More information

B561 Advanced Database Concepts. 6. Streaming Algorithms. Qin Zhang 1-1

B561 Advanced Database Concepts. 6. Streaming Algorithms. Qin Zhang 1-1 B561 Advanced Database Concepts 6. Streaming Algorithms Qin Zhang 1-1 The model and challenge The data stream model (Alon, Matias and Szegedy 1996) a n a 2 a 1 RAM CPU Why hard? Cannot store everything.

More information

Estimating Simple Functions on the Union of Data Streams

Estimating Simple Functions on the Union of Data Streams +, Estimating Simple Functions on the Union of Data Streams Phillip B. Gibbons Srikanta Tirthapura November 10, 2000 Abstract Massive data sets often arise as physically distributed data streams. We present

More information

MinHash Sketches: A Brief Survey. June 14, 2016

MinHash Sketches: A Brief Survey. June 14, 2016 MinHash Sketches: A Brief Survey Edith Cohen edith@cohenwang.com June 14, 2016 1 Definition Sketches are a very powerful tool in massive data analysis. Operations and queries that are specified with respect

More information

Cardinality Estimation: An Experimental Survey

Cardinality Estimation: An Experimental Survey : An Experimental Survey and Felix Naumann VLDB 2018 Estimation and Approximation Session Rio de Janeiro-Brazil 29 th August 2018 Information System Group Hasso Plattner Institut University of Potsdam

More information

B561 Advanced Database Concepts Streaming Model. Qin Zhang 1-1

B561 Advanced Database Concepts Streaming Model. Qin Zhang 1-1 B561 Advanced Database Concepts 2.2. Streaming Model Qin Zhang 1-1 Data Streams Continuous streams of data elements (massive possibly unbounded, rapid, time-varying) Some examples: 1. network monitoring

More information

Some Applications of Graph Bandwidth to Constraint Satisfaction Problems

Some Applications of Graph Bandwidth to Constraint Satisfaction Problems Some Applications of Graph Bandwidth to Constraint Satisfaction Problems Ramin Zabih Computer Science Department Stanford University Stanford, California 94305 Abstract Bandwidth is a fundamental concept

More information

What s Hot and What s Not: Tracking Most Frequent Items Dynamically

What s Hot and What s Not: Tracking Most Frequent Items Dynamically What s Hot and What s Not: Tracking Most Frequent Items Dynamically Graham Cormode Center for Discrete Mathematics and Computer Science (DIMACS) Rutgers University, Piscataway NJ graham@dimacs.rutgers.edu

More information

Department of Computer Science

Department of Computer Science Yale University Department of Computer Science Pass-Efficient Algorithms for Facility Location Kevin L. Chang Department of Computer Science Yale University kchang@cs.yale.edu YALEU/DCS/TR-1337 Supported

More information

Optimal Tracking of Distributed Heavy Hitters and Quantiles

Optimal Tracking of Distributed Heavy Hitters and Quantiles Optimal Tracking of Distributed Heavy Hitters and Quantiles Qin Zhang Hong Kong University of Science & Technology Joint work with Ke Yi PODS 2009 June 30, 2009 - 2- Models and Problems Distributed streaming

More information

Faster, Space-Efficient Selection Algorithms in Read-Only Memory for Integers

Faster, Space-Efficient Selection Algorithms in Read-Only Memory for Integers Faster, Space-Efficient Selection Algorithms in Read-Only Memory for Integers Timothy M. Chan 1, J. Ian Munro 1, and Venkatesh Raman 2 1 Cheriton School of Computer Science, University of Waterloo, Waterloo,

More information

Efficient Strategies for Continuous Distributed Tracking Tasks

Efficient Strategies for Continuous Distributed Tracking Tasks Efficient Strategies for Continuous Distributed Tracking Tasks Graham Cormode Bell Labs, Lucent Technologies cormode@bell-labs.com Minos Garofalakis Bell Labs, Lucent Technologies minos@bell-labs.com Abstract

More information

Sketching Asynchronous Streams Over a Sliding Window

Sketching Asynchronous Streams Over a Sliding Window Sketching Asynchronous Streams Over a Sliding Window Srikanta Tirthapura (Iowa State University) Bojian Xu (Iowa State University) Costas Busch (Rensselaer Polytechnic Institute) 1/32 Data Stream Processing

More information

Leveraging Set Relations in Exact Set Similarity Join

Leveraging Set Relations in Exact Set Similarity Join Leveraging Set Relations in Exact Set Similarity Join Xubo Wang, Lu Qin, Xuemin Lin, Ying Zhang, and Lijun Chang University of New South Wales, Australia University of Technology Sydney, Australia {xwang,lxue,ljchang}@cse.unsw.edu.au,

More information

Duplicate Detection in Click Streams

Duplicate Detection in Click Streams Duplicate Detection in Click Streams Ahmed Metwally Divyakant Agrawal Amr El Abbadi Department of Computer Science University of California, Santa Barbara {metwally, agrawal, amr}@cs.ucsb.edu Abstract

More information

Lecture 6: Arithmetic and Threshold Circuits

Lecture 6: Arithmetic and Threshold Circuits IAS/PCMI Summer Session 2000 Clay Mathematics Undergraduate Program Advanced Course on Computational Complexity Lecture 6: Arithmetic and Threshold Circuits David Mix Barrington and Alexis Maciel July

More information

Algorithms. Lecture Notes 5

Algorithms. Lecture Notes 5 Algorithms. Lecture Notes 5 Dynamic Programming for Sequence Comparison The linear structure of the Sequence Comparison problem immediately suggests a dynamic programming approach. Naturally, our sub-instances

More information

The strong chromatic number of a graph

The strong chromatic number of a graph The strong chromatic number of a graph Noga Alon Abstract It is shown that there is an absolute constant c with the following property: For any two graphs G 1 = (V, E 1 ) and G 2 = (V, E 2 ) on the same

More information

Sketching Probabilistic Data Streams

Sketching Probabilistic Data Streams Sketching Probabilistic Data Streams Graham Cormode AT&T Labs - Research graham@research.att.com Minos Garofalakis Yahoo! Research minos@acm.org Challenge of Uncertain Data Many applications generate data

More information

Summary of Raptor Codes

Summary of Raptor Codes Summary of Raptor Codes Tracey Ho October 29, 2003 1 Introduction This summary gives an overview of Raptor Codes, the latest class of codes proposed for reliable multicast in the Digital Fountain model.

More information

Using Statistics for Computing Joins with MapReduce

Using Statistics for Computing Joins with MapReduce Using Statistics for Computing Joins with MapReduce Theresa Csar 1, Reinhard Pichler 1, Emanuel Sallinger 1, and Vadim Savenkov 2 1 Vienna University of Technology {csar, pichler, sallinger}@dbaituwienacat

More information

Methods for Finding Frequent Items in Data Streams

Methods for Finding Frequent Items in Data Streams The VLDB Journal manuscript No. (will be inserted by the editor) Methods for Finding Frequent Items in Data Streams Graham Cormode Marios Hadjieleftheriou Received: date / Accepted: date Abstract The frequent

More information

On the Max Coloring Problem

On the Max Coloring Problem On the Max Coloring Problem Leah Epstein Asaf Levin May 22, 2010 Abstract We consider max coloring on hereditary graph classes. The problem is defined as follows. Given a graph G = (V, E) and positive

More information

Unit 1 Chapter 4 ITERATIVE ALGORITHM DESIGN ISSUES

Unit 1 Chapter 4 ITERATIVE ALGORITHM DESIGN ISSUES DESIGN AND ANALYSIS OF ALGORITHMS Unit 1 Chapter 4 ITERATIVE ALGORITHM DESIGN ISSUES http://milanvachhani.blogspot.in USE OF LOOPS As we break down algorithm into sub-algorithms, sooner or later we shall

More information

arxiv: v1 [cs.db] 30 Jun 2012

arxiv: v1 [cs.db] 30 Jun 2012 Sketch-based Querying of Distributed Sliding-Window Data Streams Odysseas Papapetrou, Minos Garofalakis, Antonios Deligiannakis Technical University of Crete {papapetrou,minos,adeli}@softnet.tuc.gr arxiv:1207.0139v1

More information

Sublinear Algorithms for Big Data Analysis

Sublinear Algorithms for Big Data Analysis Sublinear Algorithms for Big Data Analysis Michael Kapralov Theory of Computation Lab 4 EPFL 7 September 2017 The age of big data: massive amounts of data collected in various areas of science and technology

More information

Cache-Oblivious Traversals of an Array s Pairs

Cache-Oblivious Traversals of an Array s Pairs Cache-Oblivious Traversals of an Array s Pairs Tobias Johnson May 7, 2007 Abstract Cache-obliviousness is a concept first introduced by Frigo et al. in [1]. We follow their model and develop a cache-oblivious

More information

Computing Diameter in the Streaming and Sliding-Window Models 1

Computing Diameter in the Streaming and Sliding-Window Models 1 Computing Diameter in the Streaming and Sliding-Window Models 1 Joan Feigenbaum 2 5 Sampath Kannan 3 5 Jian Zhang 4 5 Abstract We investigate the diameter problem in the streaming and slidingwindow models.

More information

Superconcentrators of depth 2 and 3; odd levels help (rarely)

Superconcentrators of depth 2 and 3; odd levels help (rarely) Superconcentrators of depth 2 and 3; odd levels help (rarely) Noga Alon Bellcore, Morristown, NJ, 07960, USA and Department of Mathematics Raymond and Beverly Sackler Faculty of Exact Sciences Tel Aviv

More information

3 No-Wait Job Shops with Variable Processing Times

3 No-Wait Job Shops with Variable Processing Times 3 No-Wait Job Shops with Variable Processing Times In this chapter we assume that, on top of the classical no-wait job shop setting, we are given a set of processing times for each operation. We may select

More information

Optimal Time Bounds for Approximate Clustering

Optimal Time Bounds for Approximate Clustering Optimal Time Bounds for Approximate Clustering Ramgopal R. Mettu C. Greg Plaxton Department of Computer Science University of Texas at Austin Austin, TX 78712, U.S.A. ramgopal, plaxton@cs.utexas.edu Abstract

More information

DESIGN AND ANALYSIS OF ALGORITHMS. Unit 1 Chapter 4 ITERATIVE ALGORITHM DESIGN ISSUES

DESIGN AND ANALYSIS OF ALGORITHMS. Unit 1 Chapter 4 ITERATIVE ALGORITHM DESIGN ISSUES DESIGN AND ANALYSIS OF ALGORITHMS Unit 1 Chapter 4 ITERATIVE ALGORITHM DESIGN ISSUES http://milanvachhani.blogspot.in USE OF LOOPS As we break down algorithm into sub-algorithms, sooner or later we shall

More information

Run Times. Efficiency Issues. Run Times cont d. More on O( ) notation

Run Times. Efficiency Issues. Run Times cont d. More on O( ) notation Comp2711 S1 2006 Correctness Oheads 1 Efficiency Issues Comp2711 S1 2006 Correctness Oheads 2 Run Times An implementation may be correct with respect to the Specification Pre- and Post-condition, but nevertheless

More information

Chapter 5 VARIABLE-LENGTH CODING Information Theory Results (II)

Chapter 5 VARIABLE-LENGTH CODING Information Theory Results (II) Chapter 5 VARIABLE-LENGTH CODING ---- Information Theory Results (II) 1 Some Fundamental Results Coding an Information Source Consider an information source, represented by a source alphabet S. S = { s,

More information

Space-efficient Online Approximation of Time Series Data: Streams, Amnesia, and Out-of-order

Space-efficient Online Approximation of Time Series Data: Streams, Amnesia, and Out-of-order Space-efficient Online Approximation of Time Series Data: Streams, Amnesia, and Out-of-order Luca Foschini joint work with Sorabh Gandhi and Subhash Suri University of California Santa Barbara ICDE 2010

More information

Interleaving Schemes on Circulant Graphs with Two Offsets

Interleaving Schemes on Circulant Graphs with Two Offsets Interleaving Schemes on Circulant raphs with Two Offsets Aleksandrs Slivkins Department of Computer Science Cornell University Ithaca, NY 14853 slivkins@cs.cornell.edu Jehoshua Bruck Department of Electrical

More information

MAXIMAL PLANAR SUBGRAPHS OF FIXED GIRTH IN RANDOM GRAPHS

MAXIMAL PLANAR SUBGRAPHS OF FIXED GIRTH IN RANDOM GRAPHS MAXIMAL PLANAR SUBGRAPHS OF FIXED GIRTH IN RANDOM GRAPHS MANUEL FERNÁNDEZ, NICHOLAS SIEGER, AND MICHAEL TAIT Abstract. In 99, Bollobás and Frieze showed that the threshold for G n,p to contain a spanning

More information

Chapter 15 Introduction to Linear Programming

Chapter 15 Introduction to Linear Programming Chapter 15 Introduction to Linear Programming An Introduction to Optimization Spring, 2015 Wei-Ta Chu 1 Brief History of Linear Programming The goal of linear programming is to determine the values of

More information

Finding Frequent Items in Data Streams

Finding Frequent Items in Data Streams Finding Frequent Items in Data Streams Graham Cormode Marios Hadjieleftheriou AT&T Labs Research, Florham Park, NJ {graham,marioh}@research.att.com ABSTRACT The frequent items problem is to process a stream

More information

Mining Frequent Itemsets from Data Streams with a Time- Sensitive Sliding Window

Mining Frequent Itemsets from Data Streams with a Time- Sensitive Sliding Window Mining Frequent Itemsets from Data Streams with a Time- Sensitive Sliding Window Chih-Hsiang Lin, Ding-Ying Chiu, Yi-Hung Wu Department of Computer Science National Tsing Hua University Arbee L.P. Chen

More information

Fast and Simple Algorithms for Weighted Perfect Matching

Fast and Simple Algorithms for Weighted Perfect Matching Fast and Simple Algorithms for Weighted Perfect Matching Mirjam Wattenhofer, Roger Wattenhofer {mirjam.wattenhofer,wattenhofer}@inf.ethz.ch, Department of Computer Science, ETH Zurich, Switzerland Abstract

More information

Scribe from 2014/2015: Jessica Su, Hieu Pham Date: October 6, 2016 Editor: Jimmy Wu

Scribe from 2014/2015: Jessica Su, Hieu Pham Date: October 6, 2016 Editor: Jimmy Wu CS 267 Lecture 3 Shortest paths, graph diameter Scribe from 2014/2015: Jessica Su, Hieu Pham Date: October 6, 2016 Editor: Jimmy Wu Today we will talk about algorithms for finding shortest paths in a graph.

More information

INSERTION SORT is O(n log n)

INSERTION SORT is O(n log n) Michael A. Bender Department of Computer Science, SUNY Stony Brook bender@cs.sunysb.edu Martín Farach-Colton Department of Computer Science, Rutgers University farach@cs.rutgers.edu Miguel Mosteiro Department

More information

CS369G: Algorithmic Techniques for Big Data Spring

CS369G: Algorithmic Techniques for Big Data Spring CS369G: Algorithmic Techniques for Big Data Spring 2015-2016 Lecture 11: l 0 -Sampling and Introduction to Graph Streaming Prof. Moses Charikar Scribe: Austin Benson 1 Overview We present and analyze the

More information

Exact Algorithms Lecture 7: FPT Hardness and the ETH

Exact Algorithms Lecture 7: FPT Hardness and the ETH Exact Algorithms Lecture 7: FPT Hardness and the ETH February 12, 2016 Lecturer: Michael Lampis 1 Reminder: FPT algorithms Definition 1. A parameterized problem is a function from (χ, k) {0, 1} N to {0,

More information

RANGE-EFFICIENT COUNTING OF DISTINCT ELEMENTS IN A MASSIVE DATA STREAM

RANGE-EFFICIENT COUNTING OF DISTINCT ELEMENTS IN A MASSIVE DATA STREAM SIAM J. COMPUT. Vol. 37, No. 2, pp. 359 379 c 2007 Society for Industrial and Applied Mathematics RANGE-EFFICIENT COUNTING OF DISTINCT ELEMENTS IN A MASSIVE DATA STREAM A. PAVAN AND SRIKANTA TIRTHAPURA

More information

5 MST and Greedy Algorithms

5 MST and Greedy Algorithms 5 MST and Greedy Algorithms One of the traditional and practically motivated problems of discrete optimization asks for a minimal interconnection of a given set of terminals (meaning that every pair will

More information

Algorithms and Data Structures

Algorithms and Data Structures Algorithms and Data Structures Spring 2019 Alexis Maciel Department of Computer Science Clarkson University Copyright c 2019 Alexis Maciel ii Contents 1 Analysis of Algorithms 1 1.1 Introduction.................................

More information

A SIMPLE APPROXIMATION ALGORITHM FOR NONOVERLAPPING LOCAL ALIGNMENTS (WEIGHTED INDEPENDENT SETS OF AXIS PARALLEL RECTANGLES)

A SIMPLE APPROXIMATION ALGORITHM FOR NONOVERLAPPING LOCAL ALIGNMENTS (WEIGHTED INDEPENDENT SETS OF AXIS PARALLEL RECTANGLES) Chapter 1 A SIMPLE APPROXIMATION ALGORITHM FOR NONOVERLAPPING LOCAL ALIGNMENTS (WEIGHTED INDEPENDENT SETS OF AXIS PARALLEL RECTANGLES) Piotr Berman Department of Computer Science & Engineering Pennsylvania

More information

Lecture 19. Lecturer: Aleksander Mądry Scribes: Chidambaram Annamalai and Carsten Moldenhauer

Lecture 19. Lecturer: Aleksander Mądry Scribes: Chidambaram Annamalai and Carsten Moldenhauer CS-621 Theory Gems November 21, 2012 Lecture 19 Lecturer: Aleksander Mądry Scribes: Chidambaram Annamalai and Carsten Moldenhauer 1 Introduction We continue our exploration of streaming algorithms. First,

More information

Approximate Continuous Querying over Distributed Streams

Approximate Continuous Querying over Distributed Streams Approximate Continuous Querying over Distributed Streams GRAHAM CORMODE AT&T Labs Research and MINOS GAROFALAKIS Yahoo! Research & University of California, Berkeley While traditional database systems

More information

16 Greedy Algorithms

16 Greedy Algorithms 16 Greedy Algorithms Optimization algorithms typically go through a sequence of steps, with a set of choices at each For many optimization problems, using dynamic programming to determine the best choices

More information

Distinct random sampling from a distributed stream

Distinct random sampling from a distributed stream Graduate Theses and Dissertations Iowa State University Capstones, Theses and Dissertations 2013 Distinct random sampling from a distributed stream Yung-Yu Chung Iowa State University Follow this and additional

More information

5 MST and Greedy Algorithms

5 MST and Greedy Algorithms 5 MST and Greedy Algorithms One of the traditional and practically motivated problems of discrete optimization asks for a minimal interconnection of a given set of terminals (meaning that every pair will

More information

Discrete Optimization. Lecture Notes 2

Discrete Optimization. Lecture Notes 2 Discrete Optimization. Lecture Notes 2 Disjunctive Constraints Defining variables and formulating linear constraints can be straightforward or more sophisticated, depending on the problem structure. The

More information

Heap-on-Top Priority Queues. March Abstract. We introduce the heap-on-top (hot) priority queue data structure that combines the

Heap-on-Top Priority Queues. March Abstract. We introduce the heap-on-top (hot) priority queue data structure that combines the Heap-on-Top Priority Queues Boris V. Cherkassky Central Economics and Mathematics Institute Krasikova St. 32 117418, Moscow, Russia cher@cemi.msk.su Andrew V. Goldberg NEC Research Institute 4 Independence

More information

A Connection between Network Coding and. Convolutional Codes

A Connection between Network Coding and. Convolutional Codes A Connection between Network Coding and 1 Convolutional Codes Christina Fragouli, Emina Soljanin christina.fragouli@epfl.ch, emina@lucent.com Abstract The min-cut, max-flow theorem states that a source

More information

6 Randomized rounding of semidefinite programs

6 Randomized rounding of semidefinite programs 6 Randomized rounding of semidefinite programs We now turn to a new tool which gives substantially improved performance guarantees for some problems We now show how nonlinear programming relaxations can

More information

Average Case Analysis for Tree Labelling Schemes

Average Case Analysis for Tree Labelling Schemes Average Case Analysis for Tree Labelling Schemes Ming-Yang Kao 1, Xiang-Yang Li 2, and WeiZhao Wang 2 1 Northwestern University, Evanston, IL, USA, kao@cs.northwestern.edu 2 Illinois Institute of Technology,

More information

Disjoint directed cycles

Disjoint directed cycles Disjoint directed cycles Noga Alon Abstract It is shown that there exists a positive ɛ so that for any integer k, every directed graph with minimum outdegree at least k contains at least ɛk vertex disjoint

More information

General properties of staircase and convex dual feasible functions

General properties of staircase and convex dual feasible functions General properties of staircase and convex dual feasible functions JÜRGEN RIETZ, CLÁUDIO ALVES, J. M. VALÉRIO de CARVALHO Centro de Investigação Algoritmi da Universidade do Minho, Escola de Engenharia

More information

c 2003 Society for Industrial and Applied Mathematics

c 2003 Society for Industrial and Applied Mathematics SIAM J. COMPUT. Vol. 32, No. 2, pp. 538 547 c 2003 Society for Industrial and Applied Mathematics COMPUTING THE MEDIAN WITH UNCERTAINTY TOMÁS FEDER, RAJEEV MOTWANI, RINA PANIGRAHY, CHRIS OLSTON, AND JENNIFER

More information

On the Complexity of the Policy Improvement Algorithm. for Markov Decision Processes

On the Complexity of the Policy Improvement Algorithm. for Markov Decision Processes On the Complexity of the Policy Improvement Algorithm for Markov Decision Processes Mary Melekopoglou Anne Condon Computer Sciences Department University of Wisconsin - Madison 0 West Dayton Street Madison,

More information

E-Companion: On Styles in Product Design: An Analysis of US. Design Patents

E-Companion: On Styles in Product Design: An Analysis of US. Design Patents E-Companion: On Styles in Product Design: An Analysis of US Design Patents 1 PART A: FORMALIZING THE DEFINITION OF STYLES A.1 Styles as categories of designs of similar form Our task involves categorizing

More information

Frequency Estimation of Internet Packet Streams with Limited Space

Frequency Estimation of Internet Packet Streams with Limited Space Frequency Estimation of Internet Packet Streams with Limited Space Erik D. Demaine 1, Alejandro López-Ortiz 2, and J. Ian Munro 2 1 Laboratory for Computer Science, Massachusetts Institute of Technology,

More information

A SURVEY OF SYNOPSIS CONSTRUCTION IN DATA STREAMS

A SURVEY OF SYNOPSIS CONSTRUCTION IN DATA STREAMS Chapter 9 A SURVEY OF SYNOPSIS CONSTRUCTION IN DATA STREAMS Charu C. Aggarwal IBM T. J. Watson Research Center Hawthorne, NY 10532 charu@us.ibm.com Philip S. Yu IBM T. J. Watson Research Center Hawthorne,

More information

Fast Approximate PCPs for Multidimensional Bin-Packing Problems

Fast Approximate PCPs for Multidimensional Bin-Packing Problems Fast Approximate PCPs for Multidimensional Bin-Packing Problems Tuğkan Batu, Ronitt Rubinfeld, and Patrick White Department of Computer Science, Cornell University, Ithaca, NY 14850 batu,ronitt,white @cs.cornell.edu

More information

Merging Frequent Summaries

Merging Frequent Summaries Merging Frequent Summaries M. Cafaro, M. Pulimeno University of Salento, Italy {massimo.cafaro, marco.pulimeno}@unisalento.it Abstract. Recently, an algorithm for merging counter-based data summaries which

More information

Computing Frequent Elements Using Gossip

Computing Frequent Elements Using Gossip Computing Frequent Elements Using Gossip Bibudh Lahiri and Srikanta Tirthapura Iowa State University, Ames, IA, 50011, USA {bibudh,snt}@iastate.edu Abstract. We present algorithms for identifying frequently

More information

Secretary Problems and Incentives via Linear Programming

Secretary Problems and Incentives via Linear Programming Secretary Problems and Incentives via Linear Programming Niv Buchbinder Microsoft Research, New England and Kamal Jain Microsoft Research, Redmond and Mohit Singh Microsoft Research, New England In the

More information

arxiv: v2 [cs.ds] 22 May 2017

arxiv: v2 [cs.ds] 22 May 2017 A High-Performance Algorithm for Identifying Frequent Items in Data Streams Daniel Anderson Pryce Bevin Kevin Lang Edo Liberty Lee Rhodes Justin Thaler arxiv:1705.07001v2 [cs.ds] 22 May 2017 Abstract Estimating

More information

Comparing the strength of query types in property testing: The case of testing k-colorability

Comparing the strength of query types in property testing: The case of testing k-colorability Comparing the strength of query types in property testing: The case of testing k-colorability Ido Ben-Eliezer Tali Kaufman Michael Krivelevich Dana Ron Abstract We study the power of four query models

More information

arxiv: v2 [cs.ds] 30 Sep 2016

arxiv: v2 [cs.ds] 30 Sep 2016 Synergistic Sorting, MultiSelection and Deferred Data Structures on MultiSets Jérémy Barbay 1, Carlos Ochoa 1, and Srinivasa Rao Satti 2 1 Departamento de Ciencias de la Computación, Universidad de Chile,

More information