Distinct Stream Counting COMP 3801 Final Project Christopher Ermel
1 Problem Statement Given an arbitrary stream of data, we are interested in determining the number of distinct elements seen at a given time since its opening. Given the magnitude of data that travels over modern data streams, we are faced with the constraint that not all elements of data can be stored in memory. Therefore, we do not seek a deterministic method of counting distinct stream elements, but, rather, we can use a variety of probabilistic algorithms in order to estimate this number. Given that we solve the problem of counting distinct stream elements by estimation, we seek algorithms that minimize the error of this estimate. Furthermore, these algorithms must process each element of a data stream of arbitrary size. Therefore, the processing time for each stream element must also be minimized. Given these constraints, we examine different algorithms that have been published as a means of solving the Distinct Stream Counting (DSC) problem, discuss their theoretical analyses, and consider an application of their concrete implementations in Java. 2 Background Work The concrete side of this project deals with the implementation of different algorithms published to solve the DSC problem. Using the concrete implementations, we can process the inputs of a data stream and output the estimates that each algorithm has produced. Furthermore, if the data stream is relatively limited in size, we can maintain a deterministic algorithm as basis for measuring the realized error of the estimations. We therefore have a means of comparing the theoretical and realized errors produced by the various DSC algorithms implemented, assuming that we have a data stream from which elements can be processed. To this end, the publicly available Twitter Streaming API was used [2]. An application of the DSC problem supposes that a company, such as Twitter, may be interested in determining the number of distinct tweeters who tweet in a given period of time. By opening up a stream of tweet data, we can receive & process tweets as the data comes in. It is worth noting that the Twitter Streaming APIs come at varying levels of access, and only the sample stream was used in this project which limited the magnitude of tweets received thereby ensuring that a deterministic algorithm could be implemented as a means of error measure. Tweets received over the stream contain a username string, which can be hashed into a 32-bit integer. We can therefore use the tweet usernames as our stream elements, and process them through various DSC algorithms in order to measure realized error and subsequently contrast it to the theoretical error. 3 Theoretical Analysis 3.1 Categories of Stream Counting Algorithms In their 2007 paper, Flajolet et. al. discuss that the best known DSC algorithms rely on making observations on the hash values of the input stream. As stream elements are processed, these algorithms maintain some number of observables that are used in the final estimate. It is possible to divide DSC algorithms into two categories, based on the types of observables that they track: bit-pattern observable algorithms and order statistics observable algorithms [1]. Bit-pattern observables rely on making an estimation based on some unique
pattern observed in the bits of the hashes of the stream elements. Order statistics observables maintain some order statistic, e.g., the k th smallest or largest hash value seen, to make estimations based on the distinct stream elements. The algorithms implemented in this project are based on bit pattern observables. 3.2 Stochastic Averaging Let us define the process of hashing a stream value, observing the bit-pattern, and updating the observable as an experiment. Supposing that we have some DSC algorithm that maintains an observable value based on the hashes of stream elements, we run into a problem: the error produced by estimating the number of distinct stream elements based on a single observable is too great. We therefore seek some efficient method of maintaining multiple observables, perhaps by performing multiple experiments on a given stream element. An initial approach to solving this problem is as follows: for each stream element, rather than perform a single experiment and maintaining one observable value, perform m experiments by maintaining m hash function and m observables. In the end, we take some average of the estimations generated by the m observables. To Flajolet et. al., however, this approach is problematic as well [1]. We can observe that as we increase the value of m our error rate decreases because outlier observable values are evened out. However, in order to effectively make use of this strategy, we require O (m) additional space complexity to store all m hash functions. Furthermore, on each stream element processed we require m experiments performed with each hash function. If our initial experiment could be executed in constant time, we now must take O(m) time to compute all m observables. We also require O(m) additional storage complexity to store the m observables, however this is not problematic. We introduce stochastic averaging as a means to solve this problem. This technique is described in Flajolet and Martin s 1985 paper Probabilistic Counting for Database Applications [3]. It involves emulating the act of performing m experiments on an input data stream by using a single hash function. To this end, only one hash function is maintained, and only one hash value is computed for a particular stream element. It works by partitioning the hashed bit value of the stream element into two parts: the first b bits identify which of the m observables are considered in the experiment, and the rest of the bits are used for the experiment itself. Note that m is chosen such that m = 2 b thereby ensuring that the first b bits, when converted to decimal, correspond to index values { 1, 2,, m which can be used to access the observable values stored in a particular data structure of choice. This technique provides us with a method of increasing the m observables as we see fit in order to decrease the error of a DSC algorithm, assuming that the hash functions produce a bit string of sufficient length in relation to our chosen b value.
3.3 The HyperLogLog Algorithm Published in 2007 by Flajolet, Fusy, Gandouet, and Meunier, HyperLogLog is a DSC algorithm that provides a near-optimal estimate using bit-pattern observables and stochastic averaging. It provides an error improvement over the previously published (2003) LogLog algorithm [5], Figure 1: An error rate comparison of various DSC algorithms, taken from Flajolet et. al. [1] which is illustrated in Figure 1. The pseudocode for the HyperLogLog algorithm is as follows: Figure 2: Pseudocode for the HyperLogLog algorithm, taken from Flajolet et. al. [1] As can be seen from Figure 2, for each stream element v, we hash it into a binary value x. We then apply the stochastic averaging technique described in section 3.2 to split the hash into values j and w. The bit-pattern observables that the HyperLogLog algorithm is concerned with are the positions of the leftmost 1-bit in w, which we store in an array M of length m. The improvement that the HyperLogLog offers over the LogLog algorithm can be found in the final two lines of Figure 2. The LogLog algorithm initially returned 2 to the power of the geometric mean of the observables multiplied by m and some constant α [5]. The difference in Figure 2 is that we now compute the harmonic mean of 2 to the power of the observables. This error improvement, according to Flajolet et. al., is resultant of the fact that the distribution of the m observables are skewed to the right; as such, the harmonic mean is employed to reduce the variance of the distribution created by geometric mean thereby providing an estimate with reduced error [1].
4 Experimental Design and Analysis 4.1 The Tweet Stream Simulator In order to test any concrete implementations of DSC algorithms, we required some method of simulating a data stream. As mentioned in section 2, this project was concerned with testing the application of DSC algorithms using the Twitter Streaming APIs. Since the concrete implementations were done using Java, the Twitter4j [6] library was used in order to make use of Twitter s API. In order to acquire data on the estimates of the DSC algorithms implemented, a sample TwitterStream object was initialized. The stream instance was passed a StatusListener object, which fired an OnStatus() event whenever a tweet was sent, and passed a Status object containing all relevant tweet data. The code in the OnStatus() event executed three steps: 1. Pull the Twitter user s username from the Status object to acquire our stream element. 2. Process the username through each of the implemented DSC algorithms. 3. Report each DSC algorithm s estimate to standard out in CSV format. The java program was subsequently executed at the command line, with output redirected into a text file to support importing to Excel later on. 4.2 Algorithms Prior to considering the algorithms discuss, an important fact to consider if that the maximum length of a twitter username is 15 characters [7]. Furthermore, all hashes discussed in this section hash to a 32-bit Java integer. Also, due to the upper bound we have on the length of usernames, we can say that each hash function executes in O (1) time, because their running times are proportional to the length of the string they are hashing. 4.2.1 The Deterministic Algorithm The sample twitter stream generated roughly one million tweets over a 7-8 hour period. As such, we could be confident that all usernames processed during this time could be stored in main memory. Consequently, we implemented a deterministic DSC algorithm which functioned by placing all twitter usernames into a Java HashSet. To report the number of distinct stream elements, HashSet.size() was returned. For each stream element processed, the only operation performed is adding the element to the HashSet, therefore, the processing time of this algorithm is O (1). Since the maximum length of a Twitter username is 15 characters, each Java character takes 16 bits, and for a stream of length n we might see n distinct elements, the space complexity is therefore O (16 15 n) O (n)
bits in the worst case. Finally, we recognize that this algorithm has a negligible amount of error, due to the small possibility of stream element collision occurring during hashing into the set. 4.2.2 The Probabilistic Counting Algorithm Although not discussed in section 3 (as it was discussed in class), the Probabilistic Counting algorithm published by Flajolet and Martin in 1985 [3] was implemented according to its description in chapter 4 of the Mining of Massive Datasets by UIlman, Rajaraman, and Leskovec [4]. The simple version of this algorithm was implemented in Java as follows: private int maxtaillength ; public int reportdistinctelements () { return ( int ) Math. pow( 2.0, ( double ) maxtaillength ) ; public void processinput (String s) { int i = s.hashcode() ; int taillength = findtaillength(i) ; if (taillength > maxtaillength ){ maxtaillength = taillength ; private int findtaillength ( int i) { int taillength = Integer. numberoftrailingzeros(i) ; if (taillength == 32) taillength = 0 ; return taillength ; Figure 3: Java code for the simple version of the Probabilistic Counting Algorithm. The processing time for a given input of this algorithm depends on two functions: String.hashCode() & Integer.numberOfTrailingZeros(). The execution of String.hashcode() as discussed earlier executes in constant time. Integer.numberOfTrailingZeros() also executes in constant time as it performs a constant number of bit shift operations on the input. This algorithm therefore processes twitter usernames in O (1) time, as the remaining code executes a constant number of conditionals and assignments. This algorithm maintains one bit-pattern observable: The maximum tail length of the binary value of the hashes of the stream elements seen so far. The maximum tail length is defined as the number of zeros after the rightmost 1-bit. Java integers take up 32 bits of space, and since we store only one value our space complexity is thus O (1). Furthermore, looking at Figure 1, we can determine the error rate of this implementation as follows: 0.78 / 1 = 0.78 because m = 1 due to the single observable that we maintain. Next, we consider ways in which we can extend this algorithm to reduce its error rate. As discussed in section 3.2, DSC estimation based on a single observable is inefficient. Instead, by using m hash functions, maintaining m maximum tail lengths, dividing the tail lengths into k groups, taking the averages of the tail lengths in each of the k groups, and then taking the median of the averages of the k tail length averages, we can reduce the error observed by our algorithm [4].
Given this, we made the necessary modifications to our code from Figure 3 to support the error reduction. We wished to track 4 observables and as such required 4 independent hash functions in order to guarantee that the hash values of the stream elements differed. The hashes used were the Java String.hashCode(), the FNV1 and FNV1-A hashes [8], and the hash function described in section 5.3.3 of Open Data Structures by Pat Morin [9]. Again, the problem with this approach is the need to maintain m independent hash functions, as well as the necessity to process every stream element through each of the m hash functions, thereby increasing the processing time by O (m). In this case, since we only maintained 4 observables, our overall time and space complexity remained O (1). Two different algorithms were implemented: one that maintained 4 groups of 1 observable, and another that maintained 2 groups of 2 observables. Ultimately, the division into different group sizes does not modify the error rate, since we are only interested in the number of observables tracked. The error for both algorithms are as follows: with m = 4. 4.2.3 The HyperLogLog Algorithm 0.78 / 4 = 0.39 In order to contrast the power of the HyperLogLog algorithm to the Probabilistic Algorithms discussed in section 4.2.2, the HyperLogLog algorithm was implemented corresponding to Figure 2: private int b = 11 ; private int m = ( int ) Math. pow( 2.0, ( double ) b ) ; private int [] registers ; public void processinput (String s) { int x = s.hashcode() ; int j = x >>> ( MAX_BITS - b ) ; int w = (x << b ) >>> b ; int p = findfirstoneposition(w) - b ; if (p > registers [j]) registers [j] = p ; public double reportdistinctelements () { double Z = 0 ; for ( int j = 0 ; j < m ; j++) { Z += ( 1 / Math. pow( 2.0, registers [j])) ; Z = 1 / Z ; return alpha * Math. pow( m, 2.0 ) * Z ; private int findfirstoneposition ( int i) { int headlength = Integer. numberofleadingzeros(i) ; if (headlength == 32) headlength = 0 ; return headlength + 1 ; Figure 4: Java code for the HyperLogLog algorithm. An important thing to note is that Figure 4 omits the register array initialization, which sets all observable values initially to -2,147,483,648. The algorithm therefore will not have an estimate for the number of distinct stream elements until all observable registers have replaced this initial value, because at least one value added to Z (the indicator function) will be 2 2147483648.
The processing time of this algorithm depends on the Java String.hashCode() & Integer.numberOfLeadingZeros() methods. As discussed before, String.hashCode() will execute in O (1) time. Integer.numberOfLeadingZeros() executes a constant number of bit shift operations and therefore takes O (1) time. Otherwise, the input processing function does a constant number of comparison, assignment, and bit-shift operations. Therefore, our processing time is O (1). Note that by using stochastic averaging in the sense discussed above, our theoretical running time for this algorithm is not O(m), as we use the first b bits of the stream element hashes to index into our register of observable values thereby allowing us to use a single hash function. The space complexity of our algorithm, however, is O (m). This is inescapable if we wish to maintain m observables in order to reduce the error of our estimate. In this particular case, we use m = 2 11 (we set b = 11 ), meaning that we maintain bits, since we maintain with m = 2 11. 4.3 Simulation Results O(2 11 32) O (1) 2 11 32-bit integers. Our error rate, according to Figure 1, is: 1.04 / 2 11 = 0.023 The tweet stream simulation described in section 4.1 was run for a period of 7-8 hours, processing roughly one million tweets. As mentioned in section 4.2.3, there is a period of time for which the HyperLogLog algorithm cannot output an estimate on the number of stream elements. A small subset of the estimation data was examined, starting from the point in which the HyperLogLog algorithm first produced an estimate. The following graph was produced:
Figure 5: A comparison of estimates of various DSC algorithms. The graph shown in Figure 5 contrasts the deterministic number of distinct tweeters to the estimates of the various DSC algorithms discussed. The total number of tweets examined is provided as a reference line. The values on the x axis correspond to the relative number of tweets seen given this snapshot of the total data. The yellow line corresponds to a prototype algorithm not discussed in this report, as it fails to provide a meaningful estimate. As can be seen in Figure 5, the behaviour of the Flajolet-Martin (FM) algorithms discussed in section 4.2.2 are quite sporadic. In fact, the estimate for the single-observable version maxes out at around 17,000. This behaviour makes sense, since the probability that we see a binary tail length of n is 2 n as we have a ½ chance of seeing a 0 at a given binary location. This means that as our deterministic count of distinct stream elements grows larger, we have an increasingly reduced probability of seeing a maximum tail length that correctly estimates this value. The estimates of the FM algorithms that maintain 4 observables appear to not stray too far from the deterministic count. This also makes sense, as for each observable maintained, we have an additional 2 n chance of seeing a binary tail length of length n. Therefore, as more observables are added, the higher the chance of seeing a longer maximum tail length, which is essential to count distinct stream elements of a large magnitude. Furthermore, we can see that the HyperLogLog algorithm does the best job of estimating the number of distinct elements. This can be attributed to the 2 11 observables that it maintains, which is made possible to due stochastic averaging. It is worth noting that the smoothness of the estimation line for the HyperLogLog is also attributable to the magnitude of observables maintained. In fact, if we were to increase the number of observables for the FM algorithms implemented, we would see similar behaviour in the estimations produced (as this would reduce the error). To contrast realized error versus theoretical error, we calculated the first differences of algorithm estimations in comparison to the deterministic algorithm:
Figure 6: First differences of the algorithm estimates displayed in Figure 5. In general, the realized error of the Probabilistic Counting algorithms were not as bad as their theoretical estimates. The single observable FM algorithm produced 30% less error in reality, while the four observable FM algorithms both produced roughly 17% percent less error. Finally, we see that the HyperLogLog algorithm in fact produced 3% more error than it did in theory. This can be explained by examining Figure 5 in which we see that after relative tweet count 10,000 the estimate began to diverge away from the deterministic count. As more stream elements are processed in fact, the HyperLogLog algorithm s error increases. This is anticipated by Flajolet et. al. in the HyperLogLog paper [1], and is counteracted by providing a large range correction factor which is not discussed in this document. 5 Conclusion An integral part of accurate DSC algorithms is maintaining more than a single observable value to base estimates off of. We have seen a technique of simulating experiments done on stream elements using an arbitrary number of hash functions known as stochastic averaging. This technique is the key to reducing the error of bit-pattern observable based DSC algorithms by an arbitrarily large amount. We have contrasted the realized and theoretical errors of a few bit-pattern observable algorithms in order to illustrate this fact. In summary, the HyperLogLog algorithm as implemented for this project provided the best estimate for the number of distinct stream elements. However, this accuracy could have been attained with the other algorithms implemented by using the stochastic averaging technique.
6 References 1. Flajolet, Philippe, Éric Fusy, Olivier Gandouet, and Frédéric Meunier. "HyperLogLog: The Analysis of a Near-optimal Cardinality Estimation Algorithm." (2007): n. pag. Conference on Analysis of Algorithms. Web. 2. "Streaming APIs." Twitter Developer Documentation. Twitter, n.d. Web. <https://dev.twitter.com/streaming/overview>. 3. Flajolet, Journal Philippe, and G. Nigel Martin. "Probabilistic Counting Algorithms for Data Base Applications." JOURNAL OF COMPUTER AND SYSTEM SCIENCES (1985): 182-209. Web. 4. Leskovec, Jure, Anand Rajaraman, and Jeffrey Ullman. "Mining of Massive Datasets." (n.d.): 131-62. Web. 5. Flajolet, Philippe, and Marianne Durand. "LOGLOG COUNTING OF LARGE CARDINALITIES." (2003): n. pag. Web. 6. "Twitter4J." Twitter4J - A Java Library for the Twitter API. N.p., n.d. Web. <http://twitter4j.org/en/>. 7. "Changing Your Username." Help Center. Twitter, n.d. Web. <https://support.twitter.com/articles/14609>. 8. Noll, Landon. "FNV Hash." FNV Hash. N.p., n.d. Web. <http://isthe.com/chongo/tech/comp/fnv/>. 9. Morin, Pat. "Hash Codes for Arrays and Strings." Open Data Structures. N.p.: n.p., n.d. 126-28. Print.