Distinct Stream Counting

Similar documents
Sliding HyperLogLog: Estimating cardinality in a data stream

Cardinality Estimation: An Experimental Survey

Mining Data that Changes. 17 July 2015

Hashing. Hashing Procedures

COMP Data Structures

Course : Data mining

SKETCHES ON SINGLE BOARD COMPUTERS

Questions. 6. Suppose we were to define a hash code on strings s by:

Worst-case running time for RANDOMIZED-SELECT

III Data Structures. Dynamic sets

Instructor: Dr. Mehmet Aktaş. Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman Stanford University

Streaming Algorithms. Stony Brook University CSE545, Fall 2016

Theorem 2.9: nearest addition algorithm

Understanding Your Audience: Using Probabilistic Data Aggregation Jason Carey Software Engineer, Data

Lecture 5: Data Streaming Algorithms

Big Data Analytics: What is Big Data? Stony Brook University CSE545, Fall 2016 the inaugural edition

Comp Online Algorithms

Research Students Lecture Series 2015

Question 1. Notes on the Exam. Today. Comp 104: Operating Systems Concepts 11/05/2015. Revision Lectures

One-Pass Streaming Algorithms

University of the Western Cape Department of Computer Science

Lecture 15: Algorithms. AP Computer Science Principles

Notes on the Exam. Question 1. Today. Comp 104:Operating Systems Concepts 11/05/2015. Revision Lectures (separate questions and answers)

Clustering. (Part 2)

Probabilistic (Randomized) algorithms

COS 226 Midterm Exam, Spring 2009

Comp 204: Computer Systems and Their Implementation. Lecture 25a: Revision Lectures (separate questions and answers)

Indexing. CS6200: Information Retrieval. Index Construction. Slides by: Jesse Anderton

.. Spring 2017 CSC 566 Advanced Data Mining Alexander Dekhtyar..

B553 Lecture 12: Global Optimization

Vertex Cover Approximations

Approximating Square Roots

Richard E. Korf. June 27, Abstract. divide them into two subsets, so that the sum of the numbers in

Merge Sort Roberto Hibbler Dept. of Computer Science Florida Institute of Technology Melbourne, FL

Unit 6 Chapter 15 EXAMPLES OF COMPLEXITY CALCULATION

Optimal Parallel Randomized Renaming

Report on Cache-Oblivious Priority Queue and Graph Algorithm Applications[1]

Algorithm must complete after a finite number of instructions have been executed. Each step must be clearly defined, having only one interpretation.

Chapter 16 Heuristic Search

MinHash Sketches: A Brief Survey. June 14, 2016

Parallel and Sequential Data Structures and Algorithms Lecture (Spring 2012) Lecture 16 Treaps; Augmented BSTs

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

Data Analysis and Solver Plugins for KSpread USER S MANUAL. Tomasz Maliszewski

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

Lesson 12 - Operator Overloading Customising Operators

Statistics for a Random Network Design Problem

Metaheuristic Development Methodology. Fall 2009 Instructor: Dr. Masoud Yaghini

Notes for Lecture 24

Symbol Table. Symbol table is used widely in many applications. dictionary is a kind of symbol table data dictionary is database management

IT 403 Practice Problems (1-2) Answers

Basic Statistical Terms and Definitions

31.6 Powers of an element

Extraction of Evolution Tree from Product Variants Using Linear Counting Algorithm. Liu Shuchang

Dinwiddie County Public Schools Subject: Math 7 Scope and Sequence

The p-sized partitioning algorithm for fast computation of factorials of numbers

Lecture 8: Mergesort / Quicksort Steven Skiena

Chapter 2: Number Systems

For searching and sorting algorithms, this is particularly dependent on the number of data elements.

Nearest Neighbor Predictors

in this web service Cambridge University Press

Search trees, binary trie, patricia trie Marko Berezovský Radek Mařík PAL 2012

Lecture 3 Questions that we should be able to answer by the end of this lecture:

Lecture 7 Quicksort : Principles of Imperative Computation (Spring 2018) Frank Pfenning

CHAPTER 2: SAMPLING AND DATA

On the Max Coloring Problem

School of Computer and Information Science

The Bounded Edge Coloring Problem and Offline Crossbar Scheduling

Lecture 3 Questions that we should be able to answer by the end of this lecture:

Consider the actions taken on positive integers when considering decimal values shown in Table 1 where the division discards the remainder.

Chapter 9 Graph Algorithms

Recitation 11. Graph Contraction and MSTs Announcements. SegmentLab has been released, and is due Friday, April 14. It s worth 135 points.

Approximation Algorithms for Clustering Uncertain Data

Distributed Sampling in a Big Data Management System

COMP 250. Lecture 27. hashing. Nov. 10, 2017

CS125 : Introduction to Computer Science. Lecture Notes #38 and #39 Quicksort. c 2005, 2003, 2002, 2000 Jason Zych

Comparing Implementations of Optimal Binary Search Trees

Java How to Program, 9/e. Copyright by Pearson Education, Inc. All Rights Reserved.

Lecture 12 Notes Hash Tables

Randomized Algorithms, Quicksort and Randomized Selection

Project 0: Implementing a Hash Table

Goal of the course: The goal is to learn to design and analyze an algorithm. More specifically, you will learn:

We have seen that as n increases, the length of our confidence interval decreases, the confidence interval will be more narrow.

Sorting. Order in the court! sorting 1

Table : IEEE Single Format ± a a 2 a 3 :::a 8 b b 2 b 3 :::b 23 If exponent bitstring a :::a 8 is Then numerical value represented is ( ) 2 = (

Cpk: What is its Capability? By: Rick Haynes, Master Black Belt Smarter Solutions, Inc.

Chapter 9 Graph Algorithms

CS224W Project Write-up Static Crawling on Social Graph Chantat Eksombatchai Norases Vesdapunt Phumchanit Watanaprakornkul

A Visualization Program for Subset Sum Instances

Design and Analysis of Algorithms (VII)

Final Exam in Algorithms and Data Structures 1 (1DL210)

Chapter 5 VARIABLE-LENGTH CODING Information Theory Results (II)

O(n): printing a list of n items to the screen, looking at each item once.

Lecture 12 Hash Tables

CITS2200 Data Structures and Algorithms. Topic 15. Hash Tables

modern database systems lecture 10 : large-scale graph processing

6.2 DATA DISTRIBUTION AND EXPERIMENT DETAILS

COMP 250 Winter generic types, doubly linked lists Jan. 28, 2016

Disjoint Sets. The obvious data structure for disjoint sets looks like this.

9/24/ Hash functions

Transcription:

Distinct Stream Counting COMP 3801 Final Project Christopher Ermel

1 Problem Statement Given an arbitrary stream of data, we are interested in determining the number of distinct elements seen at a given time since its opening. Given the magnitude of data that travels over modern data streams, we are faced with the constraint that not all elements of data can be stored in memory. Therefore, we do not seek a deterministic method of counting distinct stream elements, but, rather, we can use a variety of probabilistic algorithms in order to estimate this number. Given that we solve the problem of counting distinct stream elements by estimation, we seek algorithms that minimize the error of this estimate. Furthermore, these algorithms must process each element of a data stream of arbitrary size. Therefore, the processing time for each stream element must also be minimized. Given these constraints, we examine different algorithms that have been published as a means of solving the Distinct Stream Counting (DSC) problem, discuss their theoretical analyses, and consider an application of their concrete implementations in Java. 2 Background Work The concrete side of this project deals with the implementation of different algorithms published to solve the DSC problem. Using the concrete implementations, we can process the inputs of a data stream and output the estimates that each algorithm has produced. Furthermore, if the data stream is relatively limited in size, we can maintain a deterministic algorithm as basis for measuring the realized error of the estimations. We therefore have a means of comparing the theoretical and realized errors produced by the various DSC algorithms implemented, assuming that we have a data stream from which elements can be processed. To this end, the publicly available Twitter Streaming API was used [2]. An application of the DSC problem supposes that a company, such as Twitter, may be interested in determining the number of distinct tweeters who tweet in a given period of time. By opening up a stream of tweet data, we can receive & process tweets as the data comes in. It is worth noting that the Twitter Streaming APIs come at varying levels of access, and only the sample stream was used in this project which limited the magnitude of tweets received thereby ensuring that a deterministic algorithm could be implemented as a means of error measure. Tweets received over the stream contain a username string, which can be hashed into a 32-bit integer. We can therefore use the tweet usernames as our stream elements, and process them through various DSC algorithms in order to measure realized error and subsequently contrast it to the theoretical error. 3 Theoretical Analysis 3.1 Categories of Stream Counting Algorithms In their 2007 paper, Flajolet et. al. discuss that the best known DSC algorithms rely on making observations on the hash values of the input stream. As stream elements are processed, these algorithms maintain some number of observables that are used in the final estimate. It is possible to divide DSC algorithms into two categories, based on the types of observables that they track: bit-pattern observable algorithms and order statistics observable algorithms [1]. Bit-pattern observables rely on making an estimation based on some unique

pattern observed in the bits of the hashes of the stream elements. Order statistics observables maintain some order statistic, e.g., the k th smallest or largest hash value seen, to make estimations based on the distinct stream elements. The algorithms implemented in this project are based on bit pattern observables. 3.2 Stochastic Averaging Let us define the process of hashing a stream value, observing the bit-pattern, and updating the observable as an experiment. Supposing that we have some DSC algorithm that maintains an observable value based on the hashes of stream elements, we run into a problem: the error produced by estimating the number of distinct stream elements based on a single observable is too great. We therefore seek some efficient method of maintaining multiple observables, perhaps by performing multiple experiments on a given stream element. An initial approach to solving this problem is as follows: for each stream element, rather than perform a single experiment and maintaining one observable value, perform m experiments by maintaining m hash function and m observables. In the end, we take some average of the estimations generated by the m observables. To Flajolet et. al., however, this approach is problematic as well [1]. We can observe that as we increase the value of m our error rate decreases because outlier observable values are evened out. However, in order to effectively make use of this strategy, we require O (m) additional space complexity to store all m hash functions. Furthermore, on each stream element processed we require m experiments performed with each hash function. If our initial experiment could be executed in constant time, we now must take O(m) time to compute all m observables. We also require O(m) additional storage complexity to store the m observables, however this is not problematic. We introduce stochastic averaging as a means to solve this problem. This technique is described in Flajolet and Martin s 1985 paper Probabilistic Counting for Database Applications [3]. It involves emulating the act of performing m experiments on an input data stream by using a single hash function. To this end, only one hash function is maintained, and only one hash value is computed for a particular stream element. It works by partitioning the hashed bit value of the stream element into two parts: the first b bits identify which of the m observables are considered in the experiment, and the rest of the bits are used for the experiment itself. Note that m is chosen such that m = 2 b thereby ensuring that the first b bits, when converted to decimal, correspond to index values { 1, 2,, m which can be used to access the observable values stored in a particular data structure of choice. This technique provides us with a method of increasing the m observables as we see fit in order to decrease the error of a DSC algorithm, assuming that the hash functions produce a bit string of sufficient length in relation to our chosen b value.

3.3 The HyperLogLog Algorithm Published in 2007 by Flajolet, Fusy, Gandouet, and Meunier, HyperLogLog is a DSC algorithm that provides a near-optimal estimate using bit-pattern observables and stochastic averaging. It provides an error improvement over the previously published (2003) LogLog algorithm [5], Figure 1: An error rate comparison of various DSC algorithms, taken from Flajolet et. al. [1] which is illustrated in Figure 1. The pseudocode for the HyperLogLog algorithm is as follows: Figure 2: Pseudocode for the HyperLogLog algorithm, taken from Flajolet et. al. [1] As can be seen from Figure 2, for each stream element v, we hash it into a binary value x. We then apply the stochastic averaging technique described in section 3.2 to split the hash into values j and w. The bit-pattern observables that the HyperLogLog algorithm is concerned with are the positions of the leftmost 1-bit in w, which we store in an array M of length m. The improvement that the HyperLogLog offers over the LogLog algorithm can be found in the final two lines of Figure 2. The LogLog algorithm initially returned 2 to the power of the geometric mean of the observables multiplied by m and some constant α [5]. The difference in Figure 2 is that we now compute the harmonic mean of 2 to the power of the observables. This error improvement, according to Flajolet et. al., is resultant of the fact that the distribution of the m observables are skewed to the right; as such, the harmonic mean is employed to reduce the variance of the distribution created by geometric mean thereby providing an estimate with reduced error [1].

4 Experimental Design and Analysis 4.1 The Tweet Stream Simulator In order to test any concrete implementations of DSC algorithms, we required some method of simulating a data stream. As mentioned in section 2, this project was concerned with testing the application of DSC algorithms using the Twitter Streaming APIs. Since the concrete implementations were done using Java, the Twitter4j [6] library was used in order to make use of Twitter s API. In order to acquire data on the estimates of the DSC algorithms implemented, a sample TwitterStream object was initialized. The stream instance was passed a StatusListener object, which fired an OnStatus() event whenever a tweet was sent, and passed a Status object containing all relevant tweet data. The code in the OnStatus() event executed three steps: 1. Pull the Twitter user s username from the Status object to acquire our stream element. 2. Process the username through each of the implemented DSC algorithms. 3. Report each DSC algorithm s estimate to standard out in CSV format. The java program was subsequently executed at the command line, with output redirected into a text file to support importing to Excel later on. 4.2 Algorithms Prior to considering the algorithms discuss, an important fact to consider if that the maximum length of a twitter username is 15 characters [7]. Furthermore, all hashes discussed in this section hash to a 32-bit Java integer. Also, due to the upper bound we have on the length of usernames, we can say that each hash function executes in O (1) time, because their running times are proportional to the length of the string they are hashing. 4.2.1 The Deterministic Algorithm The sample twitter stream generated roughly one million tweets over a 7-8 hour period. As such, we could be confident that all usernames processed during this time could be stored in main memory. Consequently, we implemented a deterministic DSC algorithm which functioned by placing all twitter usernames into a Java HashSet. To report the number of distinct stream elements, HashSet.size() was returned. For each stream element processed, the only operation performed is adding the element to the HashSet, therefore, the processing time of this algorithm is O (1). Since the maximum length of a Twitter username is 15 characters, each Java character takes 16 bits, and for a stream of length n we might see n distinct elements, the space complexity is therefore O (16 15 n) O (n)

bits in the worst case. Finally, we recognize that this algorithm has a negligible amount of error, due to the small possibility of stream element collision occurring during hashing into the set. 4.2.2 The Probabilistic Counting Algorithm Although not discussed in section 3 (as it was discussed in class), the Probabilistic Counting algorithm published by Flajolet and Martin in 1985 [3] was implemented according to its description in chapter 4 of the Mining of Massive Datasets by UIlman, Rajaraman, and Leskovec [4]. The simple version of this algorithm was implemented in Java as follows: private int maxtaillength ; public int reportdistinctelements () { return ( int ) Math. pow( 2.0, ( double ) maxtaillength ) ; public void processinput (String s) { int i = s.hashcode() ; int taillength = findtaillength(i) ; if (taillength > maxtaillength ){ maxtaillength = taillength ; private int findtaillength ( int i) { int taillength = Integer. numberoftrailingzeros(i) ; if (taillength == 32) taillength = 0 ; return taillength ; Figure 3: Java code for the simple version of the Probabilistic Counting Algorithm. The processing time for a given input of this algorithm depends on two functions: String.hashCode() & Integer.numberOfTrailingZeros(). The execution of String.hashcode() as discussed earlier executes in constant time. Integer.numberOfTrailingZeros() also executes in constant time as it performs a constant number of bit shift operations on the input. This algorithm therefore processes twitter usernames in O (1) time, as the remaining code executes a constant number of conditionals and assignments. This algorithm maintains one bit-pattern observable: The maximum tail length of the binary value of the hashes of the stream elements seen so far. The maximum tail length is defined as the number of zeros after the rightmost 1-bit. Java integers take up 32 bits of space, and since we store only one value our space complexity is thus O (1). Furthermore, looking at Figure 1, we can determine the error rate of this implementation as follows: 0.78 / 1 = 0.78 because m = 1 due to the single observable that we maintain. Next, we consider ways in which we can extend this algorithm to reduce its error rate. As discussed in section 3.2, DSC estimation based on a single observable is inefficient. Instead, by using m hash functions, maintaining m maximum tail lengths, dividing the tail lengths into k groups, taking the averages of the tail lengths in each of the k groups, and then taking the median of the averages of the k tail length averages, we can reduce the error observed by our algorithm [4].

Given this, we made the necessary modifications to our code from Figure 3 to support the error reduction. We wished to track 4 observables and as such required 4 independent hash functions in order to guarantee that the hash values of the stream elements differed. The hashes used were the Java String.hashCode(), the FNV1 and FNV1-A hashes [8], and the hash function described in section 5.3.3 of Open Data Structures by Pat Morin [9]. Again, the problem with this approach is the need to maintain m independent hash functions, as well as the necessity to process every stream element through each of the m hash functions, thereby increasing the processing time by O (m). In this case, since we only maintained 4 observables, our overall time and space complexity remained O (1). Two different algorithms were implemented: one that maintained 4 groups of 1 observable, and another that maintained 2 groups of 2 observables. Ultimately, the division into different group sizes does not modify the error rate, since we are only interested in the number of observables tracked. The error for both algorithms are as follows: with m = 4. 4.2.3 The HyperLogLog Algorithm 0.78 / 4 = 0.39 In order to contrast the power of the HyperLogLog algorithm to the Probabilistic Algorithms discussed in section 4.2.2, the HyperLogLog algorithm was implemented corresponding to Figure 2: private int b = 11 ; private int m = ( int ) Math. pow( 2.0, ( double ) b ) ; private int [] registers ; public void processinput (String s) { int x = s.hashcode() ; int j = x >>> ( MAX_BITS - b ) ; int w = (x << b ) >>> b ; int p = findfirstoneposition(w) - b ; if (p > registers [j]) registers [j] = p ; public double reportdistinctelements () { double Z = 0 ; for ( int j = 0 ; j < m ; j++) { Z += ( 1 / Math. pow( 2.0, registers [j])) ; Z = 1 / Z ; return alpha * Math. pow( m, 2.0 ) * Z ; private int findfirstoneposition ( int i) { int headlength = Integer. numberofleadingzeros(i) ; if (headlength == 32) headlength = 0 ; return headlength + 1 ; Figure 4: Java code for the HyperLogLog algorithm. An important thing to note is that Figure 4 omits the register array initialization, which sets all observable values initially to -2,147,483,648. The algorithm therefore will not have an estimate for the number of distinct stream elements until all observable registers have replaced this initial value, because at least one value added to Z (the indicator function) will be 2 2147483648.

The processing time of this algorithm depends on the Java String.hashCode() & Integer.numberOfLeadingZeros() methods. As discussed before, String.hashCode() will execute in O (1) time. Integer.numberOfLeadingZeros() executes a constant number of bit shift operations and therefore takes O (1) time. Otherwise, the input processing function does a constant number of comparison, assignment, and bit-shift operations. Therefore, our processing time is O (1). Note that by using stochastic averaging in the sense discussed above, our theoretical running time for this algorithm is not O(m), as we use the first b bits of the stream element hashes to index into our register of observable values thereby allowing us to use a single hash function. The space complexity of our algorithm, however, is O (m). This is inescapable if we wish to maintain m observables in order to reduce the error of our estimate. In this particular case, we use m = 2 11 (we set b = 11 ), meaning that we maintain bits, since we maintain with m = 2 11. 4.3 Simulation Results O(2 11 32) O (1) 2 11 32-bit integers. Our error rate, according to Figure 1, is: 1.04 / 2 11 = 0.023 The tweet stream simulation described in section 4.1 was run for a period of 7-8 hours, processing roughly one million tweets. As mentioned in section 4.2.3, there is a period of time for which the HyperLogLog algorithm cannot output an estimate on the number of stream elements. A small subset of the estimation data was examined, starting from the point in which the HyperLogLog algorithm first produced an estimate. The following graph was produced:

Figure 5: A comparison of estimates of various DSC algorithms. The graph shown in Figure 5 contrasts the deterministic number of distinct tweeters to the estimates of the various DSC algorithms discussed. The total number of tweets examined is provided as a reference line. The values on the x axis correspond to the relative number of tweets seen given this snapshot of the total data. The yellow line corresponds to a prototype algorithm not discussed in this report, as it fails to provide a meaningful estimate. As can be seen in Figure 5, the behaviour of the Flajolet-Martin (FM) algorithms discussed in section 4.2.2 are quite sporadic. In fact, the estimate for the single-observable version maxes out at around 17,000. This behaviour makes sense, since the probability that we see a binary tail length of n is 2 n as we have a ½ chance of seeing a 0 at a given binary location. This means that as our deterministic count of distinct stream elements grows larger, we have an increasingly reduced probability of seeing a maximum tail length that correctly estimates this value. The estimates of the FM algorithms that maintain 4 observables appear to not stray too far from the deterministic count. This also makes sense, as for each observable maintained, we have an additional 2 n chance of seeing a binary tail length of length n. Therefore, as more observables are added, the higher the chance of seeing a longer maximum tail length, which is essential to count distinct stream elements of a large magnitude. Furthermore, we can see that the HyperLogLog algorithm does the best job of estimating the number of distinct elements. This can be attributed to the 2 11 observables that it maintains, which is made possible to due stochastic averaging. It is worth noting that the smoothness of the estimation line for the HyperLogLog is also attributable to the magnitude of observables maintained. In fact, if we were to increase the number of observables for the FM algorithms implemented, we would see similar behaviour in the estimations produced (as this would reduce the error). To contrast realized error versus theoretical error, we calculated the first differences of algorithm estimations in comparison to the deterministic algorithm:

Figure 6: First differences of the algorithm estimates displayed in Figure 5. In general, the realized error of the Probabilistic Counting algorithms were not as bad as their theoretical estimates. The single observable FM algorithm produced 30% less error in reality, while the four observable FM algorithms both produced roughly 17% percent less error. Finally, we see that the HyperLogLog algorithm in fact produced 3% more error than it did in theory. This can be explained by examining Figure 5 in which we see that after relative tweet count 10,000 the estimate began to diverge away from the deterministic count. As more stream elements are processed in fact, the HyperLogLog algorithm s error increases. This is anticipated by Flajolet et. al. in the HyperLogLog paper [1], and is counteracted by providing a large range correction factor which is not discussed in this document. 5 Conclusion An integral part of accurate DSC algorithms is maintaining more than a single observable value to base estimates off of. We have seen a technique of simulating experiments done on stream elements using an arbitrary number of hash functions known as stochastic averaging. This technique is the key to reducing the error of bit-pattern observable based DSC algorithms by an arbitrarily large amount. We have contrasted the realized and theoretical errors of a few bit-pattern observable algorithms in order to illustrate this fact. In summary, the HyperLogLog algorithm as implemented for this project provided the best estimate for the number of distinct stream elements. However, this accuracy could have been attained with the other algorithms implemented by using the stochastic averaging technique.

6 References 1. Flajolet, Philippe, Éric Fusy, Olivier Gandouet, and Frédéric Meunier. "HyperLogLog: The Analysis of a Near-optimal Cardinality Estimation Algorithm." (2007): n. pag. Conference on Analysis of Algorithms. Web. 2. "Streaming APIs." Twitter Developer Documentation. Twitter, n.d. Web. <https://dev.twitter.com/streaming/overview>. 3. Flajolet, Journal Philippe, and G. Nigel Martin. "Probabilistic Counting Algorithms for Data Base Applications." JOURNAL OF COMPUTER AND SYSTEM SCIENCES (1985): 182-209. Web. 4. Leskovec, Jure, Anand Rajaraman, and Jeffrey Ullman. "Mining of Massive Datasets." (n.d.): 131-62. Web. 5. Flajolet, Philippe, and Marianne Durand. "LOGLOG COUNTING OF LARGE CARDINALITIES." (2003): n. pag. Web. 6. "Twitter4J." Twitter4J - A Java Library for the Twitter API. N.p., n.d. Web. <http://twitter4j.org/en/>. 7. "Changing Your Username." Help Center. Twitter, n.d. Web. <https://support.twitter.com/articles/14609>. 8. Noll, Landon. "FNV Hash." FNV Hash. N.p., n.d. Web. <http://isthe.com/chongo/tech/comp/fnv/>. 9. Morin, Pat. "Hash Codes for Arrays and Strings." Open Data Structures. N.p.: n.p., n.d. 126-28. Print.