Distinct Stream Counting

Size: px
Start display at page:

Download "Distinct Stream Counting"

Transcription

1 Distinct Stream Counting COMP 3801 Final Project Christopher Ermel

2 1 Problem Statement Given an arbitrary stream of data, we are interested in determining the number of distinct elements seen at a given time since its opening. Given the magnitude of data that travels over modern data streams, we are faced with the constraint that not all elements of data can be stored in memory. Therefore, we do not seek a deterministic method of counting distinct stream elements, but, rather, we can use a variety of probabilistic algorithms in order to estimate this number. Given that we solve the problem of counting distinct stream elements by estimation, we seek algorithms that minimize the error of this estimate. Furthermore, these algorithms must process each element of a data stream of arbitrary size. Therefore, the processing time for each stream element must also be minimized. Given these constraints, we examine different algorithms that have been published as a means of solving the Distinct Stream Counting (DSC) problem, discuss their theoretical analyses, and consider an application of their concrete implementations in Java. 2 Background Work The concrete side of this project deals with the implementation of different algorithms published to solve the DSC problem. Using the concrete implementations, we can process the inputs of a data stream and output the estimates that each algorithm has produced. Furthermore, if the data stream is relatively limited in size, we can maintain a deterministic algorithm as basis for measuring the realized error of the estimations. We therefore have a means of comparing the theoretical and realized errors produced by the various DSC algorithms implemented, assuming that we have a data stream from which elements can be processed. To this end, the publicly available Twitter Streaming API was used [2]. An application of the DSC problem supposes that a company, such as Twitter, may be interested in determining the number of distinct tweeters who tweet in a given period of time. By opening up a stream of tweet data, we can receive & process tweets as the data comes in. It is worth noting that the Twitter Streaming APIs come at varying levels of access, and only the sample stream was used in this project which limited the magnitude of tweets received thereby ensuring that a deterministic algorithm could be implemented as a means of error measure. Tweets received over the stream contain a username string, which can be hashed into a 32-bit integer. We can therefore use the tweet usernames as our stream elements, and process them through various DSC algorithms in order to measure realized error and subsequently contrast it to the theoretical error. 3 Theoretical Analysis 3.1 Categories of Stream Counting Algorithms In their 2007 paper, Flajolet et. al. discuss that the best known DSC algorithms rely on making observations on the hash values of the input stream. As stream elements are processed, these algorithms maintain some number of observables that are used in the final estimate. It is possible to divide DSC algorithms into two categories, based on the types of observables that they track: bit-pattern observable algorithms and order statistics observable algorithms [1]. Bit-pattern observables rely on making an estimation based on some unique

3 pattern observed in the bits of the hashes of the stream elements. Order statistics observables maintain some order statistic, e.g., the k th smallest or largest hash value seen, to make estimations based on the distinct stream elements. The algorithms implemented in this project are based on bit pattern observables. 3.2 Stochastic Averaging Let us define the process of hashing a stream value, observing the bit-pattern, and updating the observable as an experiment. Supposing that we have some DSC algorithm that maintains an observable value based on the hashes of stream elements, we run into a problem: the error produced by estimating the number of distinct stream elements based on a single observable is too great. We therefore seek some efficient method of maintaining multiple observables, perhaps by performing multiple experiments on a given stream element. An initial approach to solving this problem is as follows: for each stream element, rather than perform a single experiment and maintaining one observable value, perform m experiments by maintaining m hash function and m observables. In the end, we take some average of the estimations generated by the m observables. To Flajolet et. al., however, this approach is problematic as well [1]. We can observe that as we increase the value of m our error rate decreases because outlier observable values are evened out. However, in order to effectively make use of this strategy, we require O (m) additional space complexity to store all m hash functions. Furthermore, on each stream element processed we require m experiments performed with each hash function. If our initial experiment could be executed in constant time, we now must take O(m) time to compute all m observables. We also require O(m) additional storage complexity to store the m observables, however this is not problematic. We introduce stochastic averaging as a means to solve this problem. This technique is described in Flajolet and Martin s 1985 paper Probabilistic Counting for Database Applications [3]. It involves emulating the act of performing m experiments on an input data stream by using a single hash function. To this end, only one hash function is maintained, and only one hash value is computed for a particular stream element. It works by partitioning the hashed bit value of the stream element into two parts: the first b bits identify which of the m observables are considered in the experiment, and the rest of the bits are used for the experiment itself. Note that m is chosen such that m = 2 b thereby ensuring that the first b bits, when converted to decimal, correspond to index values { 1, 2,, m which can be used to access the observable values stored in a particular data structure of choice. This technique provides us with a method of increasing the m observables as we see fit in order to decrease the error of a DSC algorithm, assuming that the hash functions produce a bit string of sufficient length in relation to our chosen b value.

4 3.3 The HyperLogLog Algorithm Published in 2007 by Flajolet, Fusy, Gandouet, and Meunier, HyperLogLog is a DSC algorithm that provides a near-optimal estimate using bit-pattern observables and stochastic averaging. It provides an error improvement over the previously published (2003) LogLog algorithm [5], Figure 1: An error rate comparison of various DSC algorithms, taken from Flajolet et. al. [1] which is illustrated in Figure 1. The pseudocode for the HyperLogLog algorithm is as follows: Figure 2: Pseudocode for the HyperLogLog algorithm, taken from Flajolet et. al. [1] As can be seen from Figure 2, for each stream element v, we hash it into a binary value x. We then apply the stochastic averaging technique described in section 3.2 to split the hash into values j and w. The bit-pattern observables that the HyperLogLog algorithm is concerned with are the positions of the leftmost 1-bit in w, which we store in an array M of length m. The improvement that the HyperLogLog offers over the LogLog algorithm can be found in the final two lines of Figure 2. The LogLog algorithm initially returned 2 to the power of the geometric mean of the observables multiplied by m and some constant α [5]. The difference in Figure 2 is that we now compute the harmonic mean of 2 to the power of the observables. This error improvement, according to Flajolet et. al., is resultant of the fact that the distribution of the m observables are skewed to the right; as such, the harmonic mean is employed to reduce the variance of the distribution created by geometric mean thereby providing an estimate with reduced error [1].

5 4 Experimental Design and Analysis 4.1 The Tweet Stream Simulator In order to test any concrete implementations of DSC algorithms, we required some method of simulating a data stream. As mentioned in section 2, this project was concerned with testing the application of DSC algorithms using the Twitter Streaming APIs. Since the concrete implementations were done using Java, the Twitter4j [6] library was used in order to make use of Twitter s API. In order to acquire data on the estimates of the DSC algorithms implemented, a sample TwitterStream object was initialized. The stream instance was passed a StatusListener object, which fired an OnStatus() event whenever a tweet was sent, and passed a Status object containing all relevant tweet data. The code in the OnStatus() event executed three steps: 1. Pull the Twitter user s username from the Status object to acquire our stream element. 2. Process the username through each of the implemented DSC algorithms. 3. Report each DSC algorithm s estimate to standard out in CSV format. The java program was subsequently executed at the command line, with output redirected into a text file to support importing to Excel later on. 4.2 Algorithms Prior to considering the algorithms discuss, an important fact to consider if that the maximum length of a twitter username is 15 characters [7]. Furthermore, all hashes discussed in this section hash to a 32-bit Java integer. Also, due to the upper bound we have on the length of usernames, we can say that each hash function executes in O (1) time, because their running times are proportional to the length of the string they are hashing The Deterministic Algorithm The sample twitter stream generated roughly one million tweets over a 7-8 hour period. As such, we could be confident that all usernames processed during this time could be stored in main memory. Consequently, we implemented a deterministic DSC algorithm which functioned by placing all twitter usernames into a Java HashSet. To report the number of distinct stream elements, HashSet.size() was returned. For each stream element processed, the only operation performed is adding the element to the HashSet, therefore, the processing time of this algorithm is O (1). Since the maximum length of a Twitter username is 15 characters, each Java character takes 16 bits, and for a stream of length n we might see n distinct elements, the space complexity is therefore O (16 15 n) O (n)

6 bits in the worst case. Finally, we recognize that this algorithm has a negligible amount of error, due to the small possibility of stream element collision occurring during hashing into the set The Probabilistic Counting Algorithm Although not discussed in section 3 (as it was discussed in class), the Probabilistic Counting algorithm published by Flajolet and Martin in 1985 [3] was implemented according to its description in chapter 4 of the Mining of Massive Datasets by UIlman, Rajaraman, and Leskovec [4]. The simple version of this algorithm was implemented in Java as follows: private int maxtaillength ; public int reportdistinctelements () { return ( int ) Math. pow( 2.0, ( double ) maxtaillength ) ; public void processinput (String s) { int i = s.hashcode() ; int taillength = findtaillength(i) ; if (taillength > maxtaillength ){ maxtaillength = taillength ; private int findtaillength ( int i) { int taillength = Integer. numberoftrailingzeros(i) ; if (taillength == 32) taillength = 0 ; return taillength ; Figure 3: Java code for the simple version of the Probabilistic Counting Algorithm. The processing time for a given input of this algorithm depends on two functions: String.hashCode() & Integer.numberOfTrailingZeros(). The execution of String.hashcode() as discussed earlier executes in constant time. Integer.numberOfTrailingZeros() also executes in constant time as it performs a constant number of bit shift operations on the input. This algorithm therefore processes twitter usernames in O (1) time, as the remaining code executes a constant number of conditionals and assignments. This algorithm maintains one bit-pattern observable: The maximum tail length of the binary value of the hashes of the stream elements seen so far. The maximum tail length is defined as the number of zeros after the rightmost 1-bit. Java integers take up 32 bits of space, and since we store only one value our space complexity is thus O (1). Furthermore, looking at Figure 1, we can determine the error rate of this implementation as follows: 0.78 / 1 = 0.78 because m = 1 due to the single observable that we maintain. Next, we consider ways in which we can extend this algorithm to reduce its error rate. As discussed in section 3.2, DSC estimation based on a single observable is inefficient. Instead, by using m hash functions, maintaining m maximum tail lengths, dividing the tail lengths into k groups, taking the averages of the tail lengths in each of the k groups, and then taking the median of the averages of the k tail length averages, we can reduce the error observed by our algorithm [4].

7 Given this, we made the necessary modifications to our code from Figure 3 to support the error reduction. We wished to track 4 observables and as such required 4 independent hash functions in order to guarantee that the hash values of the stream elements differed. The hashes used were the Java String.hashCode(), the FNV1 and FNV1-A hashes [8], and the hash function described in section of Open Data Structures by Pat Morin [9]. Again, the problem with this approach is the need to maintain m independent hash functions, as well as the necessity to process every stream element through each of the m hash functions, thereby increasing the processing time by O (m). In this case, since we only maintained 4 observables, our overall time and space complexity remained O (1). Two different algorithms were implemented: one that maintained 4 groups of 1 observable, and another that maintained 2 groups of 2 observables. Ultimately, the division into different group sizes does not modify the error rate, since we are only interested in the number of observables tracked. The error for both algorithms are as follows: with m = The HyperLogLog Algorithm 0.78 / 4 = 0.39 In order to contrast the power of the HyperLogLog algorithm to the Probabilistic Algorithms discussed in section 4.2.2, the HyperLogLog algorithm was implemented corresponding to Figure 2: private int b = 11 ; private int m = ( int ) Math. pow( 2.0, ( double ) b ) ; private int [] registers ; public void processinput (String s) { int x = s.hashcode() ; int j = x >>> ( MAX_BITS - b ) ; int w = (x << b ) >>> b ; int p = findfirstoneposition(w) - b ; if (p > registers [j]) registers [j] = p ; public double reportdistinctelements () { double Z = 0 ; for ( int j = 0 ; j < m ; j++) { Z += ( 1 / Math. pow( 2.0, registers [j])) ; Z = 1 / Z ; return alpha * Math. pow( m, 2.0 ) * Z ; private int findfirstoneposition ( int i) { int headlength = Integer. numberofleadingzeros(i) ; if (headlength == 32) headlength = 0 ; return headlength + 1 ; Figure 4: Java code for the HyperLogLog algorithm. An important thing to note is that Figure 4 omits the register array initialization, which sets all observable values initially to -2,147,483,648. The algorithm therefore will not have an estimate for the number of distinct stream elements until all observable registers have replaced this initial value, because at least one value added to Z (the indicator function) will be

8 The processing time of this algorithm depends on the Java String.hashCode() & Integer.numberOfLeadingZeros() methods. As discussed before, String.hashCode() will execute in O (1) time. Integer.numberOfLeadingZeros() executes a constant number of bit shift operations and therefore takes O (1) time. Otherwise, the input processing function does a constant number of comparison, assignment, and bit-shift operations. Therefore, our processing time is O (1). Note that by using stochastic averaging in the sense discussed above, our theoretical running time for this algorithm is not O(m), as we use the first b bits of the stream element hashes to index into our register of observable values thereby allowing us to use a single hash function. The space complexity of our algorithm, however, is O (m). This is inescapable if we wish to maintain m observables in order to reduce the error of our estimate. In this particular case, we use m = 2 11 (we set b = 11 ), meaning that we maintain bits, since we maintain with m = Simulation Results O( ) O (1) bit integers. Our error rate, according to Figure 1, is: 1.04 / 2 11 = The tweet stream simulation described in section 4.1 was run for a period of 7-8 hours, processing roughly one million tweets. As mentioned in section 4.2.3, there is a period of time for which the HyperLogLog algorithm cannot output an estimate on the number of stream elements. A small subset of the estimation data was examined, starting from the point in which the HyperLogLog algorithm first produced an estimate. The following graph was produced:

9 Figure 5: A comparison of estimates of various DSC algorithms. The graph shown in Figure 5 contrasts the deterministic number of distinct tweeters to the estimates of the various DSC algorithms discussed. The total number of tweets examined is provided as a reference line. The values on the x axis correspond to the relative number of tweets seen given this snapshot of the total data. The yellow line corresponds to a prototype algorithm not discussed in this report, as it fails to provide a meaningful estimate. As can be seen in Figure 5, the behaviour of the Flajolet-Martin (FM) algorithms discussed in section are quite sporadic. In fact, the estimate for the single-observable version maxes out at around 17,000. This behaviour makes sense, since the probability that we see a binary tail length of n is 2 n as we have a ½ chance of seeing a 0 at a given binary location. This means that as our deterministic count of distinct stream elements grows larger, we have an increasingly reduced probability of seeing a maximum tail length that correctly estimates this value. The estimates of the FM algorithms that maintain 4 observables appear to not stray too far from the deterministic count. This also makes sense, as for each observable maintained, we have an additional 2 n chance of seeing a binary tail length of length n. Therefore, as more observables are added, the higher the chance of seeing a longer maximum tail length, which is essential to count distinct stream elements of a large magnitude. Furthermore, we can see that the HyperLogLog algorithm does the best job of estimating the number of distinct elements. This can be attributed to the 2 11 observables that it maintains, which is made possible to due stochastic averaging. It is worth noting that the smoothness of the estimation line for the HyperLogLog is also attributable to the magnitude of observables maintained. In fact, if we were to increase the number of observables for the FM algorithms implemented, we would see similar behaviour in the estimations produced (as this would reduce the error). To contrast realized error versus theoretical error, we calculated the first differences of algorithm estimations in comparison to the deterministic algorithm:

10 Figure 6: First differences of the algorithm estimates displayed in Figure 5. In general, the realized error of the Probabilistic Counting algorithms were not as bad as their theoretical estimates. The single observable FM algorithm produced 30% less error in reality, while the four observable FM algorithms both produced roughly 17% percent less error. Finally, we see that the HyperLogLog algorithm in fact produced 3% more error than it did in theory. This can be explained by examining Figure 5 in which we see that after relative tweet count 10,000 the estimate began to diverge away from the deterministic count. As more stream elements are processed in fact, the HyperLogLog algorithm s error increases. This is anticipated by Flajolet et. al. in the HyperLogLog paper [1], and is counteracted by providing a large range correction factor which is not discussed in this document. 5 Conclusion An integral part of accurate DSC algorithms is maintaining more than a single observable value to base estimates off of. We have seen a technique of simulating experiments done on stream elements using an arbitrary number of hash functions known as stochastic averaging. This technique is the key to reducing the error of bit-pattern observable based DSC algorithms by an arbitrarily large amount. We have contrasted the realized and theoretical errors of a few bit-pattern observable algorithms in order to illustrate this fact. In summary, the HyperLogLog algorithm as implemented for this project provided the best estimate for the number of distinct stream elements. However, this accuracy could have been attained with the other algorithms implemented by using the stochastic averaging technique.

11 6 References 1. Flajolet, Philippe, Éric Fusy, Olivier Gandouet, and Frédéric Meunier. "HyperLogLog: The Analysis of a Near-optimal Cardinality Estimation Algorithm." (2007): n. pag. Conference on Analysis of Algorithms. Web. 2. "Streaming APIs." Twitter Developer Documentation. Twitter, n.d. Web. < 3. Flajolet, Journal Philippe, and G. Nigel Martin. "Probabilistic Counting Algorithms for Data Base Applications." JOURNAL OF COMPUTER AND SYSTEM SCIENCES (1985): Web. 4. Leskovec, Jure, Anand Rajaraman, and Jeffrey Ullman. "Mining of Massive Datasets." (n.d.): Web. 5. Flajolet, Philippe, and Marianne Durand. "LOGLOG COUNTING OF LARGE CARDINALITIES." (2003): n. pag. Web. 6. "Twitter4J." Twitter4J - A Java Library for the Twitter API. N.p., n.d. Web. < 7. "Changing Your Username." Help Center. Twitter, n.d. Web. < 8. Noll, Landon. "FNV Hash." FNV Hash. N.p., n.d. Web. < 9. Morin, Pat. "Hash Codes for Arrays and Strings." Open Data Structures. N.p.: n.p., n.d Print.

Sliding HyperLogLog: Estimating cardinality in a data stream

Sliding HyperLogLog: Estimating cardinality in a data stream Sliding HyperLogLog: Estimating cardinality in a data stream Yousra Chabchoub, Georges Hébrail To cite this version: Yousra Chabchoub, Georges Hébrail. Sliding HyperLogLog: Estimating cardinality in a

More information

Cardinality Estimation: An Experimental Survey

Cardinality Estimation: An Experimental Survey : An Experimental Survey and Felix Naumann VLDB 2018 Estimation and Approximation Session Rio de Janeiro-Brazil 29 th August 2018 Information System Group Hasso Plattner Institut University of Potsdam

More information

Mining Data that Changes. 17 July 2015

Mining Data that Changes. 17 July 2015 Mining Data that Changes 17 July 2015 Data is Not Static Data is not static New transactions, new friends, stop following somebody in Twitter, But most data mining algorithms assume static data Even a

More information

Hashing. Hashing Procedures

Hashing. Hashing Procedures Hashing Hashing Procedures Let us denote the set of all possible key values (i.e., the universe of keys) used in a dictionary application by U. Suppose an application requires a dictionary in which elements

More information

COMP Data Structures

COMP Data Structures COMP 2140 - Data Structures Shahin Kamali Topic 5 - Sorting University of Manitoba Based on notes by S. Durocher. COMP 2140 - Data Structures 1 / 55 Overview Review: Insertion Sort Merge Sort Quicksort

More information

Course : Data mining

Course : Data mining Course : Data mining Lecture : Mining data streams Aristides Gionis Department of Computer Science Aalto University visiting in Sapienza University of Rome fall 2016 reading assignment LRU book: chapter

More information

SKETCHES ON SINGLE BOARD COMPUTERS

SKETCHES ON SINGLE BOARD COMPUTERS Sabancı University Program for Undergraduate Research (PURE) Summer 17-1 SKETCHES ON SINGLE BOARD COMPUTERS Ali Osman Berk Şapçı Computer Science and Engineering, 1 Egemen Ertuğrul Computer Science and

More information

Questions. 6. Suppose we were to define a hash code on strings s by:

Questions. 6. Suppose we were to define a hash code on strings s by: Questions 1. Suppose you are given a list of n elements. A brute force method to find duplicates could use two (nested) loops. The outer loop iterates over position i the list, and the inner loop iterates

More information

Worst-case running time for RANDOMIZED-SELECT

Worst-case running time for RANDOMIZED-SELECT Worst-case running time for RANDOMIZED-SELECT is ), even to nd the minimum The algorithm has a linear expected running time, though, and because it is randomized, no particular input elicits the worst-case

More information

III Data Structures. Dynamic sets

III Data Structures. Dynamic sets III Data Structures Elementary Data Structures Hash Tables Binary Search Trees Red-Black Trees Dynamic sets Sets are fundamental to computer science Algorithms may require several different types of operations

More information

Instructor: Dr. Mehmet Aktaş. Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman Stanford University

Instructor: Dr. Mehmet Aktaş. Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman Stanford University Instructor: Dr. Mehmet Aktaş Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman Stanford University J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

More information

Streaming Algorithms. Stony Brook University CSE545, Fall 2016

Streaming Algorithms. Stony Brook University CSE545, Fall 2016 Streaming Algorithms Stony Brook University CSE545, Fall 2016 Big Data Analytics -- The Class We will learn: to analyze different types of data: high dimensional graphs infinite/never-ending labeled to

More information

Theorem 2.9: nearest addition algorithm

Theorem 2.9: nearest addition algorithm There are severe limits on our ability to compute near-optimal tours It is NP-complete to decide whether a given undirected =(,)has a Hamiltonian cycle An approximation algorithm for the TSP can be used

More information

Understanding Your Audience: Using Probabilistic Data Aggregation Jason Carey Software Engineer, Data

Understanding Your Audience: Using Probabilistic Data Aggregation Jason Carey Software Engineer, Data Understanding Your Audience: Using Probabilistic Data Aggregation Jason Carey Software Engineer, Data Products @jmcarey The Vision Insights Audience API Fast, ad hoc aggregate queries 10+ proprietary demographic

More information

Lecture 5: Data Streaming Algorithms

Lecture 5: Data Streaming Algorithms Great Ideas in Theoretical Computer Science Summer 2013 Lecture 5: Data Streaming Algorithms Lecturer: Kurt Mehlhorn & He Sun In the data stream scenario, the input arrive rapidly in an arbitrary order,

More information

Big Data Analytics: What is Big Data? Stony Brook University CSE545, Fall 2016 the inaugural edition

Big Data Analytics: What is Big Data? Stony Brook University CSE545, Fall 2016 the inaugural edition Big Data Analytics: What is Big Data? Stony Brook University CSE545, Fall 2016 the inaugural edition What s the BIG deal?! 2011 2011 2008 2010 2012 What s the BIG deal?! (Gartner Hype Cycle) What s the

More information

Comp Online Algorithms

Comp Online Algorithms Comp 7720 - Online Algorithms Notes 4: Bin Packing Shahin Kamalli University of Manitoba - Fall 208 December, 208 Introduction Bin packing is one of the fundamental problems in theory of computer science.

More information

Research Students Lecture Series 2015

Research Students Lecture Series 2015 Research Students Lecture Series 215 Analyse your big data with this one weird probabilistic approach! Or: applied probabilistic algorithms in 5 easy pieces Advait Sarkar advait.sarkar@cl.cam.ac.uk Research

More information

Question 1. Notes on the Exam. Today. Comp 104: Operating Systems Concepts 11/05/2015. Revision Lectures

Question 1. Notes on the Exam. Today. Comp 104: Operating Systems Concepts 11/05/2015. Revision Lectures Comp 104: Operating Systems Concepts Revision Lectures Today Here are a sample of questions that could appear in the exam Please LET ME KNOW if there are particular subjects you want to know about??? 1

More information

One-Pass Streaming Algorithms

One-Pass Streaming Algorithms One-Pass Streaming Algorithms Theory and Practice Complaints and Grievances about theory in practice Disclaimer Experiences with Gigascope. A practitioner s perspective. Will be using my own implementations,

More information

University of the Western Cape Department of Computer Science

University of the Western Cape Department of Computer Science University of the Western Cape Department of Computer Science Algorithms and Complexity CSC212 Paper II Final Examination 13 November 2015 Time: 90 Minutes. Marks: 100. UWC number Surname, first name Mark

More information

Lecture 15: Algorithms. AP Computer Science Principles

Lecture 15: Algorithms. AP Computer Science Principles Lecture 15: Algorithms AP Computer Science Principles Algorithm algorithm: precise sequence of instructions to solve a computational problem. Search for a name in a phone s contact list. Sort emails by

More information

Notes on the Exam. Question 1. Today. Comp 104:Operating Systems Concepts 11/05/2015. Revision Lectures (separate questions and answers)

Notes on the Exam. Question 1. Today. Comp 104:Operating Systems Concepts 11/05/2015. Revision Lectures (separate questions and answers) Comp 104:Operating Systems Concepts Revision Lectures (separate questions and answers) Today Here are a sample of questions that could appear in the exam Please LET ME KNOW if there are particular subjects

More information

Clustering. (Part 2)

Clustering. (Part 2) Clustering (Part 2) 1 k-means clustering 2 General Observations on k-means clustering In essence, k-means clustering aims at minimizing cluster variance. It is typically used in Euclidean spaces and works

More information

Probabilistic (Randomized) algorithms

Probabilistic (Randomized) algorithms Probabilistic (Randomized) algorithms Idea: Build algorithms using a random element so as gain improved performance. For some cases, improved performance is very dramatic, moving from intractable to tractable.

More information

COS 226 Midterm Exam, Spring 2009

COS 226 Midterm Exam, Spring 2009 NAME: login ID: precept: COS 226 Midterm Exam, Spring 2009 This test is 10 questions, weighted as indicated. The exam is closed book, except that you are allowed to use a one page cheatsheet. No calculators

More information

Comp 204: Computer Systems and Their Implementation. Lecture 25a: Revision Lectures (separate questions and answers)

Comp 204: Computer Systems and Their Implementation. Lecture 25a: Revision Lectures (separate questions and answers) Comp 204: Computer Systems and Their Implementation Lecture 25a: Revision Lectures (separate questions and answers) 1 Today Here are a sample of questions that could appear in the exam Please LET ME KNOW

More information

Indexing. CS6200: Information Retrieval. Index Construction. Slides by: Jesse Anderton

Indexing. CS6200: Information Retrieval. Index Construction. Slides by: Jesse Anderton Indexing Index Construction CS6200: Information Retrieval Slides by: Jesse Anderton Motivation: Scale Corpus Terms Docs Entries A term incidence matrix with V terms and D documents has O(V x D) entries.

More information

.. Spring 2017 CSC 566 Advanced Data Mining Alexander Dekhtyar..

.. Spring 2017 CSC 566 Advanced Data Mining Alexander Dekhtyar.. .. Spring 2017 CSC 566 Advanced Data Mining Alexander Dekhtyar.. Machine Learning: Support Vector Machines: Linear Kernel Support Vector Machines Extending Perceptron Classifiers. There are two ways to

More information

B553 Lecture 12: Global Optimization

B553 Lecture 12: Global Optimization B553 Lecture 12: Global Optimization Kris Hauser February 20, 2012 Most of the techniques we have examined in prior lectures only deal with local optimization, so that we can only guarantee convergence

More information

Vertex Cover Approximations

Vertex Cover Approximations CS124 Lecture 20 Heuristics can be useful in practice, but sometimes we would like to have guarantees. Approximation algorithms give guarantees. It is worth keeping in mind that sometimes approximation

More information

Approximating Square Roots

Approximating Square Roots Math 560 Fall 04 Approximating Square Roots Dallas Foster University of Utah u0809485 December 04 The history of approximating square roots is one of the oldest examples of numerical approximations found.

More information

Richard E. Korf. June 27, Abstract. divide them into two subsets, so that the sum of the numbers in

Richard E. Korf. June 27, Abstract. divide them into two subsets, so that the sum of the numbers in A Complete Anytime Algorithm for Number Partitioning Richard E. Korf Computer Science Department University of California, Los Angeles Los Angeles, Ca. 90095 korf@cs.ucla.edu June 27, 1997 Abstract Given

More information

Merge Sort Roberto Hibbler Dept. of Computer Science Florida Institute of Technology Melbourne, FL

Merge Sort Roberto Hibbler Dept. of Computer Science Florida Institute of Technology Melbourne, FL Merge Sort Roberto Hibbler Dept. of Computer Science Florida Institute of Technology Melbourne, FL 32901 rhibbler@cs.fit.edu ABSTRACT Given an array of elements, we want to arrange those elements into

More information

Unit 6 Chapter 15 EXAMPLES OF COMPLEXITY CALCULATION

Unit 6 Chapter 15 EXAMPLES OF COMPLEXITY CALCULATION DESIGN AND ANALYSIS OF ALGORITHMS Unit 6 Chapter 15 EXAMPLES OF COMPLEXITY CALCULATION http://milanvachhani.blogspot.in EXAMPLES FROM THE SORTING WORLD Sorting provides a good set of examples for analyzing

More information

Optimal Parallel Randomized Renaming

Optimal Parallel Randomized Renaming Optimal Parallel Randomized Renaming Martin Farach S. Muthukrishnan September 11, 1995 Abstract We consider the Renaming Problem, a basic processing step in string algorithms, for which we give a simultaneously

More information

Report on Cache-Oblivious Priority Queue and Graph Algorithm Applications[1]

Report on Cache-Oblivious Priority Queue and Graph Algorithm Applications[1] Report on Cache-Oblivious Priority Queue and Graph Algorithm Applications[1] Marc André Tanner May 30, 2014 Abstract This report contains two main sections: In section 1 the cache-oblivious computational

More information

Algorithm must complete after a finite number of instructions have been executed. Each step must be clearly defined, having only one interpretation.

Algorithm must complete after a finite number of instructions have been executed. Each step must be clearly defined, having only one interpretation. Algorithms 1 algorithm: a finite set of instructions that specify a sequence of operations to be carried out in order to solve a specific problem or class of problems An algorithm must possess the following

More information

Chapter 16 Heuristic Search

Chapter 16 Heuristic Search Chapter 16 Heuristic Search Part I. Preliminaries Part II. Tightly Coupled Multicore Chapter 6. Parallel Loops Chapter 7. Parallel Loop Schedules Chapter 8. Parallel Reduction Chapter 9. Reduction Variables

More information

MinHash Sketches: A Brief Survey. June 14, 2016

MinHash Sketches: A Brief Survey. June 14, 2016 MinHash Sketches: A Brief Survey Edith Cohen edith@cohenwang.com June 14, 2016 1 Definition Sketches are a very powerful tool in massive data analysis. Operations and queries that are specified with respect

More information

Parallel and Sequential Data Structures and Algorithms Lecture (Spring 2012) Lecture 16 Treaps; Augmented BSTs

Parallel and Sequential Data Structures and Algorithms Lecture (Spring 2012) Lecture 16 Treaps; Augmented BSTs Lecture 16 Treaps; Augmented BSTs Parallel and Sequential Data Structures and Algorithms, 15-210 (Spring 2012) Lectured by Margaret Reid-Miller 8 March 2012 Today: - More on Treaps - Ordered Sets and Tables

More information

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

CS246: Mining Massive Datasets Jure Leskovec, Stanford University CS246: Mining Massive Datasets Jure Leskovec, Stanford University http://cs246.stanford.edu 2/24/2014 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 2 High dim. data

More information

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

CS246: Mining Massive Datasets Jure Leskovec, Stanford University CS246: Mining Massive Datasets Jure Leskovec, Stanford University http://cs246.stanford.edu 3/6/2012 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 2 In many data mining

More information

Data Analysis and Solver Plugins for KSpread USER S MANUAL. Tomasz Maliszewski

Data Analysis and Solver Plugins for KSpread USER S MANUAL. Tomasz Maliszewski Data Analysis and Solver Plugins for KSpread USER S MANUAL Tomasz Maliszewski tmaliszewski@wp.pl Table of Content CHAPTER 1: INTRODUCTION... 3 1.1. ABOUT DATA ANALYSIS PLUGIN... 3 1.3. ABOUT SOLVER PLUGIN...

More information

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

CS246: Mining Massive Datasets Jure Leskovec, Stanford University CS246: Mining Massive Datasets Jure Leskovec, Stanford University http://cs246.stanford.edu 2/25/2013 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 3 In many data mining

More information

Lesson 12 - Operator Overloading Customising Operators

Lesson 12 - Operator Overloading Customising Operators Lesson 12 - Operator Overloading Customising Operators Summary In this lesson we explore the subject of Operator Overloading. New Concepts Operators, overloading, assignment, friend functions. Operator

More information

Statistics for a Random Network Design Problem

Statistics for a Random Network Design Problem DIMACS Technical Report 2009-20 September 2009 Statistics for a Random Network Design Problem by Fred J. Rispoli 1 Fred_Rispoli@dowling.edu Steven Cosares cosares@aol.com DIMACS is a collaborative project

More information

Metaheuristic Development Methodology. Fall 2009 Instructor: Dr. Masoud Yaghini

Metaheuristic Development Methodology. Fall 2009 Instructor: Dr. Masoud Yaghini Metaheuristic Development Methodology Fall 2009 Instructor: Dr. Masoud Yaghini Phases and Steps Phases and Steps Phase 1: Understanding Problem Step 1: State the Problem Step 2: Review of Existing Solution

More information

Notes for Lecture 24

Notes for Lecture 24 U.C. Berkeley CS170: Intro to CS Theory Handout N24 Professor Luca Trevisan December 4, 2001 Notes for Lecture 24 1 Some NP-complete Numerical Problems 1.1 Subset Sum The Subset Sum problem is defined

More information

Symbol Table. Symbol table is used widely in many applications. dictionary is a kind of symbol table data dictionary is database management

Symbol Table. Symbol table is used widely in many applications. dictionary is a kind of symbol table data dictionary is database management Hashing Symbol Table Symbol table is used widely in many applications. dictionary is a kind of symbol table data dictionary is database management In general, the following operations are performed on

More information

IT 403 Practice Problems (1-2) Answers

IT 403 Practice Problems (1-2) Answers IT 403 Practice Problems (1-2) Answers #1. Using Tukey's Hinges method ('Inclusionary'), what is Q3 for this dataset? 2 3 5 7 11 13 17 a. 7 b. 11 c. 12 d. 15 c (12) #2. How do quartiles and percentiles

More information

Basic Statistical Terms and Definitions

Basic Statistical Terms and Definitions I. Basics Basic Statistical Terms and Definitions Statistics is a collection of methods for planning experiments, and obtaining data. The data is then organized and summarized so that professionals can

More information

31.6 Powers of an element

31.6 Powers of an element 31.6 Powers of an element Just as we often consider the multiples of a given element, modulo, we consider the sequence of powers of, modulo, where :,,,,. modulo Indexing from 0, the 0th value in this sequence

More information

Extraction of Evolution Tree from Product Variants Using Linear Counting Algorithm. Liu Shuchang

Extraction of Evolution Tree from Product Variants Using Linear Counting Algorithm. Liu Shuchang Extraction of Evolution Tree from Product Variants Using Linear Counting Algorithm Liu Shuchang 30 2 7 29 Extraction of Evolution Tree from Product Variants Using Linear Counting Algorithm Liu Shuchang

More information

Dinwiddie County Public Schools Subject: Math 7 Scope and Sequence

Dinwiddie County Public Schools Subject: Math 7 Scope and Sequence Dinwiddie County Public Schools Subject: Math 7 Scope and Sequence GRADE: 7 Year - 2013-2014 9 WKS Topics Targeted SOLS Days Taught Essential Skills 1 ARI Testing 1 1 PreTest 1 1 Quadrilaterals 7.7 4 The

More information

The p-sized partitioning algorithm for fast computation of factorials of numbers

The p-sized partitioning algorithm for fast computation of factorials of numbers J Supercomput (2006) 38:73 82 DOI 10.1007/s11227-006-7285-5 The p-sized partitioning algorithm for fast computation of factorials of numbers Ahmet Ugur Henry Thompson C Science + Business Media, LLC 2006

More information

Lecture 8: Mergesort / Quicksort Steven Skiena

Lecture 8: Mergesort / Quicksort Steven Skiena Lecture 8: Mergesort / Quicksort Steven Skiena Department of Computer Science State University of New York Stony Brook, NY 11794 4400 http://www.cs.stonybrook.edu/ skiena Problem of the Day Give an efficient

More information

Chapter 2: Number Systems

Chapter 2: Number Systems Chapter 2: Number Systems Logic circuits are used to generate and transmit 1s and 0s to compute and convey information. This two-valued number system is called binary. As presented earlier, there are many

More information

For searching and sorting algorithms, this is particularly dependent on the number of data elements.

For searching and sorting algorithms, this is particularly dependent on the number of data elements. Looking up a phone number, accessing a website and checking the definition of a word in a dictionary all involve searching large amounts of data. Searching algorithms all accomplish the same goal finding

More information

Nearest Neighbor Predictors

Nearest Neighbor Predictors Nearest Neighbor Predictors September 2, 2018 Perhaps the simplest machine learning prediction method, from a conceptual point of view, and perhaps also the most unusual, is the nearest-neighbor method,

More information

in this web service Cambridge University Press

in this web service Cambridge University Press 978-0-51-85748- - Switching and Finite Automata Theory, Third Edition Part 1 Preliminaries 978-0-51-85748- - Switching and Finite Automata Theory, Third Edition CHAPTER 1 Number systems and codes This

More information

Search trees, binary trie, patricia trie Marko Berezovský Radek Mařík PAL 2012

Search trees, binary trie, patricia trie Marko Berezovský Radek Mařík PAL 2012 Search trees, binary trie, patricia trie Marko Berezovský Radek Mařík PAL 212 p 2

More information

Lecture 3 Questions that we should be able to answer by the end of this lecture:

Lecture 3 Questions that we should be able to answer by the end of this lecture: Lecture 3 Questions that we should be able to answer by the end of this lecture: Which is the better exam score? 67 on an exam with mean 50 and SD 10 or 62 on an exam with mean 40 and SD 12 Is it fair

More information

Lecture 7 Quicksort : Principles of Imperative Computation (Spring 2018) Frank Pfenning

Lecture 7 Quicksort : Principles of Imperative Computation (Spring 2018) Frank Pfenning Lecture 7 Quicksort 15-122: Principles of Imperative Computation (Spring 2018) Frank Pfenning In this lecture we consider two related algorithms for sorting that achieve a much better running time than

More information

CHAPTER 2: SAMPLING AND DATA

CHAPTER 2: SAMPLING AND DATA CHAPTER 2: SAMPLING AND DATA This presentation is based on material and graphs from Open Stax and is copyrighted by Open Stax and Georgia Highlands College. OUTLINE 2.1 Stem-and-Leaf Graphs (Stemplots),

More information

On the Max Coloring Problem

On the Max Coloring Problem On the Max Coloring Problem Leah Epstein Asaf Levin May 22, 2010 Abstract We consider max coloring on hereditary graph classes. The problem is defined as follows. Given a graph G = (V, E) and positive

More information

School of Computer and Information Science

School of Computer and Information Science School of Computer and Information Science CIS Research Placement Report Multiple threads in floating-point sort operations Name: Quang Do Date: 8/6/2012 Supervisor: Grant Wigley Abstract Despite the vast

More information

The Bounded Edge Coloring Problem and Offline Crossbar Scheduling

The Bounded Edge Coloring Problem and Offline Crossbar Scheduling The Bounded Edge Coloring Problem and Offline Crossbar Scheduling Jonathan Turner WUCSE-05-07 Abstract This paper introduces a variant of the classical edge coloring problem in graphs that can be applied

More information

Lecture 3 Questions that we should be able to answer by the end of this lecture:

Lecture 3 Questions that we should be able to answer by the end of this lecture: Lecture 3 Questions that we should be able to answer by the end of this lecture: Which is the better exam score? 67 on an exam with mean 50 and SD 10 or 62 on an exam with mean 40 and SD 12 Is it fair

More information

Consider the actions taken on positive integers when considering decimal values shown in Table 1 where the division discards the remainder.

Consider the actions taken on positive integers when considering decimal values shown in Table 1 where the division discards the remainder. 9.3 Mapping Down to 0,..., M 1 In our previous step, we discussed methods for taking various objects and deterministically creating a 32- bit hash value based on the properties of the object. Hash tables,

More information

Chapter 9 Graph Algorithms

Chapter 9 Graph Algorithms Introduction graph theory useful in practice represent many real-life problems can be if not careful with data structures Chapter 9 Graph s 2 Definitions Definitions an undirected graph is a finite set

More information

Recitation 11. Graph Contraction and MSTs Announcements. SegmentLab has been released, and is due Friday, April 14. It s worth 135 points.

Recitation 11. Graph Contraction and MSTs Announcements. SegmentLab has been released, and is due Friday, April 14. It s worth 135 points. Recitation Graph Contraction and MSTs. Announcements SegmentLab has been released, and is due Friday, April. It s worth 5 points. Midterm is on Friday, April. 5 RECITATION. GRAPH CONTRACTION AND MSTS.

More information

Approximation Algorithms for Clustering Uncertain Data

Approximation Algorithms for Clustering Uncertain Data Approximation Algorithms for Clustering Uncertain Data Graham Cormode AT&T Labs - Research graham@research.att.com Andrew McGregor UCSD / MSR / UMass Amherst andrewm@ucsd.edu Introduction Many applications

More information

Distributed Sampling in a Big Data Management System

Distributed Sampling in a Big Data Management System Distributed Sampling in a Big Data Management System Dan Radion University of Washington Department of Computer Science and Engineering Undergraduate Departmental Honors Thesis Advised by Dan Suciu Contents

More information

COMP 250. Lecture 27. hashing. Nov. 10, 2017

COMP 250. Lecture 27. hashing. Nov. 10, 2017 COMP 250 Lecture 27 hashing Nov. 10, 2017 1 RECALL Map keys (type K) values (type V) Each (key, value) pairs is an entry. For each key, there is at most one value. 2 RECALL Special Case keys are unique

More information

CS125 : Introduction to Computer Science. Lecture Notes #38 and #39 Quicksort. c 2005, 2003, 2002, 2000 Jason Zych

CS125 : Introduction to Computer Science. Lecture Notes #38 and #39 Quicksort. c 2005, 2003, 2002, 2000 Jason Zych CS125 : Introduction to Computer Science Lecture Notes #38 and #39 Quicksort c 2005, 2003, 2002, 2000 Jason Zych 1 Lectures 38 and 39 : Quicksort Quicksort is the best sorting algorithm known which is

More information

Comparing Implementations of Optimal Binary Search Trees

Comparing Implementations of Optimal Binary Search Trees Introduction Comparing Implementations of Optimal Binary Search Trees Corianna Jacoby and Alex King Tufts University May 2017 In this paper we sought to put together a practical comparison of the optimality

More information

Java How to Program, 9/e. Copyright by Pearson Education, Inc. All Rights Reserved.

Java How to Program, 9/e. Copyright by Pearson Education, Inc. All Rights Reserved. Java How to Program, 9/e Copyright 1992-2012 by Pearson Education, Inc. All Rights Reserved. Searching data involves determining whether a value (referred to as the search key) is present in the data

More information

Lecture 12 Notes Hash Tables

Lecture 12 Notes Hash Tables Lecture 12 Notes Hash Tables 15-122: Principles of Imperative Computation (Spring 2016) Frank Pfenning, Rob Simmons 1 Introduction In this lecture we re-introduce the dictionaries that were implemented

More information

Randomized Algorithms, Quicksort and Randomized Selection

Randomized Algorithms, Quicksort and Randomized Selection CMPS 2200 Fall 2017 Randomized Algorithms, Quicksort and Randomized Selection Carola Wenk Slides by Carola Wenk and Charles Leiserson CMPS 2200 Intro. to Algorithms 1 Deterministic Algorithms Runtime for

More information

Project 0: Implementing a Hash Table

Project 0: Implementing a Hash Table CS: DATA SYSTEMS Project : Implementing a Hash Table CS, Data Systems, Fall Goal and Motivation. The goal of Project is to help you develop (or refresh) basic skills at designing and implementing data

More information

Goal of the course: The goal is to learn to design and analyze an algorithm. More specifically, you will learn:

Goal of the course: The goal is to learn to design and analyze an algorithm. More specifically, you will learn: CS341 Algorithms 1. Introduction Goal of the course: The goal is to learn to design and analyze an algorithm. More specifically, you will learn: Well-known algorithms; Skills to analyze the correctness

More information

We have seen that as n increases, the length of our confidence interval decreases, the confidence interval will be more narrow.

We have seen that as n increases, the length of our confidence interval decreases, the confidence interval will be more narrow. {Confidence Intervals for Population Means} Now we will discuss a few loose ends. Before moving into our final discussion of confidence intervals for one population mean, let s review a few important results

More information

Sorting. Order in the court! sorting 1

Sorting. Order in the court! sorting 1 Sorting Order in the court! sorting 1 Importance of sorting Sorting a list of values is a fundamental task of computers - this task is one of the primary reasons why people use computers in the first place

More information

Table : IEEE Single Format ± a a 2 a 3 :::a 8 b b 2 b 3 :::b 23 If exponent bitstring a :::a 8 is Then numerical value represented is ( ) 2 = (

Table : IEEE Single Format ± a a 2 a 3 :::a 8 b b 2 b 3 :::b 23 If exponent bitstring a :::a 8 is Then numerical value represented is ( ) 2 = ( Floating Point Numbers in Java by Michael L. Overton Virtually all modern computers follow the IEEE 2 floating point standard in their representation of floating point numbers. The Java programming language

More information

Cpk: What is its Capability? By: Rick Haynes, Master Black Belt Smarter Solutions, Inc.

Cpk: What is its Capability? By: Rick Haynes, Master Black Belt Smarter Solutions, Inc. C: What is its Capability? By: Rick Haynes, Master Black Belt Smarter Solutions, Inc. C is one of many capability metrics that are available. When capability metrics are used, organizations typically provide

More information

Chapter 9 Graph Algorithms

Chapter 9 Graph Algorithms Chapter 9 Graph Algorithms 2 Introduction graph theory useful in practice represent many real-life problems can be if not careful with data structures 3 Definitions an undirected graph G = (V, E) is a

More information

CS224W Project Write-up Static Crawling on Social Graph Chantat Eksombatchai Norases Vesdapunt Phumchanit Watanaprakornkul

CS224W Project Write-up Static Crawling on Social Graph Chantat Eksombatchai Norases Vesdapunt Phumchanit Watanaprakornkul 1 CS224W Project Write-up Static Crawling on Social Graph Chantat Eksombatchai Norases Vesdapunt Phumchanit Watanaprakornkul Introduction Our problem is crawling a static social graph (snapshot). Given

More information

A Visualization Program for Subset Sum Instances

A Visualization Program for Subset Sum Instances A Visualization Program for Subset Sum Instances Thomas E. O Neil and Abhilasha Bhatia Computer Science Department University of North Dakota Grand Forks, ND 58202 oneil@cs.und.edu abhilasha.bhatia@my.und.edu

More information

Design and Analysis of Algorithms (VII)

Design and Analysis of Algorithms (VII) Design and Analysis of Algorithms (VII) An Introduction to Randomized Algorithms Guoqiang Li School of Software, Shanghai Jiao Tong University Randomized Algorithms algorithms which employ a degree of

More information

Final Exam in Algorithms and Data Structures 1 (1DL210)

Final Exam in Algorithms and Data Structures 1 (1DL210) Final Exam in Algorithms and Data Structures 1 (1DL210) Department of Information Technology Uppsala University February 0th, 2012 Lecturers: Parosh Aziz Abdulla, Jonathan Cederberg and Jari Stenman Location:

More information

Chapter 5 VARIABLE-LENGTH CODING Information Theory Results (II)

Chapter 5 VARIABLE-LENGTH CODING Information Theory Results (II) Chapter 5 VARIABLE-LENGTH CODING ---- Information Theory Results (II) 1 Some Fundamental Results Coding an Information Source Consider an information source, represented by a source alphabet S. S = { s,

More information

O(n): printing a list of n items to the screen, looking at each item once.

O(n): printing a list of n items to the screen, looking at each item once. UNIT IV Sorting: O notation efficiency of sorting bubble sort quick sort selection sort heap sort insertion sort shell sort merge sort radix sort. O NOTATION BIG OH (O) NOTATION Big oh : the function f(n)=o(g(n))

More information

Lecture 12 Hash Tables

Lecture 12 Hash Tables Lecture 12 Hash Tables 15-122: Principles of Imperative Computation (Spring 2018) Frank Pfenning, Rob Simmons Dictionaries, also called associative arrays as well as maps, are data structures that are

More information

CITS2200 Data Structures and Algorithms. Topic 15. Hash Tables

CITS2200 Data Structures and Algorithms. Topic 15. Hash Tables CITS2200 Data Structures and Algorithms Topic 15 Hash Tables Introduction to hashing basic ideas Hash functions properties, 2-universal functions, hashing non-integers Collision resolution bucketing and

More information

modern database systems lecture 10 : large-scale graph processing

modern database systems lecture 10 : large-scale graph processing modern database systems lecture 1 : large-scale graph processing Aristides Gionis spring 18 timeline today : homework is due march 6 : homework out april 5, 9-1 : final exam april : homework due graphs

More information

6.2 DATA DISTRIBUTION AND EXPERIMENT DETAILS

6.2 DATA DISTRIBUTION AND EXPERIMENT DETAILS Chapter 6 Indexing Results 6. INTRODUCTION The generation of inverted indexes for text databases is a computationally intensive process that requires the exclusive use of processing resources for long

More information

COMP 250 Winter generic types, doubly linked lists Jan. 28, 2016

COMP 250 Winter generic types, doubly linked lists Jan. 28, 2016 COMP 250 Winter 2016 5 generic types, doubly linked lists Jan. 28, 2016 Java generics In our discussion of linked lists, we concentrated on how to add or remove a node from the front or back of a list.

More information

Disjoint Sets. The obvious data structure for disjoint sets looks like this.

Disjoint Sets. The obvious data structure for disjoint sets looks like this. CS61B Summer 2006 Instructor: Erin Korber Lecture 30: 15 Aug. Disjoint Sets Given a set of elements, it is often useful to break them up or partition them into a number of separate, nonoverlapping groups.

More information

9/24/ Hash functions

9/24/ Hash functions 11.3 Hash functions A good hash function satis es (approximately) the assumption of SUH: each key is equally likely to hash to any of the slots, independently of the other keys We typically have no way

More information