Sublinear Models for Streaming and/or Distributed Data

Size: px

Start display at page:

Download "Sublinear Models for Streaming and/or Distributed Data"

Emil Eaton
5 years ago
Views:

1 Sublinear Models for Streaming and/or Distributed Data Qin Zhang Guest lecture in B649 Feb. 3,

2 Now about the Big Data Big data is everywhere : over 2.5 petabytes of sales transactions : an index of over 19 billion web pages : over 40 billion of pictures

3 Now about the Big Data Big data is everywhere : over 2.5 petabytes of sales transactions : an index of over 19 billion web pages : over 40 billion of pictures... Magazine covers Nature 06 Nature 08 CACM 08 Economist

4 Source and Challenge Source Retailer databases Logistics, financial & health data Social network Pictures by mobile devices Internet of Things New forms of scientific data 3-1

5 Source and Challenge Source Retailer databases Logistics, financial & health data Social network Pictures by mobile devices Internet of Things New forms of scientific data Challenge Volume Velocity Variety (Documents, Stock records, Personal profiles, Photographs, Audio & Video, 3D models, Location data,... ) 3-2

6 Source and Challenge Source Retailer databases Logistics, financial & health data Social network Pictures by mobile devices Internet of Things New forms of scientific data Challenge Volume Velocity } The focus of algorithm design Variety (Documents, Stock records, Personal profiles, Photographs, Audio & Video, 3D models, Location data,... ) 3-3

7 What is the meaning of Big Data IN THEORY? We don t define Big Data in terms of TB, PB, EB,

8 What is the meaning of Big Data IN THEORY? We don t define Big Data in terms of TB, PB, EB,... The data is stored there, but no time to read them all. What can we do? 4-2

9 What is the meaning of Big Data IN THEORY? We don t define Big Data in terms of TB, PB, EB,... The data is stored there, but no time to read them all. What can we do? Read some of them. Sublinear in time 4-3

10 What is the meaning of Big Data IN THEORY? We don t define Big Data in terms of TB, PB, EB,... The data is stored there, but no time to read them all. What can we do? Read some of them. The data is too big to fit in main memory. What can we do? Sublinear in time 4-4

11 What is the meaning of Big Data IN THEORY? We don t define Big Data in terms of TB, PB, EB,... The data is stored there, but no time to read them all. What can we do? Read some of them. The data is too big to fit in main memory. What can we do? Sublinear in time Store on the disk (page/block access) Sublinear in I/O 4-5

12 What is the meaning of Big Data IN THEORY? We don t define Big Data in terms of TB, PB, EB,... The data is stored there, but no time to read them all. What can we do? Read some of them. The data is too big to fit in main memory. What can we do? Store on the disk (page/block access) Throw some of them away. Sublinear in time Sublinear in I/O Sublinear in space 4-6

13 What is the meaning of Big Data IN THEORY? We don t define Big Data in terms of TB, PB, EB,... The data is stored there, but no time to read them all. What can we do? Read some of them. The data is too big to fit in main memory. What can we do? Store on the disk (page/block access) Throw some of them away. Sublinear in time Sublinear in I/O Sublinear in space The data is too big to be stored in a single machine. What can we do if we do not want to throw them away? 4-7

14 What is the meaning of Big Data IN THEORY? We don t define Big Data in terms of TB, PB, EB,... The data is stored there, but no time to read them all. What can we do? Read some of them. The data is too big to fit in main memory. What can we do? Sublinear in time Store on the disk (page/block access) Throw some of them away. Sublinear in I/O Sublinear in space The data is too big to be stored in a single machine. What can we do if we do not want to throw them away? Store in multiple machines, which collaborate via communication Sublinear in communication 4-8

15 What do we mean by sublinear? Time/space/communication spent is o(input size) 5-1

16 Conceretly, theory folks talk about the following... Sublinear time algorithms Sublinear time approximation algorithms Property testing 6-1

17 Conceretly, theory folks talk about the following... Sublinear time algorithms Sublinear time approximation algorithms Property testing Sublinear space algorithms Data stream algorithms 6-2

18 Conceretly, theory folks talk about the following... Sublinear time algorithms Sublinear time approximation algorithms Property testing Sublinear space algorithms Data stream algorithms Sublinear communication algorithms Multiparty communication protocols/algorithms (special case: MapReduce, BSP, MPI,... ) 6-3

19 Conceretly, theory folks talk about the following... Sublinear time algorithms Sublinear time approximation algorithms Property testing Sublinear space algorithms Data stream algorithms Sublinear communication algorithms Multiparty communication protocols/algorithms (special case: MapReduce, BSP, MPI,... ) Sublinear I/O algorithms External memory data structures/algorithms 6-4

20 Conceretly, theory folks talk about the following... Sublinear time algorithms Sublinear time approximation algorithms Property testing Sublinear space algorithms Data stream algorithms (today) Sublinear communication algorithms (today) Multiparty communication protocols/algorithms (special case: MapReduce, BSP, MPI,... ) Sublinear I/O algorithms External memory data structures/algorithms 6-5

21 Sublinear in space The data stream model (Alon, Matias and Szegedy 1996) RAM CPU 7-1

22 Sublinear in space The data stream model (Alon, Matias and Szegedy 1996) RAM Applications Internet Router. Packets limited space CPU Router The router wants to maintain some statistics on data. E.g., want to detect anomalies for security. 7-2 Stock data, ad auction, flight logs on tapes, etc.

23 Why hard? You do see everything but then forget! Game 1: A sequence of numbers 8-1

24 Why hard? You do see everything but then forget! Game 1: A sequence of numbers

25 Why hard? You do see everything but then forget! Game 1: A sequence of numbers

26 Why hard? You do see everything but then forget! Game 1: A sequence of numbers

27 Why hard? You do see everything but then forget! Game 1: A sequence of numbers

28 Why hard? You do see everything but then forget! Game 1: A sequence of numbers

29 Why hard? You do see everything but then forget! Game 1: A sequence of numbers

30 Why hard? You do see everything but then forget! Game 1: A sequence of numbers

31 Why hard? You do see everything but then forget! Game 1: A sequence of numbers

32 Why hard? You do see everything but then forget! Game 1: A sequence of numbers

33 Why hard? You do see everything but then forget! Game 1: A sequence of numbers

34 Why hard? You do see everything but then forget! Game 1: A sequence of numbers

35 Why hard? You do see everything but then forget! Game 1: A sequence of numbers Q: What s the median? 8-13

36 Why hard? You do see everything but then forget! Game 1: A sequence of numbers Q: What s the median? A:

37 Why hard? Cannot store everything. Game 1: A sequence of numbers Q: What s the median? A: 33 Game 2: Relationships between Alice, Bob, Carol, Dave, Eva and Paul 9-1

38 Why hard? Cannot store everything. Game 1: A sequence of numbers Q: What s the median? A: 33 Game 2: Relationships between Alice, Bob, Carol, Dave, Eva and Paul Alice and Bob become friends 9-2

39 Why hard? Cannot store everything. Game 1: A sequence of numbers Q: What s the median? A: 33 Game 2: Relationships between Alice, Bob, Carol, Dave, Eva and Paul Carol and Eva become friends 9-3

40 Why hard? Cannot store everything. Game 1: A sequence of numbers Q: What s the median? A: 33 Game 2: Relationships between Alice, Bob, Carol, Dave, Eva and Paul Eva and Bob become friends 9-4

41 Why hard? Cannot store everything. Game 1: A sequence of numbers Q: What s the median? A: 33 Game 2: Relationships between Alice, Bob, Carol, Dave, Eva and Paul Dave and Paul become friends 9-5

42 Why hard? Cannot store everything. Game 1: A sequence of numbers Q: What s the median? A: 33 Game 2: Relationships between Alice, Bob, Carol, Dave, Eva and Paul Alice and Paul become friends 9-6

43 Why hard? Cannot store everything. Game 1: A sequence of numbers Q: What s the median? A: 33 Game 2: Relationships between Alice, Bob, Carol, Dave, Eva and Paul Eva and Bob unfriends 9-7

44 Why hard? Cannot store everything. Game 1: A sequence of numbers Q: What s the median? A: 33 Game 2: Relationships between Alice, Bob, Carol, Dave, Eva and Paul Alice and Dave become friends 9-8

45 Why hard? Cannot store everything. Game 1: A sequence of numbers Q: What s the median? A: 33 Game 2: Relationships between Alice, Bob, Carol, Dave, Eva and Paul Bob and Paul become friends 9-9

46 Why hard? Cannot store everything. Game 1: A sequence of numbers Q: What s the median? A: 33 Game 2: Relationships between Alice, Bob, Carol, Dave, Eva and Paul Dave and Paul unfriends 9-10

47 Why hard? Cannot store everything. Game 1: A sequence of numbers Q: What s the median? A: 33 Game 2: Relationships between Alice, Bob, Carol, Dave, Eva and Paul Dave and Carol become friends 9-11

48 Why hard? Cannot store everything. Game 1: A sequence of numbers Q: What s the median? A: 33 Game 2: Relationships between Alice, Bob, Carol, Dave, Eva and Paul Q: Are Eva and Bob connected by friends? 9-12

49 Why hard? Cannot store everything. Game 1: A sequence of numbers Q: What s the median? A: 33 Game 2: Relationships between Alice, Bob, Carol, Dave, Eva and Paul Q: Are Eva and Bob connected by friends? A: YES. Eva Carol Dave Alice Bob 9-13

50 Why hard? Cannot store everything. Game 1: A sequence of numbers Q: What s the median? A: 33 Game 2: Relationships between Alice, Bob, Carol, Dave, Eva and Paul Q: Are Eva and Bob connected by friends? A: YES. Eva Carol Dave Alice Bob Have to allow approx/randomization given a small memory. 9-14

51 A toy example: Reservoir Sampling Tasks: Find a uniform sample from a stream of unknown length, can we do it in O(1) space? 10-1

52 A toy example: Reservoir Sampling Tasks: Find a uniform sample from a stream of unknown length, can we do it in O(1) space? Algorithm: Store 1-st item. When the i-th (i > 1) item arrives With probability 1/i, replace the current sample; With probability 1 1/i, throw it away. 10-2

53 A toy example: Reservoir Sampling Tasks: Find a uniform sample from a stream of unknown length, can we do it in O(1) space? Algorithm: Store 1-st item. When the i-th (i > 1) item arrives With probability 1/i, replace the current sample; With probability 1 1/i, throw it away. Correctness: each item is included in the final sample w.p. 1 i (1 1 i+1 )... (1 1 n ) = 1 n (n: total # items) Space: O(1) 10-3

54 Sublinear in communication Communication time, energy, bandwidth,... Input Output Sensor networks Map Shuffle Reduce The MapReduce model. E.g., Hadoop, Google MapReduce. The BSP model. E.g., Apache Giraph, Pregel. Data in the cloud 11-1

55 Sublinear in communication Communication time, energy, bandwidth,... Input Output Sensor networks Map Shuffle Reduce The MapReduce model. E.g., Hadoop, Google MapReduce. The BSP model. E.g., Apache Giraph, Pregel. Abstraction Data in the cloud 11-2

model. E.g., Hadoop, Google MapReduce. The BSP model. E.g., Apache Giraph, Pregel.

56 Sublinear in communication Communication time, energy, bandwidth,... Input Output Sensor networks Map Shuffle Reduce The MapReduce model. E.g., Hadoop, Google MapReduce. The BSP model. E.g., Apache Giraph, Pregel. Data in the cloud Abstraction = C S 1 S 2 S 3 S k 11-3

57 Why hard? You do not have a global view of the data. Let s think about the graph connectivity problem: k machine each holds a set of edges of a graph. Goal: compute whether the graph is connected. 12-1

58 Why hard? You do not have a global view of the data. Let s think about the graph connectivity problem: k machine each holds a set of edges of a graph. Goal: compute whether the graph is connected. 12-2

59 Why hard? You do not have a global view of the data. Let s think about the graph connectivity problem: k machine each holds a set of edges of a graph. Goal: compute whether the graph is connected. C S 1 S 2 S 3 S k 12-3

60 Why hard? You do not have a global view of the data. Let s think about the graph connectivity problem: k machine each holds a set of edges of a graph. Goal: compute whether the graph is connected. C S 1 S 2 S 3 S k A trivial solution: each machine sends a local spanning tree to the first machine. Cost O(kn log n) bits. 12-4

61 Why hard? You do not have a global view of the data. Let s think about the graph connectivity problem: k machine each holds a set of edges of a graph. Goal: compute whether the graph is connected. C S 1 S 2 S 3 S k A trivial solution: each machine sends a local spanning tree to the first machine. Cost O(kn log n) bits. Can we do better, e.g., o(kn) bits of communication? 12-5

62 Why hard? You do not have a global view of the data. Let s think about the graph connectivity problem: k machine each holds a set of edges of a graph. Goal: compute whether the graph is connected. C S 1 S 2 S 3 S k A trivial solution: each machine sends a local spanning tree to the first machine. Cost O(kn log n) bits Can we do better, e.g., o(kn) bits of communication? What if the graph is node partitioned among the k machines? That is, each node is stored in 1 machine with all adjancent edges.

63 Problems Statistical problems Frequency moments F p F 0 : #distinct elements F 2 : size of self-join Heavy hitters Quantile Entropy

64 Problems Statistical problems Frequency moments F p F 0 : #distinct elements F 2 : size of self-join Heavy hitters Quantile Entropy... Graph problems Connectivity Bipartiteness Counting triangles Matching Minimum spanning tree

65 Problems Statistical problems Frequency moments F p F 0 : #distinct elements F 2 : size of self-join Heavy hitters Quantile Entropy... Graph problems Connectivity Bipartiteness Counting triangles Matching Minimum spanning tree... L p regression Numerical linear algebra Low-rank approximation

.. Graph problems Connectivity Bipartiteness Counting triangles Matching Minimum spanning tree.

66 Problems Statistical problems 13-4 Frequency moments F p F 0 : #distinct elements F 2 : size of self-join Heavy hitters Quantile Entropy... Numerical linear algebra L p regression Low-rank approximation... Graph problems Connectivity Bipartiteness Counting triangles Matching Minimum spanning tree... DB queries Strings Geometry problems Clustering Conjuntive queries Edit distance Earth-Mover Distance... Longest increasing sequence

67 Do the streaming and the multi-party computation model capture all? 14-1

68 Do the streaming and the multi-party computation model capture all? Sensor network: each is monitoring a stream of data 14-2

69 Do the streaming and the multi-party computation model capture all? Sensor network: each is monitoring a stream of data Data can be noisy Music, Docs,... After compressions 14-3

70 Distributed monitoring: sublinear space and comm. C S 1 S 2 S 3 S k The coordinator needs to maintain some function defined on the union of k streams at any time 15-1

71 Distributed monitoring: sublinear space and comm. C S 1 S 2 S 3 S k The coordinator needs to maintain some function defined on the union of k streams at any time e.g., total # items 15-2

72 Distributed monitoring: sublinear space and comm. C S 1 S 2 S 3 S k The coordinator needs to maintain some function defined on the union of k streams at any time e.g., total # items Goal: minimize communication (main), space and processing time per item at sites. 15-3

73 Distributed monitoring: sublinear space and comm. C S 1 S 2 S 3 S k The coordinator needs to maintain some function defined on the union of k streams at any time e.g., total # items Almost always allow approximation Goal: minimize communication (main), space and processing time per item at sites. 15-4

74 Reservoir Sampling doesn t work Tasks: Find a uniform sample from a stream of unknown length, can we do it in O(1) space? Algorithm: Store 1-st item. When the i-th (i > 1) item arrives With probability 1/i, replace the current sample; With probability 1 1/i, throw it away. C Why this doesn t work? S 1 S 2 S 3 S k 16-1

75 Random sampling Initialize i = 0 In epoch i: Sites send every item w.pr. 2 i Coordinator maintains a lower sample and an upper sample: each received item goes to either with equal prob. upper lower C S 1 S 2 S 3 S k (Each item is included in lower sample w.pr. 2 (i+1) ) 17-1

76 Random sampling Initialize i = 0 In epoch i: upper lower C Sites send every item w.pr. 2 i Coordinator maintains a lower sample and an upper sample: each received item goes to either with equal prob. S 1 S 2 S 3 S k (Each item is included in lower sample w.pr. 2 (i+1) ) When the lower sample reaches size s, the coordinator broadcasts to k sites advance to epoch i i + 1 Discards the upper sample Randomly splits the lower sample into a new lower and an upper sample 17-2

77 Random sampling Initialize i = 0 In epoch i: upper lower C Sites send every item w.pr. 2 i Coordinator maintains a lower sample and an upper sample: each received item goes to either with equal prob. S 1 S 2 S 3 S k (Each item is included in lower sample w.pr. 2 (i+1) ) When the lower sample reaches size s, the coordinator broadcasts to k sites advance to epoch i i + 1 Discards the upper sample Randomly splits the lower sample into a new lower and an upper sample Correctness: (1): In epoch i, each item is maintained in C w. pr. 2 i (2): Always s items are maintained in C 17-3

78 Noisy data Datasets may potentially be noisy If we consider similar elements as one element, then what is the number of distinct elements? C S 1 S 2 S 3 S k John Smith, 800 Mountain Av springfield Joe Smith, 100 Mount Av Springfield Joseph Smith, 800 Mt. Road Springfield Joe Smith, 800 Mt. Road Springfield 18-1

79 Noisy data (cont.) Goal: compute some good estimation while minimize communication and #rounds C S 1 S 2 S 3 S k In the coordinator model, how can we perform noise-resilient computation, given that each site employs a comparison metric which can tell whether two elements are the same or not? 19-1

80 Noisy data (cont.) Goal: compute some good estimation while minimize communication and #rounds C S 1 S 2 S 3 S k In the coordinator model, how can we perform noise-resilient computation, given that each site employs a comparison metric which can tell whether two elements are the same or not? The design of comparison metric is a separate issue. Can use some distance functions. 19-2

81 Noisy data (cont.) Goal: compute some good estimation while minimize communication and #rounds C S 1 S 2 S 3 S k In the coordinator model, how can we perform noise-resilient computation, given that each site employs a comparison metric which can tell whether two elements are the same or not? The design of comparison metric is a separate issue. Can use some distance functions Many techniques for noise-free datasets cannot be used. Often needs very different approaches.

82 Summary for today We have introduced a few models for big data : the streaming model, multiparty computation, distributed monitoring, etc. 20-1

83 Summary for today We have introduced a few models for big data : the streaming model, multiparty computation, distributed monitoring, etc. We have talked about why algorithmic design in these models/paradigms are difficult. 20-2

84 Summary for today We have introduced a few models for big data : the streaming model, multiparty computation, distributed monitoring, etc. We have talked about why algorithmic design in these models/paradigms are difficult. We have given a few examples. 20-3

85 Thank you! Good luck with B649! 21-1

86 Attribution Some examples are borrowed from Andrew McGregor s course CS711S12/index.html 22-1

87 Maintain a sample for Sliding Windows Tasks: Find a uniform sample from the last w items. 23-1

88 Maintain a sample for Sliding Windows Tasks: Find a uniform sample from the last w items. Algorithm: For each x i, we pick a random value v i (0, 1). In a window < x j w+1,..., x j >, return value x i with smallest v i. To do this, maintain the set of all x i in sliding window whose v i value is minimal among subsequent values. 23-2

89 Maintain a sample for Sliding Windows Tasks: Find a uniform sample from the last w items. Algorithm: For each x i, we pick a random value v i (0, 1). In a window < x j w+1,..., x j >, return value x i with smallest v i. To do this, maintain the set of all x i in sliding window whose v i value is minimal among subsequent values. Correctness: Obvious. Space (expected): 1/w + 1/(w 1) /1 = log w. 23-3

B561 Advanced Database Concepts Streaming Model. Qin Zhang 1-1

B561 Advanced Database Concepts 2.2. Streaming Model Qin Zhang 1-1 Data Streams Continuous streams of data elements (massive possibly unbounded, rapid, time-varying) Some examples: 1. network monitoring