Mining data streams. Irene Finocchi. finocchi/ Intro Puzzles

Size: px

Start display at page:

Download "Mining data streams. Irene Finocchi. finocchi/ Intro Puzzles"

Derrick Bates
5 years ago
Views:

Intro Puzzles Mining data streams Irene Finocchi finocchi@di.uniroma1.

1 Intro Puzzles Mining data streams Irene Finocchi finocchi/ 1 / 33 Irene Finocchi Mining data streams

Intro Puzzles Stream sources Data stream model Storing data is critical Data is growing faster than our ability to store and index it: networking: phone call networks, Internet,

2 Intro Puzzles Stream sources Data stream model Storing data is critical Data is growing faster than our ability to store and index it: networking: phone call networks, Internet, social networks scientific data: astronomical data, genome sequences, GIS geo-spatial data economic transactions: credit cards, online auctions... 2 / 33 Irene Finocchi Mining data streams

3 Intro Puzzles Stream sources Data stream model Network management Monitoring flow of IP packets through the routers (Internet traffic): how many IP addresses used a given link in the last month? which are the top 100 IP addresses in terms of traffic? which destinations use most bandwidth? what s the average duration of an IP session? which hosts have similar usage patterns (clusters)? does traffic distribution change in different periods of time? 3 / 33 Irene Finocchi Mining data streams

Intro Puzzles Stream sources Data stream model Network management Monitoring flow of IP packets through the routers (Internet traffic): how many IP addresses used a given link in the last month?

4 Intro Puzzles Stream sources Data stream model Network management Monitoring flow of IP packets through the routers (Internet traffic): how many IP addresses used a given link in the last month? which are the top 100 IP addresses in terms of traffic? which destinations use most bandwidth? what s the average duration of an IP session? which hosts have similar usage patterns (clusters)? does traffic distribution change in different periods of time? Up to 1 Billion packets per hour per router Many hundreds of routers per ISP Many terabytes of data per hour! 3 / 33 Irene Finocchi Mining data streams

5 Sensor data Intro Puzzles Stream sources Data stream model Sensors with GPS unit deployed in the ocean: Each sensor reports surface height (4-byte real number) every tenth of second Base station receives 3.5 MB per day per sensor 4 / 33 Irene Finocchi Mining data streams

Sensor data Intro Puzzles Stream sources Data stream model Sensors with GPS unit deployed in the ocean: Each sensor reports surface height (4-byte real number) every tenth of second Base station

6 Sensor data Intro Puzzles Stream sources Data stream model Sensors with GPS unit deployed in the ocean: Each sensor reports surface height (4-byte real number) every tenth of second Base station receives 3.5 MB per day per sensor What about a million sensors? 3.5 TB of data per day, coming at a high rate A million sensors isn t very many: roughly one sensor per 150 square miles of ocean... 4 / 33 Irene Finocchi Mining data streams

7 More streams... Intro Puzzles Stream sources Data stream model Image data satellites send down to earth many TBs of images per day surveillance cameras produce roughly one image per second: London has about six millions such cameras 5 / 33 Irene Finocchi Mining data streams

8 More streams... Intro Puzzles Stream sources Data stream model Image data satellites send down to earth many TBs of images per day surveillance cameras produce roughly one image per second: London has about six millions such cameras Web traffic Google receives several hundreds million search queries per day 5 / 33 Irene Finocchi Mining data streams

down to earth many TBs of images per day surveillance cameras produce roughly

trend analysis in online auction systems, users continuously submit bids for

9 More streams... Intro Puzzles Stream sources Data stream model Image data satellites send down to earth many TBs of images per day surveillance cameras produce roughly one image per second: London has about six millions such cameras Economic trend analysis in online auction systems, users continuously submit bids for items and items for auction Web traffic Google receives several hundreds million search queries per day 5 / 33 Irene Finocchi Mining data streams

10 Intro Puzzles Stream sources Data stream model Issues in data stream processing Some features common to all these applications: huge volumes of data (terabytes, even petabytes) records arrive at a rapid rate need to monitor data continuously to support exploratory analyses and to detect correlations, patterns, rare events, fraud, intrusion, unusual activities 6 / 33 Irene Finocchi Mining data streams

11 Intro Puzzles Stream sources Data stream model Issues in data stream processing Some features common to all these applications: huge volumes of data (terabytes, even petabytes) records arrive at a rapid rate need to monitor data continuously to support exploratory analyses and to detect correlations, patterns, rare events, fraud, intrusion, unusual activities Many problems about streaming data would be easy to solve if we had enough memory, but require new techniques for realistic data rates and sizes What can be computed without even storing the input? 6 / 33 Irene Finocchi Mining data streams

12 Intro Puzzles Stream sources Data stream model Basic data stream model Data stream = sequence σ = a 1, a 2,...a m of tokens drawn from universe [n] = {1, 2,...n} 7 / 33 Irene Finocchi Mining data streams

13 Intro Puzzles Stream sources Data stream model Basic data stream model Data stream = sequence σ = a 1, a 2,...a m of tokens drawn from universe [n] = {1, 2,...n} Input parameters: m and n 1 Stream σ is massively long. Stream length m is: typically unknown possibly infinite 7 / 33 Irene Finocchi Mining data streams

14 Intro Puzzles Stream sources Data stream model Basic data stream model Data stream = sequence σ = a 1, a 2,...a m of tokens drawn from universe [n] = {1, 2,...n} Input parameters: m and n 1 Stream σ is massively long. Stream length m is: typically unknown possibly infinite 2 Universe size n is also typically very large (e.g., IP addresses, URLs, item prices) 7 / 33 Irene Finocchi Mining data streams

15 Intro Puzzles Stream sources Data stream model Performance metrics Minimize space, passes, and processing time upon token arrivals 8 / 33 Irene Finocchi Mining data streams

16 Intro Puzzles Stream sources Data stream model Performance metrics Minimize space, passes, and processing time upon token arrivals 1 Use a sublinear amount of space s: s = o(min{n, m}) where s = bits of random-access working memory 8 / 33 Irene Finocchi Mining data streams

17 Intro Puzzles Stream sources Data stream model Performance metrics Minimize space, passes, and processing time upon token arrivals 1 Use a sublinear amount of space s: s = o(min{n, m}) where s = bits of random-access working memory 2 Make p passes over the data, for some small integer p (no random access to tokens) 8 / 33 Irene Finocchi Mining data streams

18 Intro Puzzles Stream sources Data stream model Performance metrics Minimize space, passes, and processing time upon token arrivals 1 Use a sublinear amount of space s: s = o(min{n, m}) where s = bits of random-access working memory 2 Make p passes over the data, for some small integer p (no random access to tokens) 3 Use small per-item processing time t 8 / 33 Irene Finocchi Mining data streams

19 Intro Puzzles Stream sources Data stream model Performance metrics Minimize space, passes, and processing time upon token arrivals 1 Use a sublinear amount of space s: s = o(min{n, m}) where s = bits of random-access working memory 2 Make p passes over the data, for some small integer p (no random access to tokens) 3 Use small per-item processing time t 8 / 33 Irene Finocchi Mining data streams

Intro Puzzles Stream sources Data stream model Performance metrics Minimize space, passes, and processing time upon token arrivals 1 Use a sublinear amount of space s: s = o(min{n, m}) where s = bits

20 Intro Puzzles Stream sources Data stream model Performance metrics Minimize space, passes, and processing time upon token arrivals 1 Use a sublinear amount of space s: s = o(min{n, m}) where s = bits of random-access working memory 2 Make p passes over the data, for some small integer p (no random access to tokens) 3 Use small per-item processing time t s = O(log m + log n) Happy if s = O(polylog(min{n, m}) p = 1 t = O(1) 8 / 33 Irene Finocchi Mining data streams

21 Token frequencies Intro Puzzles Stream sources Data stream model Data stream = sequence σ = a 1, a 2,...a m of tokens drawn from universe [n] = {1, 2,...n} 9 / 33 Irene Finocchi Mining data streams

22 Token frequencies Intro Puzzles Stream sources Data stream model Data stream = sequence σ = a 1, a 2,...a m of tokens drawn from universe [n] = {1, 2,...n} σ represents a multiset of items and implicitly defines a frequency vector f = f 1, f 2,...f n where f i = number of occurrences of item i [n] in σ Example If σ = 2, 1, 2, 1, 5, 2, 3, 2 and n = 5, then f = 2, 4, 1, 0, 1 9 / 33 Irene Finocchi Mining data streams

23 Token frequencies Intro Puzzles Stream sources Data stream model Data stream = sequence σ = a 1, a 2,...a m of tokens drawn from universe [n] = {1, 2,...n} σ represents a multiset of items and implicitly defines a frequency vector f = f 1, f 2,...f n where f i = number of occurrences of item i [n] in σ Example If σ = 2, 1, 2, 1, 5, 2, 3, 2 and n = 5, then f = 2, 4, 1, 0, 1 In many streaming problems, wish to compute some statistical properties of the multiset: e.g., majority token (if any), most frequent items, or number of distinct items 9 / 33 Irene Finocchi Mining data streams

24 Intro Puzzles Stream sources Data stream model Variations of the basic setup Data stream = sequence of tuples σ = (a 1, c 1 ), (a 2, c 2 ),... where (a i, c i ) [n] { F,..., F } Upon arrival of (a i, c i )), update frequency f ai New role for m: m = n j=1 f j = f ai + c i 10 / 33 Irene Finocchi Mining data streams

25 Intro Puzzles Stream sources Data stream model Variations of the basic setup Data stream = sequence of tuples σ = (a 1, c 1 ), (a 2, c 2 ),... where (a i, c i ) [n] { F,..., F } Upon arrival of (a i, c i )), update frequency f ai New role for m: m = n j=1 f j = f ai + c i Basic data stream model: c i = 1 (m = stream length) 10 / 33 Irene Finocchi Mining data streams

26 Intro Puzzles Stream sources Data stream model Variations of the basic setup Data stream = sequence of tuples σ = (a 1, c 1 ), (a 2, c 2 ),... where (a i, c i ) [n] { F,..., F } Upon arrival of (a i, c i )), update frequency f ai New role for m: m = n j=1 f j = f ai + c i Basic data stream model: c i = 1 (m = stream length) Cash register model: c i > 0 (items can only arrive, their frequencies can be incremented by variable amounts) 10 / 33 Irene Finocchi Mining data streams

27 Intro Puzzles Stream sources Data stream model Variations of the basic setup Data stream = sequence of tuples σ = (a 1, c 1 ), (a 2, c 2 ),... where (a i, c i ) [n] { F,..., F } Upon arrival of (a i, c i )), update frequency f ai New role for m: m = n j=1 f j = f ai + c i Basic data stream model: c i = 1 (m = stream length) Cash register model: c i > 0 (items can only arrive, their frequencies can be incremented by variable amounts) Turnstile model: generic c i (items can arrive and depart from the multiset) 10 / 33 Irene Finocchi Mining data streams

28 Historical remarks Intro Puzzles Stream sources Data stream model Origin in the 70s (seminal paper by Munro & Paterson, STOC 78) 11 / 33 Irene Finocchi Mining data streams

29 Historical remarks Intro Puzzles Stream sources Data stream model Origin in the 70s (seminal paper by Munro & Paterson, STOC 78) Gained popularity in the last fifteen years: theoretical interest: easy-to-state, but hard-to-solve problems links to other theory areas and to novel computing paradigms (MapReduce) 11 / 33 Irene Finocchi Mining data streams

30 Historical remarks Intro Puzzles Stream sources Data stream model Origin in the 70s (seminal paper by Munro & Paterson, STOC 78) Gained popularity in the last fifteen years: theoretical interest: easy-to-state, but hard-to-solve problems links to other theory areas and to novel computing paradigms (MapReduce) practical appeal: fast and effective solutions, wide applicability 11 / 33 Irene Finocchi Mining data streams

Historical remarks Intro Puzzles Stream sources Data stream model Origin in the 70s (seminal paper by Munro & Paterson, STOC 78) Gained popularity

paradigms (MapReduce) practical appeal: fast and effective solutions, wide applicability Alon, Matias & Szegedy: Gödel prize (2005) for their paper

31 Historical remarks Intro Puzzles Stream sources Data stream model Origin in the 70s (seminal paper by Munro & Paterson, STOC 78) Gained popularity in the last fifteen years: theoretical interest: easy-to-state, but hard-to-solve problems links to other theory areas and to novel computing paradigms (MapReduce) practical appeal: fast and effective solutions, wide applicability Alon, Matias & Szegedy: Gödel prize (2005) for their paper on frequency moments approximation (STOC 96, JCSS 99), foundational work for streaming and sketching algorithms 11 / 33 Irene Finocchi Mining data streams

32 Intro Puzzles Missing number Fishing Pointer & chaser Recap Three puzzles Data stream challenges 12 / 33 Irene Finocchi Mining data streams

33 Intro Puzzles Missing number Fishing Pointer & chaser Recap The missing number puzzle π = π 1, π 2,...π n 1 is a permutation of [1, n] with one number missing 13 / 33 Irene Finocchi Mining data streams

34 Intro Puzzles Missing number Fishing Pointer & chaser Recap The missing number puzzle π = π 1, π 2,...π n 1 is a permutation of [1, n] with one number missing What s the missing number? 13 / 33 Irene Finocchi Mining data streams

35 Intro Puzzles Missing number Fishing Pointer & chaser Recap The missing number puzzle π = π 1, π 2,...π n 1 is a permutation of [1, n] with one number missing What s the missing number? Constraint: Carole has limited memory: she can only use O(log n) bits 13 / 33 Irene Finocchi Mining data streams

36 Intro Puzzles Missing number Fishing Pointer & chaser Recap The missing number puzzle π = π 1, π 2,...π n 1 is a permutation of [1, n] with one number missing What s the missing number? Constraint: Carole has limited memory: she can only use O(log n) bits n(n 1) 2 n 1 i=1 π i 13 / 33 Irene Finocchi Mining data streams

37 Intro Puzzles Missing number Fishing Pointer & chaser Recap Two missing numbers Now π has two missing numbers, x and y: find them, but use only O(log n) bits! 14 / 33 Irene Finocchi Mining data streams

38 Intro Puzzles Missing number Fishing Pointer & chaser Recap Two missing numbers Now π has two missing numbers, x and y: find them, but use only O(log n) bits! Track { S = n(n+1) 2 n 2 i=1 π i P = n! Π n 2 i=1 π i Solve equations x + y = S and x y = P 14 / 33 Irene Finocchi Mining data streams

39 Intro Puzzles Missing number Fishing Pointer & chaser Recap Two missing numbers Now π has two missing numbers, x and y: find them, but use only O(log n) bits! Track { S = n(n+1) 2 n 2 i=1 π i P = n! Π n 2 i=1 π i Solve equations x + y = S and x y = P How many bits? Ω(log n!) = Ω(n log n) 14 / 33 Irene Finocchi Mining data streams

40 Intro Puzzles Missing number Fishing Pointer & chaser Recap Two missing numbers Now π has two missing numbers, x and y: find them, but use only O(log n) bits! 15 / 33 Irene Finocchi Mining data streams

41 Intro Puzzles Missing number Fishing Pointer & chaser Recap Two missing numbers Now π has two missing numbers, x and y: find them, but use only O(log n) bits! Track S 1 = n(n 1) 2 n 2 i=1 π i S 2 = n(n+1)(2n+1) 6 n 2 i=1 π2 i Solve equations x + y = S 1 and x 2 + y 2 = S 2 15 / 33 Irene Finocchi Mining data streams

42 Intro Puzzles Missing number Fishing Pointer & chaser Recap Two missing numbers Now π has two missing numbers, x and y: find them, but use only O(log n) bits! Track S 1 = n(n 1) 2 n 2 i=1 π i S 2 = n(n+1)(2n+1) 6 n 2 i=1 π2 i Solve equations x + y = S 1 and x 2 + y 2 = S 2 How many bits? O(log n 3 ) = O(log n) 15 / 33 Irene Finocchi Mining data streams

43 Lesson 1 Intro Puzzles Missing number Fishing Pointer & chaser Recap Some problems can be deterministically solved in: logarithmic space one pass 16 / 33 Irene Finocchi Mining data streams

44 Lesson 1 Intro Puzzles Missing number Fishing Pointer & chaser Recap Some problems can be deterministically solved in: logarithmic space one pass Most of the times, we re not so lucky 16 / 33 Irene Finocchi Mining data streams

45 Fishing Intro Puzzles Missing number Fishing Pointer & chaser Recap U = {1,...u} fish species in the universe 17 / 33 Irene Finocchi Mining data streams

46 Fishing Intro Puzzles Missing number Fishing Pointer & chaser Recap U = {1,...u} fish species in the universe a t U fish species caught at time t 17 / 33 Irene Finocchi Mining data streams

47 Fishing Intro Puzzles Missing number Fishing Pointer & chaser Recap U = {1,...u} fish species in the universe a t U fish species caught at time t f t [j] = {a i a i = j, i t} frequency of species j up to time t 17 / 33 Irene Finocchi Mining data streams

48 Fishing Intro Puzzles Missing number Fishing Pointer & chaser Recap U = {1,...u} fish species in the universe a t U fish species caught at time t f t [j] = {a i a i = j, i t} frequency of species j up to time t j is rare iff f t [j] = 1 17 / 33 Irene Finocchi Mining data streams

49 Fishing Intro Puzzles Missing number Fishing Pointer & chaser Recap U = {1,...u} fish species in the universe a t U fish species caught at time t f t [j] = {a i a i = j, i t} frequency of species j up to time t j is rare iff f t [j] = 1 Rarity of catch at time t: ρ t = {j f t[j] = 1} u = R t u 17 / 33 Irene Finocchi Mining data streams

50 Fishing Intro Puzzles Missing number Fishing Pointer & chaser Recap U = {1,...u} fish species in the universe a t U fish species caught at time t f t [j] = {a i a i = j, i t} frequency of species j up to time t j is rare iff f t [j] = 1 Rarity of catch at time t: ρ t = {j f t[j] = 1} u = R t u George is curious and wants to compute rarity 17 / 33 Irene Finocchi Mining data streams

51 Fishing Intro Puzzles Missing number Fishing Pointer & chaser Recap U = {1,...u} fish species in the universe a t U fish species caught at time t f t [j] = {a i a i = j, i t} frequency of species j up to time t j is rare iff f t [j] = 1 Rarity of catch at time t: ρ t = {j f t[j] = 1} u = R t u George is curious and wants to compute rarity 2u-bit vector would suffice... but George s suitcase has o(u) size 17 / 33 Irene Finocchi Mining data streams

52 Intro Puzzles Missing number Fishing Pointer & chaser Recap Deterministic fish rarity George cannot compute ρ t precisely with a deterministic algorithm using only o(u) bits By contradiction 18 / 33 Irene Finocchi Mining data streams

53 Intro Puzzles Missing number Fishing Pointer & chaser Recap Deterministic fish rarity George cannot compute ρ t precisely with a deterministic algorithm using only o(u) bits By contradiction Let S U be a set of species: no duplicates, S = Θ(u) Need Ω( S ) = Ω(u) bits to represent S If claim is false, could break information theoretic lower bound 18 / 33 Irene Finocchi Mining data streams

54 Intro Puzzles Missing number Fishing Pointer & chaser Recap Deterministic fish rarity George cannot compute ρ t precisely with a deterministic algorithm using only o(u) bits By contradiction Let S U be a set of species: no duplicates, S = Θ(u) Need Ω( S ) = Ω(u) bits to represent S If claim is false, could break information theoretic lower bound To retrieve S, for each i U, stream S, i to George and compare ρ t and ρ t+1 : 18 / 33 Irene Finocchi Mining data streams

55 Intro Puzzles Missing number Fishing Pointer & chaser Recap Deterministic fish rarity George cannot compute ρ t precisely with a deterministic algorithm using only o(u) bits By contradiction Let S U be a set of species: no duplicates, S = Θ(u) Need Ω( S ) = Ω(u) bits to represent S If claim is false, could break information theoretic lower bound To retrieve S, for each i U, stream S, i to George and compare ρ t and ρ t+1 : if i S, then R t+1 = R t + 1 and ρ t+1 > ρ t 18 / 33 Irene Finocchi Mining data streams

56 Intro Puzzles Missing number Fishing Pointer & chaser Recap Deterministic fish rarity George cannot compute ρ t precisely with a deterministic algorithm using only o(u) bits By contradiction Let S U be a set of species: no duplicates, S = Θ(u) Need Ω( S ) = Ω(u) bits to represent S If claim is false, could break information theoretic lower bound To retrieve S, for each i U, stream S, i to George and compare ρ t and ρ t+1 : if i S, then R t+1 = R t + 1 and ρ t+1 > ρ t if i S, then R t+1 = R t 1 and ρ t+1 < ρ t 18 / 33 Irene Finocchi Mining data streams

57 Intro Puzzles Missing number Fishing Pointer & chaser Recap Deterministic fish rarity George cannot compute ρ t precisely with a deterministic algorithm using only o(u) bits By contradiction Let S U be a set of species: no duplicates, S = Θ(u) Need Ω( S ) = Ω(u) bits to represent S If claim is false, could break information theoretic lower bound To retrieve S, for each i U, stream S, i to George and compare ρ t and ρ t+1 : if i S, then R t+1 = R t + 1 and ρ t+1 > ρ t if i S, then R t+1 = R t 1 and ρ t+1 < ρ t Hence ρ decreases i S 18 / 33 Irene Finocchi Mining data streams

58 Intro Puzzles Missing number Fishing Pointer & chaser Recap Randomized fish rarity (1/2) Sampling: George can approximate ρ t using 2k = o(u) bits pick k random fish species maintain rarity c 1 [t],... c k [t] of each sampled species (2 bits) Return ρ t = {i [1, k] c i[t] = 1} k = R t k 19 / 33 Irene Finocchi Mining data streams

59 Intro Puzzles Missing number Fishing Pointer & chaser Recap Randomized fish rarity (1/2) Sampling: George can approximate ρ t using 2k = o(u) bits pick k random fish species maintain rarity c 1 [t],... c k [t] of each sampled species (2 bits) Return ρ t = {i [1, k] c i[t] = 1} k Claim: E[ ρ t ] = ρ t = R t k 19 / 33 Irene Finocchi Mining data streams

60 Intro Puzzles Missing number Fishing Pointer & chaser Recap Randomized fish rarity (1/2) Sampling: George can approximate ρ t using 2k = o(u) bits pick k random fish species maintain rarity c 1 [t],... c k [t] of each sampled species (2 bits) Return ρ t = {i [1, k] c i[t] = 1} k Claim: E[ ρ t ] = ρ t = R t k If ρ t large enough, ρ t is a good estimate for ρ t with arbitrarily small precision and good probability Requires more advanced probabilistic tools: examples later 19 / 33 Irene Finocchi Mining data streams

61 Intro Puzzles Missing number Fishing Pointer & chaser Recap Randomized fish rarity (2/2) ρ t = {i [1, k] c i[t] = 1} k E[ ρ t ] = ρ t = R t k 20 / 33 Irene Finocchi Mining data streams

62 Intro Puzzles Missing number Fishing Pointer & chaser Recap Randomized fish rarity (2/2) ρ t = {i [1, k] c i[t] = 1} k E[ ρ t ] = ρ t = R t k Y i indicator variable: { Yi = 1 if c i [t] = 1 Y i = 0 otherwise 20 / 33 Irene Finocchi Mining data streams

63 Intro Puzzles Missing number Fishing Pointer & chaser Recap Randomized fish rarity (2/2) ρ t = {i [1, k] c i[t] = 1} k E[ ρ t ] = ρ t = R t k Y i indicator variable: { Yi = 1 if c i [t] = 1 Y i = 0 otherwise Pr{Y i = 1} = Pr{the i-th sampled species is rare} = R t u = ρ t 20 / 33 Irene Finocchi Mining data streams

64 Intro Puzzles Missing number Fishing Pointer & chaser Recap Randomized fish rarity (2/2) ρ t = {i [1, k] c i[t] = 1} k E[ ρ t ] = ρ t = R t k Y i indicator variable: { Yi = 1 if c i [t] = 1 Y i = 0 otherwise Pr{Y i = 1} = Pr{the i-th sampled species is rare} = R t u = ρ t E[Y i ] = ρ t 20 / 33 Irene Finocchi Mining data streams

65 Intro Puzzles Missing number Fishing Pointer & chaser Recap Randomized fish rarity (2/2) ρ t = {i [1, k] c i[t] = 1} k E[ ρ t ] = ρ t = R t k Y i indicator variable: { Yi = 1 if c i [t] = 1 Y i = 0 otherwise Pr{Y i = 1} = Pr{the i-th sampled species is rare} = R t u = ρ t E[Y i ] = ρ t E[ R t ] = k i=1 E[Y i] = kρ t 20 / 33 Irene Finocchi Mining data streams

66 Intro Puzzles Missing number Fishing Pointer & chaser Recap Randomized fish rarity (2/2) ρ t = {i [1, k] c i[t] = 1} k E[ ρ t ] = ρ t = R t k Y i indicator variable: { Yi = 1 if c i [t] = 1 Y i = 0 otherwise Pr{Y i = 1} = Pr{the i-th sampled species is rare} = R t u = ρ t E[Y i ] = ρ t E[ R t ] = k i=1 E[Y i] = kρ t E[ ρ t ] = E[ R t ] k = ρ t 20 / 33 Irene Finocchi Mining data streams

67 Lesson 2 Intro Puzzles Missing number Fishing Pointer & chaser Recap It is often impossible to solve problems precisely and deterministically in small (sublinear) space Randomization and approximation greatly help: find an answer correct within some factor (guarantee that ρ is within 10% of ρ) allow a small probability of failure (answer is correct, except with probability 1 in 10,000) 21 / 33 Irene Finocchi Mining data streams

68 Pointer and chaser Intro Puzzles Missing number Fishing Pointer & chaser Recap Paul has n + 1 pointers For each pointer i, he points to a position P[i] [1, n] n= / 33 Irene Finocchi Mining data streams

69 Pointer and chaser Intro Puzzles Missing number Fishing Pointer & chaser Recap Paul has n + 1 pointers For each pointer i, he points to a position P[i] [1, n] n= Carole has to guess any duplicate pointer Constraints: O(log n) bits O(n) queries cannot move items 22 / 33 Irene Finocchi Mining data streams

70 Repeated scans Intro Puzzles Missing number Fishing Pointer & chaser Recap n= Trivial solution for each i, count how many j are such that P[j]=i O(log n) bits, but O(n 2 ) queries 23 / 33 Irene Finocchi Mining data streams

71 Repeated scans Intro Puzzles Missing number Fishing Pointer & chaser Recap n= Trivial solution for each i, count how many j are such that P[j]=i O(log n) bits, but O(n 2 ) queries 2 Better solution if # of items below n/2 > # of items above n/2 then search for duplicates < n/2 else search for duplicates n/2 O(log n) bits and passes, O(n log n) queries 3 With O(log n) bits, Ω(log n/ log log n) passes are needed 23 / 33 Irene Finocchi Mining data streams

72 Intro Puzzles Missing number Fishing Pointer & chaser Recap Random access helps n= Chase pointers, starting from position n + 1 Problem equivalent to finding a loop in a linked list Can be solved in O(n) time with just 2 pointers! r1 r2 24 / 33 Irene Finocchi Mining data streams

73 Intro Puzzles Missing number Fishing Pointer & chaser Recap Random access helps n= Chase pointers, starting from position n + 1 Problem equivalent to finding a loop in a linked list Can be solved in O(n) time with just 2 pointers! r1 r2 25 / 33 Irene Finocchi Mining data streams

74 Intro Puzzles Missing number Fishing Pointer & chaser Recap Random access helps n= Chase pointers, starting from position n + 1 Problem equivalent to finding a loop in a linked list Can be solved in O(n) time with just 2 pointers! r1 r2 26 / 33 Irene Finocchi Mining data streams

75 Intro Puzzles Missing number Fishing Pointer & chaser Recap Random access helps n= Chase pointers, starting from position n + 1 Problem equivalent to finding a loop in a linked list Can be solved in O(n) time with just 2 pointers! r1 r2 27 / 33 Irene Finocchi Mining data streams

76 Intro Puzzles Missing number Fishing Pointer & chaser Recap Random access helps n= Chase pointers, starting from position n + 1 Problem equivalent to finding a loop in a linked list Can be solved in O(n) time with just 2 pointers! r1 r2 28 / 33 Irene Finocchi Mining data streams

77 Intro Puzzles Missing number Fishing Pointer & chaser Recap Random access helps n= Chase pointers, starting from position n + 1 Problem equivalent to finding a loop in a linked list Can be solved in O(n) time with just 2 pointers! a=9 b=3 r1 c=3 r2 29 / 33 Irene Finocchi Mining data streams

78 Intro Puzzles Missing number Fishing Pointer & chaser Recap Random access helps n= Chase pointers, starting from position n + 1 Problem equivalent to finding a loop in a linked list Can be solved in O(n) time with just 2 pointers! a=9 b=3 r1 c=3 r2 { a + b = t a + k(b + c) + b = 2t t and k known 29 / 33 Irene Finocchi Mining data streams

79 Intro Puzzles Missing number Fishing Pointer & chaser Recap Random access helps n= Chase pointers, starting from position n + 1 Problem equivalent to finding a loop in a linked list Can be solved in O(n) time with just 2 pointers! a=9 b=3 r1 c=3 r2 { a + b = t a + k(b + c) + b = 2t { a + b = t b + c = t/k t and k known 29 / 33 Irene Finocchi Mining data streams

80 Intro Puzzles Missing number Fishing Pointer & chaser Recap Random access helps n= Chase pointers, starting from position n + 1 Problem equivalent to finding a loop in a linked list Can be solved in O(n) time with just 2 pointers! a=9 b=3 r1 c=3 r2 { a + b = t a + k(b + c) + b = 2t { a + b = t b + c = t/k t and k known a = c+ k 1 t k 29 / 33 Irene Finocchi Mining data streams

81 Intro Puzzles Missing number Fishing Pointer & chaser Recap Random access helps n= Chase pointers, starting from position n + 1 Problem equivalent to finding a loop in a linked list Can be solved in O(n) time with just 2 pointers! a=9 b=3 r1 c=3 t(k-1)/k=6 r2 { a + b = t a + k(b + c) + b = 2t { a + b = t b + c = t/k a = c + k 1 t k t and k known 30 / 33 Irene Finocchi Mining data streams

82 Lesson 3 Intro Puzzles Missing number Fishing Pointer & chaser Recap Tokens come as a stream: no random access Sometimes impossible to achieve the same bounds as in the RAM model 31 / 33 Irene Finocchi Mining data streams

83 Recap on lessons Intro Puzzles Missing number Fishing Pointer & chaser Recap Typically impossible to solve problems precisely and deterministically in small (sublinear) space Randomize and approximate! Sequential data access makes things harder 32 / 33 Irene Finocchi Mining data streams

84 References Intro Puzzles Missing number Fishing Pointer & chaser Recap 1 Data streams: algorithms and applications, S. Muthukrishnan muthu/ 2 Algorithms for data streams, C. Demetrescu & I. Finocchi twiki.di.uniroma1.it/pub/ing_algo/webhome/dfchapter08.pdf 33 / 33 Irene Finocchi Mining data streams

Algorithms for data streams

Intro Puzzles Sampling Sketches Graphs Lower bounds Summing up Thanks! Algorithms for data streams Irene Finocchi finocchi@di.uniroma1.it http://www.dsi.uniroma1.it/ finocchi/ May 9, 2012 1 / 99 Irene