Mining data streams. Irene Finocchi. finocchi/ Intro Puzzles

Size: px
Start display at page:

Download "Mining data streams. Irene Finocchi. finocchi/ Intro Puzzles"

Transcription

1 Intro Puzzles Mining data streams Irene Finocchi finocchi/ 1 / 33 Irene Finocchi Mining data streams

2 Intro Puzzles Stream sources Data stream model Storing data is critical Data is growing faster than our ability to store and index it: networking: phone call networks, Internet, social networks scientific data: astronomical data, genome sequences, GIS geo-spatial data economic transactions: credit cards, online auctions... 2 / 33 Irene Finocchi Mining data streams

3 Intro Puzzles Stream sources Data stream model Network management Monitoring flow of IP packets through the routers (Internet traffic): how many IP addresses used a given link in the last month? which are the top 100 IP addresses in terms of traffic? which destinations use most bandwidth? what s the average duration of an IP session? which hosts have similar usage patterns (clusters)? does traffic distribution change in different periods of time? 3 / 33 Irene Finocchi Mining data streams

4 Intro Puzzles Stream sources Data stream model Network management Monitoring flow of IP packets through the routers (Internet traffic): how many IP addresses used a given link in the last month? which are the top 100 IP addresses in terms of traffic? which destinations use most bandwidth? what s the average duration of an IP session? which hosts have similar usage patterns (clusters)? does traffic distribution change in different periods of time? Up to 1 Billion packets per hour per router Many hundreds of routers per ISP Many terabytes of data per hour! 3 / 33 Irene Finocchi Mining data streams

5 Sensor data Intro Puzzles Stream sources Data stream model Sensors with GPS unit deployed in the ocean: Each sensor reports surface height (4-byte real number) every tenth of second Base station receives 3.5 MB per day per sensor 4 / 33 Irene Finocchi Mining data streams

6 Sensor data Intro Puzzles Stream sources Data stream model Sensors with GPS unit deployed in the ocean: Each sensor reports surface height (4-byte real number) every tenth of second Base station receives 3.5 MB per day per sensor What about a million sensors? 3.5 TB of data per day, coming at a high rate A million sensors isn t very many: roughly one sensor per 150 square miles of ocean... 4 / 33 Irene Finocchi Mining data streams

7 More streams... Intro Puzzles Stream sources Data stream model Image data satellites send down to earth many TBs of images per day surveillance cameras produce roughly one image per second: London has about six millions such cameras 5 / 33 Irene Finocchi Mining data streams

8 More streams... Intro Puzzles Stream sources Data stream model Image data satellites send down to earth many TBs of images per day surveillance cameras produce roughly one image per second: London has about six millions such cameras Web traffic Google receives several hundreds million search queries per day 5 / 33 Irene Finocchi Mining data streams

9 More streams... Intro Puzzles Stream sources Data stream model Image data satellites send down to earth many TBs of images per day surveillance cameras produce roughly one image per second: London has about six millions such cameras Economic trend analysis in online auction systems, users continuously submit bids for items and items for auction Web traffic Google receives several hundreds million search queries per day 5 / 33 Irene Finocchi Mining data streams

10 Intro Puzzles Stream sources Data stream model Issues in data stream processing Some features common to all these applications: huge volumes of data (terabytes, even petabytes) records arrive at a rapid rate need to monitor data continuously to support exploratory analyses and to detect correlations, patterns, rare events, fraud, intrusion, unusual activities 6 / 33 Irene Finocchi Mining data streams

11 Intro Puzzles Stream sources Data stream model Issues in data stream processing Some features common to all these applications: huge volumes of data (terabytes, even petabytes) records arrive at a rapid rate need to monitor data continuously to support exploratory analyses and to detect correlations, patterns, rare events, fraud, intrusion, unusual activities Many problems about streaming data would be easy to solve if we had enough memory, but require new techniques for realistic data rates and sizes What can be computed without even storing the input? 6 / 33 Irene Finocchi Mining data streams

12 Intro Puzzles Stream sources Data stream model Basic data stream model Data stream = sequence σ = a 1, a 2,...a m of tokens drawn from universe [n] = {1, 2,...n} 7 / 33 Irene Finocchi Mining data streams

13 Intro Puzzles Stream sources Data stream model Basic data stream model Data stream = sequence σ = a 1, a 2,...a m of tokens drawn from universe [n] = {1, 2,...n} Input parameters: m and n 1 Stream σ is massively long. Stream length m is: typically unknown possibly infinite 7 / 33 Irene Finocchi Mining data streams

14 Intro Puzzles Stream sources Data stream model Basic data stream model Data stream = sequence σ = a 1, a 2,...a m of tokens drawn from universe [n] = {1, 2,...n} Input parameters: m and n 1 Stream σ is massively long. Stream length m is: typically unknown possibly infinite 2 Universe size n is also typically very large (e.g., IP addresses, URLs, item prices) 7 / 33 Irene Finocchi Mining data streams

15 Intro Puzzles Stream sources Data stream model Performance metrics Minimize space, passes, and processing time upon token arrivals 8 / 33 Irene Finocchi Mining data streams

16 Intro Puzzles Stream sources Data stream model Performance metrics Minimize space, passes, and processing time upon token arrivals 1 Use a sublinear amount of space s: s = o(min{n, m}) where s = bits of random-access working memory 8 / 33 Irene Finocchi Mining data streams

17 Intro Puzzles Stream sources Data stream model Performance metrics Minimize space, passes, and processing time upon token arrivals 1 Use a sublinear amount of space s: s = o(min{n, m}) where s = bits of random-access working memory 2 Make p passes over the data, for some small integer p (no random access to tokens) 8 / 33 Irene Finocchi Mining data streams

18 Intro Puzzles Stream sources Data stream model Performance metrics Minimize space, passes, and processing time upon token arrivals 1 Use a sublinear amount of space s: s = o(min{n, m}) where s = bits of random-access working memory 2 Make p passes over the data, for some small integer p (no random access to tokens) 3 Use small per-item processing time t 8 / 33 Irene Finocchi Mining data streams

19 Intro Puzzles Stream sources Data stream model Performance metrics Minimize space, passes, and processing time upon token arrivals 1 Use a sublinear amount of space s: s = o(min{n, m}) where s = bits of random-access working memory 2 Make p passes over the data, for some small integer p (no random access to tokens) 3 Use small per-item processing time t 8 / 33 Irene Finocchi Mining data streams

20 Intro Puzzles Stream sources Data stream model Performance metrics Minimize space, passes, and processing time upon token arrivals 1 Use a sublinear amount of space s: s = o(min{n, m}) where s = bits of random-access working memory 2 Make p passes over the data, for some small integer p (no random access to tokens) 3 Use small per-item processing time t s = O(log m + log n) Happy if s = O(polylog(min{n, m}) p = 1 t = O(1) 8 / 33 Irene Finocchi Mining data streams

21 Token frequencies Intro Puzzles Stream sources Data stream model Data stream = sequence σ = a 1, a 2,...a m of tokens drawn from universe [n] = {1, 2,...n} 9 / 33 Irene Finocchi Mining data streams

22 Token frequencies Intro Puzzles Stream sources Data stream model Data stream = sequence σ = a 1, a 2,...a m of tokens drawn from universe [n] = {1, 2,...n} σ represents a multiset of items and implicitly defines a frequency vector f = f 1, f 2,...f n where f i = number of occurrences of item i [n] in σ Example If σ = 2, 1, 2, 1, 5, 2, 3, 2 and n = 5, then f = 2, 4, 1, 0, 1 9 / 33 Irene Finocchi Mining data streams

23 Token frequencies Intro Puzzles Stream sources Data stream model Data stream = sequence σ = a 1, a 2,...a m of tokens drawn from universe [n] = {1, 2,...n} σ represents a multiset of items and implicitly defines a frequency vector f = f 1, f 2,...f n where f i = number of occurrences of item i [n] in σ Example If σ = 2, 1, 2, 1, 5, 2, 3, 2 and n = 5, then f = 2, 4, 1, 0, 1 In many streaming problems, wish to compute some statistical properties of the multiset: e.g., majority token (if any), most frequent items, or number of distinct items 9 / 33 Irene Finocchi Mining data streams

24 Intro Puzzles Stream sources Data stream model Variations of the basic setup Data stream = sequence of tuples σ = (a 1, c 1 ), (a 2, c 2 ),... where (a i, c i ) [n] { F,..., F } Upon arrival of (a i, c i )), update frequency f ai New role for m: m = n j=1 f j = f ai + c i 10 / 33 Irene Finocchi Mining data streams

25 Intro Puzzles Stream sources Data stream model Variations of the basic setup Data stream = sequence of tuples σ = (a 1, c 1 ), (a 2, c 2 ),... where (a i, c i ) [n] { F,..., F } Upon arrival of (a i, c i )), update frequency f ai New role for m: m = n j=1 f j = f ai + c i Basic data stream model: c i = 1 (m = stream length) 10 / 33 Irene Finocchi Mining data streams

26 Intro Puzzles Stream sources Data stream model Variations of the basic setup Data stream = sequence of tuples σ = (a 1, c 1 ), (a 2, c 2 ),... where (a i, c i ) [n] { F,..., F } Upon arrival of (a i, c i )), update frequency f ai New role for m: m = n j=1 f j = f ai + c i Basic data stream model: c i = 1 (m = stream length) Cash register model: c i > 0 (items can only arrive, their frequencies can be incremented by variable amounts) 10 / 33 Irene Finocchi Mining data streams

27 Intro Puzzles Stream sources Data stream model Variations of the basic setup Data stream = sequence of tuples σ = (a 1, c 1 ), (a 2, c 2 ),... where (a i, c i ) [n] { F,..., F } Upon arrival of (a i, c i )), update frequency f ai New role for m: m = n j=1 f j = f ai + c i Basic data stream model: c i = 1 (m = stream length) Cash register model: c i > 0 (items can only arrive, their frequencies can be incremented by variable amounts) Turnstile model: generic c i (items can arrive and depart from the multiset) 10 / 33 Irene Finocchi Mining data streams

28 Historical remarks Intro Puzzles Stream sources Data stream model Origin in the 70s (seminal paper by Munro & Paterson, STOC 78) 11 / 33 Irene Finocchi Mining data streams

29 Historical remarks Intro Puzzles Stream sources Data stream model Origin in the 70s (seminal paper by Munro & Paterson, STOC 78) Gained popularity in the last fifteen years: theoretical interest: easy-to-state, but hard-to-solve problems links to other theory areas and to novel computing paradigms (MapReduce) 11 / 33 Irene Finocchi Mining data streams

30 Historical remarks Intro Puzzles Stream sources Data stream model Origin in the 70s (seminal paper by Munro & Paterson, STOC 78) Gained popularity in the last fifteen years: theoretical interest: easy-to-state, but hard-to-solve problems links to other theory areas and to novel computing paradigms (MapReduce) practical appeal: fast and effective solutions, wide applicability 11 / 33 Irene Finocchi Mining data streams

31 Historical remarks Intro Puzzles Stream sources Data stream model Origin in the 70s (seminal paper by Munro & Paterson, STOC 78) Gained popularity in the last fifteen years: theoretical interest: easy-to-state, but hard-to-solve problems links to other theory areas and to novel computing paradigms (MapReduce) practical appeal: fast and effective solutions, wide applicability Alon, Matias & Szegedy: Gödel prize (2005) for their paper on frequency moments approximation (STOC 96, JCSS 99), foundational work for streaming and sketching algorithms 11 / 33 Irene Finocchi Mining data streams

32 Intro Puzzles Missing number Fishing Pointer & chaser Recap Three puzzles Data stream challenges 12 / 33 Irene Finocchi Mining data streams

33 Intro Puzzles Missing number Fishing Pointer & chaser Recap The missing number puzzle π = π 1, π 2,...π n 1 is a permutation of [1, n] with one number missing 13 / 33 Irene Finocchi Mining data streams

34 Intro Puzzles Missing number Fishing Pointer & chaser Recap The missing number puzzle π = π 1, π 2,...π n 1 is a permutation of [1, n] with one number missing What s the missing number? 13 / 33 Irene Finocchi Mining data streams

35 Intro Puzzles Missing number Fishing Pointer & chaser Recap The missing number puzzle π = π 1, π 2,...π n 1 is a permutation of [1, n] with one number missing What s the missing number? Constraint: Carole has limited memory: she can only use O(log n) bits 13 / 33 Irene Finocchi Mining data streams

36 Intro Puzzles Missing number Fishing Pointer & chaser Recap The missing number puzzle π = π 1, π 2,...π n 1 is a permutation of [1, n] with one number missing What s the missing number? Constraint: Carole has limited memory: she can only use O(log n) bits n(n 1) 2 n 1 i=1 π i 13 / 33 Irene Finocchi Mining data streams

37 Intro Puzzles Missing number Fishing Pointer & chaser Recap Two missing numbers Now π has two missing numbers, x and y: find them, but use only O(log n) bits! 14 / 33 Irene Finocchi Mining data streams

38 Intro Puzzles Missing number Fishing Pointer & chaser Recap Two missing numbers Now π has two missing numbers, x and y: find them, but use only O(log n) bits! Track { S = n(n+1) 2 n 2 i=1 π i P = n! Π n 2 i=1 π i Solve equations x + y = S and x y = P 14 / 33 Irene Finocchi Mining data streams

39 Intro Puzzles Missing number Fishing Pointer & chaser Recap Two missing numbers Now π has two missing numbers, x and y: find them, but use only O(log n) bits! Track { S = n(n+1) 2 n 2 i=1 π i P = n! Π n 2 i=1 π i Solve equations x + y = S and x y = P How many bits? Ω(log n!) = Ω(n log n) 14 / 33 Irene Finocchi Mining data streams

40 Intro Puzzles Missing number Fishing Pointer & chaser Recap Two missing numbers Now π has two missing numbers, x and y: find them, but use only O(log n) bits! 15 / 33 Irene Finocchi Mining data streams

41 Intro Puzzles Missing number Fishing Pointer & chaser Recap Two missing numbers Now π has two missing numbers, x and y: find them, but use only O(log n) bits! Track S 1 = n(n 1) 2 n 2 i=1 π i S 2 = n(n+1)(2n+1) 6 n 2 i=1 π2 i Solve equations x + y = S 1 and x 2 + y 2 = S 2 15 / 33 Irene Finocchi Mining data streams

42 Intro Puzzles Missing number Fishing Pointer & chaser Recap Two missing numbers Now π has two missing numbers, x and y: find them, but use only O(log n) bits! Track S 1 = n(n 1) 2 n 2 i=1 π i S 2 = n(n+1)(2n+1) 6 n 2 i=1 π2 i Solve equations x + y = S 1 and x 2 + y 2 = S 2 How many bits? O(log n 3 ) = O(log n) 15 / 33 Irene Finocchi Mining data streams

43 Lesson 1 Intro Puzzles Missing number Fishing Pointer & chaser Recap Some problems can be deterministically solved in: logarithmic space one pass 16 / 33 Irene Finocchi Mining data streams

44 Lesson 1 Intro Puzzles Missing number Fishing Pointer & chaser Recap Some problems can be deterministically solved in: logarithmic space one pass Most of the times, we re not so lucky 16 / 33 Irene Finocchi Mining data streams

45 Fishing Intro Puzzles Missing number Fishing Pointer & chaser Recap U = {1,...u} fish species in the universe 17 / 33 Irene Finocchi Mining data streams

46 Fishing Intro Puzzles Missing number Fishing Pointer & chaser Recap U = {1,...u} fish species in the universe a t U fish species caught at time t 17 / 33 Irene Finocchi Mining data streams

47 Fishing Intro Puzzles Missing number Fishing Pointer & chaser Recap U = {1,...u} fish species in the universe a t U fish species caught at time t f t [j] = {a i a i = j, i t} frequency of species j up to time t 17 / 33 Irene Finocchi Mining data streams

48 Fishing Intro Puzzles Missing number Fishing Pointer & chaser Recap U = {1,...u} fish species in the universe a t U fish species caught at time t f t [j] = {a i a i = j, i t} frequency of species j up to time t j is rare iff f t [j] = 1 17 / 33 Irene Finocchi Mining data streams

49 Fishing Intro Puzzles Missing number Fishing Pointer & chaser Recap U = {1,...u} fish species in the universe a t U fish species caught at time t f t [j] = {a i a i = j, i t} frequency of species j up to time t j is rare iff f t [j] = 1 Rarity of catch at time t: ρ t = {j f t[j] = 1} u = R t u 17 / 33 Irene Finocchi Mining data streams

50 Fishing Intro Puzzles Missing number Fishing Pointer & chaser Recap U = {1,...u} fish species in the universe a t U fish species caught at time t f t [j] = {a i a i = j, i t} frequency of species j up to time t j is rare iff f t [j] = 1 Rarity of catch at time t: ρ t = {j f t[j] = 1} u = R t u George is curious and wants to compute rarity 17 / 33 Irene Finocchi Mining data streams

51 Fishing Intro Puzzles Missing number Fishing Pointer & chaser Recap U = {1,...u} fish species in the universe a t U fish species caught at time t f t [j] = {a i a i = j, i t} frequency of species j up to time t j is rare iff f t [j] = 1 Rarity of catch at time t: ρ t = {j f t[j] = 1} u = R t u George is curious and wants to compute rarity 2u-bit vector would suffice... but George s suitcase has o(u) size 17 / 33 Irene Finocchi Mining data streams

52 Intro Puzzles Missing number Fishing Pointer & chaser Recap Deterministic fish rarity George cannot compute ρ t precisely with a deterministic algorithm using only o(u) bits By contradiction 18 / 33 Irene Finocchi Mining data streams

53 Intro Puzzles Missing number Fishing Pointer & chaser Recap Deterministic fish rarity George cannot compute ρ t precisely with a deterministic algorithm using only o(u) bits By contradiction Let S U be a set of species: no duplicates, S = Θ(u) Need Ω( S ) = Ω(u) bits to represent S If claim is false, could break information theoretic lower bound 18 / 33 Irene Finocchi Mining data streams

54 Intro Puzzles Missing number Fishing Pointer & chaser Recap Deterministic fish rarity George cannot compute ρ t precisely with a deterministic algorithm using only o(u) bits By contradiction Let S U be a set of species: no duplicates, S = Θ(u) Need Ω( S ) = Ω(u) bits to represent S If claim is false, could break information theoretic lower bound To retrieve S, for each i U, stream S, i to George and compare ρ t and ρ t+1 : 18 / 33 Irene Finocchi Mining data streams

55 Intro Puzzles Missing number Fishing Pointer & chaser Recap Deterministic fish rarity George cannot compute ρ t precisely with a deterministic algorithm using only o(u) bits By contradiction Let S U be a set of species: no duplicates, S = Θ(u) Need Ω( S ) = Ω(u) bits to represent S If claim is false, could break information theoretic lower bound To retrieve S, for each i U, stream S, i to George and compare ρ t and ρ t+1 : if i S, then R t+1 = R t + 1 and ρ t+1 > ρ t 18 / 33 Irene Finocchi Mining data streams

56 Intro Puzzles Missing number Fishing Pointer & chaser Recap Deterministic fish rarity George cannot compute ρ t precisely with a deterministic algorithm using only o(u) bits By contradiction Let S U be a set of species: no duplicates, S = Θ(u) Need Ω( S ) = Ω(u) bits to represent S If claim is false, could break information theoretic lower bound To retrieve S, for each i U, stream S, i to George and compare ρ t and ρ t+1 : if i S, then R t+1 = R t + 1 and ρ t+1 > ρ t if i S, then R t+1 = R t 1 and ρ t+1 < ρ t 18 / 33 Irene Finocchi Mining data streams

57 Intro Puzzles Missing number Fishing Pointer & chaser Recap Deterministic fish rarity George cannot compute ρ t precisely with a deterministic algorithm using only o(u) bits By contradiction Let S U be a set of species: no duplicates, S = Θ(u) Need Ω( S ) = Ω(u) bits to represent S If claim is false, could break information theoretic lower bound To retrieve S, for each i U, stream S, i to George and compare ρ t and ρ t+1 : if i S, then R t+1 = R t + 1 and ρ t+1 > ρ t if i S, then R t+1 = R t 1 and ρ t+1 < ρ t Hence ρ decreases i S 18 / 33 Irene Finocchi Mining data streams

58 Intro Puzzles Missing number Fishing Pointer & chaser Recap Randomized fish rarity (1/2) Sampling: George can approximate ρ t using 2k = o(u) bits pick k random fish species maintain rarity c 1 [t],... c k [t] of each sampled species (2 bits) Return ρ t = {i [1, k] c i[t] = 1} k = R t k 19 / 33 Irene Finocchi Mining data streams

59 Intro Puzzles Missing number Fishing Pointer & chaser Recap Randomized fish rarity (1/2) Sampling: George can approximate ρ t using 2k = o(u) bits pick k random fish species maintain rarity c 1 [t],... c k [t] of each sampled species (2 bits) Return ρ t = {i [1, k] c i[t] = 1} k Claim: E[ ρ t ] = ρ t = R t k 19 / 33 Irene Finocchi Mining data streams

60 Intro Puzzles Missing number Fishing Pointer & chaser Recap Randomized fish rarity (1/2) Sampling: George can approximate ρ t using 2k = o(u) bits pick k random fish species maintain rarity c 1 [t],... c k [t] of each sampled species (2 bits) Return ρ t = {i [1, k] c i[t] = 1} k Claim: E[ ρ t ] = ρ t = R t k If ρ t large enough, ρ t is a good estimate for ρ t with arbitrarily small precision and good probability Requires more advanced probabilistic tools: examples later 19 / 33 Irene Finocchi Mining data streams

61 Intro Puzzles Missing number Fishing Pointer & chaser Recap Randomized fish rarity (2/2) ρ t = {i [1, k] c i[t] = 1} k E[ ρ t ] = ρ t = R t k 20 / 33 Irene Finocchi Mining data streams

62 Intro Puzzles Missing number Fishing Pointer & chaser Recap Randomized fish rarity (2/2) ρ t = {i [1, k] c i[t] = 1} k E[ ρ t ] = ρ t = R t k Y i indicator variable: { Yi = 1 if c i [t] = 1 Y i = 0 otherwise 20 / 33 Irene Finocchi Mining data streams

63 Intro Puzzles Missing number Fishing Pointer & chaser Recap Randomized fish rarity (2/2) ρ t = {i [1, k] c i[t] = 1} k E[ ρ t ] = ρ t = R t k Y i indicator variable: { Yi = 1 if c i [t] = 1 Y i = 0 otherwise Pr{Y i = 1} = Pr{the i-th sampled species is rare} = R t u = ρ t 20 / 33 Irene Finocchi Mining data streams

64 Intro Puzzles Missing number Fishing Pointer & chaser Recap Randomized fish rarity (2/2) ρ t = {i [1, k] c i[t] = 1} k E[ ρ t ] = ρ t = R t k Y i indicator variable: { Yi = 1 if c i [t] = 1 Y i = 0 otherwise Pr{Y i = 1} = Pr{the i-th sampled species is rare} = R t u = ρ t E[Y i ] = ρ t 20 / 33 Irene Finocchi Mining data streams

65 Intro Puzzles Missing number Fishing Pointer & chaser Recap Randomized fish rarity (2/2) ρ t = {i [1, k] c i[t] = 1} k E[ ρ t ] = ρ t = R t k Y i indicator variable: { Yi = 1 if c i [t] = 1 Y i = 0 otherwise Pr{Y i = 1} = Pr{the i-th sampled species is rare} = R t u = ρ t E[Y i ] = ρ t E[ R t ] = k i=1 E[Y i] = kρ t 20 / 33 Irene Finocchi Mining data streams

66 Intro Puzzles Missing number Fishing Pointer & chaser Recap Randomized fish rarity (2/2) ρ t = {i [1, k] c i[t] = 1} k E[ ρ t ] = ρ t = R t k Y i indicator variable: { Yi = 1 if c i [t] = 1 Y i = 0 otherwise Pr{Y i = 1} = Pr{the i-th sampled species is rare} = R t u = ρ t E[Y i ] = ρ t E[ R t ] = k i=1 E[Y i] = kρ t E[ ρ t ] = E[ R t ] k = ρ t 20 / 33 Irene Finocchi Mining data streams

67 Lesson 2 Intro Puzzles Missing number Fishing Pointer & chaser Recap It is often impossible to solve problems precisely and deterministically in small (sublinear) space Randomization and approximation greatly help: find an answer correct within some factor (guarantee that ρ is within 10% of ρ) allow a small probability of failure (answer is correct, except with probability 1 in 10,000) 21 / 33 Irene Finocchi Mining data streams

68 Pointer and chaser Intro Puzzles Missing number Fishing Pointer & chaser Recap Paul has n + 1 pointers For each pointer i, he points to a position P[i] [1, n] n= / 33 Irene Finocchi Mining data streams

69 Pointer and chaser Intro Puzzles Missing number Fishing Pointer & chaser Recap Paul has n + 1 pointers For each pointer i, he points to a position P[i] [1, n] n= Carole has to guess any duplicate pointer Constraints: O(log n) bits O(n) queries cannot move items 22 / 33 Irene Finocchi Mining data streams

70 Repeated scans Intro Puzzles Missing number Fishing Pointer & chaser Recap n= Trivial solution for each i, count how many j are such that P[j]=i O(log n) bits, but O(n 2 ) queries 23 / 33 Irene Finocchi Mining data streams

71 Repeated scans Intro Puzzles Missing number Fishing Pointer & chaser Recap n= Trivial solution for each i, count how many j are such that P[j]=i O(log n) bits, but O(n 2 ) queries 2 Better solution if # of items below n/2 > # of items above n/2 then search for duplicates < n/2 else search for duplicates n/2 O(log n) bits and passes, O(n log n) queries 3 With O(log n) bits, Ω(log n/ log log n) passes are needed 23 / 33 Irene Finocchi Mining data streams

72 Intro Puzzles Missing number Fishing Pointer & chaser Recap Random access helps n= Chase pointers, starting from position n + 1 Problem equivalent to finding a loop in a linked list Can be solved in O(n) time with just 2 pointers! r1 r2 24 / 33 Irene Finocchi Mining data streams

73 Intro Puzzles Missing number Fishing Pointer & chaser Recap Random access helps n= Chase pointers, starting from position n + 1 Problem equivalent to finding a loop in a linked list Can be solved in O(n) time with just 2 pointers! r1 r2 25 / 33 Irene Finocchi Mining data streams

74 Intro Puzzles Missing number Fishing Pointer & chaser Recap Random access helps n= Chase pointers, starting from position n + 1 Problem equivalent to finding a loop in a linked list Can be solved in O(n) time with just 2 pointers! r1 r2 26 / 33 Irene Finocchi Mining data streams

75 Intro Puzzles Missing number Fishing Pointer & chaser Recap Random access helps n= Chase pointers, starting from position n + 1 Problem equivalent to finding a loop in a linked list Can be solved in O(n) time with just 2 pointers! r1 r2 27 / 33 Irene Finocchi Mining data streams

76 Intro Puzzles Missing number Fishing Pointer & chaser Recap Random access helps n= Chase pointers, starting from position n + 1 Problem equivalent to finding a loop in a linked list Can be solved in O(n) time with just 2 pointers! r1 r2 28 / 33 Irene Finocchi Mining data streams

77 Intro Puzzles Missing number Fishing Pointer & chaser Recap Random access helps n= Chase pointers, starting from position n + 1 Problem equivalent to finding a loop in a linked list Can be solved in O(n) time with just 2 pointers! a=9 b=3 r1 c=3 r2 29 / 33 Irene Finocchi Mining data streams

78 Intro Puzzles Missing number Fishing Pointer & chaser Recap Random access helps n= Chase pointers, starting from position n + 1 Problem equivalent to finding a loop in a linked list Can be solved in O(n) time with just 2 pointers! a=9 b=3 r1 c=3 r2 { a + b = t a + k(b + c) + b = 2t t and k known 29 / 33 Irene Finocchi Mining data streams

79 Intro Puzzles Missing number Fishing Pointer & chaser Recap Random access helps n= Chase pointers, starting from position n + 1 Problem equivalent to finding a loop in a linked list Can be solved in O(n) time with just 2 pointers! a=9 b=3 r1 c=3 r2 { a + b = t a + k(b + c) + b = 2t { a + b = t b + c = t/k t and k known 29 / 33 Irene Finocchi Mining data streams

80 Intro Puzzles Missing number Fishing Pointer & chaser Recap Random access helps n= Chase pointers, starting from position n + 1 Problem equivalent to finding a loop in a linked list Can be solved in O(n) time with just 2 pointers! a=9 b=3 r1 c=3 r2 { a + b = t a + k(b + c) + b = 2t { a + b = t b + c = t/k t and k known a = c+ k 1 t k 29 / 33 Irene Finocchi Mining data streams

81 Intro Puzzles Missing number Fishing Pointer & chaser Recap Random access helps n= Chase pointers, starting from position n + 1 Problem equivalent to finding a loop in a linked list Can be solved in O(n) time with just 2 pointers! a=9 b=3 r1 c=3 t(k-1)/k=6 r2 { a + b = t a + k(b + c) + b = 2t { a + b = t b + c = t/k a = c + k 1 t k t and k known 30 / 33 Irene Finocchi Mining data streams

82 Lesson 3 Intro Puzzles Missing number Fishing Pointer & chaser Recap Tokens come as a stream: no random access Sometimes impossible to achieve the same bounds as in the RAM model 31 / 33 Irene Finocchi Mining data streams

83 Recap on lessons Intro Puzzles Missing number Fishing Pointer & chaser Recap Typically impossible to solve problems precisely and deterministically in small (sublinear) space Randomize and approximate! Sequential data access makes things harder 32 / 33 Irene Finocchi Mining data streams

84 References Intro Puzzles Missing number Fishing Pointer & chaser Recap 1 Data streams: algorithms and applications, S. Muthukrishnan muthu/ 2 Algorithms for data streams, C. Demetrescu & I. Finocchi twiki.di.uniroma1.it/pub/ing_algo/webhome/dfchapter08.pdf 33 / 33 Irene Finocchi Mining data streams

Algorithms for data streams

Algorithms for data streams Intro Puzzles Sampling Sketches Graphs Lower bounds Summing up Thanks! Algorithms for data streams Irene Finocchi finocchi@di.uniroma1.it http://www.dsi.uniroma1.it/ finocchi/ May 9, 2012 1 / 99 Irene

More information

Data Streams: Algorithms and Applications

Data Streams: Algorithms and Applications Foundations and Trends R in Theoretical Computer Science Vol. 1, No 2 (2005) 117 236 c 2005 S. Muthukrishnan Data Streams: Algorithms and Applications S. Muthukrishnan Rutgers University, New Brunswick,

More information

Course : Data mining

Course : Data mining Course : Data mining Lecture : Mining data streams Aristides Gionis Department of Computer Science Aalto University visiting in Sapienza University of Rome fall 2016 reading assignment LRU book: chapter

More information

B561 Advanced Database Concepts Streaming Model. Qin Zhang 1-1

B561 Advanced Database Concepts Streaming Model. Qin Zhang 1-1 B561 Advanced Database Concepts 2.2. Streaming Model Qin Zhang 1-1 Data Streams Continuous streams of data elements (massive possibly unbounded, rapid, time-varying) Some examples: 1. network monitoring

More information

Data Streams: Algorithms and Applications 1

Data Streams: Algorithms and Applications 1 Data Streams: Algorithms and Applications 1 Data Streams: Algorithms and Applications 2 Contents 0.1 Abstract 5 0.2 Introduction 5 0.3 Map 10 0.4 The Data Stream Phenomenon 10 0.5 Data Streaming: Formal

More information

Sublinear Models for Streaming and/or Distributed Data

Sublinear Models for Streaming and/or Distributed Data Sublinear Models for Streaming and/or Distributed Data Qin Zhang Guest lecture in B649 Feb. 3, 2015 1-1 Now about the Big Data Big data is everywhere : over 2.5 petabytes of sales transactions : an index

More information

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

CS246: Mining Massive Datasets Jure Leskovec, Stanford University CS246: Mining Massive Datasets Jure Leskovec, Stanford University http://cs246.stanford.edu 3/6/2012 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 2 In many data mining

More information

Lecture 5: Data Streaming Algorithms

Lecture 5: Data Streaming Algorithms Great Ideas in Theoretical Computer Science Summer 2013 Lecture 5: Data Streaming Algorithms Lecturer: Kurt Mehlhorn & He Sun In the data stream scenario, the input arrive rapidly in an arbitrary order,

More information

Introduction to Data Mining

Introduction to Data Mining Introduction to Data Mining Lecture #6: Mining Data Streams Seoul National University 1 Outline Overview Sampling From Data Stream Queries Over Sliding Window 2 Data Streams In many data mining situations,

More information

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

CS246: Mining Massive Datasets Jure Leskovec, Stanford University CS246: Mining Massive Datasets Jure Leskovec, Stanford University http://cs246.stanford.edu 2/24/2014 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 2 High dim. data

More information

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

CS246: Mining Massive Datasets Jure Leskovec, Stanford University CS246: Mining Massive Datasets Jure Leskovec, Stanford University http://cs246.stanford.edu 2/25/2013 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 3 In many data mining

More information

Mining Data Streams. Chapter The Stream Data Model

Mining Data Streams. Chapter The Stream Data Model Chapter 4 Mining Data Streams Most of the algorithms described in this book assume that we are mining a database. That is, all our data is available when and if we want it. In this chapter, we shall make

More information

Data Streaming Algorithms for Geometric Problems

Data Streaming Algorithms for Geometric Problems Data Streaming Algorithms for Geometric roblems R.Sharathkumar Duke University 1 Introduction A data stream is an ordered sequence of points that can be read only once or a small number of times. Formally,

More information

Streaming Algorithms. Stony Brook University CSE545, Fall 2016

Streaming Algorithms. Stony Brook University CSE545, Fall 2016 Streaming Algorithms Stony Brook University CSE545, Fall 2016 Big Data Analytics -- The Class We will learn: to analyze different types of data: high dimensional graphs infinite/never-ending labeled to

More information

CS144: Intro to Computer Networks Homework 1 Scan and submit your solution online. Due Friday January 30, 4pm

CS144: Intro to Computer Networks Homework 1 Scan and submit your solution online. Due Friday January 30, 4pm CS144: Intro to Computer Networks Homework 1 Scan and submit your solution online. Due Friday January 30, 2015 @ 4pm Your Name: Answers SUNet ID: root @stanford.edu Check if you would like exam routed

More information

Tracking Frequent Items Dynamically: What s Hot and What s Not To appear in PODS 2003

Tracking Frequent Items Dynamically: What s Hot and What s Not To appear in PODS 2003 Tracking Frequent Items Dynamically: What s Hot and What s Not To appear in PODS 2003 Graham Cormode graham@dimacs.rutgers.edu dimacs.rutgers.edu/~graham S. Muthukrishnan muthu@cs.rutgers.edu Everyday

More information

Big Data Analytics CSCI 4030

Big Data Analytics CSCI 4030 High dim. data Graph data Infinite data Machine learning Apps Locality sensitive hashing PageRank, SimRank Filtering data streams SVM Recommen der systems Clustering Community Detection Queries on streams

More information

Based on the original slides by Minos Garofalakis Yahoo! Research & UC Berkeley

Based on the original slides by Minos Garofalakis Yahoo! Research & UC Berkeley Based on the original slides by Minos Garofalakis Yahoo! Research & UC Berkeley Traditional DBMS: data stored in finite, persistent data sets Data Streams: distributed, continuous, unbounded, rapid, time

More information

One-Pass Streaming Algorithms

One-Pass Streaming Algorithms One-Pass Streaming Algorithms Theory and Practice Complaints and Grievances about theory in practice Disclaimer Experiences with Gigascope. A practitioner s perspective. Will be using my own implementations,

More information

Summarizing and mining inverse distributions on data streams via dynamic inverse sampling

Summarizing and mining inverse distributions on data streams via dynamic inverse sampling Summarizing and mining inverse distributions on data streams via dynamic inverse sampling Presented by Graham Cormode cormode@bell-labs.com S. Muthukrishnan muthu@cs.rutgers.edu Irina Rozenbaum rozenbau@paul.rutgers.edu

More information

Thanks to the advances of data processing technologies, a lot of data can be collected and stored in databases efficiently New challenges: with a

Thanks to the advances of data processing technologies, a lot of data can be collected and stored in databases efficiently New challenges: with a Data Mining and Information Retrieval Introduction to Data Mining Why Data Mining? Thanks to the advances of data processing technologies, a lot of data can be collected and stored in databases efficiently

More information

Knowledge Discovery and Data Mining

Knowledge Discovery and Data Mining Knowledge Discovery and Data Mining Unit # 1 1 Acknowledgement Several Slides in this presentation are taken from course slides provided by Han and Kimber (Data Mining Concepts and Techniques) and Tan,

More information

B561 Advanced Database Concepts. 6. Streaming Algorithms. Qin Zhang 1-1

B561 Advanced Database Concepts. 6. Streaming Algorithms. Qin Zhang 1-1 B561 Advanced Database Concepts 6. Streaming Algorithms Qin Zhang 1-1 The model and challenge The data stream model (Alon, Matias and Szegedy 1996) a n a 2 a 1 RAM CPU Why hard? Cannot store everything.

More information

Optimal Parallel Randomized Renaming

Optimal Parallel Randomized Renaming Optimal Parallel Randomized Renaming Martin Farach S. Muthukrishnan September 11, 1995 Abstract We consider the Renaming Problem, a basic processing step in string algorithms, for which we give a simultaneously

More information

CS144: Intro to Computer Networks Homework 1 Scan and submit your solution online. Due Friday January 30, 4pm

CS144: Intro to Computer Networks Homework 1 Scan and submit your solution online. Due Friday January 30, 4pm CS144: Intro to Computer Networks Homework 1 Scan and submit your solution online. Due Friday January 30, 2015 @ 4pm Your Name: SUNet ID: @stanford.edu Check if you would like exam routed back via SCPD:

More information

Information Retrieval II

Information Retrieval II Information Retrieval II David Hawking 30 Sep 2010 Machine Learning Summer School, ANU Session Outline Ranking documents in response to a query Measuring the quality of such rankings Case Study: Tuning

More information

Data Streams: Algorithms and Applications

Data Streams: Algorithms and Applications Data Streams: Algorithms and Applications S. Muthukrishnan Contents 1 Introduction 2 1.1 Puzzle 1: Finding Missing Numbers..... 2 1.2 Puzzle 2: Fishing.... 3 1.3 Lessons... 5 2 Map 6 3 Data Stream Phenomenon

More information

On Biased Reservoir Sampling in the Presence of Stream Evolution

On Biased Reservoir Sampling in the Presence of Stream Evolution Charu C. Aggarwal T J Watson Research Center IBM Corporation Hawthorne, NY USA On Biased Reservoir Sampling in the Presence of Stream Evolution VLDB Conference, Seoul, South Korea, 2006 Synopsis Construction

More information

DATA MINING II - 1DL460

DATA MINING II - 1DL460 DATA MINING II - 1DL460 Spring 2016 A second course in data mining http://www.it.uu.se/edu/course/homepage/infoutv2/vt16 Kjell Orsborn Uppsala Database Laboratory Department of Information Technology,

More information

Seek and Ye shall Find

Seek and Ye shall Find Seek and Ye shall Find The continuum of computer intelligence COS 116, Spring 2010 Adam Finkelstein Final tally: Computer $77,147, Ken Jennings $24,000, Brad Rutter $21,600. Jennings: I, for one, welcome

More information

Extended R-Tree Indexing Structure for Ensemble Stream Data Classification

Extended R-Tree Indexing Structure for Ensemble Stream Data Classification Extended R-Tree Indexing Structure for Ensemble Stream Data Classification P. Sravanthi M.Tech Student, Department of CSE KMM Institute of Technology and Sciences Tirupati, India J. S. Ananda Kumar Assistant

More information

Algorithms for Grid Graphs in the MapReduce Model

Algorithms for Grid Graphs in the MapReduce Model University of Nebraska - Lincoln DigitalCommons@University of Nebraska - Lincoln Computer Science and Engineering: Theses, Dissertations, and Student Research Computer Science and Engineering, Department

More information

Cache and Forward Architecture

Cache and Forward Architecture Cache and Forward Architecture Shweta Jain Research Associate Motivation Conversation between computers connected by wires Wired Network Large content retrieval using wireless and mobile devices Wireless

More information

Efficient Approximation of Correlated Sums on Data Streams

Efficient Approximation of Correlated Sums on Data Streams Efficient Approximation of Correlated Sums on Data Streams Rohit Ananthakrishna Cornell University rohit@cs.cornell.edu Flip Korn AT&T Labs Research flip@research.att.com Abhinandan Das Cornell University

More information

Mining for Patterns and Anomalies in Data Streams. Sampath Kannan University of Pennsylvania

Mining for Patterns and Anomalies in Data Streams. Sampath Kannan University of Pennsylvania Mining for Patterns and Anomalies in Data Streams Sampath Kannan University of Pennsylvania The Problem Data sizes too large to fit in primary memory Devices with small memory Access times to secondary

More information

Seek and Ye shall Find

Seek and Ye shall Find Seek and Ye shall Find The continuum of computer intelligence COS 116, Spring 2012 Adam Finkelstein Recap: Binary Representation Powers of 2 2 0 2 1 2 2 2 3 2 4 2 5 2 6 2 7 2 8 2 9 2 10 1 2 4 8 16 32 64

More information

CS246: Mining Massive Data Sets Winter Final

CS246: Mining Massive Data Sets Winter Final CS246: Mining Massive Data Sets Winter 2013 Final These questions require thought, but do not require long answers. Be as concise as possible. You have three hours to complete this final. The exam has

More information

DATA MINING II - 1DL460

DATA MINING II - 1DL460 DATA MINING II - 1DL460 Spring 2012 A second course in data mining!! http://www.it.uu.se/edu/course/homepage/infoutv2/vt12 Kjell Orsborn! Uppsala Database Laboratory! Department of Information Technology,

More information

1 (15 points) LexicoSort

1 (15 points) LexicoSort CS161 Homework 2 Due: 22 April 2016, 12 noon Submit on Gradescope Handed out: 15 April 2016 Instructions: Please answer the following questions to the best of your ability. If you are asked to show your

More information

Frequent Pattern Mining in Data Streams. Raymond Martin

Frequent Pattern Mining in Data Streams. Raymond Martin Frequent Pattern Mining in Data Streams Raymond Martin Agenda -Breakdown & Review -Importance & Examples -Current Challenges -Modern Algorithms -Stream-Mining Algorithm -How KPS Works -Combing KPS and

More information

Sublinear Algorithms for Big Data Analysis

Sublinear Algorithms for Big Data Analysis Sublinear Algorithms for Big Data Analysis Michael Kapralov Theory of Computation Lab 4 EPFL 7 September 2017 The age of big data: massive amounts of data collected in various areas of science and technology

More information

DS504/CS586: Big Data Analytics Data Management Prof. Yanhua Li

DS504/CS586: Big Data Analytics Data Management Prof. Yanhua Li Welcome to DS504/CS586: Big Data Analytics Data Management Prof. Yanhua Li Time: 6:00pm 8:50pm R Location: KH 116 Fall 2017 First Grading for Reading Assignment Weka v 6 weeks v https://weka.waikato.ac.nz/dataminingwithweka/preview

More information

TELCOM2125: Network Science and Analysis

TELCOM2125: Network Science and Analysis School of Information Sciences University of Pittsburgh TELCOM2125: Network Science and Analysis Konstantinos Pelechrinis Spring 2015 Figures are taken from: M.E.J. Newman, Networks: An Introduction 2

More information

Lecture 2: Streaming Algorithms for Counting Distinct Elements

Lecture 2: Streaming Algorithms for Counting Distinct Elements Lecture 2: Streaming Algorithms for Counting Distinct Elements 20th August, 2008 Streaming Algorithms Streaming Algorithms Streaming algorithms have the following properties: 1 items in the stream are

More information

Detect Cyber Threats with Securonix Proxy Traffic Analyzer

Detect Cyber Threats with Securonix Proxy Traffic Analyzer Detect Cyber Threats with Securonix Proxy Traffic Analyzer Introduction Many organizations encounter an extremely high volume of proxy data on a daily basis. The volume of proxy data can range from 100

More information

A Mathematical Proof. Zero Knowledge Protocols. Interactive Proof System. Other Kinds of Proofs. When referring to a proof in logic we usually mean:

A Mathematical Proof. Zero Knowledge Protocols. Interactive Proof System. Other Kinds of Proofs. When referring to a proof in logic we usually mean: A Mathematical Proof When referring to a proof in logic we usually mean: 1. A sequence of statements. 2. Based on axioms. Zero Knowledge Protocols 3. Each statement is derived via the derivation rules.

More information

Zero Knowledge Protocols. c Eli Biham - May 3, Zero Knowledge Protocols (16)

Zero Knowledge Protocols. c Eli Biham - May 3, Zero Knowledge Protocols (16) Zero Knowledge Protocols c Eli Biham - May 3, 2005 442 Zero Knowledge Protocols (16) A Mathematical Proof When referring to a proof in logic we usually mean: 1. A sequence of statements. 2. Based on axioms.

More information

Clustering. (Part 2)

Clustering. (Part 2) Clustering (Part 2) 1 k-means clustering 2 General Observations on k-means clustering In essence, k-means clustering aims at minimizing cluster variance. It is typically used in Euclidean spaces and works

More information

Draft Notes 1 : Scaling in Ad hoc Routing Protocols

Draft Notes 1 : Scaling in Ad hoc Routing Protocols Draft Notes 1 : Scaling in Ad hoc Routing Protocols Timothy X Brown University of Colorado April 2, 2008 2 Introduction What is the best network wireless network routing protocol? This question is a function

More information

CS 5614: (Big) Data Management Systems. B. Aditya Prakash Lecture #13: Mining Streams 1

CS 5614: (Big) Data Management Systems. B. Aditya Prakash Lecture #13: Mining Streams 1 CS 5614: (Big) Data Management Systems B. Aditya Prakash Lecture #13: Mining Streams 1 Data Streams In many data mining situa;ons, we do not know the en;re data set in advance Stream Management is important

More information

Direct Routing: Algorithms and Complexity

Direct Routing: Algorithms and Complexity Direct Routing: Algorithms and Complexity Costas Busch, RPI Malik Magdon-Ismail, RPI Marios Mavronicolas, Univ. Cyprus Paul Spirakis, Univ. Patras 1 Outline 1. Direct Routing. 2. Direct Routing is Hard.

More information

COMP 465 Special Topics: Data Mining

COMP 465 Special Topics: Data Mining COMP 465 Special Topics: Data Mining Introduction & Course Overview 1 Course Page & Class Schedule http://cs.rhodes.edu/welshc/comp465_s15/ What s there? Course info Course schedule Lecture media (slides,

More information

Data Streams Algorithms

Data Streams Algorithms Data Streams Algorithms Phillip B. Gibbons Intel Research Pittsburgh Guest Lecture in 15-853 Algorithms in the Real World Phil Gibbons, 15-853, 12/4/06 # 1 Outline Data Streams in the Real World Formal

More information

Superconcentrators of depth 2 and 3; odd levels help (rarely)

Superconcentrators of depth 2 and 3; odd levels help (rarely) Superconcentrators of depth 2 and 3; odd levels help (rarely) Noga Alon Bellcore, Morristown, NJ, 07960, USA and Department of Mathematics Raymond and Beverly Sackler Faculty of Exact Sciences Tel Aviv

More information

Practice Midterm Exam Solutions

Practice Midterm Exam Solutions CSE 332: Data Abstractions Autumn 2015 Practice Midterm Exam Solutions Name: Sample Solutions ID #: 1234567 TA: The Best Section: A9 INSTRUCTIONS: You have 50 minutes to complete the exam. The exam is

More information

On User-centric QoE Prediction for VoIP & Video Streaming based on Machine-Learning

On User-centric QoE Prediction for VoIP & Video Streaming based on Machine-Learning UNIVERSITY OF CRETE On User-centric QoE Prediction for VoIP & Video Streaming based on Machine-Learning Michalis Katsarakis, Maria Plakia, Paul Charonyktakis & Maria Papadopouli University of Crete Foundation

More information

Big Data Infrastructures & Technologies

Big Data Infrastructures & Technologies Big Data Infrastructures & Technologies Data streams and low latency processing DATA STREAM BASICS What is a data stream? Large data volume, likely structured, arriving at a very high rate Potentially

More information

Big Data Analytics: What is Big Data? Stony Brook University CSE545, Fall 2016 the inaugural edition

Big Data Analytics: What is Big Data? Stony Brook University CSE545, Fall 2016 the inaugural edition Big Data Analytics: What is Big Data? Stony Brook University CSE545, Fall 2016 the inaugural edition What s the BIG deal?! 2011 2011 2008 2010 2012 What s the BIG deal?! (Gartner Hype Cycle) What s the

More information

Dr. Amotz Bar-Noy s Compendium of Algorithms Problems. Problems, Hints, and Solutions

Dr. Amotz Bar-Noy s Compendium of Algorithms Problems. Problems, Hints, and Solutions Dr. Amotz Bar-Noy s Compendium of Algorithms Problems Problems, Hints, and Solutions Chapter 1 Searching and Sorting Problems 1 1.1 Array with One Missing 1.1.1 Problem Let A = A[1],..., A[n] be an array

More information

Information Retrieval and Organisation

Information Retrieval and Organisation Information Retrieval and Organisation Dell Zhang Birkbeck, University of London 2015/16 IR Chapter 04 Index Construction Hardware In this chapter we will look at how to construct an inverted index Many

More information

A Brief Introduction to Data Mining

A Brief Introduction to Data Mining A Brief Introduction to Data Mining L. Torgo ltorgo@dcc.fc.up.pt Departamento de Ciência de Computadores Faculdade de Ciências / Universidade do Porto Sept, 2014 Introduction Motivation for Data Mining?

More information

NP versus PSPACE. Frank Vega. To cite this version: HAL Id: hal https://hal.archives-ouvertes.fr/hal

NP versus PSPACE. Frank Vega. To cite this version: HAL Id: hal https://hal.archives-ouvertes.fr/hal NP versus PSPACE Frank Vega To cite this version: Frank Vega. NP versus PSPACE. Preprint submitted to Theoretical Computer Science 2015. 2015. HAL Id: hal-01196489 https://hal.archives-ouvertes.fr/hal-01196489

More information

048866: Packet Switch Architectures

048866: Packet Switch Architectures 048866: Packet Switch Architectures Output-Queued Switches Deterministic Queueing Analysis Fairness and Delay Guarantees Dr. Isaac Keslassy Electrical Engineering, Technion isaac@ee.technion.ac.il http://comnet.technion.ac.il/~isaac/

More information

Faster, Space-Efficient Selection Algorithms in Read-Only Memory for Integers

Faster, Space-Efficient Selection Algorithms in Read-Only Memory for Integers Faster, Space-Efficient Selection Algorithms in Read-Only Memory for Integers Timothy M. Chan 1, J. Ian Munro 1, and Venkatesh Raman 2 1 Cheriton School of Computer Science, University of Waterloo, Waterloo,

More information

Treewidth and graph minors

Treewidth and graph minors Treewidth and graph minors Lectures 9 and 10, December 29, 2011, January 5, 2012 We shall touch upon the theory of Graph Minors by Robertson and Seymour. This theory gives a very general condition under

More information

Real-time Scheduling of Skewed MapReduce Jobs in Heterogeneous Environments

Real-time Scheduling of Skewed MapReduce Jobs in Heterogeneous Environments Real-time Scheduling of Skewed MapReduce Jobs in Heterogeneous Environments Nikos Zacheilas, Vana Kalogeraki Department of Informatics Athens University of Economics and Business 1 Big Data era has arrived!

More information

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

CS246: Mining Massive Datasets Jure Leskovec, Stanford University CS46: Mining Massive Datasets Jure Leskovec, Stanford University http://cs46.stanford.edu /7/ Jure Leskovec, Stanford C46: Mining Massive Datasets Many real-world problems Web Search and Text Mining Billions

More information

Heavy-Hitter Detection Entirely in the Data Plane

Heavy-Hitter Detection Entirely in the Data Plane Heavy-Hitter Detection Entirely in the Data Plane VIBHAALAKSHMI SIVARAMAN SRINIVAS NARAYANA, ORI ROTTENSTREICH, MUTHU MUTHUKRSISHNAN, JENNIFER REXFORD 1 Heavy Hitter Flows Flows above a certain threshold

More information

Is IPv4 Sufficient for Another 30 Years?

Is IPv4 Sufficient for Another 30 Years? Is IPv4 Sufficient for Another 30 Years? October 7, 2004 Abstract TCP/IP was developed 30 years ago. It has been successful over the past 30 years, until recently when its limitation started emerging.

More information

BIG DATA TESTING: A UNIFIED VIEW

BIG DATA TESTING: A UNIFIED VIEW http://core.ecu.edu/strg BIG DATA TESTING: A UNIFIED VIEW BY NAM THAI ECU, Computer Science Department, March 16, 2016 2/30 PRESENTATION CONTENT 1. Overview of Big Data A. 5 V s of Big Data B. Data generation

More information

Parallel Disk-Based Computation and Computational Group Theory. Eric Robinson Gene Cooperman Daniel Kunkle

Parallel Disk-Based Computation and Computational Group Theory. Eric Robinson Gene Cooperman Daniel Kunkle Parallel Disk-Based Computation and Computational Group Theory Eric Robinson Gene Cooperman Daniel Kunkle Jürgen Müller Northeastern University, Boston, USA RWTH, Aachen, Germany Applications from Computational

More information

Research Students Lecture Series 2015

Research Students Lecture Series 2015 Research Students Lecture Series 215 Analyse your big data with this one weird probabilistic approach! Or: applied probabilistic algorithms in 5 easy pieces Advait Sarkar advait.sarkar@cl.cam.ac.uk Research

More information

Selection from Read-Only Memory with Limited Workspace

Selection from Read-Only Memory with Limited Workspace Selection from Read-Only Memory with Limited Workspace Srinivasa Rao Satti Seoul National University Joint work with Amr Elmasry, Daniel Dahl Juhl, and Jyrki Katajainen Outline Models and Problem definition

More information

Knowledge Discovery & Data Mining

Knowledge Discovery & Data Mining Announcements ISM 50 - Business Information Systems Lecture 17 Instructor: Magdalini Eirinaki UC Santa Cruz May 29, 2007 News Folio #3 DUE Thursday 5/31 Database Assignment DUE Tuesday 6/5 Business Paper

More information

Video AI Alerts An Artificial Intelligence-Based Approach to Anomaly Detection and Root Cause Analysis for OTT Video Publishers

Video AI Alerts An Artificial Intelligence-Based Approach to Anomaly Detection and Root Cause Analysis for OTT Video Publishers Video AI Alerts An Artificial Intelligence-Based Approach to Anomaly Detection and Root Cause Analysis for OTT Video Publishers Live and on-demand programming delivered by over-the-top (OTT) will soon

More information

The Effect of Inverse Document Frequency Weights on Indexed Sequence Retrieval. Kevin C. O'Kane. Department of Computer Science

The Effect of Inverse Document Frequency Weights on Indexed Sequence Retrieval. Kevin C. O'Kane. Department of Computer Science The Effect of Inverse Document Frequency Weights on Indexed Sequence Retrieval Kevin C. O'Kane Department of Computer Science The University of Northern Iowa Cedar Falls, Iowa okane@cs.uni.edu http://www.cs.uni.edu/~okane

More information

ISM 50 - Business Information Systems

ISM 50 - Business Information Systems ISM 50 - Business Information Systems Lecture 17 Instructor: Magdalini Eirinaki UC Santa Cruz May 29, 2007 Announcements News Folio #3 DUE Thursday 5/31 Database Assignment DUE Tuesday 6/5 Business Paper

More information

Visualisation of Abstract Information

Visualisation of Abstract Information Visualisation of Abstract Information Visualisation Lecture 17 Institute for Perception, Action & Behaviour School of Informatics Abstract Information 1 Information Visualisation Previously data with inherent

More information

Advanced Algorithms and Data Structures

Advanced Algorithms and Data Structures Advanced Algorithms and Data Structures Prof. Tapio Elomaa tapio.elomaa@tut.fi Course Prerequisites A seven credit unit course Replaced OHJ-2156 Analysis of Algorithms We take things a bit further than

More information

MULTIDIMENSIONAL INDEXING TREE STRUCTURE FOR SPATIAL DATABASE MANAGEMENT

MULTIDIMENSIONAL INDEXING TREE STRUCTURE FOR SPATIAL DATABASE MANAGEMENT MULTIDIMENSIONAL INDEXING TREE STRUCTURE FOR SPATIAL DATABASE MANAGEMENT Dr. G APPARAO 1*, Mr. A SRINIVAS 2* 1. Professor, Chairman-Board of Studies & Convener-IIIC, Department of Computer Science Engineering,

More information

Sliding HyperLogLog: Estimating cardinality in a data stream

Sliding HyperLogLog: Estimating cardinality in a data stream Sliding HyperLogLog: Estimating cardinality in a data stream Yousra Chabchoub, Georges Hébrail To cite this version: Yousra Chabchoub, Georges Hébrail. Sliding HyperLogLog: Estimating cardinality in a

More information

The Cross-Entropy Method

The Cross-Entropy Method The Cross-Entropy Method Guy Weichenberg 7 September 2003 Introduction This report is a summary of the theory underlying the Cross-Entropy (CE) method, as discussed in the tutorial by de Boer, Kroese,

More information

Problem Set 2 Solutions

Problem Set 2 Solutions 6.893 Problem Set 2 Solutons 1 Problem Set 2 Solutions The problem set was worth a total of 25 points. Points are shown in parentheses before the problem. Part 1 - Warmup (5 points total) 1. We studied

More information

Online Mining of Frequent Query Trees over XML Data Streams

Online Mining of Frequent Query Trees over XML Data Streams Online Mining of Frequent Query Trees over XML Data Streams Hua-Fu Li*, Man-Kwan Shan and Suh-Yin Lee Department of Computer Science National Chiao-Tung University Hsinchu, Taiwan 300, R.O.C. http://www.csie.nctu.edu.tw/~hfli/

More information

Topics. Big Data Analytics What is and Why Hadoop? Comparison to other technologies Hadoop architecture Hadoop ecosystem Hadoop usage examples

Topics. Big Data Analytics What is and Why Hadoop? Comparison to other technologies Hadoop architecture Hadoop ecosystem Hadoop usage examples Hadoop Introduction 1 Topics Big Data Analytics What is and Why Hadoop? Comparison to other technologies Hadoop architecture Hadoop ecosystem Hadoop usage examples 2 Big Data Analytics What is Big Data?

More information

Algorithms for Evolving Data Sets

Algorithms for Evolving Data Sets Algorithms for Evolving Data Sets Mohammad Mahdian Google Research Based on joint work with Aris Anagnostopoulos, Bahman Bahmani, Ravi Kumar, Eli Upfal, and Fabio Vandin Algorithm Design Paradigms Traditional

More information

CSE 5243 INTRO. TO DATA MINING

CSE 5243 INTRO. TO DATA MINING CSE 53 INTRO. TO DATA MINING Locality Sensitive Hashing (LSH) Huan Sun, CSE@The Ohio State University Slides adapted from Prof. Jiawei Han @UIUC, Prof. Srinivasan Parthasarathy @OSU MMDS Secs. 3.-3.. Slides

More information

Algorithms for GIS:! Quadtrees

Algorithms for GIS:! Quadtrees Algorithms for GIS: Quadtrees Quadtree A data structure that corresponds to a hierarchical subdivision of the plane Start with a square (containing inside input data) Divide into 4 equal squares (quadrants)

More information

context: massive systems

context: massive systems cutting the electric bill for internetscale systems Asfandyar Qureshi (MIT) Rick Weber (Akamai) Hari Balakrishnan (MIT) John Guttag (MIT) Bruce Maggs (Duke/Akamai) Éole @ flickr context: massive systems

More information

An Indian Journal FULL PAPER. Trade Science Inc. Research on data mining clustering algorithm in cloud computing environments ABSTRACT KEYWORDS

An Indian Journal FULL PAPER. Trade Science Inc. Research on data mining clustering algorithm in cloud computing environments ABSTRACT KEYWORDS [Type text] [Type text] [Type text] ISSN : 0974-7435 Volume 10 Issue 17 BioTechnology 2014 An Indian Journal FULL PAPER BTAIJ, 10(17), 2014 [9562-9566] Research on data mining clustering algorithm in cloud

More information

Database and Knowledge-Base Systems: Data Mining. Martin Ester

Database and Knowledge-Base Systems: Data Mining. Martin Ester Database and Knowledge-Base Systems: Data Mining Martin Ester Simon Fraser University School of Computing Science Graduate Course Spring 2006 CMPT 843, SFU, Martin Ester, 1-06 1 Introduction [Fayyad, Piatetsky-Shapiro

More information

PSON: A Parallelized SON Algorithm with MapReduce for Mining Frequent Sets

PSON: A Parallelized SON Algorithm with MapReduce for Mining Frequent Sets 2011 Fourth International Symposium on Parallel Architectures, Algorithms and Programming PSON: A Parallelized SON Algorithm with MapReduce for Mining Frequent Sets Tao Xiao Chunfeng Yuan Yihua Huang Department

More information

Data Mining. Yi-Cheng Chen ( 陳以錚 ) Dept. of Computer Science & Information Engineering, Tamkang University

Data Mining. Yi-Cheng Chen ( 陳以錚 ) Dept. of Computer Science & Information Engineering, Tamkang University Data Mining Yi-Cheng Chen ( 陳以錚 ) Dept. of Computer Science & Information Engineering, Tamkang University Why Mine Data? Commercial Viewpoint Lots of data is being collected and warehoused Web data, e-commerce

More information

745: Advanced Database Systems

745: Advanced Database Systems 745: Advanced Database Systems Yanlei Diao University of Massachusetts Amherst Outline Overview of course topics Course requirements Database Management Systems 1. Online Analytical Processing (OLAP) vs.

More information

Large-Scale Data Engineering. Overview and Introduction

Large-Scale Data Engineering. Overview and Introduction Large-Scale Data Engineering Overview and Introduction Administration Blackboard Page Announcements, also via email (pardon html formatting) Practical enrollment, Turning in assignments, Check Grades Contact:

More information

How TokuDB Fractal TreeTM. Indexes Work. Bradley C. Kuszmaul. Guest Lecture in MIT Performance Engineering, 18 November 2010.

How TokuDB Fractal TreeTM. Indexes Work. Bradley C. Kuszmaul. Guest Lecture in MIT Performance Engineering, 18 November 2010. 6.172 How Fractal Trees Work 1 How TokuDB Fractal TreeTM Indexes Work Bradley C. Kuszmaul Guest Lecture in MIT 6.172 Performance Engineering, 18 November 2010. 6.172 How Fractal Trees Work 2 I m an MIT

More information

I am a Data Nerd and so are YOU!

I am a Data Nerd and so are YOU! I am a Data Nerd and so are YOU! Not This Type of Nerd Data Nerd Coffee Talk We saw Cloudera as the lone open source champion of Hadoop and the EMC/Greenplum/MapR initiative as a more closed and

More information

CHENNAI MATHEMATICAL INSTITUTE M.Sc. / Ph.D. Programme in Computer Science

CHENNAI MATHEMATICAL INSTITUTE M.Sc. / Ph.D. Programme in Computer Science CHENNAI MATHEMATICAL INSTITUTE M.Sc. / Ph.D. Programme in Computer Science Entrance Examination, 5 May 23 This question paper has 4 printed sides. Part A has questions of 3 marks each. Part B has 7 questions

More information

Big Data Programming: an Introduction. Spring 2015, X. Zhang Fordham Univ.

Big Data Programming: an Introduction. Spring 2015, X. Zhang Fordham Univ. Big Data Programming: an Introduction Spring 2015, X. Zhang Fordham Univ. Outline What the course is about? scope Introduction to big data programming Opportunity and challenge of big data Origin of Hadoop

More information

Fall 2017 ECEN Special Topics in Data Mining and Analysis

Fall 2017 ECEN Special Topics in Data Mining and Analysis Fall 2017 ECEN 689-600 Special Topics in Data Mining and Analysis Nick Duffield Department of Electrical & Computer Engineering Teas A&M University Organization Organization Instructor: Nick Duffield,

More information