Mining data streams. Irene Finocchi. finocchi/ Intro Puzzles
|
|
- Derrick Bates
- 5 years ago
- Views:
Transcription
1 Intro Puzzles Mining data streams Irene Finocchi finocchi/ 1 / 33 Irene Finocchi Mining data streams
2 Intro Puzzles Stream sources Data stream model Storing data is critical Data is growing faster than our ability to store and index it: networking: phone call networks, Internet, social networks scientific data: astronomical data, genome sequences, GIS geo-spatial data economic transactions: credit cards, online auctions... 2 / 33 Irene Finocchi Mining data streams
3 Intro Puzzles Stream sources Data stream model Network management Monitoring flow of IP packets through the routers (Internet traffic): how many IP addresses used a given link in the last month? which are the top 100 IP addresses in terms of traffic? which destinations use most bandwidth? what s the average duration of an IP session? which hosts have similar usage patterns (clusters)? does traffic distribution change in different periods of time? 3 / 33 Irene Finocchi Mining data streams
4 Intro Puzzles Stream sources Data stream model Network management Monitoring flow of IP packets through the routers (Internet traffic): how many IP addresses used a given link in the last month? which are the top 100 IP addresses in terms of traffic? which destinations use most bandwidth? what s the average duration of an IP session? which hosts have similar usage patterns (clusters)? does traffic distribution change in different periods of time? Up to 1 Billion packets per hour per router Many hundreds of routers per ISP Many terabytes of data per hour! 3 / 33 Irene Finocchi Mining data streams
5 Sensor data Intro Puzzles Stream sources Data stream model Sensors with GPS unit deployed in the ocean: Each sensor reports surface height (4-byte real number) every tenth of second Base station receives 3.5 MB per day per sensor 4 / 33 Irene Finocchi Mining data streams
6 Sensor data Intro Puzzles Stream sources Data stream model Sensors with GPS unit deployed in the ocean: Each sensor reports surface height (4-byte real number) every tenth of second Base station receives 3.5 MB per day per sensor What about a million sensors? 3.5 TB of data per day, coming at a high rate A million sensors isn t very many: roughly one sensor per 150 square miles of ocean... 4 / 33 Irene Finocchi Mining data streams
7 More streams... Intro Puzzles Stream sources Data stream model Image data satellites send down to earth many TBs of images per day surveillance cameras produce roughly one image per second: London has about six millions such cameras 5 / 33 Irene Finocchi Mining data streams
8 More streams... Intro Puzzles Stream sources Data stream model Image data satellites send down to earth many TBs of images per day surveillance cameras produce roughly one image per second: London has about six millions such cameras Web traffic Google receives several hundreds million search queries per day 5 / 33 Irene Finocchi Mining data streams
9 More streams... Intro Puzzles Stream sources Data stream model Image data satellites send down to earth many TBs of images per day surveillance cameras produce roughly one image per second: London has about six millions such cameras Economic trend analysis in online auction systems, users continuously submit bids for items and items for auction Web traffic Google receives several hundreds million search queries per day 5 / 33 Irene Finocchi Mining data streams
10 Intro Puzzles Stream sources Data stream model Issues in data stream processing Some features common to all these applications: huge volumes of data (terabytes, even petabytes) records arrive at a rapid rate need to monitor data continuously to support exploratory analyses and to detect correlations, patterns, rare events, fraud, intrusion, unusual activities 6 / 33 Irene Finocchi Mining data streams
11 Intro Puzzles Stream sources Data stream model Issues in data stream processing Some features common to all these applications: huge volumes of data (terabytes, even petabytes) records arrive at a rapid rate need to monitor data continuously to support exploratory analyses and to detect correlations, patterns, rare events, fraud, intrusion, unusual activities Many problems about streaming data would be easy to solve if we had enough memory, but require new techniques for realistic data rates and sizes What can be computed without even storing the input? 6 / 33 Irene Finocchi Mining data streams
12 Intro Puzzles Stream sources Data stream model Basic data stream model Data stream = sequence σ = a 1, a 2,...a m of tokens drawn from universe [n] = {1, 2,...n} 7 / 33 Irene Finocchi Mining data streams
13 Intro Puzzles Stream sources Data stream model Basic data stream model Data stream = sequence σ = a 1, a 2,...a m of tokens drawn from universe [n] = {1, 2,...n} Input parameters: m and n 1 Stream σ is massively long. Stream length m is: typically unknown possibly infinite 7 / 33 Irene Finocchi Mining data streams
14 Intro Puzzles Stream sources Data stream model Basic data stream model Data stream = sequence σ = a 1, a 2,...a m of tokens drawn from universe [n] = {1, 2,...n} Input parameters: m and n 1 Stream σ is massively long. Stream length m is: typically unknown possibly infinite 2 Universe size n is also typically very large (e.g., IP addresses, URLs, item prices) 7 / 33 Irene Finocchi Mining data streams
15 Intro Puzzles Stream sources Data stream model Performance metrics Minimize space, passes, and processing time upon token arrivals 8 / 33 Irene Finocchi Mining data streams
16 Intro Puzzles Stream sources Data stream model Performance metrics Minimize space, passes, and processing time upon token arrivals 1 Use a sublinear amount of space s: s = o(min{n, m}) where s = bits of random-access working memory 8 / 33 Irene Finocchi Mining data streams
17 Intro Puzzles Stream sources Data stream model Performance metrics Minimize space, passes, and processing time upon token arrivals 1 Use a sublinear amount of space s: s = o(min{n, m}) where s = bits of random-access working memory 2 Make p passes over the data, for some small integer p (no random access to tokens) 8 / 33 Irene Finocchi Mining data streams
18 Intro Puzzles Stream sources Data stream model Performance metrics Minimize space, passes, and processing time upon token arrivals 1 Use a sublinear amount of space s: s = o(min{n, m}) where s = bits of random-access working memory 2 Make p passes over the data, for some small integer p (no random access to tokens) 3 Use small per-item processing time t 8 / 33 Irene Finocchi Mining data streams
19 Intro Puzzles Stream sources Data stream model Performance metrics Minimize space, passes, and processing time upon token arrivals 1 Use a sublinear amount of space s: s = o(min{n, m}) where s = bits of random-access working memory 2 Make p passes over the data, for some small integer p (no random access to tokens) 3 Use small per-item processing time t 8 / 33 Irene Finocchi Mining data streams
20 Intro Puzzles Stream sources Data stream model Performance metrics Minimize space, passes, and processing time upon token arrivals 1 Use a sublinear amount of space s: s = o(min{n, m}) where s = bits of random-access working memory 2 Make p passes over the data, for some small integer p (no random access to tokens) 3 Use small per-item processing time t s = O(log m + log n) Happy if s = O(polylog(min{n, m}) p = 1 t = O(1) 8 / 33 Irene Finocchi Mining data streams
21 Token frequencies Intro Puzzles Stream sources Data stream model Data stream = sequence σ = a 1, a 2,...a m of tokens drawn from universe [n] = {1, 2,...n} 9 / 33 Irene Finocchi Mining data streams
22 Token frequencies Intro Puzzles Stream sources Data stream model Data stream = sequence σ = a 1, a 2,...a m of tokens drawn from universe [n] = {1, 2,...n} σ represents a multiset of items and implicitly defines a frequency vector f = f 1, f 2,...f n where f i = number of occurrences of item i [n] in σ Example If σ = 2, 1, 2, 1, 5, 2, 3, 2 and n = 5, then f = 2, 4, 1, 0, 1 9 / 33 Irene Finocchi Mining data streams
23 Token frequencies Intro Puzzles Stream sources Data stream model Data stream = sequence σ = a 1, a 2,...a m of tokens drawn from universe [n] = {1, 2,...n} σ represents a multiset of items and implicitly defines a frequency vector f = f 1, f 2,...f n where f i = number of occurrences of item i [n] in σ Example If σ = 2, 1, 2, 1, 5, 2, 3, 2 and n = 5, then f = 2, 4, 1, 0, 1 In many streaming problems, wish to compute some statistical properties of the multiset: e.g., majority token (if any), most frequent items, or number of distinct items 9 / 33 Irene Finocchi Mining data streams
24 Intro Puzzles Stream sources Data stream model Variations of the basic setup Data stream = sequence of tuples σ = (a 1, c 1 ), (a 2, c 2 ),... where (a i, c i ) [n] { F,..., F } Upon arrival of (a i, c i )), update frequency f ai New role for m: m = n j=1 f j = f ai + c i 10 / 33 Irene Finocchi Mining data streams
25 Intro Puzzles Stream sources Data stream model Variations of the basic setup Data stream = sequence of tuples σ = (a 1, c 1 ), (a 2, c 2 ),... where (a i, c i ) [n] { F,..., F } Upon arrival of (a i, c i )), update frequency f ai New role for m: m = n j=1 f j = f ai + c i Basic data stream model: c i = 1 (m = stream length) 10 / 33 Irene Finocchi Mining data streams
26 Intro Puzzles Stream sources Data stream model Variations of the basic setup Data stream = sequence of tuples σ = (a 1, c 1 ), (a 2, c 2 ),... where (a i, c i ) [n] { F,..., F } Upon arrival of (a i, c i )), update frequency f ai New role for m: m = n j=1 f j = f ai + c i Basic data stream model: c i = 1 (m = stream length) Cash register model: c i > 0 (items can only arrive, their frequencies can be incremented by variable amounts) 10 / 33 Irene Finocchi Mining data streams
27 Intro Puzzles Stream sources Data stream model Variations of the basic setup Data stream = sequence of tuples σ = (a 1, c 1 ), (a 2, c 2 ),... where (a i, c i ) [n] { F,..., F } Upon arrival of (a i, c i )), update frequency f ai New role for m: m = n j=1 f j = f ai + c i Basic data stream model: c i = 1 (m = stream length) Cash register model: c i > 0 (items can only arrive, their frequencies can be incremented by variable amounts) Turnstile model: generic c i (items can arrive and depart from the multiset) 10 / 33 Irene Finocchi Mining data streams
28 Historical remarks Intro Puzzles Stream sources Data stream model Origin in the 70s (seminal paper by Munro & Paterson, STOC 78) 11 / 33 Irene Finocchi Mining data streams
29 Historical remarks Intro Puzzles Stream sources Data stream model Origin in the 70s (seminal paper by Munro & Paterson, STOC 78) Gained popularity in the last fifteen years: theoretical interest: easy-to-state, but hard-to-solve problems links to other theory areas and to novel computing paradigms (MapReduce) 11 / 33 Irene Finocchi Mining data streams
30 Historical remarks Intro Puzzles Stream sources Data stream model Origin in the 70s (seminal paper by Munro & Paterson, STOC 78) Gained popularity in the last fifteen years: theoretical interest: easy-to-state, but hard-to-solve problems links to other theory areas and to novel computing paradigms (MapReduce) practical appeal: fast and effective solutions, wide applicability 11 / 33 Irene Finocchi Mining data streams
31 Historical remarks Intro Puzzles Stream sources Data stream model Origin in the 70s (seminal paper by Munro & Paterson, STOC 78) Gained popularity in the last fifteen years: theoretical interest: easy-to-state, but hard-to-solve problems links to other theory areas and to novel computing paradigms (MapReduce) practical appeal: fast and effective solutions, wide applicability Alon, Matias & Szegedy: Gödel prize (2005) for their paper on frequency moments approximation (STOC 96, JCSS 99), foundational work for streaming and sketching algorithms 11 / 33 Irene Finocchi Mining data streams
32 Intro Puzzles Missing number Fishing Pointer & chaser Recap Three puzzles Data stream challenges 12 / 33 Irene Finocchi Mining data streams
33 Intro Puzzles Missing number Fishing Pointer & chaser Recap The missing number puzzle π = π 1, π 2,...π n 1 is a permutation of [1, n] with one number missing 13 / 33 Irene Finocchi Mining data streams
34 Intro Puzzles Missing number Fishing Pointer & chaser Recap The missing number puzzle π = π 1, π 2,...π n 1 is a permutation of [1, n] with one number missing What s the missing number? 13 / 33 Irene Finocchi Mining data streams
35 Intro Puzzles Missing number Fishing Pointer & chaser Recap The missing number puzzle π = π 1, π 2,...π n 1 is a permutation of [1, n] with one number missing What s the missing number? Constraint: Carole has limited memory: she can only use O(log n) bits 13 / 33 Irene Finocchi Mining data streams
36 Intro Puzzles Missing number Fishing Pointer & chaser Recap The missing number puzzle π = π 1, π 2,...π n 1 is a permutation of [1, n] with one number missing What s the missing number? Constraint: Carole has limited memory: she can only use O(log n) bits n(n 1) 2 n 1 i=1 π i 13 / 33 Irene Finocchi Mining data streams
37 Intro Puzzles Missing number Fishing Pointer & chaser Recap Two missing numbers Now π has two missing numbers, x and y: find them, but use only O(log n) bits! 14 / 33 Irene Finocchi Mining data streams
38 Intro Puzzles Missing number Fishing Pointer & chaser Recap Two missing numbers Now π has two missing numbers, x and y: find them, but use only O(log n) bits! Track { S = n(n+1) 2 n 2 i=1 π i P = n! Π n 2 i=1 π i Solve equations x + y = S and x y = P 14 / 33 Irene Finocchi Mining data streams
39 Intro Puzzles Missing number Fishing Pointer & chaser Recap Two missing numbers Now π has two missing numbers, x and y: find them, but use only O(log n) bits! Track { S = n(n+1) 2 n 2 i=1 π i P = n! Π n 2 i=1 π i Solve equations x + y = S and x y = P How many bits? Ω(log n!) = Ω(n log n) 14 / 33 Irene Finocchi Mining data streams
40 Intro Puzzles Missing number Fishing Pointer & chaser Recap Two missing numbers Now π has two missing numbers, x and y: find them, but use only O(log n) bits! 15 / 33 Irene Finocchi Mining data streams
41 Intro Puzzles Missing number Fishing Pointer & chaser Recap Two missing numbers Now π has two missing numbers, x and y: find them, but use only O(log n) bits! Track S 1 = n(n 1) 2 n 2 i=1 π i S 2 = n(n+1)(2n+1) 6 n 2 i=1 π2 i Solve equations x + y = S 1 and x 2 + y 2 = S 2 15 / 33 Irene Finocchi Mining data streams
42 Intro Puzzles Missing number Fishing Pointer & chaser Recap Two missing numbers Now π has two missing numbers, x and y: find them, but use only O(log n) bits! Track S 1 = n(n 1) 2 n 2 i=1 π i S 2 = n(n+1)(2n+1) 6 n 2 i=1 π2 i Solve equations x + y = S 1 and x 2 + y 2 = S 2 How many bits? O(log n 3 ) = O(log n) 15 / 33 Irene Finocchi Mining data streams
43 Lesson 1 Intro Puzzles Missing number Fishing Pointer & chaser Recap Some problems can be deterministically solved in: logarithmic space one pass 16 / 33 Irene Finocchi Mining data streams
44 Lesson 1 Intro Puzzles Missing number Fishing Pointer & chaser Recap Some problems can be deterministically solved in: logarithmic space one pass Most of the times, we re not so lucky 16 / 33 Irene Finocchi Mining data streams
45 Fishing Intro Puzzles Missing number Fishing Pointer & chaser Recap U = {1,...u} fish species in the universe 17 / 33 Irene Finocchi Mining data streams
46 Fishing Intro Puzzles Missing number Fishing Pointer & chaser Recap U = {1,...u} fish species in the universe a t U fish species caught at time t 17 / 33 Irene Finocchi Mining data streams
47 Fishing Intro Puzzles Missing number Fishing Pointer & chaser Recap U = {1,...u} fish species in the universe a t U fish species caught at time t f t [j] = {a i a i = j, i t} frequency of species j up to time t 17 / 33 Irene Finocchi Mining data streams
48 Fishing Intro Puzzles Missing number Fishing Pointer & chaser Recap U = {1,...u} fish species in the universe a t U fish species caught at time t f t [j] = {a i a i = j, i t} frequency of species j up to time t j is rare iff f t [j] = 1 17 / 33 Irene Finocchi Mining data streams
49 Fishing Intro Puzzles Missing number Fishing Pointer & chaser Recap U = {1,...u} fish species in the universe a t U fish species caught at time t f t [j] = {a i a i = j, i t} frequency of species j up to time t j is rare iff f t [j] = 1 Rarity of catch at time t: ρ t = {j f t[j] = 1} u = R t u 17 / 33 Irene Finocchi Mining data streams
50 Fishing Intro Puzzles Missing number Fishing Pointer & chaser Recap U = {1,...u} fish species in the universe a t U fish species caught at time t f t [j] = {a i a i = j, i t} frequency of species j up to time t j is rare iff f t [j] = 1 Rarity of catch at time t: ρ t = {j f t[j] = 1} u = R t u George is curious and wants to compute rarity 17 / 33 Irene Finocchi Mining data streams
51 Fishing Intro Puzzles Missing number Fishing Pointer & chaser Recap U = {1,...u} fish species in the universe a t U fish species caught at time t f t [j] = {a i a i = j, i t} frequency of species j up to time t j is rare iff f t [j] = 1 Rarity of catch at time t: ρ t = {j f t[j] = 1} u = R t u George is curious and wants to compute rarity 2u-bit vector would suffice... but George s suitcase has o(u) size 17 / 33 Irene Finocchi Mining data streams
52 Intro Puzzles Missing number Fishing Pointer & chaser Recap Deterministic fish rarity George cannot compute ρ t precisely with a deterministic algorithm using only o(u) bits By contradiction 18 / 33 Irene Finocchi Mining data streams
53 Intro Puzzles Missing number Fishing Pointer & chaser Recap Deterministic fish rarity George cannot compute ρ t precisely with a deterministic algorithm using only o(u) bits By contradiction Let S U be a set of species: no duplicates, S = Θ(u) Need Ω( S ) = Ω(u) bits to represent S If claim is false, could break information theoretic lower bound 18 / 33 Irene Finocchi Mining data streams
54 Intro Puzzles Missing number Fishing Pointer & chaser Recap Deterministic fish rarity George cannot compute ρ t precisely with a deterministic algorithm using only o(u) bits By contradiction Let S U be a set of species: no duplicates, S = Θ(u) Need Ω( S ) = Ω(u) bits to represent S If claim is false, could break information theoretic lower bound To retrieve S, for each i U, stream S, i to George and compare ρ t and ρ t+1 : 18 / 33 Irene Finocchi Mining data streams
55 Intro Puzzles Missing number Fishing Pointer & chaser Recap Deterministic fish rarity George cannot compute ρ t precisely with a deterministic algorithm using only o(u) bits By contradiction Let S U be a set of species: no duplicates, S = Θ(u) Need Ω( S ) = Ω(u) bits to represent S If claim is false, could break information theoretic lower bound To retrieve S, for each i U, stream S, i to George and compare ρ t and ρ t+1 : if i S, then R t+1 = R t + 1 and ρ t+1 > ρ t 18 / 33 Irene Finocchi Mining data streams
56 Intro Puzzles Missing number Fishing Pointer & chaser Recap Deterministic fish rarity George cannot compute ρ t precisely with a deterministic algorithm using only o(u) bits By contradiction Let S U be a set of species: no duplicates, S = Θ(u) Need Ω( S ) = Ω(u) bits to represent S If claim is false, could break information theoretic lower bound To retrieve S, for each i U, stream S, i to George and compare ρ t and ρ t+1 : if i S, then R t+1 = R t + 1 and ρ t+1 > ρ t if i S, then R t+1 = R t 1 and ρ t+1 < ρ t 18 / 33 Irene Finocchi Mining data streams
57 Intro Puzzles Missing number Fishing Pointer & chaser Recap Deterministic fish rarity George cannot compute ρ t precisely with a deterministic algorithm using only o(u) bits By contradiction Let S U be a set of species: no duplicates, S = Θ(u) Need Ω( S ) = Ω(u) bits to represent S If claim is false, could break information theoretic lower bound To retrieve S, for each i U, stream S, i to George and compare ρ t and ρ t+1 : if i S, then R t+1 = R t + 1 and ρ t+1 > ρ t if i S, then R t+1 = R t 1 and ρ t+1 < ρ t Hence ρ decreases i S 18 / 33 Irene Finocchi Mining data streams
58 Intro Puzzles Missing number Fishing Pointer & chaser Recap Randomized fish rarity (1/2) Sampling: George can approximate ρ t using 2k = o(u) bits pick k random fish species maintain rarity c 1 [t],... c k [t] of each sampled species (2 bits) Return ρ t = {i [1, k] c i[t] = 1} k = R t k 19 / 33 Irene Finocchi Mining data streams
59 Intro Puzzles Missing number Fishing Pointer & chaser Recap Randomized fish rarity (1/2) Sampling: George can approximate ρ t using 2k = o(u) bits pick k random fish species maintain rarity c 1 [t],... c k [t] of each sampled species (2 bits) Return ρ t = {i [1, k] c i[t] = 1} k Claim: E[ ρ t ] = ρ t = R t k 19 / 33 Irene Finocchi Mining data streams
60 Intro Puzzles Missing number Fishing Pointer & chaser Recap Randomized fish rarity (1/2) Sampling: George can approximate ρ t using 2k = o(u) bits pick k random fish species maintain rarity c 1 [t],... c k [t] of each sampled species (2 bits) Return ρ t = {i [1, k] c i[t] = 1} k Claim: E[ ρ t ] = ρ t = R t k If ρ t large enough, ρ t is a good estimate for ρ t with arbitrarily small precision and good probability Requires more advanced probabilistic tools: examples later 19 / 33 Irene Finocchi Mining data streams
61 Intro Puzzles Missing number Fishing Pointer & chaser Recap Randomized fish rarity (2/2) ρ t = {i [1, k] c i[t] = 1} k E[ ρ t ] = ρ t = R t k 20 / 33 Irene Finocchi Mining data streams
62 Intro Puzzles Missing number Fishing Pointer & chaser Recap Randomized fish rarity (2/2) ρ t = {i [1, k] c i[t] = 1} k E[ ρ t ] = ρ t = R t k Y i indicator variable: { Yi = 1 if c i [t] = 1 Y i = 0 otherwise 20 / 33 Irene Finocchi Mining data streams
63 Intro Puzzles Missing number Fishing Pointer & chaser Recap Randomized fish rarity (2/2) ρ t = {i [1, k] c i[t] = 1} k E[ ρ t ] = ρ t = R t k Y i indicator variable: { Yi = 1 if c i [t] = 1 Y i = 0 otherwise Pr{Y i = 1} = Pr{the i-th sampled species is rare} = R t u = ρ t 20 / 33 Irene Finocchi Mining data streams
64 Intro Puzzles Missing number Fishing Pointer & chaser Recap Randomized fish rarity (2/2) ρ t = {i [1, k] c i[t] = 1} k E[ ρ t ] = ρ t = R t k Y i indicator variable: { Yi = 1 if c i [t] = 1 Y i = 0 otherwise Pr{Y i = 1} = Pr{the i-th sampled species is rare} = R t u = ρ t E[Y i ] = ρ t 20 / 33 Irene Finocchi Mining data streams
65 Intro Puzzles Missing number Fishing Pointer & chaser Recap Randomized fish rarity (2/2) ρ t = {i [1, k] c i[t] = 1} k E[ ρ t ] = ρ t = R t k Y i indicator variable: { Yi = 1 if c i [t] = 1 Y i = 0 otherwise Pr{Y i = 1} = Pr{the i-th sampled species is rare} = R t u = ρ t E[Y i ] = ρ t E[ R t ] = k i=1 E[Y i] = kρ t 20 / 33 Irene Finocchi Mining data streams
66 Intro Puzzles Missing number Fishing Pointer & chaser Recap Randomized fish rarity (2/2) ρ t = {i [1, k] c i[t] = 1} k E[ ρ t ] = ρ t = R t k Y i indicator variable: { Yi = 1 if c i [t] = 1 Y i = 0 otherwise Pr{Y i = 1} = Pr{the i-th sampled species is rare} = R t u = ρ t E[Y i ] = ρ t E[ R t ] = k i=1 E[Y i] = kρ t E[ ρ t ] = E[ R t ] k = ρ t 20 / 33 Irene Finocchi Mining data streams
67 Lesson 2 Intro Puzzles Missing number Fishing Pointer & chaser Recap It is often impossible to solve problems precisely and deterministically in small (sublinear) space Randomization and approximation greatly help: find an answer correct within some factor (guarantee that ρ is within 10% of ρ) allow a small probability of failure (answer is correct, except with probability 1 in 10,000) 21 / 33 Irene Finocchi Mining data streams
68 Pointer and chaser Intro Puzzles Missing number Fishing Pointer & chaser Recap Paul has n + 1 pointers For each pointer i, he points to a position P[i] [1, n] n= / 33 Irene Finocchi Mining data streams
69 Pointer and chaser Intro Puzzles Missing number Fishing Pointer & chaser Recap Paul has n + 1 pointers For each pointer i, he points to a position P[i] [1, n] n= Carole has to guess any duplicate pointer Constraints: O(log n) bits O(n) queries cannot move items 22 / 33 Irene Finocchi Mining data streams
70 Repeated scans Intro Puzzles Missing number Fishing Pointer & chaser Recap n= Trivial solution for each i, count how many j are such that P[j]=i O(log n) bits, but O(n 2 ) queries 23 / 33 Irene Finocchi Mining data streams
71 Repeated scans Intro Puzzles Missing number Fishing Pointer & chaser Recap n= Trivial solution for each i, count how many j are such that P[j]=i O(log n) bits, but O(n 2 ) queries 2 Better solution if # of items below n/2 > # of items above n/2 then search for duplicates < n/2 else search for duplicates n/2 O(log n) bits and passes, O(n log n) queries 3 With O(log n) bits, Ω(log n/ log log n) passes are needed 23 / 33 Irene Finocchi Mining data streams
72 Intro Puzzles Missing number Fishing Pointer & chaser Recap Random access helps n= Chase pointers, starting from position n + 1 Problem equivalent to finding a loop in a linked list Can be solved in O(n) time with just 2 pointers! r1 r2 24 / 33 Irene Finocchi Mining data streams
73 Intro Puzzles Missing number Fishing Pointer & chaser Recap Random access helps n= Chase pointers, starting from position n + 1 Problem equivalent to finding a loop in a linked list Can be solved in O(n) time with just 2 pointers! r1 r2 25 / 33 Irene Finocchi Mining data streams
74 Intro Puzzles Missing number Fishing Pointer & chaser Recap Random access helps n= Chase pointers, starting from position n + 1 Problem equivalent to finding a loop in a linked list Can be solved in O(n) time with just 2 pointers! r1 r2 26 / 33 Irene Finocchi Mining data streams
75 Intro Puzzles Missing number Fishing Pointer & chaser Recap Random access helps n= Chase pointers, starting from position n + 1 Problem equivalent to finding a loop in a linked list Can be solved in O(n) time with just 2 pointers! r1 r2 27 / 33 Irene Finocchi Mining data streams
76 Intro Puzzles Missing number Fishing Pointer & chaser Recap Random access helps n= Chase pointers, starting from position n + 1 Problem equivalent to finding a loop in a linked list Can be solved in O(n) time with just 2 pointers! r1 r2 28 / 33 Irene Finocchi Mining data streams
77 Intro Puzzles Missing number Fishing Pointer & chaser Recap Random access helps n= Chase pointers, starting from position n + 1 Problem equivalent to finding a loop in a linked list Can be solved in O(n) time with just 2 pointers! a=9 b=3 r1 c=3 r2 29 / 33 Irene Finocchi Mining data streams
78 Intro Puzzles Missing number Fishing Pointer & chaser Recap Random access helps n= Chase pointers, starting from position n + 1 Problem equivalent to finding a loop in a linked list Can be solved in O(n) time with just 2 pointers! a=9 b=3 r1 c=3 r2 { a + b = t a + k(b + c) + b = 2t t and k known 29 / 33 Irene Finocchi Mining data streams
79 Intro Puzzles Missing number Fishing Pointer & chaser Recap Random access helps n= Chase pointers, starting from position n + 1 Problem equivalent to finding a loop in a linked list Can be solved in O(n) time with just 2 pointers! a=9 b=3 r1 c=3 r2 { a + b = t a + k(b + c) + b = 2t { a + b = t b + c = t/k t and k known 29 / 33 Irene Finocchi Mining data streams
80 Intro Puzzles Missing number Fishing Pointer & chaser Recap Random access helps n= Chase pointers, starting from position n + 1 Problem equivalent to finding a loop in a linked list Can be solved in O(n) time with just 2 pointers! a=9 b=3 r1 c=3 r2 { a + b = t a + k(b + c) + b = 2t { a + b = t b + c = t/k t and k known a = c+ k 1 t k 29 / 33 Irene Finocchi Mining data streams
81 Intro Puzzles Missing number Fishing Pointer & chaser Recap Random access helps n= Chase pointers, starting from position n + 1 Problem equivalent to finding a loop in a linked list Can be solved in O(n) time with just 2 pointers! a=9 b=3 r1 c=3 t(k-1)/k=6 r2 { a + b = t a + k(b + c) + b = 2t { a + b = t b + c = t/k a = c + k 1 t k t and k known 30 / 33 Irene Finocchi Mining data streams
82 Lesson 3 Intro Puzzles Missing number Fishing Pointer & chaser Recap Tokens come as a stream: no random access Sometimes impossible to achieve the same bounds as in the RAM model 31 / 33 Irene Finocchi Mining data streams
83 Recap on lessons Intro Puzzles Missing number Fishing Pointer & chaser Recap Typically impossible to solve problems precisely and deterministically in small (sublinear) space Randomize and approximate! Sequential data access makes things harder 32 / 33 Irene Finocchi Mining data streams
84 References Intro Puzzles Missing number Fishing Pointer & chaser Recap 1 Data streams: algorithms and applications, S. Muthukrishnan muthu/ 2 Algorithms for data streams, C. Demetrescu & I. Finocchi twiki.di.uniroma1.it/pub/ing_algo/webhome/dfchapter08.pdf 33 / 33 Irene Finocchi Mining data streams
Algorithms for data streams
Intro Puzzles Sampling Sketches Graphs Lower bounds Summing up Thanks! Algorithms for data streams Irene Finocchi finocchi@di.uniroma1.it http://www.dsi.uniroma1.it/ finocchi/ May 9, 2012 1 / 99 Irene
More informationData Streams: Algorithms and Applications
Foundations and Trends R in Theoretical Computer Science Vol. 1, No 2 (2005) 117 236 c 2005 S. Muthukrishnan Data Streams: Algorithms and Applications S. Muthukrishnan Rutgers University, New Brunswick,
More informationCourse : Data mining
Course : Data mining Lecture : Mining data streams Aristides Gionis Department of Computer Science Aalto University visiting in Sapienza University of Rome fall 2016 reading assignment LRU book: chapter
More informationB561 Advanced Database Concepts Streaming Model. Qin Zhang 1-1
B561 Advanced Database Concepts 2.2. Streaming Model Qin Zhang 1-1 Data Streams Continuous streams of data elements (massive possibly unbounded, rapid, time-varying) Some examples: 1. network monitoring
More informationData Streams: Algorithms and Applications 1
Data Streams: Algorithms and Applications 1 Data Streams: Algorithms and Applications 2 Contents 0.1 Abstract 5 0.2 Introduction 5 0.3 Map 10 0.4 The Data Stream Phenomenon 10 0.5 Data Streaming: Formal
More informationSublinear Models for Streaming and/or Distributed Data
Sublinear Models for Streaming and/or Distributed Data Qin Zhang Guest lecture in B649 Feb. 3, 2015 1-1 Now about the Big Data Big data is everywhere : over 2.5 petabytes of sales transactions : an index
More informationCS246: Mining Massive Datasets Jure Leskovec, Stanford University
CS246: Mining Massive Datasets Jure Leskovec, Stanford University http://cs246.stanford.edu 3/6/2012 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 2 In many data mining
More informationLecture 5: Data Streaming Algorithms
Great Ideas in Theoretical Computer Science Summer 2013 Lecture 5: Data Streaming Algorithms Lecturer: Kurt Mehlhorn & He Sun In the data stream scenario, the input arrive rapidly in an arbitrary order,
More informationIntroduction to Data Mining
Introduction to Data Mining Lecture #6: Mining Data Streams Seoul National University 1 Outline Overview Sampling From Data Stream Queries Over Sliding Window 2 Data Streams In many data mining situations,
More informationCS246: Mining Massive Datasets Jure Leskovec, Stanford University
CS246: Mining Massive Datasets Jure Leskovec, Stanford University http://cs246.stanford.edu 2/24/2014 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 2 High dim. data
More informationCS246: Mining Massive Datasets Jure Leskovec, Stanford University
CS246: Mining Massive Datasets Jure Leskovec, Stanford University http://cs246.stanford.edu 2/25/2013 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 3 In many data mining
More informationMining Data Streams. Chapter The Stream Data Model
Chapter 4 Mining Data Streams Most of the algorithms described in this book assume that we are mining a database. That is, all our data is available when and if we want it. In this chapter, we shall make
More informationData Streaming Algorithms for Geometric Problems
Data Streaming Algorithms for Geometric roblems R.Sharathkumar Duke University 1 Introduction A data stream is an ordered sequence of points that can be read only once or a small number of times. Formally,
More informationStreaming Algorithms. Stony Brook University CSE545, Fall 2016
Streaming Algorithms Stony Brook University CSE545, Fall 2016 Big Data Analytics -- The Class We will learn: to analyze different types of data: high dimensional graphs infinite/never-ending labeled to
More informationCS144: Intro to Computer Networks Homework 1 Scan and submit your solution online. Due Friday January 30, 4pm
CS144: Intro to Computer Networks Homework 1 Scan and submit your solution online. Due Friday January 30, 2015 @ 4pm Your Name: Answers SUNet ID: root @stanford.edu Check if you would like exam routed
More informationTracking Frequent Items Dynamically: What s Hot and What s Not To appear in PODS 2003
Tracking Frequent Items Dynamically: What s Hot and What s Not To appear in PODS 2003 Graham Cormode graham@dimacs.rutgers.edu dimacs.rutgers.edu/~graham S. Muthukrishnan muthu@cs.rutgers.edu Everyday
More informationBig Data Analytics CSCI 4030
High dim. data Graph data Infinite data Machine learning Apps Locality sensitive hashing PageRank, SimRank Filtering data streams SVM Recommen der systems Clustering Community Detection Queries on streams
More informationBased on the original slides by Minos Garofalakis Yahoo! Research & UC Berkeley
Based on the original slides by Minos Garofalakis Yahoo! Research & UC Berkeley Traditional DBMS: data stored in finite, persistent data sets Data Streams: distributed, continuous, unbounded, rapid, time
More informationOne-Pass Streaming Algorithms
One-Pass Streaming Algorithms Theory and Practice Complaints and Grievances about theory in practice Disclaimer Experiences with Gigascope. A practitioner s perspective. Will be using my own implementations,
More informationSummarizing and mining inverse distributions on data streams via dynamic inverse sampling
Summarizing and mining inverse distributions on data streams via dynamic inverse sampling Presented by Graham Cormode cormode@bell-labs.com S. Muthukrishnan muthu@cs.rutgers.edu Irina Rozenbaum rozenbau@paul.rutgers.edu
More informationThanks to the advances of data processing technologies, a lot of data can be collected and stored in databases efficiently New challenges: with a
Data Mining and Information Retrieval Introduction to Data Mining Why Data Mining? Thanks to the advances of data processing technologies, a lot of data can be collected and stored in databases efficiently
More informationKnowledge Discovery and Data Mining
Knowledge Discovery and Data Mining Unit # 1 1 Acknowledgement Several Slides in this presentation are taken from course slides provided by Han and Kimber (Data Mining Concepts and Techniques) and Tan,
More informationB561 Advanced Database Concepts. 6. Streaming Algorithms. Qin Zhang 1-1
B561 Advanced Database Concepts 6. Streaming Algorithms Qin Zhang 1-1 The model and challenge The data stream model (Alon, Matias and Szegedy 1996) a n a 2 a 1 RAM CPU Why hard? Cannot store everything.
More informationOptimal Parallel Randomized Renaming
Optimal Parallel Randomized Renaming Martin Farach S. Muthukrishnan September 11, 1995 Abstract We consider the Renaming Problem, a basic processing step in string algorithms, for which we give a simultaneously
More informationCS144: Intro to Computer Networks Homework 1 Scan and submit your solution online. Due Friday January 30, 4pm
CS144: Intro to Computer Networks Homework 1 Scan and submit your solution online. Due Friday January 30, 2015 @ 4pm Your Name: SUNet ID: @stanford.edu Check if you would like exam routed back via SCPD:
More informationInformation Retrieval II
Information Retrieval II David Hawking 30 Sep 2010 Machine Learning Summer School, ANU Session Outline Ranking documents in response to a query Measuring the quality of such rankings Case Study: Tuning
More informationData Streams: Algorithms and Applications
Data Streams: Algorithms and Applications S. Muthukrishnan Contents 1 Introduction 2 1.1 Puzzle 1: Finding Missing Numbers..... 2 1.2 Puzzle 2: Fishing.... 3 1.3 Lessons... 5 2 Map 6 3 Data Stream Phenomenon
More informationOn Biased Reservoir Sampling in the Presence of Stream Evolution
Charu C. Aggarwal T J Watson Research Center IBM Corporation Hawthorne, NY USA On Biased Reservoir Sampling in the Presence of Stream Evolution VLDB Conference, Seoul, South Korea, 2006 Synopsis Construction
More informationDATA MINING II - 1DL460
DATA MINING II - 1DL460 Spring 2016 A second course in data mining http://www.it.uu.se/edu/course/homepage/infoutv2/vt16 Kjell Orsborn Uppsala Database Laboratory Department of Information Technology,
More informationSeek and Ye shall Find
Seek and Ye shall Find The continuum of computer intelligence COS 116, Spring 2010 Adam Finkelstein Final tally: Computer $77,147, Ken Jennings $24,000, Brad Rutter $21,600. Jennings: I, for one, welcome
More informationExtended R-Tree Indexing Structure for Ensemble Stream Data Classification
Extended R-Tree Indexing Structure for Ensemble Stream Data Classification P. Sravanthi M.Tech Student, Department of CSE KMM Institute of Technology and Sciences Tirupati, India J. S. Ananda Kumar Assistant
More informationAlgorithms for Grid Graphs in the MapReduce Model
University of Nebraska - Lincoln DigitalCommons@University of Nebraska - Lincoln Computer Science and Engineering: Theses, Dissertations, and Student Research Computer Science and Engineering, Department
More informationCache and Forward Architecture
Cache and Forward Architecture Shweta Jain Research Associate Motivation Conversation between computers connected by wires Wired Network Large content retrieval using wireless and mobile devices Wireless
More informationEfficient Approximation of Correlated Sums on Data Streams
Efficient Approximation of Correlated Sums on Data Streams Rohit Ananthakrishna Cornell University rohit@cs.cornell.edu Flip Korn AT&T Labs Research flip@research.att.com Abhinandan Das Cornell University
More informationMining for Patterns and Anomalies in Data Streams. Sampath Kannan University of Pennsylvania
Mining for Patterns and Anomalies in Data Streams Sampath Kannan University of Pennsylvania The Problem Data sizes too large to fit in primary memory Devices with small memory Access times to secondary
More informationSeek and Ye shall Find
Seek and Ye shall Find The continuum of computer intelligence COS 116, Spring 2012 Adam Finkelstein Recap: Binary Representation Powers of 2 2 0 2 1 2 2 2 3 2 4 2 5 2 6 2 7 2 8 2 9 2 10 1 2 4 8 16 32 64
More informationCS246: Mining Massive Data Sets Winter Final
CS246: Mining Massive Data Sets Winter 2013 Final These questions require thought, but do not require long answers. Be as concise as possible. You have three hours to complete this final. The exam has
More informationDATA MINING II - 1DL460
DATA MINING II - 1DL460 Spring 2012 A second course in data mining!! http://www.it.uu.se/edu/course/homepage/infoutv2/vt12 Kjell Orsborn! Uppsala Database Laboratory! Department of Information Technology,
More information1 (15 points) LexicoSort
CS161 Homework 2 Due: 22 April 2016, 12 noon Submit on Gradescope Handed out: 15 April 2016 Instructions: Please answer the following questions to the best of your ability. If you are asked to show your
More informationFrequent Pattern Mining in Data Streams. Raymond Martin
Frequent Pattern Mining in Data Streams Raymond Martin Agenda -Breakdown & Review -Importance & Examples -Current Challenges -Modern Algorithms -Stream-Mining Algorithm -How KPS Works -Combing KPS and
More informationSublinear Algorithms for Big Data Analysis
Sublinear Algorithms for Big Data Analysis Michael Kapralov Theory of Computation Lab 4 EPFL 7 September 2017 The age of big data: massive amounts of data collected in various areas of science and technology
More informationDS504/CS586: Big Data Analytics Data Management Prof. Yanhua Li
Welcome to DS504/CS586: Big Data Analytics Data Management Prof. Yanhua Li Time: 6:00pm 8:50pm R Location: KH 116 Fall 2017 First Grading for Reading Assignment Weka v 6 weeks v https://weka.waikato.ac.nz/dataminingwithweka/preview
More informationTELCOM2125: Network Science and Analysis
School of Information Sciences University of Pittsburgh TELCOM2125: Network Science and Analysis Konstantinos Pelechrinis Spring 2015 Figures are taken from: M.E.J. Newman, Networks: An Introduction 2
More informationLecture 2: Streaming Algorithms for Counting Distinct Elements
Lecture 2: Streaming Algorithms for Counting Distinct Elements 20th August, 2008 Streaming Algorithms Streaming Algorithms Streaming algorithms have the following properties: 1 items in the stream are
More informationDetect Cyber Threats with Securonix Proxy Traffic Analyzer
Detect Cyber Threats with Securonix Proxy Traffic Analyzer Introduction Many organizations encounter an extremely high volume of proxy data on a daily basis. The volume of proxy data can range from 100
More informationA Mathematical Proof. Zero Knowledge Protocols. Interactive Proof System. Other Kinds of Proofs. When referring to a proof in logic we usually mean:
A Mathematical Proof When referring to a proof in logic we usually mean: 1. A sequence of statements. 2. Based on axioms. Zero Knowledge Protocols 3. Each statement is derived via the derivation rules.
More informationZero Knowledge Protocols. c Eli Biham - May 3, Zero Knowledge Protocols (16)
Zero Knowledge Protocols c Eli Biham - May 3, 2005 442 Zero Knowledge Protocols (16) A Mathematical Proof When referring to a proof in logic we usually mean: 1. A sequence of statements. 2. Based on axioms.
More informationClustering. (Part 2)
Clustering (Part 2) 1 k-means clustering 2 General Observations on k-means clustering In essence, k-means clustering aims at minimizing cluster variance. It is typically used in Euclidean spaces and works
More informationDraft Notes 1 : Scaling in Ad hoc Routing Protocols
Draft Notes 1 : Scaling in Ad hoc Routing Protocols Timothy X Brown University of Colorado April 2, 2008 2 Introduction What is the best network wireless network routing protocol? This question is a function
More informationCS 5614: (Big) Data Management Systems. B. Aditya Prakash Lecture #13: Mining Streams 1
CS 5614: (Big) Data Management Systems B. Aditya Prakash Lecture #13: Mining Streams 1 Data Streams In many data mining situa;ons, we do not know the en;re data set in advance Stream Management is important
More informationDirect Routing: Algorithms and Complexity
Direct Routing: Algorithms and Complexity Costas Busch, RPI Malik Magdon-Ismail, RPI Marios Mavronicolas, Univ. Cyprus Paul Spirakis, Univ. Patras 1 Outline 1. Direct Routing. 2. Direct Routing is Hard.
More informationCOMP 465 Special Topics: Data Mining
COMP 465 Special Topics: Data Mining Introduction & Course Overview 1 Course Page & Class Schedule http://cs.rhodes.edu/welshc/comp465_s15/ What s there? Course info Course schedule Lecture media (slides,
More informationData Streams Algorithms
Data Streams Algorithms Phillip B. Gibbons Intel Research Pittsburgh Guest Lecture in 15-853 Algorithms in the Real World Phil Gibbons, 15-853, 12/4/06 # 1 Outline Data Streams in the Real World Formal
More informationSuperconcentrators of depth 2 and 3; odd levels help (rarely)
Superconcentrators of depth 2 and 3; odd levels help (rarely) Noga Alon Bellcore, Morristown, NJ, 07960, USA and Department of Mathematics Raymond and Beverly Sackler Faculty of Exact Sciences Tel Aviv
More informationPractice Midterm Exam Solutions
CSE 332: Data Abstractions Autumn 2015 Practice Midterm Exam Solutions Name: Sample Solutions ID #: 1234567 TA: The Best Section: A9 INSTRUCTIONS: You have 50 minutes to complete the exam. The exam is
More informationOn User-centric QoE Prediction for VoIP & Video Streaming based on Machine-Learning
UNIVERSITY OF CRETE On User-centric QoE Prediction for VoIP & Video Streaming based on Machine-Learning Michalis Katsarakis, Maria Plakia, Paul Charonyktakis & Maria Papadopouli University of Crete Foundation
More informationBig Data Infrastructures & Technologies
Big Data Infrastructures & Technologies Data streams and low latency processing DATA STREAM BASICS What is a data stream? Large data volume, likely structured, arriving at a very high rate Potentially
More informationBig Data Analytics: What is Big Data? Stony Brook University CSE545, Fall 2016 the inaugural edition
Big Data Analytics: What is Big Data? Stony Brook University CSE545, Fall 2016 the inaugural edition What s the BIG deal?! 2011 2011 2008 2010 2012 What s the BIG deal?! (Gartner Hype Cycle) What s the
More informationDr. Amotz Bar-Noy s Compendium of Algorithms Problems. Problems, Hints, and Solutions
Dr. Amotz Bar-Noy s Compendium of Algorithms Problems Problems, Hints, and Solutions Chapter 1 Searching and Sorting Problems 1 1.1 Array with One Missing 1.1.1 Problem Let A = A[1],..., A[n] be an array
More informationInformation Retrieval and Organisation
Information Retrieval and Organisation Dell Zhang Birkbeck, University of London 2015/16 IR Chapter 04 Index Construction Hardware In this chapter we will look at how to construct an inverted index Many
More informationA Brief Introduction to Data Mining
A Brief Introduction to Data Mining L. Torgo ltorgo@dcc.fc.up.pt Departamento de Ciência de Computadores Faculdade de Ciências / Universidade do Porto Sept, 2014 Introduction Motivation for Data Mining?
More informationNP versus PSPACE. Frank Vega. To cite this version: HAL Id: hal https://hal.archives-ouvertes.fr/hal
NP versus PSPACE Frank Vega To cite this version: Frank Vega. NP versus PSPACE. Preprint submitted to Theoretical Computer Science 2015. 2015. HAL Id: hal-01196489 https://hal.archives-ouvertes.fr/hal-01196489
More information048866: Packet Switch Architectures
048866: Packet Switch Architectures Output-Queued Switches Deterministic Queueing Analysis Fairness and Delay Guarantees Dr. Isaac Keslassy Electrical Engineering, Technion isaac@ee.technion.ac.il http://comnet.technion.ac.il/~isaac/
More informationFaster, Space-Efficient Selection Algorithms in Read-Only Memory for Integers
Faster, Space-Efficient Selection Algorithms in Read-Only Memory for Integers Timothy M. Chan 1, J. Ian Munro 1, and Venkatesh Raman 2 1 Cheriton School of Computer Science, University of Waterloo, Waterloo,
More informationTreewidth and graph minors
Treewidth and graph minors Lectures 9 and 10, December 29, 2011, January 5, 2012 We shall touch upon the theory of Graph Minors by Robertson and Seymour. This theory gives a very general condition under
More informationReal-time Scheduling of Skewed MapReduce Jobs in Heterogeneous Environments
Real-time Scheduling of Skewed MapReduce Jobs in Heterogeneous Environments Nikos Zacheilas, Vana Kalogeraki Department of Informatics Athens University of Economics and Business 1 Big Data era has arrived!
More informationCS246: Mining Massive Datasets Jure Leskovec, Stanford University
CS46: Mining Massive Datasets Jure Leskovec, Stanford University http://cs46.stanford.edu /7/ Jure Leskovec, Stanford C46: Mining Massive Datasets Many real-world problems Web Search and Text Mining Billions
More informationHeavy-Hitter Detection Entirely in the Data Plane
Heavy-Hitter Detection Entirely in the Data Plane VIBHAALAKSHMI SIVARAMAN SRINIVAS NARAYANA, ORI ROTTENSTREICH, MUTHU MUTHUKRSISHNAN, JENNIFER REXFORD 1 Heavy Hitter Flows Flows above a certain threshold
More informationIs IPv4 Sufficient for Another 30 Years?
Is IPv4 Sufficient for Another 30 Years? October 7, 2004 Abstract TCP/IP was developed 30 years ago. It has been successful over the past 30 years, until recently when its limitation started emerging.
More informationBIG DATA TESTING: A UNIFIED VIEW
http://core.ecu.edu/strg BIG DATA TESTING: A UNIFIED VIEW BY NAM THAI ECU, Computer Science Department, March 16, 2016 2/30 PRESENTATION CONTENT 1. Overview of Big Data A. 5 V s of Big Data B. Data generation
More informationParallel Disk-Based Computation and Computational Group Theory. Eric Robinson Gene Cooperman Daniel Kunkle
Parallel Disk-Based Computation and Computational Group Theory Eric Robinson Gene Cooperman Daniel Kunkle Jürgen Müller Northeastern University, Boston, USA RWTH, Aachen, Germany Applications from Computational
More informationResearch Students Lecture Series 2015
Research Students Lecture Series 215 Analyse your big data with this one weird probabilistic approach! Or: applied probabilistic algorithms in 5 easy pieces Advait Sarkar advait.sarkar@cl.cam.ac.uk Research
More informationSelection from Read-Only Memory with Limited Workspace
Selection from Read-Only Memory with Limited Workspace Srinivasa Rao Satti Seoul National University Joint work with Amr Elmasry, Daniel Dahl Juhl, and Jyrki Katajainen Outline Models and Problem definition
More informationKnowledge Discovery & Data Mining
Announcements ISM 50 - Business Information Systems Lecture 17 Instructor: Magdalini Eirinaki UC Santa Cruz May 29, 2007 News Folio #3 DUE Thursday 5/31 Database Assignment DUE Tuesday 6/5 Business Paper
More informationVideo AI Alerts An Artificial Intelligence-Based Approach to Anomaly Detection and Root Cause Analysis for OTT Video Publishers
Video AI Alerts An Artificial Intelligence-Based Approach to Anomaly Detection and Root Cause Analysis for OTT Video Publishers Live and on-demand programming delivered by over-the-top (OTT) will soon
More informationThe Effect of Inverse Document Frequency Weights on Indexed Sequence Retrieval. Kevin C. O'Kane. Department of Computer Science
The Effect of Inverse Document Frequency Weights on Indexed Sequence Retrieval Kevin C. O'Kane Department of Computer Science The University of Northern Iowa Cedar Falls, Iowa okane@cs.uni.edu http://www.cs.uni.edu/~okane
More informationISM 50 - Business Information Systems
ISM 50 - Business Information Systems Lecture 17 Instructor: Magdalini Eirinaki UC Santa Cruz May 29, 2007 Announcements News Folio #3 DUE Thursday 5/31 Database Assignment DUE Tuesday 6/5 Business Paper
More informationVisualisation of Abstract Information
Visualisation of Abstract Information Visualisation Lecture 17 Institute for Perception, Action & Behaviour School of Informatics Abstract Information 1 Information Visualisation Previously data with inherent
More informationAdvanced Algorithms and Data Structures
Advanced Algorithms and Data Structures Prof. Tapio Elomaa tapio.elomaa@tut.fi Course Prerequisites A seven credit unit course Replaced OHJ-2156 Analysis of Algorithms We take things a bit further than
More informationMULTIDIMENSIONAL INDEXING TREE STRUCTURE FOR SPATIAL DATABASE MANAGEMENT
MULTIDIMENSIONAL INDEXING TREE STRUCTURE FOR SPATIAL DATABASE MANAGEMENT Dr. G APPARAO 1*, Mr. A SRINIVAS 2* 1. Professor, Chairman-Board of Studies & Convener-IIIC, Department of Computer Science Engineering,
More informationSliding HyperLogLog: Estimating cardinality in a data stream
Sliding HyperLogLog: Estimating cardinality in a data stream Yousra Chabchoub, Georges Hébrail To cite this version: Yousra Chabchoub, Georges Hébrail. Sliding HyperLogLog: Estimating cardinality in a
More informationThe Cross-Entropy Method
The Cross-Entropy Method Guy Weichenberg 7 September 2003 Introduction This report is a summary of the theory underlying the Cross-Entropy (CE) method, as discussed in the tutorial by de Boer, Kroese,
More informationProblem Set 2 Solutions
6.893 Problem Set 2 Solutons 1 Problem Set 2 Solutions The problem set was worth a total of 25 points. Points are shown in parentheses before the problem. Part 1 - Warmup (5 points total) 1. We studied
More informationOnline Mining of Frequent Query Trees over XML Data Streams
Online Mining of Frequent Query Trees over XML Data Streams Hua-Fu Li*, Man-Kwan Shan and Suh-Yin Lee Department of Computer Science National Chiao-Tung University Hsinchu, Taiwan 300, R.O.C. http://www.csie.nctu.edu.tw/~hfli/
More informationTopics. Big Data Analytics What is and Why Hadoop? Comparison to other technologies Hadoop architecture Hadoop ecosystem Hadoop usage examples
Hadoop Introduction 1 Topics Big Data Analytics What is and Why Hadoop? Comparison to other technologies Hadoop architecture Hadoop ecosystem Hadoop usage examples 2 Big Data Analytics What is Big Data?
More informationAlgorithms for Evolving Data Sets
Algorithms for Evolving Data Sets Mohammad Mahdian Google Research Based on joint work with Aris Anagnostopoulos, Bahman Bahmani, Ravi Kumar, Eli Upfal, and Fabio Vandin Algorithm Design Paradigms Traditional
More informationCSE 5243 INTRO. TO DATA MINING
CSE 53 INTRO. TO DATA MINING Locality Sensitive Hashing (LSH) Huan Sun, CSE@The Ohio State University Slides adapted from Prof. Jiawei Han @UIUC, Prof. Srinivasan Parthasarathy @OSU MMDS Secs. 3.-3.. Slides
More informationAlgorithms for GIS:! Quadtrees
Algorithms for GIS: Quadtrees Quadtree A data structure that corresponds to a hierarchical subdivision of the plane Start with a square (containing inside input data) Divide into 4 equal squares (quadrants)
More informationcontext: massive systems
cutting the electric bill for internetscale systems Asfandyar Qureshi (MIT) Rick Weber (Akamai) Hari Balakrishnan (MIT) John Guttag (MIT) Bruce Maggs (Duke/Akamai) Éole @ flickr context: massive systems
More informationAn Indian Journal FULL PAPER. Trade Science Inc. Research on data mining clustering algorithm in cloud computing environments ABSTRACT KEYWORDS
[Type text] [Type text] [Type text] ISSN : 0974-7435 Volume 10 Issue 17 BioTechnology 2014 An Indian Journal FULL PAPER BTAIJ, 10(17), 2014 [9562-9566] Research on data mining clustering algorithm in cloud
More informationDatabase and Knowledge-Base Systems: Data Mining. Martin Ester
Database and Knowledge-Base Systems: Data Mining Martin Ester Simon Fraser University School of Computing Science Graduate Course Spring 2006 CMPT 843, SFU, Martin Ester, 1-06 1 Introduction [Fayyad, Piatetsky-Shapiro
More informationPSON: A Parallelized SON Algorithm with MapReduce for Mining Frequent Sets
2011 Fourth International Symposium on Parallel Architectures, Algorithms and Programming PSON: A Parallelized SON Algorithm with MapReduce for Mining Frequent Sets Tao Xiao Chunfeng Yuan Yihua Huang Department
More informationData Mining. Yi-Cheng Chen ( 陳以錚 ) Dept. of Computer Science & Information Engineering, Tamkang University
Data Mining Yi-Cheng Chen ( 陳以錚 ) Dept. of Computer Science & Information Engineering, Tamkang University Why Mine Data? Commercial Viewpoint Lots of data is being collected and warehoused Web data, e-commerce
More information745: Advanced Database Systems
745: Advanced Database Systems Yanlei Diao University of Massachusetts Amherst Outline Overview of course topics Course requirements Database Management Systems 1. Online Analytical Processing (OLAP) vs.
More informationLarge-Scale Data Engineering. Overview and Introduction
Large-Scale Data Engineering Overview and Introduction Administration Blackboard Page Announcements, also via email (pardon html formatting) Practical enrollment, Turning in assignments, Check Grades Contact:
More informationHow TokuDB Fractal TreeTM. Indexes Work. Bradley C. Kuszmaul. Guest Lecture in MIT Performance Engineering, 18 November 2010.
6.172 How Fractal Trees Work 1 How TokuDB Fractal TreeTM Indexes Work Bradley C. Kuszmaul Guest Lecture in MIT 6.172 Performance Engineering, 18 November 2010. 6.172 How Fractal Trees Work 2 I m an MIT
More informationI am a Data Nerd and so are YOU!
I am a Data Nerd and so are YOU! Not This Type of Nerd Data Nerd Coffee Talk We saw Cloudera as the lone open source champion of Hadoop and the EMC/Greenplum/MapR initiative as a more closed and
More informationCHENNAI MATHEMATICAL INSTITUTE M.Sc. / Ph.D. Programme in Computer Science
CHENNAI MATHEMATICAL INSTITUTE M.Sc. / Ph.D. Programme in Computer Science Entrance Examination, 5 May 23 This question paper has 4 printed sides. Part A has questions of 3 marks each. Part B has 7 questions
More informationBig Data Programming: an Introduction. Spring 2015, X. Zhang Fordham Univ.
Big Data Programming: an Introduction Spring 2015, X. Zhang Fordham Univ. Outline What the course is about? scope Introduction to big data programming Opportunity and challenge of big data Origin of Hadoop
More informationFall 2017 ECEN Special Topics in Data Mining and Analysis
Fall 2017 ECEN 689-600 Special Topics in Data Mining and Analysis Nick Duffield Department of Electrical & Computer Engineering Teas A&M University Organization Organization Instructor: Nick Duffield,
More information