B490 Mining the Big Data. 5. Models for Big Data

Size: px

Start display at page:

Download "B490 Mining the Big Data. 5. Models for Big Data"

June Bishop
5 years ago
Views:

1 B490 Mining the Big Data 5. Models for Big Data Qin Zhang 1-1

2 2-1 MapReduce

3 MapReduce The MapReduce model (Dean & Ghemawat 2004) Input Output Goal Map Shuffle Reduce Standard model in industry for massive data computation E.g., Hadoop. Minimize (1) total communication, (2) # rounds. 3-1

4 MapReduce (cont.) Input Output Map Shuffle Reduce Map: Transforms a (key, value) pair into other (key, value) pairs using a UDF (User Defined Function) called Map. Many mappers can run in parallel on vast amounts of data in a distributed file system Shuffle: The infrastructure then transfers data from the mapper nodes to the reducer nodes so that all the (key, value) pairs with the same key go to the same reducer Reduce: A UDF that aggregates all the values corresponding to a key. Many reducers can run in parallel. 4-1

5 MapReduce (cont.) Input Output Map Shuffle Reduce A simple example Consider as input an enormous text corpus (for instance all of English Wikipedia) stored in out DFS. The goal is to count how many times each word is used. Map: For each word w, make a key-value pair (w, 1) Reduce: For all words w have (w, v 1 ), (w, v 2 ),..., output: (w, i v i) 5-1

6 Recall Page Rank Page Rank Algorithm We start, say, from Google.com q 0 = [0,..., 0, 1, 0,..., 0]. In the iterative step, we compute a new vector estimate of PageRanks q i+1 from the current PageRanks estimate q i, and the transition matrix M q i+1 = βmq i + (1 β)e/n, where β is a chosen constant, usually in [0.8, 0.9], e is a vector of all 1 s, and n is the total number of nodes in the web graph. Let P = βm + (1 β)b, where B[a, b] = 1/n, we can rewrite q = P t q 0 as t. 6-1

7 Implememtation of Page Rank M is sparse and can be stored (on many machines at Google). Its size is roughly 20n, where n is on the order of 1 billion. 7-1

8 Implememtation of Page Rank M is sparse and can be stored (on many machines at Google). Its size is roughly 20n, where n is on the order of 1 billion. But P is dense, since B is dense. So P t is also dense. Thus we can only iterate q i+1 = Pq i instead of using matrix powering method. 7-2

9 Implememtation of Page Rank M is sparse and can be stored (on many machines at Google). Its size is roughly 20n, where n is on the order of 1 billion. But P is dense, since B is dense. So P t is also dense. Thus we can only iterate q i+1 = Pq i instead of using matrix powering method. However, since n is very big, so we cannot store M or q i on any single machine. We will see how to use MapReduce to accomplish the job. 7-3

10 PageRank on MapReduce Idea 1: Partition M to k vertical stripes M = [M 1,..., M k ] so each M j fits on a machine. Partition q into q T = [q 1,..., q k ] (a horizontal split). Make sure each M j or q j fits on a single machine. Now in each round: Mapper: for the k-th row of M j, creates (k, l (Mj k,l qj k,l )). Reducer: each gets (k, v 1 ), (k, v 2 ),..., outputs β t v t + (1 β)/n. = 8-1

11 PageRank on MapReduce (cont.) Idea 2: Partition M to l l blocks, each machine gets M i,j and q i. (details on board) l l = 9-1

12 MapReduce is designed for the one-shot (static) computation. For the dynamic case, we have a model called ActiveDHT 10-1

13 ActiveDHT The ActiveDHT model (Bahmani, Chowdhury & Goel 2010) Update (key, value) Query (key) Used in Yahoo! S4 & Twitter Storm responsible for keys with hash = 4, 5 responsible for keys with hash = 6,

14 An example Problem: There is a stream of data arriving (eg. tweets) which needs to be farmed out to many users/feeds in real time Simple solution Update (key, value) Query (key) keys with hash = 4, 5 keys with hash = 6, 7 Map: (user u, string tweet, time t) (v 1, (tweet, t)) (v 2, (tweet, t))... (v k, (tweet, t)), where v 1, v 2,..., v k follow u Reduce: (user v, (tweet 1, t 1 ), (tweet 2, t 2 ),... )) sort tweets in descending order of time or importance

15 ActiveDHT (cont.) Performance Measures Number of network calls per update Size of network data transfer per update Maximum size of a key-value pair Total size of all key-value pairs Maximum number of requests that go to a particular key-value pair 13-1

16 ActiveDHT (cont.) Performance Measures Number of network calls per update Size of network data transfer per update Maximum size of a key-value pair Total size of all key-value pairs Maximum number of requests that go to a particular key-value pair A challenage: need new (distributed) data structures and algorithms 13-2

17 Underlying membership query mechanism Underlying membership query mechanism: Distributed Hash Table (DHT) Popular realization of DHT: Chord (see Stoica s notes) 14-1

18 PageRank in ActiveDHT PageRank in ActiveDHT (Bahmani, Chowdhury, Goel) 1. Do r (depends on the network topology, O(log(n/ɛ)) is good enough) random walk starting at each node of the network. Each random walk has stop probability ɛ, thus has expected lenght 1/ɛ. 2. At time t, for every random walk passing through node u t, shift it to use the new edge u t, v t with probability 1/d t (u t ) 3. Time for each re-routing: O(1/ɛ). 4. Time to decide whether any walk will get rerouted: O(1) 5. Claim: This faithfully maintains R random walks after arbitrary edge arrivals. 15-1

19 PageRank in ActiveDHT PageRank in ActiveDHT (Bahmani, Chowdhury, Goel) 1. Do r (depends on the network topology, O(log(n/ɛ)) is good enough) random walk starting at each node of the network. Each random walk has stop probability ɛ, thus has expected lenght 1/ɛ. 2. At time t, for every random walk passing through node u t, shift it to use the new edge u t, v t with probability 1/d t (u t ) 3. Time for each re-routing: O(1/ɛ). 4. Time to decide whether any walk will get rerouted: O(1) 5. Claim: This faithfully maintains R random walks after arbitrary edge arrivals. Observe that we need the graph and the stored random walks to be available in an Active DHT; this is a reasonable assumption for social networks, though not necessarily for the web-graph. 15-2

20 PageRank in ActiveDHT, Analysis On board. Need O((rn log m)/ɛ 2 ) time. Compared with other apporaches. Naive Approach 1: Run the power iteration method from scratch: Total time over m edge arrivals is O(rm 2 ). Naive Approach 2: Run the Monte Carlo method from scratch: Total time over m edge arrivals is O(rmn/ɛ). 16-1

21 17-1 Thank you!

22 LSH in ActiveDHT Recall LSH Definition: h is (l, u, p l, p u )-sensitive with a distance function d if Pr[h(a) = h(b)] > p l if d(a, b) < l. Pr[h(a) = h(b)] < p u if d(a, b) > u. 18-1

23 LSH in ActiveDHT 18-2 Recall LSH Definition: h is (l, u, p l, p u )-sensitive with a distance function d if Pr[h(a) = h(b)] > p l if d(a, b) < l. Pr[h(a) = h(b)] < p u if d(a, b) > u. Given a LSH family, one can design an algorithm for the (l, u)-near Neighbor problem that uses O(n ρ ) hash functions, where n is the number of points, and ρ = log l log u (l, u)-near Neighbor: Given a set X of n points in some metric space with distance function d(, ), and a query point q, return All points p X s.t. d(x, q) l. No points p X s.t. d(x, q) u. Arbitrary decisions for points p X s.t. l < d(x, q) < u. (Next few slides are borrowed from Ashish Goel s talk)

Distributed computing: index building and use

Distributed computing: index building and use Distributed computing Goals Distributing computation across several machines to Do one computation faster - latency Do more computations in given time - throughput