MapReduce 1
Outline Distributed File System Map-Reduce The Computational Model Map-Reduce Algorithm Evaluation Computing Joins 2
Outline Distributed File System Map-Reduce The Computational Model Map-Reduce Algorithm Evaluation Computing Joins 3
Cluster Architecture 1 Gbps between any pair of nodes in a rack Switch 2-10 Gbps backbone between racks Switch Switch CPU CPU CPU CPU Mem Mem Mem Mem Disk Disk Disk Disk Each rack contains 16-64 nodes 4
Stable storage First order problem: if nodes can fail, how can we store data persistently? Answer: Distributed File System Provides global file namespace Google GFS; Hadoop HDFS; Kosmix KFS Typical usage pattern Huge files (100s of GB to TB) Data is rarely updated in place Reads and appends are common 5
Distributed File System Chunk Servers File is split into contiguous chunks Typically each chunk is 16-64MB Each chunk replicated (usually 2x or 3x) Try to keep replicas in different racks Master node Name Nodes in HDFS Stores metadata Might be replicated Client library for file access Talks to master to find chunk servers Connects directly to chunkservers to access data 6
Outline Distributed File System Map-Reduce The Computational Model Map-Reduce Algorithm Evaluation Computing Joins 7
MapReduce: The Map Step Input key-value pairs Intermediate key-value pairs k v map k k v v k v map k v k v k v 8
MapReduce: The Reduce Step Intermediate key-value pairs k k v v group Key-value groups k v v v k v v reduce reduce Output key-value pairs k k v v k v k v k v k v 9
MapReduce Input: a set of key/value pairs User supplies two functions: map(k,v) list(k1,v1) reduce(k1, list(v1)) v2 (k1,v1) is an intermediate key/value pair Output is the set of (k1,v2) pairs 10
Word Count using MapReduce map(key, value): // key: document name; value: text of document for each word w in value: emit(w, 1) reduce(key, values): // key: a word; value: an iterator over counts result = 0 for each count v in values: result += v emit(result) 11
Distributed Execution Overview User Program fork fork fork assign map Master assign reduce Input Data Split 0 Split 1 Split 2 read Worker Worker Worker local write remote read, sort Worker Worker write Output File 0 Output File 1 12
Data flow Input, final output are stored on a distributed file system Scheduler tries to schedule map tasks close to physical storage location of input data Intermediate results are stored on local FS of map and reduce workers Output is often input to another map reduce task 13
Coordination Master data structures Task status: (idle, in-progress, completed) Idle tasks get scheduled as workers become available When a map task completes, it sends the master the location and sizes of its R intermediate files, one for each reducer Master pushes this info to reducers Master pings workers periodically to detect failures 14
Failures Map worker failure Map tasks completed or in-progress at worker are reset to idle Reduce workers are notified when task is rescheduled on another worker Reduce worker failure Only in-progress tasks are reset to idle Master failure MapReduce task is aborted and client is notified 15
How many Map and Reduce jobs? M map tasks, R reduce tasks Rule of thumb: Make M and R much larger than the number of nodes in cluster One DFS chunk per map is common Improves dynamic load balancing and speeds recovery from worker failure Usually R is smaller than M, because output is spread across R files 16
Combiners Often a map task will produce many pairs of the form (k,v1), (k,v2), for the same key k E.g., popular words in Word Count Can save network time by pre-aggregating at mapper combine(k1, list(v1)) v2 Usually same as reduce function Works only if reduce function is commutative and associative 17
Exercise 1: Host size Suppose we have a large web corpus Let s look at the metadata file Lines of the form (URL, size, date, ) For each host, find the total number of bytes i.e., the sum of the page sizes for all URLs from that host 18
Exercise 2: Graph reversal Given a directed graph as an adjacency list: src1: dest11, dest12, src2: dest21, dest22, Construct the graph in which all the links are reversed 19
Implementations Google Not available outside Google Hadoop An open-source implementation in Java Uses HDFS for stable storage Aster Data Cluster-optimized SQL Database that also implements MapReduce 20
Outline Distributed File System Map-Reduce The Computational Model Map-Reduce Algorithm Evalution Computing Joins 21
Overview There is a new computing environment available: Massive files, many computing nodes. Map-reduce allows us to exploit this environment easily. But not everything is map-reduce. What else can we do in the same environment? 22
Files Stored in dedicated file system. Treated like relations. Order of elements does not matter. Massive chunks (e.g., 64MB). Chunks are replicated. Parallel read/write of chunks is possible. 23
Processes Each process operates at one node. Infinite supply of nodes. Communication among processes can be via the file system or special communication channels. Example: Master controller assembling output of Map processes and passing them to Reduce processes. 24
Algorithms An algorithm is described by an acyclic graph. 1. A collection of processes (nodes). 2. Arcs from node a to node b, indicating that (part of) the output of a goes to the input of b. 25
Example: A Map-Reduce Graph map reduce map... reduce reduce map 26
Outline Distributed File System Map-Reduce The Computational Model Map-Reduce Algorithm Evaluation Computing Joins 27
Algorithm Design Goal: Algorithms should exploit as much parallelism as possible. To encourage parallelism, we put a limit s on the amount of input or output that any one process can have. s could be: What fits in main memory. What fits on local disk. No more than a process can handle before cosmic rays are likely to cause an error. 28
Cost Measures for Algorithms 1. Communication cost = total I/O of all processes. 2. Elapsed communication cost = max of I/O along any path. 3. (Elapsed ) computation costs analogous, but count only running time of processes. 29
Example: Cost Measures For a map-reduce algorithm: Communication cost = input file size + 2 (sum of the sizes of all files passed from Map processes to Reduce processes) + the sum of the output sizes of the Reduce processes. Elapsed communication cost is the sum of the largest input + output for any map process, plus the same for any reduce process. 30
What Cost Measures Mean Either the I/O (communication) or processing (computation) cost dominates. Ignore one or the other. Total costs tell what you pay in rent from your friendly neighborhood cloud. Elapsed costs are wall-clock time using parallelism. 31
Outline Distributed File System Map-Reduce The Computational Model Map-Reduce Algorithm Evalution Computing Joins 32
Join By Map-Reduce Compute the natural join R(A,B) S(B,C). R and S each are stored in files. Tuples are pairs (a,b) or (b,c). 33
Map-Reduce Join (2) Use a hash function h from B-values to 1..k. A Map process turns input tuple R(a,b) into key-value pair (b,(a,r)) and each input tuple S(b,c) into (b,(c,s)). 34
Map-Reduce Join (3) Map processes send each key-value pair with key b to Reduce process h(b). Hadoop does this automatically; just tell it what k is. Each Reduce process matches all the pairs (b,(a,r)) with all (b,(c,s)) and outputs (a,b,c). 35
Cost of Map-Reduce Join Total communication cost = O( R + S ) 36
Three-Way Join Consider a simple join of three relations, the natural join R(A,B) S(B,C) T(C,D). One way: cascade of two 2-way joins, each implemented by map-reduce. Fine, unless the 2-way joins produce large intermediate relations. 37
Large Intermediate Relations A = good pages ; B, C = all pages ; D = spam pages. R, S, and T each represent links. 3-way join = path of length 3 from good page to spam page R S = paths of length 2 from good page to any; S T = paths of length 2 from any page to spam page. 38
Another 3-Way Join Reduce processes use hash values of entire S(B,C) tuples as key. Choose a hash function h that maps B- and C-values to k buckets. There are k 2 Reduce processes, one for each (B-bucket, C-bucket) pair. 39
Mapping for 3-Way Join We map each tuple S(b,c) to ((h(b), h(c)), (S, b, c)). We map each R(a,b) tuple to ((h(b), y), (R, a, b)) for all y = 1, 2,,k. We map each T(c,d) tuple to ((x, h(c)), (T, c, d)) for all x = 1, 2,,k. Keys Values 40
Job of the Reducers Each reducer gets, for certain B- values b and C-values c : 1. All tuples from R with B = b, 2. All tuples from T with C = c, and 3. The tuple S(b,c) if it exists. Thus it can create every tuple of the form (a, b, c, d) in the join. 41
Comments 不适合迭代计算 不能随机读取 性能问题 SPARK STORM 42