MapReduce and Friends - PDF Free Download

MapReduce and Friends Craig C. Douglas University of Wyoming with thanks to Mookwon Seo

Why was it invented? MapReduce is a mergesort for large distributed memory computers. It was the basis for a web page search engine now known as Google, where pages are ranked. It is most useful for embarrassingly parallel applications. It has become a paradigm for implementation of parallel algorithms. 2

Is it efficient? Yes: When communication time can be managed so as not to be overwhelming. Encourages reconsidering how standard algorithms are implemented on distributed parallel machines. No: Frequently uses files instead of pipes to communicate. Encourages bad algorithm implementations. 3

Distributed file system We assume that we have huge amounts of storage that has to be spread across a compute cluster made up of commodity PC s. Files are read and appended, not edited. Files distributed in chunks (64 MB typical size). Chunks are replicated on multiple compute nodes (3 is typical). A master maintains index of chunk locations. 4

Compute clusters Typically made up of racks of 1U boards with multiple CPUs per board plus local storage. Intra- rack is usually 1-40 Gb/s (ethernet or Infiniband). Inter- rack connection is usually something fast. Multiple communication fabric fairly common on very big clusters. 5

Distributed file systems Common ones IBM GPFS, Google GFS, LUSTRE, Apache HDFS Wikipedia lists 19 systems (and 3 of the above are not on their list!). Proprietary and open source ones exist Some are easy to setup and others are an ongoing nightmare. Reconfiguring DFS is usually painful. 6

How does it all interact? SQL: PIG, OINK, HIVE, etc., or NoSQL: Cassandra, Dynamo, etc MapReduce: Hadoop, MR- MPI, etc. Object Store (key- value): BigTable, Hbase, etc. Distributed file system Compute cluster 7

MapReduce systems Common ones Google s MapReduce (Python) Apache s Hadoop (GPL, Java open source) MR- MPI (Sandia Nat l Lab, BSDL, C/C++ open source) Important features: Specialized parallel computing tools. Typically, user writes just two serial functions. Avoid restart of whole job if there is a compute node failure. 8

Key- value systems Popular ones: Google BigTable Apache Hbase and Cassandra (NoSQL) Amazon Dynamo Each row is associated with a key. The number of columns in a row can be variable. Each (row,column) has a set of values. 9

SQL- like systems Many, many available. Some popular ones: Apache HIVE (open source) Implements QL, a restricted subset of SQL standard. Sits on top of Hadoop. Yahoo! PIG Implements a relational algebra. Microsoft Scope Google Sawzall Implements parallel select + aggregation. 10

NoSQL systems Not Only Structured Query Language NoSQL a non- relational database Key- store structure internally Apache Hbase and Cassandra Dynamo CouchDB Eventually consistent over quiet periods, Many systems now exist. 11

SQL versus NoSQL SQL Every record in a collection is a table that has the same sequence of fields (though not necessarily the same number of fields). NoSQL Documents in a collection may have fields that are completely different. Documents are addressed by a unique key. Queries allow a document type. 12

Why Mergesort? Comparisons and performance Serial computer: O(nlogn). Parallel computer with n processors: O(logn). Cache aware versions of mergesort exist. n/2 auxiliary storage standard, but only O(1) if a linked list is used. Too much copying unless a linked list is used. Lots of communication for parallel computing. 13

Other sorting methods Heapsort Usually faster on serial computers. Impractical for linked lists. O(1) auxiliary storage standard. Quicksort (serial computers with caches) If quicksort average is Cnlogn comparisons, then mergesort maximum comparisons is 0.39Cnlogn. Quicksort on average is much faster by clock time. 14

MapReduce Paradigm MapReduce system Creates a large number of tasks for each of two functions. Work is divides among tasks precisely. Two functions only: Map tasks converts inputs from DFS to key- value pairs, where the keys are not necessarily unique. Output sorted by key. Reduce tasks combines the key- value pairs for a given key. Usually one Reduce task per key. Output to DFS. 15

What is MapReduce Good for? Matrix- vector and matrix- matrix multiplication Iterative form of PageRank uses these operations extensively. General relational algebra operations. Join operations in databases. Almost anything that is embarrassingly parallel that uses lots of data from a DFS. Dealing with failures efficiently. 16

Failure techniques Re- execute failed tasks, not whole jobs. Some systems do checkpointing and then restart at the last checkpoint. Very expensive to dump everything to disk. Adds cost of extra disk drives that have to be on another compute rack and can lead to early disk failure if used extensively. Time lengthy to move data across inter- rack network and may be measured in minutes or fractions of hours. 17

What is the obvious Map function? A hash function h(x)! Produce h(x) as the key. Value is x and placed in the h(x) bucket. When finished mapping, send the h(x) bucket to its Reduce task for combining. An efficient hash table code is imperative. Use memory cache tricks and complicated hash table implementations, not textbook ones. 18

MapReduce variant of Join Suppose we have a chunked file with lots of edges from one vertex to another for graph. We want to find all edges of the form R(a,b) and S(b,c) and join them to create T(a,c) if it does not already exist. Map: b value Use hash function h from b values to k buckets. Reduce: deal with a bucket. 19

MapReduce variant of Join Tuple R(a,b) to Reduce task h(b) key = b, value = R(a,b) Tuple S(b,c) to Reduce task h(b) key = b, value = S(b,c) If R(a,b) joins with S(b,c), then both edges are sent to Reduce task h(b). Their join (a,b,c) is appended to the output file on the DFS. 20

Example of Join Suppose we have a directed graph. 1 4 (1,2) 2 3 (1,3) (1,4) (2,3) (3,4) 21

Example of Join Map Keys R(a,b) T(a,c) 1 (1,2),(1,3),(1,4) 2 (1,2) (2,3) 3 (1,3), (2,3) (3,4) 4 (1,4), (3,4) Reduce T(1,3), T(1,4), T(2,4) 22

Matrix- vector multiplication Compute y = Mx. For NxN matrix M = [ m ij ] and N- vectors x = [ x i ] and y = [ y i ], then y i = m i1 x 1 + m i2 x 2 + + m in x N. Simplest Map function key = i, value = m ij x j. (optimization: ignore if 0) Works for dense and sparse matrices. Reduce function adds up products for given i. Inexcusably inefficient in this form, however. 23

Matrix- vector multiplication example Let M = [ m 11 m 12 m 13 ; m 21 m 22 m 23 ; m 31 m 32 m 33 ] and x = [ x 1 ; x 2 ; x 3 ]. Map Reduce Key Values 1 m 11 x 1, m 12 x 2, m 13 x 3 2 m 21 x 1, m 22 x 2, m 23 x 3 3 m 31 x 1, m 32 x 2, m 33 x 3 Key Values 1 m 11 x 1 + m 12 x 2 + m 13 x 3 2 m 21 x 1 + m 22 x 2 + m 23 x 3 3 m 31 x 1 + m 32 x 2 + m 33 x 3 24

Matrix- vector multiplication Better approach when x and y are small enough to fit on all nodes. Input M by sets of rows and assign key k based on the row sets. Compute whole y i s as a value element. Store a subvector of y as the value for key k. Reduce task just writes out the subvectorsto DFS. When x and y are too big, apply 2D domain decomposition methods to M, x, and y. 25

Matrix- matrix multiplication Compute C = AB. For NxL matrix A = [ a ij ], LxM matrix B = [ b ij ], and NxM matrix C = [ c ij ], then C is formed from NxM inner products. Simplest Map function Apply matrix- vector product formulation key = i, value = individual product. Works for dense and sparse matrices. Reduce function adds up products for given i. 26

Matrix- matrix multiplication Unbelievably inefficient, but seen in practice. Better approach for Map function: Assume each matrix is stored in blocks of size nxn (pad by zeroes at right and bottom of a matrix), where n is convenient to your DFS. Do matrix- matrix multiplication using a block scheme. Never, ever do a formal transpose on a DFS. Still works for dense and sparse matrices. 27

Security and scaling Most MapReduce systems (e.g., Hadoop) provide no security or firewalling abilities. All users have access to everything in databases. No encryption by default. Allows for far better scaling on parallel systems. Extremely difficult to add later and still scale. Medical record systems in USA using Hadoop on notice they are not in compliance with privacy laws in effect on January 1, 2014. Big Disaster. 28

Quick summary MapReduce is a distributed mergesort using disk files as intermediaries. Replace mergesort with a fast sorting algorithm in each Reduce task. No reason to restrict to slow disk files if all fits in the global memory of the compute cluster. Use MPI or OpenMP communication techniques from traditional supercomputing. 29