INTRODUCTION TO DATA SCIENCE. MapReduce and the New Software Stacks(MMDS2)

INTRODUCTION TO DATA SCIENCE MapReduce and the New Software Stacks(MMDS2)

Big-Data Hardware Computer Clusters Computation: large number of computers/cpus Network: Ethernet switching Storage: Large collection of distributed disks Commodity hardware- cheap but relatively unreliable Computer node: Network connected CPU(s), Disk(s) and RAM How to manage distributed computations tasks and storage? How to protect against frequent failures?

New Software Stack Distributed File System Data is duplicated and distributed across multiple locations MapReduce programming paradigm A computational model for performing parallel computations A software infrastructure that manages all the boring tasks (failures, scheduling etc.)

2.1 Distributed File Systems

Computer Nodes Cluster computing Nodes are placed on racks (8-64 nodes) connected by Ethernet Racks are connected by switch

Failures Typical failures Loss of a node Loss of a rack Some computations take hours Can t restart whole process for each failure (will never complete) Solution Redundant storage of files Continue to work on the same file chunk but on other node Computation divided into tasks. Restart task on other node without affecting other tasks

Large-Scale File-System Organization Assumptions Enormous files (TB) Files are rarely updated Read or data is appended

DFS: Master Node Files are divided into large chunks (e.g. 64MB) Chunks are replicated on different nodes/racks To find required chunk Locate a name node or master node for a file Name node itself and file index are replicated All participants know how to find the directory

2.2 MapReduce

MapReduce systems A model and a software system for large-scale fault resilient computation Powerful and simple Multiple implementations Google MapReduce Open-Source Hadoop and HDFS Code all the logic/algorithm in two functions only Map function and Reduce function Let the system to handle the rest: failure, duplication, restart, scheduling, resources, monitoring etc.

MapReduce Computation Map: A number of Map tasks on multiple nodes are given file chunks from a DFS Produce key-value pairs according to logic coded in a map function Each data element might produce zero o more key-value pairs Custom logic, coded by user Group: Master controller collect key-value pairs and sorted by key. Divide result into chunks to submit to a number of Reduce tasks. All key-value pairs with the same key are going to the same chunk Standard logic, implemented by the system Reduce: A number of Reduce tasks each working on one key at a time Combine all values for a single key Custom logic, coded by user

MapReduce Computation

Function, Task, Node Map or Reduce function Logic coded by a user Mapper or reducer Map or Reduce function with a single input Example: Reducer for key w Map or Reduce Task Map/Reduce function applied to a chunk (list of key-value pairs) A Reduce task runs a number of reducers Map or Reduce Node A computer that currently runs one or more tasks (Map or Reduce) Tasks might be scheduled to a different nodes (more tasks then nodes)

The Map Function Input file Elements: a tuple, a line, a document Chunk is a collection of elements Each map task works on a chunk at a time Technically, inputs to Map are key-value pairs Allows composition of several MapReduce processes Usually key is not relevant for Map task (e.g. line number in the input file) Output Each element is converted to zero or more key-value pair Keys are not unique, several same key-value pair from same element are possible

Example: Word Count Input: repository of documents Output: number of appearances of each word Each document is an input element The Map function Read file and break into a sequence of words Produce key-value pairs

Grouping by key Collect outputs from all maps into a single list Combine same key values into key-value list The system divides all keys into buckets The number of buckets as the number of Reduce tasks Use an appropriate hash function Send each bucket to a Reduce task Input to each Reduce task: list of key-value pairs

Example: Word Count Input to the Reduce function Key is word Value is ones Sum all ones to get counter Output of all Reduce tasks is a sequence of (w,m) w is a word (key) m is number of appearances

Combiners Optimize MapReduce process provided reducers are Associative and Commutative Values can be combined in any order Push some of Reducer logic to Map tasks Word count example Apply reduce step in each map Reduce key-value pairs with a similar key to a single (w,m) Grouping and reduce steps are still necessary Call a Combiner function after each map task Works on local files produced by a map task Before all map s output are collected and shuffled

MapReduce with Combiners

Parallelism Can execute each reducer by a dedicate Reduce task Single key per process Overhead with creating tasks is too much Skew: difference in computing time for reducers Different number of values for each keys Different computation time, some nodes become idle early Control number of Reduce tasks Run several reducers per each Reduce task to average the load More tasks than nodes to balance node load

Details of MapReduce Execution Fork a Master controller process and a number of Worker processes Worker handles Map or Reduce but not both Create a number of Reduce and Map tasks Usually Map task for each input chunk Select number of reduce task carefully More task means more communication More tasks means more parallelism Master keeps tracks of tasks

MapReduce Execution

Coping With Node Failures Failure of master node Restart the whole MapReduce job Worst case Failure of worker node Master monitors workers and detects a failure Master restarts only tasks in this worker

2.3 Algorithms Using MapReduce

Usage Original usage Google uses it for very large vector-matrix multiplication (compute the PageRank) Matrices represents links between web pages Vector represents an importance of a web page Make sense Very large files Not updated in-place Batch processing

Matrix-Vector Multiplication n-by-n matrix M with elements m i,j n is 10B for Web pages A vector v of length n v j Matrix-vector product is a vector x of length n Assume matrix and vector are stored in DFS Easily discoverable row-column coordinates For instance, stored as triples i, j, m i,j

Case I: Vector fits into RAM Whole vector data is available to each mapper The Map Function Element i, j, m i,j Output key-value pair i, m i,j v j The Reduce Function Sum all values associated with the key

Case II: Vector cannot fit in RAM Partition matrix and vector into stripes Each map task get a chunk from a stripe and gets entire corresponding strip of the vector

Relation-Algebra Operations Data frequently is stored in tables Relational Database Management Systems (RDBMS) Query language SQL Underlying theory: relations and operations over them MapReduce-Data is stored in files Frequently files contain tables and key-value pairs Need to perform SQL-like queries using MapReduce

Relations A Relation is a table(set) with column headers Attribute: a column header Tuple: a row in the table, no duplicates(!) Schema: a set of all attributes of particular relation Relation Example

Relational Algebra Relational Algebra: a set of standard operations Operations usual produce other relations from one or more input relations Operations (Queries) are often written in SQL and are executed by RDBMS Need some formalism to describe/define similar operations to be executed by MapReduce

Relational Algebra Selection R = σ C R Apply Boolean condition C on every tuple Produce relation with tuples that satisfy C Projection R = π S (R) Select attributes that are in a given subset S New relation contains only selected attributes Union, Intersection, Difference Easy to define for same-schema relations

Relational Algebra Natural Join R = R S Join two relations into a single relation (table) Merge tuples which agree on intersecting attributes Grouping and Aggregation R = γ X R Partition tuples according to values of attributes set G Compute aggregation per group (MAX, SUM,..) for all other attributes Value X is a list of elements that are either A grouping attribute Aggregation function θ A, for A not a grouping attribute

Examples Web Links Find path of length two using relation Links Triples (u,v,w) Natural join to itself Two copies L1(U1,U2) L2(U2,U3) Social Network Friends Tuples Number of friends for each user

MapReduce Selection Projection

Union, Intersection, Difference Union Intersection Difference

MapReduce: Natural Join Start with a simple form R(A,B) and S(B,C) Similar approach for joining by a groups of attributes

MapReduce: Grouping And Aggregations Simple case R(A,B,C). Group by A, aggregate by B

Matrix Multiplication using Relational Algebra M,N are matrices with elements m i,j and n i,j Multiplication P=MN Represent a matrix by a relation M(I,J,V) and N(J,K,W) Especially efficient if matrices are sparse (omit zeroes) Product Natural join i, j, k, v, w represents m i,j n j,k Transform to i, j, k, v w Group by I and K with sum aggregation over J

MapReduce: Matrix Multiplication 1 st phase (create all products m i,j n j,k ) Map for each m i,j produce j, M, i, m i,j for each n j,k produce j, N, k, n j,k Reduce For each key, output all possible combinations of M and N values with key i, k and multiplication m i,j n j,k 2 nd phase (combine all products for i, k ) Map: identity Reduce: sum values for each key i, k

Single Step Map Reduce Matrix Multiplication Map function Create multiple copies of each input element i, k, M, j, m i,j for k = 1,.., n i, k, N, j, n j,k for i = 1,.., n Reduce Input: For all keys i, k pairs M, j, m i,j and N, j, n j,k for all j Output Multiply and sum up all pairs for each keys

2.4 Map-Reduce Extensions

Workflow systems Will discuss later as we ll talk about streams

Recursive Extensions Recursive tasks are difficult to compute using MapReduce Map-Reduce: independent restart of failed task What if parent in the recursion chain is failed? Need some other mechanism for implementing recursive workflows Represented by flow graphs with cycles Convert recursion to iteration

Example: Path relation in a graph Assume directed graph is represented by the relation E(X, Y) Compute the path relation P(X,Y) There is a path from node X to node Y P n (X, Y) = π X,Y P n 1 (X, Z) E (Z,Y) Iterative algorithm for computing P(X,Y) Start from P(X, Y) = E(X, Y) Update/add new pairs till there is no change

Algorithm

Workflow Implementation Two types of tasks: Join and Dup-elim n Join tasks create new candidate pairs m Dup-elim tasks remove duplicates and resend to Joins Route/Partition data by hash functions Each join task handle pairs (a,b) (b,c) according to hash value h(b) Each Dup-elim handles pairs (a,c) according to hash value g(a,c)

Join Tasks Join task #i receives all pairs a, b according to hash value On of the nodes h(b) = i or h(a) = i Each pair can go to two tasks h a = i and h b = j Each Join task Stores P(a,b) locally till the end of the computation If h(a) = i tries to match locally stored P(x,a) and new P(a,b) and produce output P(x,b) If h(b) = i tries to match locally stored P(b,y) and new P(a,y) and produce output P(y,b) Send resulting(c,d) to dup-elim according to a hash value g(c,d)

Dup-elim task Dup-elim task #j stores all pairs (c,d) with hash value g(c, d) = j On receiving new pair, it checks it against locally stored pairs If it s a new pair, it s sent to Join tasks according to h(c) and h(d)

Workflow

Details Every Join task writes to m output files Single file for each Dup-elim task Every Dup-elim writes to n output files Single file for each Join task Start by sending E(a,b) pairs appropriate Dup-elim tasks According to g(a,b) Wait till all Join tasks finish before starting dup-elim phase All Dup-elim tasks have their input files

Failures Not necessary to have two types of tasks Whenever Join produces a new candidate (a,c), transmit it to two other Join tasks according to h(a), h(b) Before Join task uses a new pair to search for candidates, check it against locally stored and discard if already exists Failure Single task: everything with this hash value is lost Two types of tasks can handle single failures Failed Join recreates data from relevant Dup-elim tasks Failed Dup-elim recreates data from relevant Join tasks No problem with restarted task to produce duplicate input for other tasks

Graph processing systems Handle computation where input is a very large graph Google s Pregel and Apache s Giraph Facebook 200 machines, Giraph, 1Trillion graph edges 4 Minutes processing time

Example Given a graph compute shortest distance between each pair of nodes Assign task per node Group several tasks on a single node Each task receive messages Process them and send out other messages Computation by supersteps All nodes process their messages All nodes issue their messages

Algorithm Initially Node #a stores every edges from a Edge from a to b of weight w (b,w) Node #a sends messages to all other nodes Message : (a,b,w) When message (c,d,w) arrives to node #a Consider new paths (a,d) as (a->c->d) and (c,a) as (c->d->a). Update weight if a shorter path discovered Send out message with newly discovered path to all other nodes

Handling Failure By checkpoints Every node save its entire step every few supersteps Failure All nodes are restarted from a last checkpoint

2.5 The Communication Cost Model

Measuring Quality of Algorithms For many algorithms the performance bottleneck is moving data between tasks(nodes) Each task is usually simple, linear in data Transmitting and reading data into memory is slow Communication cost dominates To measure/estimate communication cost Describe algorithm as acyclic workflow Graph of tasks and communication between them Measure/Estimate data transmitted at each edge

Communication Cost Communication Cost of a task is the size of the input for the task Measured in bytes or tuples Communication Cost of an algorithm is the sum of the communication cost of all the tasks. Why not outputs? Counted as input to another tasks Unless it s the output of entire algorithm If output of entire algorithm is large, then most probably it s an input to a next stage Count it as an input to the next stage

Example: Natural Join Algorithm Reminder Map For each (a,b) of R create pair (b, (R,a)) For each (b,c) of S create pair (b,(s,c)) Reduce For a key b combine all pairs (R,a),(S,c) into (a,c) Communication Cost Assume R and S are of size r and s Input to all Maps: r+s Output of all Maps and input to all Reduces: r+s Total 2r+2s tuples or O(r+s) bytes Computation time is small We don t count for output from Reduce (potentially r*s)

Wall-Clock Time Can assign all work to a single task to minimize the communication cost Running time of the algorithm : wall-clock time Need to divide work fairly among the task while minimizing the communication cost Talk about it later

Multiway Joins: Cascade Example 3-way join Cascade two MapReduce jobs Do first join, then second join or Do second join, then first join Communication cost p-is a proportion (probability) of match 1 st, then 2nd O((r+s)+(prs)+t) bytes 2 nd, then 1 st O((s+t)+(pst+r)) bytes

Multiway Join: Single Step A single MapReduce job that joins all three relations at once Key to a Reducer is a pair (i, j) Receives R(u, v), S(v, w), T(w, x) such that h v = i and g w = j Total number of reducers k = b c where b and c are number of buckets for h and g Send S(v, w) only to a single Reducer Send R(u, v) to c Reducers and T(w, x) to b Reducers

Single Step: Communication Cost To Reduce tasks s tuples to move (1 copy of each S) cr tuples to move (c copies of each r tuple) bt tuples to move (b copies of each t tuple) To Map Task r + s +t input tuples to all Map tasks How do you select c and b subject to cb = k? Map communication cost is the same for all choices

Optimization Problem Minimize under constrain s + cr + bt, cb = k Lagrangean multipliers Minimize s + cr + bt λ cb k Set derivatives w.t.r. to c, b to zero r λb = 0, t λc = 0 Rearrange and multiply rt = λ 2 k λ = Substitute b = r λ = r rt k = kr t, c = Substitute to s + cr + bt = s + 2 krt Add map cost r + 2s + t + 2 krt kt r rt k

Example: Facebook 1B users 300 friends each on average Size of relation r is 3 10 11 Friends of Friends relationship R R Maximum size r 2, assume cliques so 30r Want to compute friends of friends of friends relationship R R R Start marketing to those who has large number of friends of friends of friends What is the best way to compute on MapReduce? Cascade a pair of two-way joins Single 3-way join workflow

Example: Communication cost Cascade computation First join Map: 2r Reduce 2r Total 4r Second join Map: r+30r Reduce:r+30r Total 62r Total: 4r + 62r = 66r = 1.98 10 13 3-way join r + r + 2r + 2r k = 1.2 10 12 + 6 10 11 k Compare 1.98 10 13 > 1.2 10 12 + 6 10 11 k k < 18.6 10 12 /6 10 11 =31 k<961 Result If number of reducers is less then 961, use 3-way join

Star join: Walmart Example Fact table Business facts Each sale is kept in the fact table Relation F A 1, A 2,.. Attributes are important components of the sale Item, Store, Branch, Customer Id Very large table Dimension tables For each key attribute D A i, B i,1, B i,2, Descriptive attributes/fields For Customer Id: Attribute: phone number, address, age Many small tables Analytical queries Join fact table with one or more dimensional tables (star join) Aggregate results into a useful form Example: aggregated sales by region and color for each month

Star Joins: Map Reduce Computation Don t wait for a specific query, prepare Reduce nodes for all possibilities Send and store locally dimension tables at the Reduce nodes Use the same hashing values as would be used for Multiway join of fact table with every dimension table Store fact table on Map nodes Run multi-way join and aggregate

2.6 Complexity Theory for MapReduce

Parameters of a MapReduce Algorithm Reducer size q: upper bound on the number of values for a single key Want it to be small Input for a single reducer fits into memory High degree of parallelism Replication rate r: average number of key-value pairs per input element Average communication from Map to Reduce per input element Usually there is a tradeoff between communication cost factor (r) and computational efficiency factor(q)

Example: 1Pass Matrix Multiplication Recall 1Pass Matrix Multiplication A reducer per each output(matrix element) Map sends a copy to each relevant reducer Replication rate is r = n Actually each element is duplicated exactly n times Reducer size is q = 2n (n values for each matrix) Can design a family of algorithms with qr > 2n 2

Example: Similarity Joins Assume a collection of images X Size 1M images Input key-value: index+image i, P i Some measure of similarity s(x, y) Symmetric s x, y = s y, x Find all pairs (x, y) such that s(x, y) > t

Obvious algorithm Reducer per pair Evaluate s on key is i, j If greater then a threshold, produce output Map Duplicate each input (i, P i ) for each j: ((i, j), P i ) Duplication rate is r=999999, reduce size q=2 Communication Map+ Reduce: 10 6 + 106 10 6 2 5 10 11 For 1Mb images total size is 10 18 Bytes or Exabyte Takes 300 years over the gigabit Ethernet

Similarity Joins Using Groups Select g- number of groups, each has 106 g images Use hash function with g values to define groups The Map Function For each input element (i, P i ) generate g 1 key-value parts Each key is a set (unordered) u, v, where u is group of this image and v is all other groups The Reduce Function (key is u, v ) Compare between two groups For each key u, v there are 2 106 g elements Compare between instances from different groups Need to choose a reducer to compare within the group For instance, compare within the group u for {u,u+1}

Analysis Replication Rate r = g 1 g Assume g is large Reducer size q= 2 106 g For 1M images, total bytes for a reducer is 2 1012 Number of reducers k g 2 /2 g For instance g=1000 Input for a single reducer is 2Gb Communication cost is 10 6 999 1Mb = 10 15 bytes 1000 times less (4 month vs. 300 years) 500K reducers, can be balanced well

Graph model for a MapReduce problem MapReduce problem A set of inputs A set of outputs A many-many relationship between input and output Which inputs are needed to produce which outputs Example: Similarity Join for 4 pictures

Example: Matrix Multiplication Multiply two n-by-n matrices 2n 2 inputs m ij and n jk and n 2 outputs p ik Each output p ik is related to 2n inputs m i1, m i2,. and n 1,k, n 2,k,.. Each input m ij or n jk is related to n outputs p i1, p i2, for m ij and p 1k, p 2k, for n jk

Implicit inputs/outputs Example: Natural join R(A,B) and S(B,C) Assume A,B,C have finite domain Finite number of possible inputs and outputs Not all inputs are present(all possible tuples) Not all outputs are produced For the analysis purposes, consider the complete graph for the problem It s model for the problem not for a specific input instance

Mapping Schemas Each algorithm is defined by a mapping scheme How outputs are produced from inputs (by reducers) Given a reducer size q, mapping scheme is an assignment of inputs to one or more reducers No reducer is assigned more then q inputs For every output, there is at least one reducer assigned to all inputs related to this output

Example: Similarity Join Number of inputs is p, number of outputs is p 2 p2 2 Assign g 2 g2 reducers A reducer get inputs from 2 groups : q=2p/g It s a mapping schema Reducer size is q Each output is covered Replication rate is r= g 1 g In this case r = 2p/q Inverse relation between r and q

When Not All Inputs Are Present Not all inputs are present Always 1M images Bur how many tuples in a relation? How to assign? Example: Assume only 5% of possible data is present For reducer size q, only q/20 will actually arrive After the analysis, estimate 5% and replace q by 20q Restrict algorithm fro 20q, in reality only q will arrive

Lower Bound On Replication Rate Similarity join Select reducer size to trade communication vs. parallelism Or ability to execute in RAM How do we know that we got the best tradeoff between q and r? Minimum possible communication(r) given q? Prove matching lower bound Find lower bound for the problem (for a given q) Show that tradeoff (achieved by the assignment) matches it

Steps to Prove Lower Bounds Bound outputs of a single reducer Given an input q, it can cover only g(q) output For any mapping Calculate total number of outputs for the problem Not depended on particular mapping Bound number of outputs covered by all reducers k i=1 Note inequality g q i > n k Manipulate inequality to get total communication q i Use reducer size (max input) q i=1 k Replication rate is i=1 q i divided by number of inputs

Example: Similarity Join Reducer with q inputs can t cover more then q2 2 Total number of outputs p2 2 Total coverable outputs q k i=1 q i 2 q i 2 So replication rate is r = k i=1 p q i k 2 q i 2 k i=1 p2 2 2 i=1 p2 2 or q i p q k i=1 p2 q

Case Study: Matrix Multiplication Consider an improvement to 1-pass matrix multiplication Consider the tradeoff of the algorithm and show matching low bound Idea: Group rows and columns into bands Each reducer gets band of columns from the 1 st matrix and a band of rows from the second matrix Reducer produce a square of elements of the output matrix

Matrix Multiplications

Matrix Multiplication Compute P=MN, all matrices are n-by-n Group rows of M into g groups/bands, n/g rows each Keys corresponds to two groups (from M and N) Map function g 2 keys: pairs of row/column band number Duplicate each M/N input for all possible rows/columns of N/M. Reduce function Compute a square of output elements

Analysis Reducer gets n n/g from each matrix Total q = 2n2 g Replication rate r = g Combine to obtain tradeoff r = 2n2 q Can t get better tradeoff Lower r for the same q with 1 pass mapping

Analysis Reducer have to receive a full row and column to produce a single output Input consist of a rows and b columns, total t = a + b Produce s=ab outputs. Most coverage if a=b q=2na, g(q)=a 2 = q2 4n 2 Total number of outputs is n 2, therefore q i 2 k i=1 n 2 4n 2

Analysis From previous Derive q q i=1 q i 4n 4 q(r2n 2 ) 4n 4 r 2n2 q Total communication r2n 2 = 4n4 q

Matrix Multiplication II Recall two-pass algorithm 1 st step: combine (i, j) and (j, k) with reducers by j 2 nd step send all i, k to be summed up by i, k Generalization Partition rows and columns in both matrices in groups Total g 2 squares of n2 2 elements in each square g Square (I,J) in M and (J,K) in N are needed to compute square (I,K), where I,J,K are sets of indices(groups)

Matrix Multiplication

Matrix Multiplication: 1st step The Map Function Keys are (I,J,K) group numbers Duplicate M inputs for all K(group numbers) and N inputs for all I(group numbers) for Replication rate is g The Reduce Function: key (I,J,K) Compute some products needed for each P i,k x i,j,k = j in J m i,j n j,k, for all i in I and k in K

Matrix Multiplication: 2 nd step The Map Function Input x i,j,k Key i, k The Reduce Function Sum all x i,j,k to obtain P i,k

Analysis 1 st step Replication rate g Total communication 2gn 2 Reducer size q = 2 n2 g 2 Total communication (subs g) 2gn 2 = 2n 3 2 q 2nd step Communication Total g for each (i, k), total gn 2 = n 3 2 q Total Communication 2-pass algorithm (1 st +2 nd ): 3n 3 2 q 1-pass algorithm (see slide 91): 4 n4 q

Summary Cluster Computing Disk, CPU, Memory in racks of nodes Distributed File System Large duplicated chunks Complexity Communication cost Reducer size and replication rate Problem as an input-output graph MapReduce Parallelize, manage failures, logic in two custom functions Hadoop, Workflows Relation Operations Natural Join, multiway, star joins Mapping schema Matrix Multiplication 1 pass vs 2-pass Generalization fro band and squares Analysis