Map-Reduce Applications: Counting, Graph Shortest Paths

Similar documents
Map-Reduce Applications: Counting, Graph Shortest Paths

Graph Algorithms. Revised based on the slides by Ruoming Kent State

University of Maryland. Tuesday, March 2, 2010

Graph Data Processing with MapReduce

Jordan Boyd-Graber University of Maryland. Thursday, March 3, 2011

Data-Intensive Distributed Computing

Data-Intensive Computing with MapReduce

MapReduce Patterns. MCSN - N. Tonellotto - Distributed Enabling Platforms

Link Analysis in the Cloud

Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2016)

COSC 6339 Big Data Analytics. Graph Algorithms and Apache Giraph

Developing MapReduce Programs

Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2017)

Data-Intensive Distributed Computing

Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2017)

MapReduce Algorithm Design

Introduction to Map/Reduce. Kostas Solomos Computer Science Department University of Crete, Greece

MapReduce Simplified Data Processing on Large Clusters

Parallel Processing - MapReduce and FlumeJava. Amir H. Payberah 14/09/2018

Graph Algorithms using Map-Reduce. Graphs are ubiquitous in modern society. Some examples: The hyperlink structure of the web

with MapReduce The ischool University of Maryland JHU Summer School on Human Language Technology Wednesday, June 17, 2009

MapReduce Design Patterns

ECE 7650 Scalable and Secure Internet Services and Architecture ---- A Systems Perspective

Big Data: Architectures and Data Analytics

Data-Intensive Computing with MapReduce

Big Data: Architectures and Data Analytics

Big Data: Architectures and Data Analytics

Introduction to MapReduce

UNIT V PROCESSING YOUR DATA WITH MAPREDUCE Syllabus

Clustering Documents. Document Retrieval. Case Study 2: Document Retrieval

Big Data: Architectures and Data Analytics

Clustering Documents. Case Study 2: Document Retrieval

Clustering Lecture 8: MapReduce

Pregel: A System for Large-Scale Graph Proces sing

Big Data Analysis using Hadoop. Map-Reduce An Introduction. Lecture 2

Chapter 4. Distributed Algorithms based on MapReduce. - Applications

Parallel Computing: MapReduce Jin, Hai

Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2017)

Informa)on Retrieval and Map- Reduce Implementa)ons. Mohammad Amir Sharif PhD Student Center for Advanced Computer Studies

Parallel Computing. Prof. Marco Bertini

Department of Computer Science University of Cyprus EPL646 Advanced Topics in Databases. Lecture 16. Big Data Management VI (MapReduce Programming)

Introduction to MapReduce. Adapted from Jimmy Lin (U. Maryland, USA)

CSE 100: GRAPH ALGORITHMS

Big Data Analysis using Hadoop Lecture 3

ECE5610/CSC6220 Introduction to Parallel and Distribution Computing. Lecture 6: MapReduce in Parallel Computing

Data-Intensive Distributed Computing

MapReduce: Recap. Juliana Freire & Cláudio Silva. Some slides borrowed from Jimmy Lin, Jeff Ullman, Jerome Simeon, and Jure Leskovec

Map Reduce. Yerevan.

Chapter 3. Distributed Algorithms based on MapReduce

Data-Intensive Distributed Computing

Parallel Data Processing with Hadoop/MapReduce. CS140 Tao Yang, 2014

CS6030 Cloud Computing. Acknowledgements. Today s Topics. Intro to Cloud Computing 10/20/15. Ajay Gupta, WMU-CS. WiSe Lab

Laboratory Session: MapReduce

An Introduction to Apache Spark

Map Reduce. MCSN - N. Tonellotto - Distributed Enabling Platforms

Lecture 11: Graph algorithms! Claudia Hauff (Web Information Systems)!

Session 1 Big Data and Hadoop - Overview. - Dr. M. R. Sanghavi

Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2017)

Design and Analysis of Algorithms

TI2736-B Big Data Processing. Claudia Hauff

Design and Analysis of Algorithms

Homework Assignment #3 Graph

Ghislain Fourny. Big Data 6. Massive Parallel Processing (MapReduce)

MapReduce. Arend Hintze

Steps: First install hadoop (if not installed yet) by,

Introduction to MapReduce

Batch Processing Basic architecture

Parallel Dijkstra s Algorithm

Ghislain Fourny. Big Data Fall Massive Parallel Processing (MapReduce)

CS 512: Comments on Graph Search 16:198:512 Instructor: Wes Cowan

MAPREDUCE - PARTITIONER

Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2017)

Announcements. Parallel Data Processing in the 20 th Century. Parallel Join Illustration. Introduction to Database Systems CSE 414

Java in MapReduce. Scope

CS171 Final Practice Exam

Prelim 2 Solution. CS 2110, November 19, 2015, 5:30 PM Total. Sorting Invariants Max Score Grader

MapReduce Algorithms

Clustering Documents. Document Retrieval. Case Study 2: Document Retrieval

Map Reduce.

CMPE 150: Introduction to Computer Networks

CS455: Introduction to Distributed Systems [Spring 2018] Dept. Of Computer Science, Colorado State University

Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2016)

CS171 Final Practice Exam

Prelim 2 Solution. CS 2110, November 19, 2015, 7:30 PM Total. Sorting Invariants Max Score Grader

CSE 100: GRAPH SEARCH

Topics covered in this lecture

Youki Kadobayashi NAIST

Tutorial. Question There are no forward edges. 4. For each back edge u, v, we have 0 d[v] d[u].

Algorithms for MapReduce. Combiners Partition and Sort Pairs vs Stripes

Graphs (Part II) Shannon Quinn

Announcements. Optional Reading. Distributed File System (DFS) MapReduce Process. MapReduce. Database Systems CSE 414. HW5 is due tomorrow 11pm

Introduction to MapReduce. Adapted from Jimmy Lin (U. Maryland, USA)

2. True or false: even though BFS and DFS have the same space complexity, they do not always have the same worst case asymptotic time complexity.

CSE 373: Data Structures and Algorithms

Introduction to MapReduce (cont.)

Large-scale Information Processing

Database Systems CSE 414

CS-2510 COMPUTER OPERATING SYSTEMS

Information Retrieval Processing with MapReduce

COMP4442. Service and Cloud Computing. Lab 12: MapReduce. Prof. George Baciu PQ838.

Transcription:

Map-Reduce Applications: Counting, Graph Shortest Paths Adapted from UMD Jimmy Lin s slides, which is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 United States. See http://creativecommons.org/licenses/by-nc-sa/3.0/us/ for details

Overview Simple counting, averaging Graph problems and representations Parallel breadth-first search

MapReduce: Recap Programmers must specify: map (k, v) <k, v >* reduce (k, v ) <k, v >* All values with the same key are reduced together Optionally, also: partition (k, number of partitions) partition for k Often a simple hash of the key, e.g., hash(k ) mod n Divides up key space for parallel reduce operations combine (k, v ) <k, v >* Mini-reducers that run in memory after the map phase Used as an optimization to reduce network traffic The execution framework handles everything else

Everything Else The execution framework handles everything else Scheduling: assigns workers to map and reduce tasks Data distribution : moves processes to data Synchronization: gathers, sorts, and shuffles intermediate data Errors and faults: detects worker failures and restarts Limited control over data and execution flow All algorithms must expressed in m, r, c, p You don t know: Where mappers and reducers run When a mapper or reducer begins or finishes Which input a particular mapper is processing Which intermediate key a particular reducer is processing

k v k 2 v 2 k 3 v 3 k 4 v 4 k v k 6 v 6 map map map map a b 2 c 3 c 6 a c 2 b 7 c 8 combine combine combine combine a b 2 c 9 a c 2 b 7 c 8 partition partition partition partition Shuffle and Sort: aggregate values by keys a b 2 7 c 2 9 8 reduce reduce reduce r s CS 6323, r 2 Algorithms s 2 r 3 s 3

Word Count: Baseline What s the impact of combiners?

Word Count: Version

Word Count: Version 2

Design Pattern for Local Aggregation In-mapper combining Fold the functionality of the combiner into the mapper by preserving state across multiple map calls Advantages Speed Faster than actual combiners Disadvantages Explicit memory management required Potential for order-dependent bugs

Combiner Design Combiners and reducers share same method signature Sometimes, reducers can serve as combiners Often, not Remember: combiner are optional optimizations Should not affect algorithm correctness May be run 0,, or multiple times Example: find average of all integers associated with the same key

Computing the Mean: Version

Single Source Shortest Path Problem: find shortest path from a source node to one or more target nodes Shortest might also mean lowest weight or cost First, a refresher: Dijkstra s Algorithm

Dijkstra s Algorithm Example 0 0 2 3 9 4 6 7 2 Example from CLR

Dijkstra s Algorithm Example 0 0 0 2 3 9 4 6 7 2 Example from CLR

Dijkstra s Algorithm Example 8 4 0 0 2 3 9 4 6 7 2 7 Example from CLR

Dijkstra s Algorithm Example 8 3 0 0 2 3 9 4 6 7 2 7 Example from CLR

Dijkstra s Algorithm Example 8 9 0 0 2 3 9 4 6 7 2 7 Example from CLR

Dijkstra s Algorithm Example 8 9 0 0 2 3 9 4 6 7 2 7 Example from CLR

Single Source Shortest Path Problem: find shortest path from a source node to one or more target nodes Shortest might also mean lowest weight or cost Single processor machine: Dijkstra s Algorithm MapReduce: parallel Breadth-First Search (BFS)

Source: Wikipedia (Wave)

Finding the Shortest Path Consider simple case of equal edge weights Solution to the problem can be defined inductively Here s the intuition: Define: b is reachable from a if b is on adjacency list of a DISTANCETO(s) = 0 For all nodes p reachable from s, DISTANCETO(p) = For all nodes n reachable from some other set of nodes M, DISTANCETO(n) = + min(distanceto(m), m M) d m s d 2 m 2 n d 3 m 3

Visualizing Parallel BFS n 0 n n 7 n 3 n 2 n 6 n n 4 n 8 n 9

From Intuition to Algorithm Data representation: Key: node n Value: d (distance from start), adjacency list (list of nodes reachable from n) Initialization: for all nodes except for start node, d = Mapper: m adjacency list: emit (m, d + ) Sort/Shuffle Groups distances by reachable nodes Reducer: Selects minimum distance path for each reachable node Additional bookkeeping needed to keep track of actual path

Multiple Iterations Needed Each MapReduce iteration advances the known frontier by one hop Subsequent iterations include more and more reachable nodes as frontier expands Multiple iterations are needed to explore entire graph Preserving graph structure: Problem: Where did the adjacency list go? Solution: mapper emits (n, adjacency list) as well

BFS Pseudo-Code

Example: SSSP Parallel BFS in MapReduce Adjacency matrix A B C D E A 0 B 2 C 4 D 3 9 2 E 7 6 Adjacency List A: (B, 0), (D, ) B: (C, ), (D, 2) C: (E, 4) D: (B, 3), (C, 9), (E, 2) E: (A, 7), (C, 6) A 0 0 B 2 3 D 2 9 7 C 4 6 E 3

Example: SSSP Parallel BFS in MapReduce Map input: <node ID, <dist, adj list>> <A, <0, <(B, 0), (D, )>>> <B, <inf, <(C, ), (D, 2)>>> <C, <inf, <(E, 4)>>> A 0 <D, <inf, <(B, 3), (C, 9), (E, 2)>>> <E, <inf, <(A, 7), (C, 6)>>> 0 2 3 B 9 C 4 6 Map output: <dest node ID, dist> <B, 0> <D, > <C, inf> <D, inf> <E, inf> <B, inf> <C, inf> <E, inf> <A, inf> <C, inf> <A, <0, <(B, 0), (D, )>>> <B, <inf, <(C, ), (D, 2)>>> <C, <inf, <(E, 4)>>> <D, <inf, <(B, 3), (C, 9), (E, 2)>>> <E, <inf, <(A, 7), (C, 6)>>> Flushed to local disk!! D 2 32 7 E

Example: SSSP Parallel BFS in MapReduce Reduce input: <node ID, dist> B C <A, <0, <(B, 0), (D, )>>> <A, inf> <B, <inf, <(C, ), (D, 2)>>> A 0 <B, 0> <B, inf> 0 2 3 9 4 6 <C, <inf, <(E, 4)>>> <C, inf> <C, inf> <C, inf> 7 <D, <inf, <(B, 3), (C, 9), (E, 2)>>> <D, > <D, inf> D 2 E <E, <inf, <(A, 7), (C, 6)>>> <E, inf> <E, inf> 33

Example: SSSP Parallel BFS in MapReduce Reduce input: <node ID, dist> B C <A, <0, <(B, 0), (D, )>>> <A, inf> <B, <inf, <(C, ), (D, 2)>>> A 0 <B, 0> <B, inf> 0 2 3 9 4 6 <C, <inf, <(E, 4)>>> <C, inf> <C, inf> <C, inf> 7 <D, <inf, <(B, 3), (C, 9), (E, 2)>>> <D, > <D, inf> 2 <E, <inf, <(A, 7), (C, 6)>>> D E <E, inf> <E, inf> 34

Example: SSSP Parallel BFS in MapReduce Reduce output: <node ID, <dist, adj list>> B = Map input for next iteration <A, <0, <(B, 0), (D, )>>> <B, <0, <(C, ), (D, 2)>>> <C, <inf, <(E, 4)>>> <D, <, <(B, 3), (C, 9), (E, 2)>>> <E, <inf, <(A, 7), (C, 6)>>> Map output: <dest node ID, dist> <B, 0> <D, > <C, > <D, 2> <E, inf> <B, 8> <C, 4> <E, 7> <A, inf> <C, inf> Flushed to DFS!! A 0 0 <A, <0, <(B, 0), (D, )>>> <B, <0, <(C, ), (D, 2)>>> <C, <inf, <(E, 4)>>> <D, <, <(B, 3), (C, 9), (E, 2)>>> <E, <inf, <(A, 7), (C, 6)>>> 0 2 3 D 2 3 9 7 C 4 6 Flushed to local disk!! E

Example: SSSP Parallel BFS in MapReduce Reduce input: <node ID, dist> B C <A, <0, <(B, 0), (D, )>>> <A, inf> 0 <B, <0, <(C, ), (D, 2)>>> A 0 <B, 0> <B, 8> 0 2 3 9 4 6 <C, <inf, <(E, 4)>>> <C, > <C, 4> <C, inf> 7 <D, <, <(B, 3), (C, 9), (E, 2)>>> 2 <D, > <D, 2> D E <E, <inf, <(A, 7), (C, 6)>>> <E, inf> <E, 7> 36

Example: SSSP Parallel BFS in MapReduce Reduce input: <node ID, dist> B C <A, <0, <(B, 0), (D, )>>> <A, inf> 0 <B, <0, <(C, ), (D, 2)>>> A 0 <B, 0> <B, 8> 0 2 3 9 4 6 <C, <inf, <(E, 4)>>> <C, > <C, 4> <C, inf> 7 <D, <, <(B, 3), (C, 9), (E, 2)>>> 2 <D, > <D, 2> D E <E, <inf, <(A, 7), (C, 6)>>> <E, inf> <E, 7> 37

Example: SSSP Parallel BFS in MapReduce Reduce output: <node ID, <dist, adj list>> = Map input for next iteration <A, <0, <(B, 0), (D, )>>> Flushed to DFS!! B 8 C <B, <8, <(C, ), (D, 2)>>> <C, <, <(E, 4)>>> <D, <, <(B, 3), (C, 9), (E, 2)>>> A 0 0 2 3 9 4 6 <E, <7, <(A, 7), (C, 6)>>> 7 the rest omitted 2 7 D E 38

Stopping Criterion How many iterations are needed in parallel BFS (equal edge weight case)? Convince yourself: when a node is first discovered, we ve found the shortest path Now answer the question... Six degrees of separation? Practicalities of implementation in MapReduce

Comparison to Dijkstra Dijkstra s algorithm is more efficient At any step it only pursues edges from the minimum-cost path inside the frontier MapReduce explores all paths in parallel Lots of waste Useful work is only done at the frontier Why can t we do better using MapReduce?

Weighted Edges Now add positive weights to the edges Why can t edge weights be negative? Simple change: adjacency list now includes a weight w for each edge In mapper, emit (m, d + w p ) instead of (m, d + ) for each node m

Stopping Criterion How many iterations are needed in parallel BFS (positive edge weight case) Graph diameter D Convince yourself: when a node is first discovered, we ve found the shortest path

Additional Complexities search frontier n 6 n 7 r 0 n n 8 n 9 n s p q n 2 n 4 n 3

Stopping Criterion How many iterations are needed in parallel BFS (positive edge weight case)? Practicalities of implementation in MapReduce

Graphs and MapReduce Graph algorithms typically involve: Performing computations at each node: based on node features, edge features, and local link structure Propagating computations: traversing the graph Generic recipe: Represent graphs as adjacency lists Perform local computations in mapper Pass along partial results via outlinks, keyed by destination node Perform aggregation in reducer on inlinks to a node Iterate until convergence: controlled by external driver Don t forget to pass the graph structure between iterations

http://famousphil.com/blog/20/06/a-hadoop-mapreduce-solution-to-dijkstra%e2%80%99s-algorithm/ public class Dijkstra extends Configured implements Tool { public static String OUT = "outfile"; public static String IN = "inputlarger ; public static class TheMapper extends Mapper<LongWritable, Text, LongWritable, Text> { public void map(longwritable key, Text value, Context context) throws IOException, InterruptedException Text word = new Text(); String line = value.tostring();//looks like 0 2:3: String[] sp = line.split(" ");//splits on space int distanceadd = Integer.parseInt(sp[]) + ; String[] PointsTo = sp[2].split(":"); for(int i=0; i<pointsto.length; i++){ word.set("value "+distanceadd);//tells me to look at distance value context.write(new LongWritable(Integer.parseInt(PointsTo[i])), word); word.clear(); } //pass in current node's distance (if it is the lowest distance) word.set("value "+sp[]); context.write( new LongWritable( Integer.parseInt( sp[0] ) ), word ); word.clear(); word.set("nodes "+sp[2]);//tells me to append on the final tally context.write( new LongWritable( Integer.parseInt( sp[0] ) ), word ); word.clear(); } }

public static class TheReducer extends Reducer<LongWritable, Text, LongWritable, Text> { public void reduce(longwritable key, Iterable<Text> values, Context context) throws IOException, InterruptedException { String nodes = "UNMODED"; Text word = new Text(); int lowest = 0009;//start at infinity for (Text val : values) {//looks like NODES/VALUES 0 2:3:, we need to use the first as a key String[] sp = val.tostring().split(" ");//splits on space //look at first value if(sp[0].equalsignorecase("nodes")){ nodes = null; nodes = sp[]; }else if(sp[0].equalsignorecase("value")){ int distance = Integer.parseInt(sp[]); lowest = Math.min(distance, lowest); } } word.set(lowest+" "+nodes); context.write(key, word); word.clear(); } }

public int run(string[] args) throws Exception { //http://code.google.com/p/joycrawler/source/browse/netflixchallenge/src/org/niubility/learning/knn/knndriver.java?r=2 getconf().set("mapred.textoutputformat.separator", " ");//make the key -> value space separated (for iterations).. while(isdone == false){ Job job = new Job(getConf()); job.setjarbyclass(dijkstra.class); job.setjobname("dijkstra"); job.setoutputkeyclass(longwritable.class); job.setoutputvalueclass(text.class); job.setmapperclass(themapper.class); job.setreducerclass(thereducer.class); job.setinputformatclass(textinputformat.class); job.setoutputformatclass(textoutputformat.class); FileInputFormat.addInputPath(job, new Path(infile)); FileOutputFormat.setOutputPath(job, new Path(outputfile)); success = job.waitforcompletion(true); //remove the input file //http://eclipse.sys-con.com/node/28780/mobile if(infile!= IN){ String indir = infile.replace("part-r-00000", ""); Path ddir = new Path(indir); FileSystem dfs = FileSystem.get(getConf()); dfs.delete(ddir, true); }