MapReduce in Erlang. Tom Van Cutsem

Similar documents
CS555: Distributed Systems [Fall 2017] Dept. Of Computer Science, Colorado State University

The MapReduce Abstraction

Parallel Computing: MapReduce Jin, Hai

Big Data Management and NoSQL Databases

CSE Lecture 11: Map/Reduce 7 October Nate Nystrom UTA

Distributed Computations MapReduce. adapted from Jeff Dean s slides

Map-Reduce. John Hughes

Introduction to MapReduce

Map-Reduce (PFP Lecture 12) John Hughes

Parallel Programming Concepts

Parallel Programming Principle and Practice. Lecture 10 Big Data Processing with MapReduce

Principles of Data Management. Lecture #16 (MapReduce & DFS for Big Data)

MapReduce: A Programming Model for Large-Scale Distributed Computation

Google: A Computer Scientist s Playground

Google: A Computer Scientist s Playground

Map Reduce Group Meeting

Clustering Lecture 8: MapReduce

Motivation. Map in Lisp (Scheme) Map/Reduce. MapReduce: Simplified Data Processing on Large Clusters

Introduction to Map Reduce

Introduction to MapReduce

CS 345A Data Mining. MapReduce

CIS 601 Graduate Seminar Presentation Introduction to MapReduce --Mechanism and Applicatoin. Presented by: Suhua Wei Yong Yu

MapReduce Spark. Some slides are adapted from those of Jeff Dean and Matei Zaharia

MapReduce: Simplified Data Processing on Large Clusters 유연일민철기

L22: SC Report, Map Reduce

CS427 Multicore Architecture and Parallel Computing

The MapReduce Framework

MapReduce. Tom Anderson

Advanced Data Management Technologies

Parallel Nested Loops

Parallel Partition-Based. Parallel Nested Loops. Median. More Join Thoughts. Parallel Office Tools 9/15/2011

MI-PDB, MIE-PDB: Advanced Database Systems

DISTRIBUTED SYSTEMS [COMP9243] Lecture 1.5: Erlang INTRODUCTION TO ERLANG BASICS: SEQUENTIAL PROGRAMMING 2. Slide 1

Overview. Why MapReduce? What is MapReduce? The Hadoop Distributed File System Cloudera, Inc.

Map Reduce. Yerevan.

Introduction to Erlang. Franck Petit / Sebastien Tixeuil

Data-Intensive Distributed Computing

Announcements. Optional Reading. Distributed File System (DFS) MapReduce Process. MapReduce. Database Systems CSE 414. HW5 is due tomorrow 11pm

Database Systems CSE 414

Erlang. Functional Concurrent Distributed Soft real-time OTP (fault-tolerance, hot code update ) Open. Check the source code of generic behaviours

Remote Procedure Call. Tom Anderson

MapReduce & Resilient Distributed Datasets. Yiqing Hua, Mengqi(Mandy) Xia

Where We Are. Review: Parallel DBMS. Parallel DBMS. Introduction to Data Management CSE 344

An introduction to Big Data. Presentation by Devesh Sharma, Zubair Asghar & Andreas Aalsaunet

Introduction to MapReduce

MapReduce. Cloud Computing COMP / ECPE 293A

MapReduce Algorithms

MapReduce & BigTable

Systems Infrastructure for Data Science. Web Science Group Uni Freiburg WS 2013/14

PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS

CS-2510 COMPUTER OPERATING SYSTEMS

Erlang. Joe Armstrong

Message-passing concurrency in Erlang

MapReduce. Stony Brook University CSE545, Fall 2016

Lecture 5: Erlang 1/31/12

Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2017)

Introduction to MapReduce

Chapter 5. The MapReduce Programming Model and Implementation

Database System Architectures Parallel DBs, MapReduce, ColumnStores

Large-Scale GPU programming

MapReduce: Simplified Data Processing on Large Clusters

CS 138: Google. CS 138 XVI 1 Copyright 2017 Thomas W. Doeppner. All rights reserved.

Survey Paper on Traditional Hadoop and Pipelined Map Reduce

Map-Reduce. Marco Mura 2010 March, 31th

Distributed Erlang and Map-Reduce

Concurrency for data-intensive applications

Introduction to Data Management CSE 344

Introduction to MapReduce. Adapted from Jimmy Lin (U. Maryland, USA)

CS 138: Google. CS 138 XVII 1 Copyright 2016 Thomas W. Doeppner. All rights reserved.

Distributed computing: index building and use

Distributed computing: index building and use

MapReduce and Hadoop

MapReduce for Graph Algorithms

Outline. Distributed File System Map-Reduce The Computational Model Map-Reduce Algorithm Evaluation Computing Joins

CS 61C: Great Ideas in Computer Architecture. MapReduce

Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2016)

1. Introduction to MapReduce

Part A: MapReduce. Introduction Model Implementation issues

MapReduce Simplified Data Processing on Large Clusters

Parallel Processing - MapReduce and FlumeJava. Amir H. Payberah 14/09/2018

Improving the MapReduce Big Data Processing Framework

Authors: Malewicz, G., Austern, M. H., Bik, A. J., Dehnert, J. C., Horn, L., Leiser, N., Czjkowski, G.

MapReduce: Recap. Juliana Freire & Cláudio Silva. Some slides borrowed from Jimmy Lin, Jeff Ullman, Jerome Simeon, and Jure Leskovec

Programming Systems for Big Data

ECE5610/CSC6220 Introduction to Parallel and Distribution Computing. Lecture 6: MapReduce in Parallel Computing

Batch Processing Basic architecture

CS MapReduce. Vitaly Shmatikov

Map Reduce.

Erlang. Functional Concurrent Distributed Soft real-time OTP (fault-tolerance, hot code update ) Open. Check the source code of generic behaviours

CompSci 516: Database Systems

Index construction CE-324: Modern Information Retrieval Sharif University of Technology

MapReduce: Algorithm Design for Relational Operations

Distributed Computation Models

PSON: A Parallelized SON Algorithm with MapReduce for Mining Frequent Sets

Systems Infrastructure for Data Science. Web Science Group Uni Freiburg WS 2012/13

Introduction to Data Management CSE 344

MapReduce & HyperDex. Kathleen Durant PhD Lecture 21 CS 3200 Northeastern University

Index construction CE-324: Modern Information Retrieval Sharif University of Technology

STATS Data Analysis using Python. Lecture 7: the MapReduce framework Some slides adapted from C. Budak and R. Burns

Cloud Programming. Programming Environment Oct 29, 2015 Osamu Tatebe

Transcription:

MapReduce in Erlang Tom Van Cutsem 1

Context Masters course on Multicore Programming Focus on concurrent, parallel and... functional programming Didactic implementation of Google s MapReduce algorithm in Erlang Goal: teach both Erlang and MapReduce style 2

What is MapReduce? A programming model to formulate large data processing jobs in terms of map and reduce computations Parallel implementation on a large cluster of commodity hardware Characteristics: Jobs take input records, process them, and produce output records Massive amount of I/O: reading, writing and transferring large files Computations typically not so CPU-intensive Dean and Ghemawat (Google) MapReduce: Simplified Data Processing on Large Clusters OSDI 2004 3

MapReduce: why? Example: index the WWW 20+ billion web pages x 20KB = 400+ TB One computer: read 30-35MB/sec from disk ~ four months to read the web ~ 1000 hard drives to store the web Good news: on 1000 machines, need < 3 hours Bad news: programming work, and repeated for every problem (Source: Michael Kleber, The MapReduce Paradigm, Google Inc.) 4

MapReduce: fundamental idea Separate application-specific computations from the messy details of parallelisation, fault-tolerance, data distribution and load balancing These application-specific computations are expressed as functions that map or reduce data The use of a functional model allows for easy parallelisation and allows the use of re-execution as the primary mechanism for fault tolerance 5

MapReduce: key phases Read lots of data (key-value records) Map: extract useful data from each record, generate intermediate keys/values Group intermediate key/value pairs by key Reduce: aggregate, summarize, filter or transform intermediate values with the same key Write output key/value pairs Same general structure for all problems, Map and Reduce are problem-specific (Source: Michael Kleber, The MapReduce Paradigm, Google Inc.) 6

MapReduce: inspiration In functional programming (e.g. in Clojure, similar in other FP languages) (map (fn [x] (* x x)) [1 2 3]) => [1 4 9] (reduce + 0 [1 4 9]) => 14 The Map and Reduce functions of MapReduce are inspired by but not the same as the map and fold/reduce operations from functional programming 7

Map and Reduce functions Map takes an input key/value pair and produces a list of intermediate key/ value pairs Input keys/values are not necessarily from the same domain as output keys/ values map: (K1, V1) List[(K2, V2)] reduce: (K2, List[V2]) List[V2] mapreduce: (List[(K1, V1)],map,reduce) Map[K2, List[V2]] 8

Map and Reduce functions All vi with the same ki are reduced together (remember the invisible grouping step) map: (k, v) [ (k2,v2), (k2,v2 ),... ] reduce: (k2, [v2,v2,...]) [ v2,... ] (K1,V1) (K2,V2) (K2,V2) (a, x) (b, y) (c, z) (Source: Michael Kleber, The MapReduce Paradigm, Google Inc.) 9

Map and Reduce functions All vi with the same ki are reduced together (remember the invisible grouping step) map: (k, v) [ (k2,v2), (k2,v2 ),... ] reduce: (k2, [v2,v2,...]) [ v2,... ] (K1,V1) (K2,V2) (K2,V2) (a, x) (b, y) (c, z) (u,1) (u,2) (v,1) (v,3) map (Source: Michael Kleber, The MapReduce Paradigm, Google Inc.) 10

Map and Reduce functions All vi with the same ki are reduced together (remember the invisible grouping step) map: (k, v) [ (k2,v2), (k2,v2 ),... ] reduce: (k2, [v2,v2,...]) [ v2,... ] (K1,V1) (K2,V2) (K2,V2) (a, x) (b, y) (c, z) (u,1) (u,2) (v,1) (v,3) (v, 4) (u, 3) map reduce (Source: Michael Kleber, The MapReduce Paradigm, Google Inc.) 11

Example: word frequencies in web pages (K1,V1) = (document URL, document contents) (K2,V2) = (word, frequency) Map document1, to be or not to be ( to, 1) ( be, 1) ( or, 1)... (Source: Michael Kleber, The MapReduce Paradigm, Google Inc.) 12

Example: word frequencies in web pages (K1,V1) = (document URL, document contents) (K2,V2) = (word, frequency) Reduce ( be, [ 1, 1 ] ) ( not, [ 1 ] ) ( or, [ 1 ] ) ( to, [ 1, 1 ] ) [ 2 ] [ 1 ] [ 1 ] [ 2 ] (Source: Michael Kleber, The MapReduce Paradigm, Google Inc.) 13

Example: word frequencies in web pages (K1,V1) = (document URL, document contents) (K2,V2) = (word, frequency) Output ( be, 2) ( not, 1) ( or, 1) ( to, 2) (Source: Michael Kleber, The MapReduce Paradigm, Google Inc.) 14

More Examples Count URL access frequency: Map: process logs of web page requests and output <URL,1> Reduce: add together all values for the same URL and output <URL,total> Distributed Grep: Map: emit a line if it matches the pattern Reduce: identity function 15

More Examples Inverted Index for a collection of (text) documents: Map: emits a sequence of <word, documentid> pairs Reduce: accepts all pairs for a given word, sorts documentids and returns <word, List(documentID)> Implementation in Erlang follows later 16

Conceptual execution model map: (K1, V1) List[(K2, V2)] reduce: (K2, List[V2]) List[V2] mapreduce: (List[(K1, V1)],map,reduce) Map[K2, List[V2]] Master assign work to mappers collect and sort according to K2 assign work to reducers collect reduced values Input K1 V1 k1 v1 Mapper [(a,1)] Intermediate K2 List[V2] Reducer [1,3] K2 Output List[V2] k2 v2 Mapper a [1,2] a [1,3] k3 v3 Mapper [(a,2), (b,2)] b [2,3,4] [(b,3), (b,4)] Reducer [2,5,9] b [2,5,9] Map Phase Reduce Phase 17

The devil is in the details! How to partition the data, how to balance the load among workers? How to efficiently route all that data between master and workers? Overlapping the map and the reduce phase (pipelining) Dealing with crashed workers (master pings workers, re-assigns tasks) Infrastructure (need a distributed file system, e.g. GFS)... 18

Erlang in a nutshell 19

Erlang fact sheet Invented at Ericsson Research Labs, Sweden Declarative (functional) core language, inspired by Prolog Support for concurrency: processes with isolated state, asynchronous message passing Support for distribution: Processes can be distributed over a network 20

Sequential programming: factorial -module(math1). -export([factorial/1]). factorial(0) -> 1; factorial(n) -> N * factorial(n-1). > math1:factorial(6). 720 > math1:factorial(25). 15511210043330985984000000 21

Example: an echo process Echo process echoes any message sent to it -module(echo). -export([start/0, loop/0]). start() -> spawn(echo, loop, []). loop() -> receive {From, Message} -> From! Message, loop() end. Id = echo:start(), Id! { self(), hello }, receive Msg -> io:format( echoed ~w~n, [Msg]) end. 22

Processes can encapsulate state Example: a counter process Note the use of tail recursion -module(counter). -export([start/0, loop/1]). start() -> spawn(counter, loop, [0]). loop(val) -> receive increment -> loop(val + 1); {From, value} -> From! {self(), Val}, loop(val) end. 23

MapReduce in Erlang 24

A naive parallel implementation Map and Reduce functions will be applied in parallel: Mapper worker process spawned for each {K1,V1} in Input Reducer worker process spawned for each intermediate {K2,[V2]} %% Input = [{K1, V1}] %% Map(K1, V1, Emit) -> Emit a stream of {K2,V2} tuples %% Reduce(K2, List[V2], Emit) -> Emit a stream of {K2,V2} tuples %% Returns a Map[K2,List[V2]] mapreduce(input, Map, Reduce) -> 25

A naive parallel implementation % Input = [{K1, V1}] % Map(K1, V1, Emit) -> Emit a stream of {K2,V2} tuples % Reduce(K2, List[V2], Emit) -> Emit a stream of {K2,V2} tuples % Returns a Map[K2,List[V2]] mapreduce(input, Map, Reduce) -> S = self(), Pid = spawn(fun() -> master(s, Map, Reduce, Input) end), receive {Pid, Result} -> Result end. 26

A naive parallel implementation {K1,V1} M {K2,V2}, {K2,V2 },... {K2,[V2,V2 ]} R {K2,V2 },... collect_replies R collect_replies Master {K1,V1 } M Master {K2,[V2 ]} Master spawn_workers spawn_workers R {K1,V1 } {K2,V2 },... M R 27

A naive parallel implementation master(parent, Map, Reduce, Input) -> process_flag(trap_exit, true), MasterPid = self(), % Create the mapper processes, one for each element in Input spawn_workers(masterpid, Map, Input), M = length(input), % Wait for M Map processes to terminate Intermediate = collect_replies(m, dict:new()), % Create the reducer processes, one for each intermediate Key spawn_workers(masterpid, Reduce, dict:to_list(intermediate)), R = dict:size(intermediate), % Wait for R Reduce processes to terminate Output = collect_replies(r, dict:new()), Parent! {self(), Output}. 28

A naive parallel implementation spawn_workers(masterpid, Fun, Pairs) -> lists:foreach(fun({k,v}) -> spawn_link(fun() -> worker(masterpid, Fun, {K,V}) end) end, Pairs). % Worker must send {K2, V2} messsages to master and then terminate worker(masterpid, Fun, {K,V}) -> Fun(K, V, fun(k2,v2) -> MasterPid! {K2, V2} end). 29

A naive parallel implementation [{K,V},...] spawn_workers(masterpid, Fun, Pairs) -> lists:foreach(fun({k,v}) -> spawn_link(fun() -> worker(masterpid, Fun, {K,V}) end) end, Pairs). % Worker must send {K2, V2} messsages to master and then terminate worker(masterpid, Fun, {K,V}) -> Fun(K, V, fun(k2,v2) -> MasterPid! {K2, V2} end). Fun calls Emit(K2,V2) for each pair it wants to produce 30

A naive parallel implementation % collect and merge {Key, Value} messages from N processes. % When N processes have terminated return a dictionary % of {Key, [Value]} pairs collect_replies(0, Dict) -> Dict; collect_replies(n, Dict) -> receive {Key, Val} -> end. Dict1 = dict:append(key, Val, Dict), collect_replies(n, Dict1); {'EXIT', _Who, _Why} -> collect_replies(n-1, Dict) 31

Example: text indexing Example input: Filename /test/dogs /test/cats /test/cars Contents [rover,jack,buster,winston]. [zorro,daisy,jaguar]. [rover,jaguar,ford]. Input: a list of {Idx,FileName} Idx 1 2 3 Filename /test/dogs /test/cats /test/cars 32

Example: text indexing Goal: to build an inverted index: Word rover jack buster winston zorro daisy jaguar ford File Index dogs, cars dogs dogs dogs cats cats cats, cars cars Querying the index by word is now straightforward 33

Example: text indexing Building the inverted index using mapreduce: Map(Idx,File): emit {Word,Idx} tuple for each Word in File Reduce(Word, Files) -> filter out duplicate Files Map(1, dogs ) M [{rover,1},...] Reduce(zorro,[2]) Map(2, cats ) M [{zorro,2},...] R [{rover,[ dogs, cars ]}, {zorro,[ cats ]},...] Map(3, cars ) M [{rover,3},...] R Reduce(rover,[1,3]) 34

Text indexing using the parallel implementation index(dirname) -> NumberedFiles = list_numbered_files(dirname), mapreduce(numberedfiles, fun find_words/3, fun remove_duplicates/3). % the Map function find_words(index, FileName, Emit) -> {ok, [Words]} = file:consult(filename), lists:foreach(fun (Word) -> Emit(Word, Index) end, Words). % the Reduce function remove_duplicates(word, Indices, Emit) -> UniqueIndices = sets:to_list(sets:from_list(indices)), lists:foreach(fun (Index) -> Emit(Word, Index) end, UniqueIndices). 35

Text indexing using the parallel implementation > dict:to_list(index(test)). [{rover,["test/dogs","test/cars"]}, {buster,["test/dogs"]}, {jaguar,["test/cats","test/cars"]}, {ford,["test/cars"]}, {daisy,["test/cats"]}, {jack,["test/dogs"]}, {winston,["test/dogs"]}, {zorro,["test/cats"]}] 36

Summary MapReduce: programming model that separates application-specific map and reduce computations from parallel processing concerns. Functional model: easy to parallelise, fault tolerance via re-execution Erlang: functional core language, concurrent processes + async message passing MapReduce in Erlang Didactic implementation Simple idea, arbitrarily complex implementations 37