Introduction to MapReduce. Adapted from Jimmy Lin (U. Maryland, USA)

Similar documents
Introduction to MapReduce. Adapted from Jimmy Lin (U. Maryland, USA)

Clustering Lecture 8: MapReduce

Parallel Programming Principle and Practice. Lecture 10 Big Data Processing with MapReduce

Introduction to MapReduce

Parallel Computing: MapReduce Jin, Hai

CS6030 Cloud Computing. Acknowledgements. Today s Topics. Intro to Cloud Computing 10/20/15. Ajay Gupta, WMU-CS. WiSe Lab

Introduction to MapReduce

Overview. Why MapReduce? What is MapReduce? The Hadoop Distributed File System Cloudera, Inc.

Database System Architectures Parallel DBs, MapReduce, ColumnStores

Hadoop/MapReduce Computing Paradigm

Introduction to MapReduce

MapReduce: Simplified Data Processing on Large Clusters

Laarge-Scale Data Engineering

The amount of data increases every day Some numbers ( 2012):

2/26/2017. The amount of data increases every day Some numbers ( 2012):

CS-2510 COMPUTER OPERATING SYSTEMS

CS 61C: Great Ideas in Computer Architecture. MapReduce

CMSC 723: Computational Linguistics I Session #12 MapReduce and Data Intensive NLP. University of Maryland. Wednesday, November 18, 2009

The MapReduce Abstraction

What is this course about?

Introduction to MapReduce

Informa)on Retrieval and Map- Reduce Implementa)ons. Mohammad Amir Sharif PhD Student Center for Advanced Computer Studies

MapReduce: A Programming Model for Large-Scale Distributed Computation

Map Reduce. Yerevan.

Introduction to MapReduce

Distributed Computations MapReduce. adapted from Jeff Dean s slides

MapReduce-style data processing

CSE Lecture 11: Map/Reduce 7 October Nate Nystrom UTA

Developing MapReduce Programs

Map-Reduce Applications: Counting, Graph Shortest Paths

L22: SC Report, Map Reduce

Information Retrieval Processing with MapReduce

The MapReduce Framework

Map Reduce Group Meeting

Systems Infrastructure for Data Science. Web Science Group Uni Freiburg WS 2013/14

Database Systems CSE 414

Announcements. Optional Reading. Distributed File System (DFS) MapReduce Process. MapReduce. Database Systems CSE 414. HW5 is due tomorrow 11pm

Introduction to Hadoop. Owen O Malley Yahoo!, Grid Team

MapReduce Spark. Some slides are adapted from those of Jeff Dean and Matei Zaharia

In the news. Request- Level Parallelism (RLP) Agenda 10/7/11

PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS

Where We Are. Review: Parallel DBMS. Parallel DBMS. Introduction to Data Management CSE 344

Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2016)

MapReduce, Hadoop and Spark. Bompotas Agorakis

Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2017)

Introduction to Map Reduce

ΕΠΛ 602:Foundations of Internet Technologies. Cloud Computing

CSE 544 Principles of Database Management Systems. Alvin Cheung Fall 2015 Lecture 10 Parallel Programming Models: Map Reduce and Spark

Practice and Applications of Data Management CMPSCI 345. Lecture 18: Big Data, Hadoop, and MapReduce

Parallel Data Processing with Hadoop/MapReduce. CS140 Tao Yang, 2014

Advanced Database Technologies NoSQL: Not only SQL

Announcements. Parallel Data Processing in the 20 th Century. Parallel Join Illustration. Introduction to Database Systems CSE 414

Introduction to Data Management CSE 344

Map-Reduce and Related Systems

CS 345A Data Mining. MapReduce

Big Data Programming: an Introduction. Spring 2015, X. Zhang Fordham Univ.

ECE5610/CSC6220 Introduction to Parallel and Distribution Computing. Lecture 6: MapReduce in Parallel Computing

Big Data Management and NoSQL Databases

CSE 344 MAY 2 ND MAP/REDUCE

MapReduce. Simplified Data Processing on Large Clusters (Without the Agonizing Pain) Presented by Aaron Nathan

Introduction to Data Management CSE 344

Clustering Documents. Case Study 2: Document Retrieval

Introduction to MapReduce

Introduction to MapReduce. Instructor: Dr. Weikuan Yu Computer Sci. & Software Eng.

MapReduce and Hadoop

Introduction to Data Management CSE 344

Lecture 11 Hadoop & Spark

Hadoop محبوبه دادخواه کارگاه ساالنه آزمایشگاه فناوری وب زمستان 1391

MapReduce & HyperDex. Kathleen Durant PhD Lecture 21 CS 3200 Northeastern University

Map Reduce.

Systems Infrastructure for Data Science. Web Science Group Uni Freiburg WS 2012/13

Data Clustering on the Parallel Hadoop MapReduce Model. Dimitrios Verraros

MapReduce. Stony Brook University CSE545, Fall 2016

Outline. Distributed File System Map-Reduce The Computational Model Map-Reduce Algorithm Evaluation Computing Joins

with MapReduce The ischool University of Maryland JHU Summer School on Human Language Technology Wednesday, June 17, 2009

Introduction to Hadoop and MapReduce

1. Introduction to MapReduce

Distributed computing: index building and use

PARALLEL DATA PROCESSING IN BIG DATA SYSTEMS

Cloud Programming. Programming Environment Oct 29, 2015 Osamu Tatebe

Large-Scale GPU programming

Clustering Documents. Document Retrieval. Case Study 2: Document Retrieval

Motivation. Map in Lisp (Scheme) Map/Reduce. MapReduce: Simplified Data Processing on Large Clusters

Batch Processing Basic architecture

Parallel Programming Concepts

MI-PDB, MIE-PDB: Advanced Database Systems

Agenda. Request- Level Parallelism. Agenda. Anatomy of a Web Search. Google Query- Serving Architecture 9/20/10

CS 138: Google. CS 138 XVII 1 Copyright 2016 Thomas W. Doeppner. All rights reserved.

Programming Systems for Big Data

Introduction to MapReduce

Big Data Analytics. Izabela Moise, Evangelos Pournaras, Dirk Helbing

CS555: Distributed Systems [Fall 2017] Dept. Of Computer Science, Colorado State University

Google: A Computer Scientist s Playground

Google: A Computer Scientist s Playground

CS 138: Google. CS 138 XVI 1 Copyright 2017 Thomas W. Doeppner. All rights reserved.

Data-Intensive Distributed Computing

TI2736-B Big Data Processing. Claudia Hauff

Introduction to MapReduce

Distributed computing: index building and use

CS427 Multicore Architecture and Parallel Computing

Transcription:

Introduction to MapReduce Adapted from Jimmy Lin (U. Maryland, USA)

Motivation Overview Need for handling big data New programming paradigm Review of functional programming mapreduce uses this abstraction mapreduce details

Motivation: Large Scale Data Processing Want to process lots of data ( > 1 TB) Want to parallelize across hundreds/thousands of CPUs Want to make this easy

Source: Wikipedia (IBM Roadrunner)

Divide and Conquer Work Partition w 1 w 2 w 3 worker worker worker r 1 r 2 r 3 Result Combine

Parallelization Challenges How do we assign work units to workers? What if we have more work units than workers? What if workers need to share partial results? How do we aggregate partial results? How do we know all the workers have finished? What if workers die? What is the common theme of all of these problems?

Common Theme? Parallelization problems arise from: Communication between workers (e.g., to exchange state) Access to shared resources (e.g., data) Thus, we need a synchronization mechanism

Source: Ricardo Guimarães Herrmann

Managing Multiple Workers Difficult because We don t know the order in which workers run We don t know when workers interrupt each other We don t know the order in which workers access shared data Thus, we need: Semaphores (lock, unlock) Conditional variables (wait, notify, broadcast) Barriers Still, lots of problems: Deadlock, livelock, race conditions... Dining philosophers, sleeping barbers, cigarette smokers... Moral of the story: be careful!

Memory Current Tools Programming models Shared memory (pthreads) Message passing (MPI) Design Patterns Master-slaves Producer-consumer flows Shared work queues Shared Memory P 1 P 2 P 3 P 4 P 5 Message Passing P 1 P 2 P 3 P 4 P 5 master producer consumer work queue slaves producer consumer

Concurrency Challenge! Concurrency is difficult to reason about Concurrency is even more difficult to reason about At the scale of datacenters (even across datacenters) In the presence of failures In terms of multiple interacting services Not to mention debugging The reality: Lots of one-off solutions, custom code Write you own dedicated library, then program with it Burden on the programmer to explicitly manage everything

What s the point? It s all about the right level of abstraction The von Neumann architecture has served us well, but is no longer appropriate for the multi-core/cluster environment Hide system-level details from the developers No more race conditions, lock contention, etc. Separating the what from how Developer specifies the computation that needs to be performed Execution framework ( runtime ) handles actual execution The datacenter is the computer!

Scale out, not up Big Ideas Limits of SMP and large shared-memory machines Move processing to the data Cluster have limited bandwidth Process data sequentially, avoid random access Seeks are expensive, disk throughput is reasonable Seamless scalability From the mythical man-month to the tradable machine-hour

Motivation Overview Need for handling big data New programming paradigm Review of functional programming mapreduce uses this abstraction mapreduce details

MapReduce: Roots in Functional Programming Map f f f f f Reduce g g g g g

Functional Programming Review Functional operations do not modify data structures They always create new ones Original data still exists in unmodified form Data flows are implicit in program design Order of operations does not matter

Functional Programming Review fun foo(l: int list) = sum(l) + mul(l) + length(l) Order of sum() and mul(), etc does not matter they do not modify l

Updates Don t Modify Structures fun append(x, lst) = let lst' = reverse lst in reverse ( x :: lst' ) The append() function above reverses a list, adds a new element to the front, and returns all of that, reversed, which appends an item. But it never modifies lst!

Functions Can Be Used As Arguments fun DoDouble(f, x) = f (f x) It does not matter what f does to its argument; DoDouble() will do it twice.

Map map f lst: ( a-> b) -> ( a list) -> ( b list) Creates a new list by applying f to each element of the input list; returns output in order. f f f f f f

Reduce (Fold) fold f x 0 lst: ('a*'b->'b)->'b->('a list)->'b Moves across a list, applying f to each element plus an accumulator. f returns the next accumulator value, which is combined with the next element of the list initial f f f f f returned

fold left vs. fold right Order of list elements can be significant Fold left moves left-to-right across the list Fold right moves from right-to-left SML Implementation: fun foldl f a [] = a foldl f a (x::xs) = foldl f (f(x, a)) xs fun foldr f a [] = a foldr f a (x::xs) = f(x, (foldr f a xs))

fun foo(l: int list) = sum(l) + mul(l) + length(l) Example How can we implement this?

fun foo(l: int list) = sum(l) + mul(l) + length(l) Example (Solved) fun sum(lst) = foldl (fn (x,a)=>x+a) 0 lst fun mul(lst) = foldl (fn (x,a)=>x*a) 1 lst fun length(lst) = foldl (fn (x,a)=>1+a) 0 lst

A More Complicated Fold Problem Given a list of numbers, how can we generate a list of partial sums? e.g.: [1, 4, 8, 3, 7, 9] [0, 1, 5, 13, 16, 23, 32]

Implicit Parallelism In Map In a purely functional setting, elements of a list being computed by map cannot see the effects of the computations on other elements If order of application of f to elements in list is commutative, we can reorder or parallelize execution This is the secret that MapReduce exploits

MapReduce Details

Typical Large-Data Problem Iterate over a large number of records Extract something of interest from each Shuffle and sort intermediate results Aggregate intermediate results Generate final output Key idea: provide a functional abstraction for these two operations (Dean and Ghemawat, OSDI 2004)

Word Count Example Input Map Shuffle & Sort Reduce Output the quick brown fox Map the, 1 the, 1 the, 1 brown, 1 quick, 1 adjective, article Reduce brown, 2 how, 1 now, 1 quick, 1 the, 3 the fox ate the mouse Map fox, 1 how now brown cow Map how, 1 now, 1 brown, 1 ate, 1 mouse, 1 fox, 1 cow, 1 Reduce noun, verb ate, 1 cow, 1 mouse, 1 fox, 2

Programming Model Borrows from functional programming Users implement interface of two functions: map (in_key, in_value) -> (out_key, intermediate_value) list reduce (out_key, intermediate_value list) -> out_value list

MapReduce Programmers specify two functions: map (k, v) [(k, v )] reduce (k, [v ]) [(k, v )] All values with the same key are sent to the same reducer The execution framework handles everything else

Map Records from the data source (lines out of files, rows of a database, etc) are fed into the map function as key*value pairs: e.g., (filename, line). map() produces one or more intermediate values along with an output key from the input.

Map map (in_key, in_value) -> (out_key, intermediate_value) list

Reduce After the map phase is over, all the intermediate values for a given output key are combined together into a list reduce() combines those intermediate values into one or more final values for that same output key (in practice, usually only one final value per key)

Reduce reduce (out_key, intermediate_value list) -> out_value list initial returned

Input key*value pairs Input key*value pairs... Data store 1 map Data store n map (key 1, values...) (key 2, values...) (key 3, values...) (key 1, values...) (key 2, values...) (key 3, values...) == Barrier == : Aggregates intermediate values by output key key 1, intermediate values key 2, intermediate values key 3, intermediate values reduce reduce reduce final key 1 values final key 2 values final key 3 values

Parallelism map() functions run in parallel, creating different intermediate values from different input data sets reduce() functions also run in parallel, each working on a different output key All values are processed independently Bottleneck: reduce phase can t start until map phase is completely finished.

Example: Count word occurrences map(string input_key, String input_value): // input_key: document name // input_value: document contents for each word w in input_value: EmitIntermediate(w, 1); reduce(string output_key, Iterator<int> intermediate_values): // output_key: a word // output_values: a list of counts int result = 0; for each v in intermediate_values: result += v; Emit(result);

Optimizations No reduce can start until map is complete: A single slow disk controller can rate-limit the whole process Master redundantly executes slow-moving map tasks; uses results of first copy to finish Why is it safe to redundantly execute map tasks? Wouldn t this mess up the total computation?

Combining Phase Run on mapper nodes after map phase Mini-reduce, only on local map output Used to save bandwidth before sending data to full reducer Reducer can be combiner if commutative & associative

Combiner, graphically Map output On one mapper machine: Combiner replaces with: To reducer To reducer

k 1 v 1 k 2 v 2 k 3 v 3 k 4 v 4 k 5 v 5 k 6 v 6 map map map map a 1 b 2 c 3 c 6 a 5 c 2 b 7 c 8 Shuffle and Sort: aggregate values by keys a 1 5 b 2 7 c 2 3 6 8 reduce reduce reduce r 1 s 1 r 2 s 2 r 3 s 3

MapReduce Programmers specify two functions: map (k, v) <k, v >* reduce (k, v ) <k, v >* All values with the same key are sent to the same reducer The execution framework handles everything else What s everything else?

MapReduce Runtime Handles scheduling Assigns workers to map and reduce tasks Handles data distribution Moves processes to data Handles synchronization Gathers, sorts, and shuffles intermediate data Handles errors and faults Detects worker failures and restarts Everything happens on top of a distributed FS

MapReduce Programmers specify two functions: map (k, v) [(k, v )] reduce (k, [v ]) [(k, v )] All values with the same key are reduced together The execution framework handles everything else Not quite usually, programmers also specify: partition (k, number of partitions) partition for k Often a simple hash of the key, e.g., hash(k ) mod n Divides up key space for parallel reduce operations combine (k, [v ]) [(k, v )] Mini-reducers that run in memory after the map phase Used as an optimization to reduce network traffic

k 1 v 1 k 2 v 2 k 3 v 3 k 4 v 4 k 5 v 5 k 6 v 6 map map map map a 1 b 2 c 3 c 6 a 5 c 2 b 7 c 8 combine combine combine combine a 1 b 2 c 9 a 5 c 2 b 7 c 8 partition partition partition partition Shuffle and Sort: aggregate values by keys a 1 5 b 2 7 c 2 93 86 8 reduce reduce reduce r 1 s 1 CS 4407, r 2 Algorithms s 2 r 3 s 3

Two more details Barrier between map and reduce phases But we can begin copying intermediate data earlier Keys arrive at each reducer in sorted order No enforced ordering across reducers

MapReduce can refer to The programming model The execution framework (aka runtime ) The specific implementation Usage is usually clear from context!

Hello World : Word Count Map(String docid, String text): for each word w in text: Emit(w, 1); Reduce(String term, Iterator<Int> values): int sum = 0; for each v in values: sum += v; Emit(term, value);

MapReduce Automatic parallelization & distribution Fault-tolerant Provides status and monitoring tools Clean abstraction for programmers

Word Count Execution Input Map Shuffle & Sort Reduce Output the quick brown fox the fox ate the mouse Map Map the, 1 fox, 1 the, 1 the, 1 brown, 1 fox, 1 quick, 1 Reduce brown, 2 fox, 2 how, 1 now, 1 the, 3 how now brown cow Map how, 1 now, 1 brown, 1 ate, 1 mouse, 1 cow, 1 Reduce ate, 1 cow, 1 mouse, 1 quick, 1

An Optimization: The Combiner A combiner is a local aggregation function for repeated keys produced by same map For associative ops. like sum, count, max Decreases size of intermediate data Example: local counting for Word Count: def combiner(key, values): output(key, sum(values))

Word Count with Combiner Input Map & Combine Shuffle & Sort Reduce Output the quick brown fox Map the, 2 fox, 1 the, 1 brown, 1 fox, 1 Reduce brown, 2 fox, 2 how, 1 now, 1 the, 3 the fox ate the mouse Map quick, 1 how now brown cow Map how, 1 now, 1 brown, 1 ate, 1 mouse, 1 cow, 1 Reduce ate, 1 cow, 1 mouse, 1 quick, 1

User Program (1) submit Master (2) schedule map (2) schedule reduce worker split 0 split 1 split 2 split 3 split 4 (3) read worker (4) local write (5) remote read worker worker (6) write output file 0 output file 1 worker Input files Map phase Intermediate files (on local disk) Reduce phase Output files Adapted from (Dean and Ghemawat, OSDI 2004)

MapReduce Conclusions MapReduce has proven to be a useful abstraction Greatly simplifies large-scale computations at Google Functional programming paradigm can be applied to large-scale applications Fun to use: focus on problem, let library deal w/ messy details

MapReduce in the Real World Implementations Who works with this approach

MapReduce Implementations Google has a proprietary implementation in C++ Bindings in Java, Python Hadoop is an open-source implementation in Java Development led by Yahoo, used in production Now an Apache project Rapidly expanding software ecosystem Lots of custom research implementations For GPUs, cell processors, etc.

Hadoop History Dec 2004 Google GFS paper published July 2005 Nutch uses MapReduce Feb 2006 Becomes Lucene subproject Apr 2007 Yahoo! on 1000-node cluster Jan 2008 An Apache Top Level Project Jul 2008 A 4000 node test cluster Sept 2008 Hive becomes a Hadoop subproject Feb 2009 The Yahoo! Search Webmap is a Hadoop application that runs on more than 10,000 core Linux cluster and produces data that is now used in every Yahoo! Web search query. June 2009 On June 10, 2009, Yahoo! made available the source code to the version of Hadoop it runs in production. In 2010 Facebook claimed that they have the largest Hadoop cluster in the world with 21 PB of storage. On July 27, 2011 they announced the data has grown to 30 PB.

Amazon/A9 Facebook Google IBM Joost Last.fm New York Times PowerSet Veoh Yahoo! Who uses Hadoop?

Typical Hadoop Cluster Aggregation switch Rack switch 40 nodes/rack, 1000-4000 nodes in cluster 1 GBps bandwidth in rack, 8 GBps out of rack Node specs (Yahoo terasort): 8 x 2.0 GHz cores, 8 GB RAM, 4 disks (= 4 TB?)

Typical Hadoop Cluster Image from http://wiki.apache.org/hadoop-data/attachments/hadooppresentations/attachments/aw-apachecon-eu-2009.pdf

Example vs. Actual Source Code Example is written in pseudo-code Actual implementation is in C++, using a MapReduce library Bindings for Python and Java exist via interfaces True code is somewhat more involved (defines how the input key/values are divided up and accessed, etc.)

Locality Master program divvies up tasks based on location of data: tries to have map() tasks on same machine as physical file data, or at least same rack map() task inputs are divided into 64 MB blocks: same size as Google File System chunks

Fault Tolerance Master detects worker failures Re-executes completed & in-progress map() tasks Re-executes in-progress reduce() tasks Master notices particular input key/values cause crashes in map(), and skips those values on re-execution. Effect: Can work around bugs in third-party libraries!