Introduction to MapReduce

Similar documents
Tutorial Outline. Map/Reduce. Data Center as a computer [Patterson, cacm 2008] Acknowledgements

Parallel Programming Concepts

Tutorial Outline. Map/Reduce. Data Center as a computer [Patterson, cacm 2008] Acknowledgements

Parallel Programming Principle and Practice. Lecture 10 Big Data Processing with MapReduce

CS 345A Data Mining. MapReduce

The MapReduce Abstraction

CS555: Distributed Systems [Fall 2017] Dept. Of Computer Science, Colorado State University

L22: SC Report, Map Reduce

MapReduce: Simplified Data Processing on Large Clusters

Parallel Computing: MapReduce Jin, Hai

Big Data Management and NoSQL Databases

Advanced Data Management Technologies

Introduction to Map Reduce

MI-PDB, MIE-PDB: Advanced Database Systems

MapReduce: Simplified Data Processing on Large Clusters 유연일민철기

Clustering Lecture 8: MapReduce

Distributed Computations MapReduce. adapted from Jeff Dean s slides

Map Reduce Group Meeting

The MapReduce Framework

CS 61C: Great Ideas in Computer Architecture. MapReduce

Concurrency for data-intensive applications

Introduction to MapReduce

1. Introduction to MapReduce

PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS

Distributed Systems. 18. MapReduce. Paul Krzyzanowski. Rutgers University. Fall 2015

ECE5610/CSC6220 Introduction to Parallel and Distribution Computing. Lecture 6: MapReduce in Parallel Computing

Scientific data mining and processing using MapReduce in cloud environments

CSE Lecture 11: Map/Reduce 7 October Nate Nystrom UTA

CS /5/18. Paul Krzyzanowski 1. Credit. Distributed Systems 18. MapReduce. Simplest environment for parallel processing. Background.

MapReduce. Kiril Valev LMU Kiril Valev (LMU) MapReduce / 35

MapReduce Spark. Some slides are adapted from those of Jeff Dean and Matei Zaharia

CS6030 Cloud Computing. Acknowledgements. Today s Topics. Intro to Cloud Computing 10/20/15. Ajay Gupta, WMU-CS. WiSe Lab

Where We Are. Review: Parallel DBMS. Parallel DBMS. Introduction to Data Management CSE 344

Introduction to MapReduce

Cloud Programming. Programming Environment Oct 29, 2015 Osamu Tatebe

Introduction to MapReduce

MapReduce & Resilient Distributed Datasets. Yiqing Hua, Mengqi(Mandy) Xia

Hadoop/MapReduce Computing Paradigm

MapReduce: A Programming Model for Large-Scale Distributed Computation

Parallel Nested Loops

Parallel Partition-Based. Parallel Nested Loops. Median. More Join Thoughts. Parallel Office Tools 9/15/2011

Data Clustering on the Parallel Hadoop MapReduce Model. Dimitrios Verraros

CS MapReduce. Vitaly Shmatikov

Introduction to MapReduce. Adapted from Jimmy Lin (U. Maryland, USA)

Outline. Distributed File System Map-Reduce The Computational Model Map-Reduce Algorithm Evaluation Computing Joins

Big Data Programming: an Introduction. Spring 2015, X. Zhang Fordham Univ.

CS 138: Google. CS 138 XVII 1 Copyright 2016 Thomas W. Doeppner. All rights reserved.

CS 138: Google. CS 138 XVI 1 Copyright 2017 Thomas W. Doeppner. All rights reserved.

MapReduce. U of Toronto, 2014

Announcements. Parallel Data Processing in the 20 th Century. Parallel Join Illustration. Introduction to Database Systems CSE 414

Large-Scale GPU programming

CS427 Multicore Architecture and Parallel Computing

CIS 601 Graduate Seminar Presentation Introduction to MapReduce --Mechanism and Applicatoin. Presented by: Suhua Wei Yong Yu

Programming Models MapReduce

MapReduce. Cloud Computing COMP / ECPE 293A

Introduction to Data Management CSE 344

MapReduce-style data processing

Systems Infrastructure for Data Science. Web Science Group Uni Freiburg WS 2013/14

Introduction to Data Management CSE 344

Overview. Why MapReduce? What is MapReduce? The Hadoop Distributed File System Cloudera, Inc.

CS-2510 COMPUTER OPERATING SYSTEMS

CSE 544 Principles of Database Management Systems. Alvin Cheung Fall 2015 Lecture 10 Parallel Programming Models: Map Reduce and Spark

Introduction to MapReduce

BigData and Map Reduce VITMAC03

Map Reduce. Yerevan.

CS 345A Data Mining. MapReduce

Introduction to Data Management CSE 344

Big Data Analytics. Izabela Moise, Evangelos Pournaras, Dirk Helbing

HADOOP FRAMEWORK FOR BIG DATA

Chapter 5. The MapReduce Programming Model and Implementation

Programming Systems for Big Data

Database Systems CSE 414

Announcements. Optional Reading. Distributed File System (DFS) MapReduce Process. MapReduce. Database Systems CSE 414. HW5 is due tomorrow 11pm

Big Data and Distributed Data Processing (Analytics)

MapReduce, Hadoop and Spark. Bompotas Agorakis

Systems Infrastructure for Data Science. Web Science Group Uni Freiburg WS 2012/13

Big Data Analytics Overview with Hadoop and Spark

Principles of Data Management. Lecture #16 (MapReduce & DFS for Big Data)

Database System Architectures Parallel DBs, MapReduce, ColumnStores

Introduction to MapReduce (cont.)

TITLE: PRE-REQUISITE THEORY. 1. Introduction to Hadoop. 2. Cluster. Implement sort algorithm and run it using HADOOP

Lecture 11 Hadoop & Spark

MapReduce and Hadoop

Improving the MapReduce Big Data Processing Framework

Hadoop Map Reduce 10/17/2018 1

In the news. Request- Level Parallelism (RLP) Agenda 10/7/11

Motivation. Map in Lisp (Scheme) Map/Reduce. MapReduce: Simplified Data Processing on Large Clusters

Part A: MapReduce. Introduction Model Implementation issues

MapReduce & HyperDex. Kathleen Durant PhD Lecture 21 CS 3200 Northeastern University

Google File System (GFS) and Hadoop Distributed File System (HDFS)

Parallel Data Processing with Hadoop/MapReduce. CS140 Tao Yang, 2014

ΕΠΛ 602:Foundations of Internet Technologies. Cloud Computing

HDFS: Hadoop Distributed File System. CIS 612 Sunnie Chung

Parallel Processing - MapReduce and FlumeJava. Amir H. Payberah 14/09/2018

Introduction to Hadoop. Owen O Malley Yahoo!, Grid Team

MapReduce. Simplified Data Processing on Large Clusters (Without the Agonizing Pain) Presented by Aaron Nathan

Agenda. Request- Level Parallelism. Agenda. Anatomy of a Web Search. Google Query- Serving Architecture 9/20/10

CSE 344 MAY 2 ND MAP/REDUCE

Distributed computing: index building and use

Parallel data processing with MapReduce

Transcription:

Introduction to MapReduce Jacqueline Chame CS503 Spring 2014 Slides based on: MapReduce: Simplified Data Processing on Large Clusters. Jeffrey Dean and Sanjay Ghemawat. OSDI 2004. MapReduce: The Programming Model and Practice. Jerry Zhao, Jelena Pjesivac-Grbovic. Sigmetrics tutorial, Sigmetrics 2009. research.google.com/pubs/archive/36249.pdf

MapReduce Programming model and implementation developed at Google for processing and generating large datasets Many real world applications can be expressed in this model Parallelism: same computation performed at different cpus on different pieces of input dataset Programs are automatically parallelized and executed on large cluster of machines

Motivation Before MapReduce, Google developers implemented hundreds of special-purpose computations to process large amounts of data mostly simple computations input data so large that it must be distributed across hundreds of thousands of machines developers had to figure out how to parallelize computation, distribute data, deal with hardware failures,... MapReduce: abstraction that allows programmers to write simple computations while hiding the details of: parallelization data distribution load balancing run-time system/ MapReduce library fault tolerance

MapReduce Programming Model Inspired by the map and reduce primitives of functional programming languages such as Lisp map: takes as input a function and a sequence of values and applies the function to each value in the sequence reduce: takes as input a sequence of values and combines all values using a binary operator but not equivalent!

MapReduce Programming Model Computation: takes a set of input <key, value> pairs and produces a set of <key, value> pairs map (written by user) takes a input <key, value> pair produces a set of intermediate <key, value> pairs MapReduce library groups all intermediate values associated with same intermediate key I sort intermediate values by key reduce (written by user) takes as input an intermediate key I and a set of values for that key ( <key, v1, v2,..., vn> ) merges values together to form a smaller set of values typically produces one value

MapReduce execution model map(k:v)! (k :v ) reduce(k :v [ ])! (v ) map group (k :v )s by k reduce input run-time system/ MapReduce library

Example: Word Count Count the number of occurrences of a word in a large collection of documents: map(string key, String value): // key: document name // value: document contents for each word w in value: EmitIntermediate (w, 1 ); reduce(string key, Iterator values): // key: a word // values: a list of counts int result = 0; for each v in values: result += ParseInt(v); Emit(AsString(result)); User also writes code to fill in a MapReduce specification object with names of input and files optional tuning parameters User code is linked with the MapReduce library

Execution example The quick brown fox jumps over the lazy fox input (splits) mapper brown, 1 fox, 1 quick, 1 the, 1 jumps, 1 mapper over, 1 lazy, 1 fox, 1 the, 1 mapper map shuffle and sort (MapReduce library) reducer brown, 1 fox, 2 jumps, 1 lazy, 1 reducer over, 1 quick, 1 the, 2 reduce

Word count code #include "mapreduce/mapreduce.h" // User s map function class WordCounter : public Mapper { public: virtual void Map(const MapInput& input) { const string& text = input.value(); const int n = text.size(); for (int i = 0; i < n; ) { // Skip past leading whitespace while ((i < n) && isspace(text[i])) i++; // Find word end int start = i; while ((i < n) &&!isspace(text[i])) i++; if (start < i) Emit(text.substr(start,i-start),"1"); } } }; REGISTER_MAPPER(WordCounter); // User s reduce function class Adder : public Reducer { virtual void Reduce(ReduceInput* input) { // Iterate over all entries with the same // key and add the values int64 value = 0; while (!input->done()){ value += StringToInt(input->value()); input->nextvalue(); } // Emit sum for input->key() Emit(IntToString(value)); } }; REGISTER_REDUCER(Adder); int main(int argc, char** argv) { ParseCommandLineFlags(argc, argv); MapReduceSpecification spec; } // Store list of input files into "spec" for (int i = 1; i < argc; i++) { MapReduceInput* input = spec.add_input(); input->set_format("text"); input->set_filepattern(argv[i]); input->set_mapper_class("wordcounter");} // Specify the files: // /gfs/test/freq-00000-of-00100 // /gfs/test/freq-00001-of-00100 //... MapReduceOutput* out = spec.(); out->set_filebase("/gfs/test/freq"); out->set_num_tasks(100); out->set_format("text"); out->set_reducer_class("adder"); // Optional: do partial sums within map tasks // to save network bandwidth out-set_combiner_class("adder"); // Tuning parameters: use at most 2000 machines // and 100 MB of memory per task spec.set_machines(2000); spec.set_map_megabytes(100); sec.set_reduce_megabytes(100); // Now run it MapReduceResult result; if (!MapReduce(spec, &result)) abort(); // Done: result structure contains info about // counters, time taken, number of machines etc. return 0;

More examples Distributed grep map: emits a line if it matches a supplied pattern reduce: identity function (copies intermediate data to ) Count of URL access frequency: map: processes logs of web page requests and emits <URL,1> reduce: adds together all values for same URL and emits <URL,total_count> pair Reverse Web-Link Graph map: emits <target,source> pairs for each link to a target URL found in a page named source. reduce: concatenates the list of all source URLs associated with a given target URL and emits <target,list(source)>

MapReduce implementation targeted to execution environment Google: Large clusters of commodity PCs connected by Ethernet A cluster has hundreds of thousands of machines: machine failures are common Storage: inexpensive disks attached directly to individual machines. In-house distributed file system Jobs (set of tasks) mapped by scheduler to available machines within a cluster Different implementations depend on environment

Execution user program master split data 0 split data 1 file 0 split data 2 split data 3 file 1 split data 4 input files map intermediate files on local disks reduce files

Execution MapReduce library splits input data into a set of M pieces user program fork fork fork MapReduce library starts copies of user program master split data 0 split data 1 file 0 split data 2 split data 3 file 1 split data 4 input files map intermediate files on local disks reduce files

Execution master copy assigns Map and Reduce tasks user program master assign map assign reduce split data 0 split data 1 file 0 split data 2 split data 3 file 1 split data 4 input files map intermediate files on local disks reduce files

Execution: map map task reads from input split, passes each <key,value> to user Map function user program master split data 0 split data 1 file 0 split data 2 read split data 3 file 1 split data 4 input files map intermediate files on local disks reduce files

Execution intermediate <key, value> produced by Map are written to local disk user program intermediate keys partitioned into R pieces master split data 0 split data 1 split data 2 local write file 0 split data 3 file 1 split data 4 input files map intermediate files on local disks reduce files

Execution: reduce user program reducer s read (remote read) intermediate data from local disks of mappers and sort <keyi, list(v)> by keys master split data 0 split data 1 file 0 split data 2 split data 3 split data 4 remote read file 1 input files map intermediate files on local disks reduce files

Execution: reduce user program reduce iterates over the sorted intermediate data and passes <keyi, list(values)> to user Reduce function master split data 0 split data 1 file 0 split data 2 split data 3 file 1 split data 4 input files map intermediate files on local disks reduce files

Execution: user program s of user Reduce function are appended to final files master split data 0 split data 1 file 0 split data 2 split data 3 file 1 split data 4 input files map intermediate files on local disks reduce files

Data locality Input data stored in local disks of cluster machines Several copies of each block of data on different machines MapReduce master tries to assign a map task to a machine that contains a copy of the task s input data, or to a machine near that (on the same network switch) Most input data is read locally consumes no bandwidth

Fault tolerance MapReduce library designed to help process very large data must handle machine failures Worker failure: completed map tasks are reset to initial idle state (their data is unavailable) in-progress map and reduce tasks also reset to idle idle tasks are eligible for rescheduling reduce tasks notified if map task rescheduled (reads data from new )

Worker failure user program master split data 0 split data 1 file 0 split data 2 split data 3 file 1 split data 4 input files map intermediate files on local disks reduce files

Recovery by re-execution user program master split data 0 split data 1 file 0 split data 2 split data 3 file 1 split data 4 input files map intermediate files on local disks reduce files

Fault tolerance Master failure master performs periodic checkpoints of its data upon failure, a new master copy can start from the checkpoint state Master data: status of each task (idle, in-progress, complete) and machine id of non-idle tasks locations and sizes of R intermediate data files generated by each map task

Master failure user program master split data 0 split data 1 file 0 split data 2 split data 3 file 1 split data 4 input files map intermediate files on local disks reduce files

Recover from master s execution log user program master execution log on google file system split data 0 split data 1 file 0 split data 2 split data 3 file 1 split data 4 input files map intermediate files on local disks reduce files

some useful extensions: Partitioning Partitioning function default partitioning function uses hashing (e.g. hash(key) mod R) Library also supports user-provided partitioning functions e.g. hash(hostname(urlkey)) mod R all urls from same host end up in same file

some useful extensions: Combiner function Combiner: does partial merging of data produced by a map function decreases the amount of data that needs to be read (over the network) by reduce tasks e.g.: word count Map typically emits hundreds or thousands of pairs <the, 1> to be sent over the network and added by a Reduce function Combiner function is executed on each machine which performs a Map stored in intermediate files Speedups some classes of MapReduce computations

Example: Word frequency Input: Large number of text documents Task: Compute word frequency across all documents Frequency is calculated using the total word count A naive solution with basic MapReduce model requires two MapReduces MR1: count number of all words in these documents Use combiners MR2: count number of each word and divide it by the total count from MR1

Word frequency Can we do better? Two nice features of Google's MapReduce implementation Ordering guarantee of reduce key Auxiliary functionality: EmitToAllReducers(k, v) A nice trick: To compute the total number of words in all documents Every map task sends its total world count with key " " to ALL reducer splits Key " " will be the first key processed by reducer Sum of its values total number of words!

Word frequency - mapper map(string key, String value): // key: document name, value: document contents int word_count = 0; for each word w in value: EmitIntermediate(w, "1"); word_count++; EmitIntermediateToAllReducers("", AsString(word_count)); combine(string key, Iterator values): // Combiner for map // key: a word, values: a list of counts int partial_word_count = 0; for each v in values: partial_word_count += ParseInt(v); Emit(key, AsString(partial_word_count));

Word frequency - reducer reduce(string key, Iterator values): // Actual reducer // key: a word // values: a list of counts if (is_first_key): assert("" == key); // sanity check total_word_count_ = 0; for each v in values: total_word_count_ += ParseInt(v) else: assert(""!= key); // sanity check int word_count = 0; for each v in values: word_count += ParseInt(v); Emit(key, AsString(word_count / total_word_count_));

Example: Average income in a city SSTable 1: (SSN, {Personal Information}) 123456:(John Smith;Sunnyvale, CA) 123457:(Jane Brown;Mountain View, CA) 123458:(Tom Little;Mountain View, CA) SSTable 2: (SSN, {year, income}) 123456:(2007,$70000),(2006,$65000),(2005,$6000),... 123457:(2007,$72000),(2006,$70000),(2005,$6000),... 123458:(2007,$80000),(2006,$85000),(2005,$7500),... Task: Compute average income in each city in 2007 Note: Both inputs sorted by SSN

Average income solution Mapper 1a: Input: SSN Personal Information Output: (SSN, City) Mapper 1b: Input: SSN Annual Incomes Output: (SSN, 2007 Income) Reducer 1: Input: SSN {City, 2007 Income} Output: (SSN, [City, 2007 Income]) Mapper 2: Input: SSN [City, 2007 Income] Output: (City, 2007 Income) Reducer 2: Input: City 2007 Incomes Output: (City, AVG(2007 Incomes))

Average income joined solution - inputs are sorted - custom input readers Mapper: Input: SSN Personal Information and Incomes Output: (City, 2007 Income) Reducer Input: City 2007 Income Output: (City, AVG(2007 Incomes))

Summary MapReduce is a flexible programming framework for many applications through a couple of restricted Map()/Reduce() constructs Google invented and implemented MapReduce around its infrastructure to allow its engineers scale with the growth of the Internet, and the growth of Google products/services Open source implementations of MapReduce, such as Hadoop are creating a new ecosystem to enable large scale computing over the offthe-shelf clusters

More examples at: MapReduce: The Programming Model and Practice. Jerry Zhao, Jelena Pjesivac-Grbovic. Sigmetrics tutorial, Sigmetrics 2009. research.google.com/pubs/archive/36249.pdf

Hadoop Open source, top-level Apache project GFS HDFS HDFS (Hadoop Distributed File System) is designed to store very large files across machines in a large cluster Used by Yahoo, Facebook, ebay, Amazon, Twitter...

Hadoop Scalable Thousands of nodes Petabytes of data over 10M files Single file: gigabytes to terabytes Economical Open source Commercial off-the-shelf hardware (but master nodes should be reliable) Well-suited to bag-of-tasks applications (many bio apps) Files are split into blocks and distributed across nodes High-throughput access to huge datasets

Hadoop: architecture client job request client sends job request to job tracker HDFS master name node MR master job tracker job tracker queries name node about physical data block locations data node data node data node HDFS slaves task tracker task tracker task tracker MR slaves input stream is split among the desired number of map tasks map tasks are scheduled closest to where data reside Slide from Simone Leo, Python MapReduce Programming with Pydoop, EuroPython 2011

Hadoop distributed file system client filename a name node read data a3 a1 a1 data node 1 block ID, data nodes a1 a2 data node 3 secondary name node - namespace checkpointing - logs a3 a2 data node 5 Each block is replicated n times (3 by default) One replica on the same rack, the others on different racks a3 data node 2 a1 data node 4 a2 data node 6 User has to provide network topology rack 1 rack 2 rack 3 Slide from Simone Leo, Python MapReduce Programming with Pydoop, EuroPython 2011

Other Hadoop MapReduce components Combiner (local Reducer) RecordReader Translates the byte-oriented view of input files into the record-oriented view required by the Mapper Directly accesses HDFS files Processing unit: InputSplit (filename, offset, length) Partitioner Decides which Reducer receives which key Typically uses a hash function of the key RecordWriter Writes key/value pairs by the Reducer Directly accesses HDFS files

More on Hadoop Homework (and optional Hadoop final project) we will provide instructions for downloading, configuring Hadoop, running your MapReduce application on your laptop/machine.