Introduction to MapReduce

Similar documents
Map Reduce Group Meeting

CS555: Distributed Systems [Fall 2017] Dept. Of Computer Science, Colorado State University

Big Data Management and NoSQL Databases

CSE Lecture 11: Map/Reduce 7 October Nate Nystrom UTA

Clustering Lecture 8: MapReduce

MapReduce: Simplified Data Processing on Large Clusters 유연일민철기

The MapReduce Abstraction

Parallel Computing: MapReduce Jin, Hai

Distributed Computations MapReduce. adapted from Jeff Dean s slides

Parallel Programming Concepts

CS 345A Data Mining. MapReduce

Introduction to Map Reduce

MI-PDB, MIE-PDB: Advanced Database Systems

MapReduce. Stony Brook University CSE545, Fall 2016

CS MapReduce. Vitaly Shmatikov

Parallel Programming Principle and Practice. Lecture 10 Big Data Processing with MapReduce

Distributed Computation Models

Outline. Distributed File System Map-Reduce The Computational Model Map-Reduce Algorithm Evaluation Computing Joins

Distributed Systems. 18. MapReduce. Paul Krzyzanowski. Rutgers University. Fall 2015

Principles of Data Management. Lecture #16 (MapReduce & DFS for Big Data)

CS 138: Google. CS 138 XVII 1 Copyright 2016 Thomas W. Doeppner. All rights reserved.

CS 138: Google. CS 138 XVI 1 Copyright 2017 Thomas W. Doeppner. All rights reserved.

CS /5/18. Paul Krzyzanowski 1. Credit. Distributed Systems 18. MapReduce. Simplest environment for parallel processing. Background.

Introduction to MapReduce (cont.)

CS427 Multicore Architecture and Parallel Computing

Improving the MapReduce Big Data Processing Framework

L22: SC Report, Map Reduce

MapReduce Simplified Data Processing on Large Clusters

PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS

Programming Systems for Big Data

MapReduce & Resilient Distributed Datasets. Yiqing Hua, Mengqi(Mandy) Xia

Parallel Processing - MapReduce and FlumeJava. Amir H. Payberah 14/09/2018

Motivation. Map in Lisp (Scheme) Map/Reduce. MapReduce: Simplified Data Processing on Large Clusters

MapReduce: Simplified Data Processing on Large Clusters

The MapReduce Framework

MapReduce Spark. Some slides are adapted from those of Jeff Dean and Matei Zaharia

Parallel Nested Loops

Parallel Partition-Based. Parallel Nested Loops. Median. More Join Thoughts. Parallel Office Tools 9/15/2011

Big Data and Scripting map reduce in Hadoop

Introduction to MapReduce

TITLE: PRE-REQUISITE THEORY. 1. Introduction to Hadoop. 2. Cluster. Implement sort algorithm and run it using HADOOP

Advanced Data Management Technologies

MapReduce. U of Toronto, 2014

MapReduce: Recap. Juliana Freire & Cláudio Silva. Some slides borrowed from Jimmy Lin, Jeff Ullman, Jerome Simeon, and Jure Leskovec

BigData and Map Reduce VITMAC03

Cloud Programming. Programming Environment Oct 29, 2015 Osamu Tatebe

HADOOP FRAMEWORK FOR BIG DATA

CompSci 516: Database Systems

Announcements. Reading Material. Map Reduce. The Map-Reduce Framework 10/3/17. Big Data. CompSci 516: Database Systems

Introduction to MapReduce

April Final Quiz COSC MapReduce Programming a) Explain briefly the main ideas and components of the MapReduce programming model.

Where We Are. Review: Parallel DBMS. Parallel DBMS. Introduction to Data Management CSE 344

CS 61C: Great Ideas in Computer Architecture. MapReduce

FLAT DATACENTER STORAGE. Paper-3 Presenter-Pratik Bhatt fx6568

1. Introduction to MapReduce

Introduction to MapReduce. Adapted from Jimmy Lin (U. Maryland, USA)

Introduction to MapReduce

Cloud Computing CS

Distributed Filesystem

Map-Reduce and Related Systems

Announcements. Optional Reading. Distributed File System (DFS) MapReduce Process. MapReduce. Database Systems CSE 414. HW5 is due tomorrow 11pm

CSE 544 Principles of Database Management Systems. Alvin Cheung Fall 2015 Lecture 10 Parallel Programming Models: Map Reduce and Spark

HDFS: Hadoop Distributed File System. CIS 612 Sunnie Chung

CA485 Ray Walshe Google File System

Lecture 11 Hadoop & Spark

Chapter 5. The MapReduce Programming Model and Implementation

What Is Datacenter (Warehouse) Computing. Distributed and Parallel Technology. Datacenter Computing Architecture

Database Systems CSE 414

MapReduce and Hadoop

MapReduce: A Programming Model for Large-Scale Distributed Computation

Introduction to Data Management CSE 344

Hadoop/MapReduce Computing Paradigm

A brief history on Hadoop

MapReduce in Erlang. Tom Van Cutsem

Hadoop streaming is an alternative way to program Hadoop than the traditional approach of writing and compiling Java code.

Data Partitioning and MapReduce

STATS Data Analysis using Python. Lecture 7: the MapReduce framework Some slides adapted from C. Budak and R. Burns

CS6030 Cloud Computing. Acknowledgements. Today s Topics. Intro to Cloud Computing 10/20/15. Ajay Gupta, WMU-CS. WiSe Lab

Systems Infrastructure for Data Science. Web Science Group Uni Freiburg WS 2013/14

Map-Reduce. Marco Mura 2010 March, 31th

STA141C: Big Data & High Performance Statistical Computing

ECE5610/CSC6220 Introduction to Parallel and Distribution Computing. Lecture 6: MapReduce in Parallel Computing

Remote Procedure Call. Tom Anderson

Mitigating Data Skew Using Map Reduce Application

Announcements. Parallel Data Processing in the 20 th Century. Parallel Join Illustration. Introduction to Database Systems CSE 414

Survey on MapReduce Scheduling Algorithms

Batch Processing Basic architecture

Google File System (GFS) and Hadoop Distributed File System (HDFS)

MapReduce Algorithms

MapReduce. Kiril Valev LMU Kiril Valev (LMU) MapReduce / 35

Distributed File Systems II

ECE 7650 Scalable and Secure Internet Services and Architecture ---- A Systems Perspective

Map Reduce. Yerevan.

CLOUD-SCALE FILE SYSTEMS

Tutorial Outline. Map/Reduce. Data Center as a computer [Patterson, cacm 2008] Acknowledgements

Map-Reduce. John Hughes

Hadoop. copyright 2011 Trainologic LTD

MapReduce-II. September 2013 Alberto Abelló & Oscar Romero 1

Introduction to MapReduce

Database Applications (15-415)

Transcription:

732A54 Big Data Analytics Introduction to MapReduce Christoph Kessler IDA, Linköping University Towards Parallel Processing of Big-Data Big Data too large to be read+processed in reasonable time by 1 server only too large to fit in main memory at a time Usually residing on secondary storage (local or remote) Storage on a single hard disk (sequential access) would prevent parallel processing Use lots of servers with lots of hard disks i.e., standard server nodes in a cluster Distribute the data across nodes to allow for parallel access Christoph Kessler, IDA, Linköpings universitet. 2 Distributed File System MapReduce works atop a distributed file system Large files are distributed ( sharded = split into blocks of e.g. 64MB (shards) and spread out across cluster nodes) Parallel access possible Faster access to local chunks (higher bandwidth, lower latency) Also, replicas for fault tolerance E.g. 3 copies of each block on different servers Examples: Google GFS, Hadoop HDFS May need to first copy the data from ordinary (host) file system to HDFS 3 4 Designed to operate on LARGE input data sets Data elements: Key-value pairs Map Phase Shuffle Phase Reduce Phase - Record reader - Mapper - Combiner - Partitioner - Shuffle and sort - Reduce - Output formatter 5 6

A generalization Map function: of the data-parallel Reduce MapReduce function: skeleton of K 1 x V 1 List(K 2 x V 2 ) K 2 x List(K 2 x V 2 ) List(V 2 ) Record Reader Parses an input file block from stdin into key-value pairs that define input data records Key in K 1 is typically positional information (location in file) Value in V 1 = chunk of input data that composes a record 7 8 Mapper Combiner Applies a user-defined function to each element (i.e., key/value pair coming from the Record reader). Examples: Filter function drop elements that do not fulfill a constraint Transformation function calculation on each element Produces a list of zero or more new key/value pairs = intermediate elements Key in K 2 : index for grouping of data Value in V 2 : Data to be forwarded to reducer Buffered in memory 9 An optional local reducer run in the mapper task as postprocessor Applies a user-provided function to aggregate values in the intermediate elements of one mapper task Reduction/aggregation could also be done by the reducer, but local reduction can improve performance considerably Data locality key/value pairs still in cache resp. memory of same node Data reduction aggregated information is often smaller Applicable if the user-defined Reduce function is commutative and associative Recommended if there is significant repetition of intermediate keys produced by each Mapper task 10 Partitioner Splits the intermediate elements from the mapper/combiner into shards (64MB blocks stored in local files) one shard per reducer Default: element to hashcode(element.key) modulo R for even (round-robin) distribution of elements Usually good for load balancing Writes the shards to the local file system Shuffle-and-sort Downloads the needed files written by the partitioners to the node on which the reducer is running Sort the received (key,value) pairs by key into one list Pairs with equivalent keys will now be next to each other (groups) To be handled by the reducer No customization here beyond how to sort and group by keys 11 12

Reducer Run a user-defined reduce function once per key grouping Can aggregate, filter, and combine data Output: 0 or more key/value pairs sent to output formatter. Output Formatter Translates the final (key,value) pair from the reduce function and writes it to stdout to a file in HDFS Default formatting (key <TAB> value <NEWLINE>) can be customized 13 14 Python code for the Mapper task: Python code for the Combiner task: # remove leading and trailing whitespace: # split the line into words: words = line.split() # increase counters: for word in words: print '%s\t%s' % (word, 1) ABC DEF. ABC ABC - GHI DEF. DEF. - GHI ABC? ABC? - GHI DEF DEF ABC? DEF Python code adapted from MapReduce tutorial, Princeton U., 2015 15 # for each document create dictionary of words: wordcounts = dict() words = line.split() for word in words: if word not in wordcounts.keys(): wordcounts[word] = 1 else: wordcounts[word] += 1 # emit key-value pairs only for distinct words per document for w in wordcounts.keys(): print '%s\t%s' % (w, wordcounts[w]) 16 Effect of Shuffle-And-Sort: 17 Python code for the Reducer task: current_word = None current_count = 0 word = None # remove leading and trailing whitespace # parse the input we got from mapper: word, count = line.split('\t', 1) # convert count from string to int: try: count = int(count) except ValueError: # silently ignore invalid line continue. NB words come in sorted order if word is same as the last one, just add its count. if current_word == word: current_count += count else: # new word print tuple for # the previous one to stdout: if current_word: print '%s\t%s' % (current_word, current_count) current_count = count current_word = word # loop done, write the last tuple: if current_word == word: print '%s\t%s' % (current_word, current_count) 18

Special Cases of MapReduce Effect of Reducer: Map only (Reduce is identity function) Data Filtering E.g. distributed grep Data Transformation DEF<tab>3 GHI<tab>4 19 Shuffle-and-sort only: Sorting values by key Mapper extracts key from record and forms <key, record> pairs Shuffle-and-sort phase does the sorting by key Reduce only: (Map is identity function, Combiner for local reduce) Reductions (summarizations): Find global maximum/minimum, global sum, average, median, standard deviation, Find top-10 20 Further Examples for MapReduce Count URL frequencies (a variant of wordcount) Input: logs of web page requests <URL, 1> Reduce function adds together all values for same URL Construct reverse web-link graph Input: <sourceurl, targeturl> pairs Mapper reverses: <targeturl, sourceurl> Shuffle-and-sort <targeturl, list of all URLs pointing to targeturl> no reduction Reduce function is identity function Indexing web documents Input: list of documents (e.g. web pages) Mapper parses documents and builds sequences <word, documentid>* Shuffle-and-sort produces for each word a list of all documentids where word occurs (Reduce function is identity) 21 MapReduce Implementation / Execution Flow User application calls MapReduce and waits. MapReduce library implementation splits the input data (if not already done) in M blocks (of e.g. 64MB) and creates P MapReduce processes on different cluster nodes: 1 master and P-1 workers. Master creates M mapper tasks and R reducer tasks, and dispatches them to idle workers (dynamic scheduling) Worker executing a Mapper task reads its block of input, applies the Map (and local Combine) function, and buffers (key,value) pairs in memory. Buffered pairs are periodically written to local disk, locations of these files are sent to Master. Worker executing a Reducer task is notified by Master about locations of intermediate data to shuffle+sort and fetches them by remote memory access request, then sorts them by key (K 2 ). It applies the Reduce function to the sorted data and appends its output to a local file. When all mapper and reducer tasks have completed, the master wakes up the user program and returns the locations of the R output files. on different cluster nodes 22 MapReduce Implementation: Fault Tolerance MapReduce Implementation: Data Locality Worker failure Master pings every worker periodically. Master marks a dead worker s tasks for reexecution eventually reassigned to other workers Completed map tasks (as their local files with intermediate data are no longer accessible) and unfinished map and reduce tasks Reducer tasks using data from a failed map task are informed by master about the new worker Master failure Less likely (1 Master, P-1 Workers) Use checkpointing, a new master can restart from latest checkpoint For data storage fault tolerance, have 3 copies of each 64MB data block, each stored on a different cluster node Master uses Locality-aware scheduling: Schedule a mapper task to a worker node holding one copy of its input data block Or on a node that is near a copy holder (e.g. a neighbor node in the network topology) on different cluster nodes 23 on different cluster nodes 24

MapReduce Implementation: Granularity References Numbers M, R and work of tasks (block size) might be tuned Default: M = input file size / block size User can set other value M, R should be >> P For flexibility in dynamic load balancing Hadoop recommends ~10100 mappers per cluster node, or more if lightweight Not too large, though ~ M+R scheduling decisions by master Block size should be reasonably large (e.g. 64MB) to keep relative impact of communication and task overhead low J. Dean, S. Ghemawat: MapReduce: Simplified Data Processing on Large Clusters. Proc. OSDI 2004. Also in: Communications of the ACM 51(1), 2008. D. Miner, A. Shook: MapReduce Design Patterns. O Reilly, 2012. Apache Hadoop: https://hadoop.apache.org on different cluster nodes 25 26 Questions for Reflection A MapReduce computation should process 12.8 TB of data in a distributed file with block (shard) size 64MB. How many mapper tasks will be created, by default? (Hint: 1 TB (Terabyte) = 10 12 byte) Discuss the design decision to offer just one MapReduce construct that covers both mapping, shuffle+sort and reducing. Wouldn t it be easier to provide one separate construct for each phase? What would be the performance implications of such a design operating on distributed files? Reformulate the wordcount example program to use no Combiner. Consider the local reduction performed by a Combiner: Why should the user-defined Reduce function be associative and commutative? Give examples for reduce functions that are associative and commutative, and such that are not. Extend the wordcount program to discard words shorter than 4 characters. Write a wordcount program to only count all words of odd and of even length. There are several possibilities. Show how to calculate a database join with MapReduce. Sometimes, workers might be temporarily slowed down (e.g. repeated disk read errors) without being broken. Such workers could delay the completion of an entire MapReduce computation considerably. How could the master speed up the overall MapReduce processing if it observes that some worker is late? 27