Map Reduce Group Meeting

Similar documents
MapReduce: Simplified Data Processing on Large Clusters 유연일민철기

CS555: Distributed Systems [Fall 2017] Dept. Of Computer Science, Colorado State University

Big Data Management and NoSQL Databases

The MapReduce Abstraction

CSE Lecture 11: Map/Reduce 7 October Nate Nystrom UTA

MapReduce: Simplified Data Processing on Large Clusters

Cloud Programming. Programming Environment Oct 29, 2015 Osamu Tatebe

Parallel Computing: MapReduce Jin, Hai

L22: SC Report, Map Reduce

Introduction to MapReduce

CIS 601 Graduate Seminar Presentation Introduction to MapReduce --Mechanism and Applicatoin. Presented by: Suhua Wei Yong Yu

CS 61C: Great Ideas in Computer Architecture. MapReduce

Clustering Lecture 8: MapReduce

MI-PDB, MIE-PDB: Advanced Database Systems

PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS

The MapReduce Framework

Parallel Programming Concepts

MapReduce: A Programming Model for Large-Scale Distributed Computation

CS 345A Data Mining. MapReduce

Large-Scale GPU programming

Motivation. Map in Lisp (Scheme) Map/Reduce. MapReduce: Simplified Data Processing on Large Clusters

Distributed Computations MapReduce. adapted from Jeff Dean s slides

Chapter 5. The MapReduce Programming Model and Implementation

CS /5/18. Paul Krzyzanowski 1. Credit. Distributed Systems 18. MapReduce. Simplest environment for parallel processing. Background.

Introduction to MapReduce

Distributed Systems. 18. MapReduce. Paul Krzyzanowski. Rutgers University. Fall 2015

MapReduce. Kiril Valev LMU Kiril Valev (LMU) MapReduce / 35

MapReduce & HyperDex. Kathleen Durant PhD Lecture 21 CS 3200 Northeastern University

CSE 544 Principles of Database Management Systems. Alvin Cheung Fall 2015 Lecture 10 Parallel Programming Models: Map Reduce and Spark

Introduction to MapReduce. Adapted from Jimmy Lin (U. Maryland, USA)

Introduction to Map Reduce

CS427 Multicore Architecture and Parallel Computing

Survey on MapReduce Scheduling Algorithms

ΕΠΛ 602:Foundations of Internet Technologies. Cloud Computing

MapReduce Spark. Some slides are adapted from those of Jeff Dean and Matei Zaharia

An introduction to Big Data. Presentation by Devesh Sharma, Zubair Asghar & Andreas Aalsaunet

Introduction to MapReduce

Parallel Nested Loops

Parallel Partition-Based. Parallel Nested Loops. Median. More Join Thoughts. Parallel Office Tools 9/15/2011

Principles of Data Management. Lecture #16 (MapReduce & DFS for Big Data)

Where We Are. Review: Parallel DBMS. Parallel DBMS. Introduction to Data Management CSE 344

Department of Computer Science San Marcos, TX Report Number TXSTATE-CS-TR Clustering in the Cloud. Xuan Wang

CS 138: Google. CS 138 XVII 1 Copyright 2016 Thomas W. Doeppner. All rights reserved.

CS6030 Cloud Computing. Acknowledgements. Today s Topics. Intro to Cloud Computing 10/20/15. Ajay Gupta, WMU-CS. WiSe Lab

1. Introduction to MapReduce

Data Clustering on the Parallel Hadoop MapReduce Model. Dimitrios Verraros

CS 138: Google. CS 138 XVI 1 Copyright 2017 Thomas W. Doeppner. All rights reserved.

Advanced Data Management Technologies

Concurrency for data-intensive applications

Parallel Programming Principle and Practice. Lecture 10 Big Data Processing with MapReduce

TITLE: PRE-REQUISITE THEORY. 1. Introduction to Hadoop. 2. Cluster. Implement sort algorithm and run it using HADOOP

MapReduce Simplified Data Processing on Large Clusters

Programming Systems for Big Data

Distributed Computation Models

Map Reduce. Yerevan.

Systems Infrastructure for Data Science. Web Science Group Uni Freiburg WS 2013/14

In the news. Request- Level Parallelism (RLP) Agenda 10/7/11

Introduction to Data Management CSE 344

MapReduce-II. September 2013 Alberto Abelló & Oscar Romero 1

Map Reduce & Hadoop Recommended Text:

Lecture 11 Hadoop & Spark

Parallel data processing with MapReduce

Improving the MapReduce Big Data Processing Framework

Distributed File Systems II

MapReduce & Resilient Distributed Datasets. Yiqing Hua, Mengqi(Mandy) Xia

CompSci 516: Database Systems

CS 345A Data Mining. MapReduce

Introduction to MapReduce (cont.)

Programming model and implementation for processing and. Programs can be automatically parallelized and executed on a large cluster of machines

Introduction to Data Management CSE 344

Announcements. Reading Material. Map Reduce. The Map-Reduce Framework 10/3/17. Big Data. CompSci 516: Database Systems

MapReduce. Cloud Computing COMP / ECPE 293A

Announcements. Optional Reading. Distributed File System (DFS) MapReduce Process. MapReduce. Database Systems CSE 414. HW5 is due tomorrow 11pm

Outline. Distributed File System Map-Reduce The Computational Model Map-Reduce Algorithm Evaluation Computing Joins

HADOOP FRAMEWORK FOR BIG DATA

Introduction to MapReduce. Instructor: Dr. Weikuan Yu Computer Sci. & Software Eng.

Introduction to MapReduce

MapReduce: Simplified Data Processing on Large Clusters. By Stephen Cardina

Parallel Processing - MapReduce and FlumeJava. Amir H. Payberah 14/09/2018

HDFS: Hadoop Distributed File System. CIS 612 Sunnie Chung

Database Systems CSE 414

Introduction to MapReduce. Adapted from Jimmy Lin (U. Maryland, USA)

HiTune. Dataflow-Based Performance Analysis for Big Data Cloud

Distributed computing: index building and use

CS November 2017

Hadoop/MapReduce Computing Paradigm

Mitigating Data Skew Using Map Reduce Application

MapReduce, Hadoop and Spark. Bompotas Agorakis

Bigtable. A Distributed Storage System for Structured Data. Presenter: Yunming Zhang Conglong Li. Saturday, September 21, 13

ECE5610/CSC6220 Introduction to Parallel and Distribution Computing. Lecture 6: MapReduce in Parallel Computing

Global Journal of Engineering Science and Research Management

MapReduce for Data Intensive Scientific Analyses

Scalable Web Programming. CS193S - Jan Jannink - 2/25/10

What Is Datacenter (Warehouse) Computing. Distributed and Parallel Technology. Datacenter Computing Architecture

CA485 Ray Walshe Google File System

MapReduce: Simplified Data Processing on Large Clusters

CS 61C: Great Ideas in Computer Architecture (Machine Structures) Warehouse-Scale Computing, MapReduce, and Spark

MapReduce. U of Toronto, 2014

732A54/TDDE31 Big Data Analytics

Databases 2 (VU) ( / )

Transcription:

Map Reduce Group Meeting Yasmine Badr 10/07/2014 A lot of material in this presenta0on has been adopted from the original MapReduce paper in OSDI 2004

What is Map Reduce? Programming paradigm/model for processing of large data sets in parallel on large distributed clusters The framework or runtime system takes care of: partitioning input data, scheduling, fault tolerance and communication Programs are written in functional style Basically an abstraction 2

Programming Model of MapReduce Input & output are sets of key-value pairs Programmer expresses computation as 2 functions: Map and Reduce Input K- V pairs Intermeduiate K- V pairs 3

Programming Model of MapReduce (cont d) Map function: Takes input K-V pair and produces a set of intermediate K-V pairs MapReduce library groups together all intermediate values associated with same intermediate key & passes them to Reduce function Reduce function: accepts an intermediate key and a set of values for that key. merges together these values to form a possibly smaller set of values. Typically just zero or one output value is produced per Reduce 4

Example: Pseudo code to count number of occurrences of words [O.P.] 5

Execution Overview [O.P] 6

Execution Overview (cont d) 1. Map Reduce library splits input files into M pieces. Then it starts up multiple workers. One of them is master. 2. Master picks idle workers and assigns each one a map or reduce task. M MAP tasks and R reduce tasks in total 3. A MAP worker, when assigned a MAP task Parses its share of key-value pairs and passes each pair to the Map function. Output of Map is intermediate K-V pairs buffered in memory. periodically partitioned into R regions on local disk (uses hashing function) Locations on the disk are passed to master, which forwards it to the reduce workers 4. When a REDUCE worker is notified by master, it Uses remote procedure calls to read data from local disks of MAP workers Sorts data by intermediate keys, so all occurrences of same key are grouped together Iterates on sorted data and for each key calls the Reduce function. Appends output of Reduce function to final output file for this reduce partition 5. When all MAP and REDUCE workers complete, master wakes up user programà back to user code 7

Fault Tolerance To tolerate failing machines: Worker failure: Master pings worker periodically No responseè marked as failed è map tasks completed or in progress are eligible for re-scheduling è Workers executing reduce tasks are notified of the re-execution In case of large-scale failures (like network maintenance on a cluster) master re-executes tasks done by unreachable workers Master failure: Unlikely failure but can be handled by writing periodic checkpoints and starting another master 8

Map Reduce libraries HADOOP Open Source by Apache In Java Can be used with C++, Java, Python Uses Hadoop Distributed File System (HDFS) Interfaces on top of it: Amazon Elastic Map Reduce to use Amazon cloud compute Hive Cloudera Most popular MARS On GPUs In CUDA Others including open and closed source 9

Other Examples Distributed Grep: Map function: emits a line if it matches pattern. Reduce function: just copies the supplied intermediate data to the output Distributed Sort: Map function: extracts the key from each record, and produces a (key, record) pair. Reduce function: emits all pairs unchanged. It depends on the partition and order facilities in the execution overview Incl. startup overhead, performed similar to best reported result for TeraSort benchmark at that time 10

Google used it for Large scale Machine Learning problems Indexing Clustering problems for google news Large-scale graph computations 11

Task Granularity: M & R Map phase is M pieces Reduce phase is R pieces M, R>> number of machines Improves dynamic load balancing Faster recovery when a worker fails Practical bounds: Master takes O(M+R) scheduling decisions Master keeps O(M*R) state in memory (~byte each) R usually constrained by users because output of each reduce task ends up in separate file In practice (2004): they choose M so that each individual task ~16 to 64MB of input data so that locality optimization is most effective (M=200K, R=5K on 2000 workers) 12

Map Reduce Proposed by Google in 2003 Later Open sourced According to Data Center Knowledge article in June 2014, Google abanoned it lately and is using Cloud Dataflow because MapReduce didn t work well when the data size reached few petabytes!! L http://www.datacenterknowledge.com/archives/2014/06/25/google-dumps-mapreducefavor-new-hyper-scale-analytics-system/, 13

References [O.P.] Dean, Jeffrey, Sanjay Ghemawat. MapReduce simplified data processing on large clusters." OSDI 2004 [G.V.] Cluster Computing and Map Reduce course by Google https://www.youtube.com/watch?v=vd6pudf3js&list=pld20c9de1e63e1617&index=2 http://www.slideshare.net/tugrulh/distributed-computing-seminarlecture-2-mapreduce-theory-and-implementation?related=1 [J.E.G]examples http://www.java2s.com/code/jar/h/hadoop-mapreduce.htm 14

BACKUP 15

Locality We conserve network bandwidth by taking advantage of the fact that the input data (managed by GFS) is stored on the local disks of the machines that make up our cluster. GFS divides each file into 64 MB blocks, and stores several copies of each block (typically 3 copies) on different machines. Master takes location information of input files into account and attempts to schedule a map task on a machine that contains a replica of the corresponding input data. Failing that, it attempts to schedule map task near a replica of that task's input data (e.g., on worker machine on same network switch as machine containing the data).

Functional Programming MapReduce is Inspired from Functional programming languages like LISP Functional programming vs imperative languages Functional: do not modify data, create new ones. Oder of operations doesn t matter, each operation is creating a copy Declarative programming( programming with expressions) 17

How is it like? Most of computations consist of applying a map operation to each logical record. in our input in order to compute a set of intermediate key/value pairs, and then applying a reduce operation to all the values that shared the same key, in order to combine the derived data appropriately. 18

Why? Large input data Computations are often straightforward but need to be distributed MapReduce is an abstraction that allows programmers to express the simple computations but hides the messy details of parallelization, fault-tolerance, data distribution and load balancing in a library 19

20