MapReduce-II. September 2013 Alberto Abelló & Oscar Romero 1

Similar documents
September 2013 Alberto Abelló & Oscar Romero 1

Where We Are. Review: Parallel DBMS. Parallel DBMS. Introduction to Data Management CSE 344

Introduction to Data Management CSE 344

Lecture 11 Hadoop & Spark

Parallel Computing: MapReduce Jin, Hai

MapReduce: Simplified Data Processing on Large Clusters 유연일민철기

Map Reduce Group Meeting

Introduction to Data Management CSE 344

Apache Spark is a fast and general-purpose engine for large-scale data processing Spark aims at achieving the following goals in the Big data context

Announcements. Optional Reading. Distributed File System (DFS) MapReduce Process. MapReduce. Database Systems CSE 414. HW5 is due tomorrow 11pm

Database Systems CSE 414

Announcements. Parallel Data Processing in the 20 th Century. Parallel Join Illustration. Introduction to Database Systems CSE 414

HadoopDB: An open source hybrid of MapReduce

Large-Scale GPU programming

Databases 2 (VU) ( / )

L22: SC Report, Map Reduce

Improving the MapReduce Big Data Processing Framework

CIS 601 Graduate Seminar Presentation Introduction to MapReduce --Mechanism and Applicatoin. Presented by: Suhua Wei Yong Yu

CSE 544 Principles of Database Management Systems. Alvin Cheung Fall 2015 Lecture 10 Parallel Programming Models: Map Reduce and Spark

Announcements. Reading Material. Map Reduce. The Map-Reduce Framework 10/3/17. Big Data. CompSci 516: Database Systems

CompSci 516: Database Systems

2/26/2017. Originally developed at the University of California - Berkeley's AMPLab

Hadoop An Overview. - Socrates CCDH

SQL-to-MapReduce Translation for Efficient OLAP Query Processing

Parallel Programming Principle and Practice. Lecture 10 Big Data Processing with MapReduce

Introduction to Hadoop. High Availability Scaling Advantages and Challenges. Introduction to Big Data

CSE 444: Database Internals. Lecture 23 Spark

CSE Lecture 11: Map/Reduce 7 October Nate Nystrom UTA

April Final Quiz COSC MapReduce Programming a) Explain briefly the main ideas and components of the MapReduce programming model.

Big Data Hadoop Course Content

Data Partitioning and MapReduce

The MapReduce Abstraction

Introduction to MapReduce

Batch Processing Basic architecture

MapReduce & HyperDex. Kathleen Durant PhD Lecture 21 CS 3200 Northeastern University

TITLE: PRE-REQUISITE THEORY. 1. Introduction to Hadoop. 2. Cluster. Implement sort algorithm and run it using HADOOP

Big Data Analytics. Izabela Moise, Evangelos Pournaras, Dirk Helbing

Big Data Hadoop Developer Course Content. Big Data Hadoop Developer - The Complete Course Course Duration: 45 Hours

ΕΠΛ 602:Foundations of Internet Technologies. Cloud Computing

Big Data Management and NoSQL Databases

HADOOP FRAMEWORK FOR BIG DATA

Outline. Distributed File System Map-Reduce The Computational Model Map-Reduce Algorithm Evaluation Computing Joins

Delving Deep into Hadoop Course Contents Introduction to Hadoop and Architecture

Introduction to Hadoop and MapReduce

A MapReduce Relational-Database Index-Selection Tool

Distributed Computation Models

CSE 190D Spring 2017 Final Exam Answers

Hadoop Map Reduce 10/17/2018 1

PSON: A Parallelized SON Algorithm with MapReduce for Mining Frequent Sets

Principles of Data Management. Lecture #16 (MapReduce & DFS for Big Data)

Topics. Big Data Analytics What is and Why Hadoop? Comparison to other technologies Hadoop architecture Hadoop ecosystem Hadoop usage examples

Introduction to BigData, Hadoop:-

MapReduce & Resilient Distributed Datasets. Yiqing Hua, Mengqi(Mandy) Xia

Final Exam Review 2. Kathleen Durant CS 3200 Northeastern University Lecture 23

PROFESSIONAL. NoSQL. Shashank Tiwari WILEY. John Wiley & Sons, Inc.

MI-PDB, MIE-PDB: Advanced Database Systems

Cloud Computing 3. CSCI 4850/5850 High-Performance Computing Spring 2018

MIT805 BIG DATA MAPREDUCE

CS427 Multicore Architecture and Parallel Computing

Distributed computing: index building and use

MapReduce. U of Toronto, 2014

HDFS: Hadoop Distributed File System. CIS 612 Sunnie Chung

Jaql. Kevin Beyer, Vuk Ercegovac, Eugene Shekita, Jun Rao, Ning Li, Sandeep Tata. IBM Almaden Research Center

MapReduce and Friends

Database System Architectures Parallel DBs, MapReduce, ColumnStores

Chapter 5. The MapReduce Programming Model and Implementation

PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS

CSE 190D Spring 2017 Final Exam

Introduction to Data Management CSE 344

Andrew Pavlo, Erik Paulson, Alexander Rasin, Daniel Abadi, David DeWitt, Samuel Madden, and Michael Stonebraker SIGMOD'09. Presented by: Daniel Isaacs

MapReduce Simplified Data Processing on Large Clusters

MapReduce and Hadoop. Debapriyo Majumdar Indian Statistical Institute Kolkata

Storm. Distributed and fault-tolerant realtime computation. Nathan Marz Twitter

MapReduce and Hadoop

I am: Rana Faisal Munir

HiTune. Dataflow-Based Performance Analysis for Big Data Cloud

CS555: Distributed Systems [Fall 2017] Dept. Of Computer Science, Colorado State University

Parallel Nested Loops

Parallel Partition-Based. Parallel Nested Loops. Median. More Join Thoughts. Parallel Office Tools 9/15/2011

Understanding NoSQL Database Implementations

A BigData Tour HDFS, Ceph and MapReduce

Hadoop 2.x Core: YARN, Tez, and Spark. Hortonworks Inc All Rights Reserved

EXTRACT DATA IN LARGE DATABASE WITH HADOOP

The MapReduce Framework

Hadoop Development Introduction

Part A: MapReduce. Introduction Model Implementation issues

MapReduce Spark. Some slides are adapted from those of Jeff Dean and Matei Zaharia

.. Cal Poly CPE/CSC 369: Distributed Computations Alexander Dekhtyar..

BigData and Map Reduce VITMAC03

Systems Infrastructure for Data Science. Web Science Group Uni Freiburg WS 2013/14

Innovatus Technologies

Cloud Computing & Visualization

A Review Paper on Big data & Hadoop

Distributed computing: index building and use

Advanced Data Management Technologies

Introduction to MapReduce (cont.)

Advances in Data Management - NoSQL, NewSQL and Big Data A.Poulovassilis

Lecture 7: MapReduce design patterns! Claudia Hauff (Web Information Systems)!

We are ready to serve Latest Testing Trends, Are you ready to learn?? New Batches Info

CS 345A Data Mining. MapReduce

Transcription:

MapReduce-II September 2013 Alberto Abelló & Oscar Romero 1

Knowledge objectives 1. Enumerate the different kind of processes in the MapReduce framework 2. Explain the information kept in the master 3. Explain how to decide the number of mappers and reducers 4. Explain the fault tolerance mechanisms in place in case of a) Worker failure b) Master failure 5. Explain the main problems of the MapReduce framework 6. Explain six potential improvement to the MapReduce framework 7. Name two kinds of generic tasks that can easily fit in a MapReduce framework September 2013 Alberto Abelló & Oscar Romero 2

Application Objectives 1. Provide the pseudo-code of map, reduce and combine functions for a simple problem September 2013 Alberto Abelló & Oscar Romero 3

Activity: MapReduce Objective: Understand the algorithm underneath MapReduce Tasks: 1.(5 ) Individually read the problem statement 2.(30 ) Reproduce step by step the MapReduce execution 3.Hand in the details of the execution 4.(5 ) Class sharing of the results Febrer 2014 Alberto Abelló & Oscar Romero 4

Processes J. Dean et al. September 2013 Alberto Abelló & Oscar Romero 5

Architectural decisions Users submit jobs to a master scheduling system There is one master and many workers Jobs are decomposed into a set of tasks Tasks are assigned to available workers within the cluster/cloud by the master O(M + R) scheduling decisions Try to benefit from locality As computation comes close to completion, master assigns the same task to multiple workers The master keeps all relevant information a) Map and Reduce tasks Worker state (i.e., idle, in-progress, or completed) Identity of the worker machine b) Intermediate file regions Location and size of each intermediate file region produced by each map task Stores O(M * R) states in memory September 2013 Alberto Abelló & Oscar Romero 6

Design decisions Number of Mappers Mappers should take at least a minute to execute One per chunk in the input Ideally, to exploit data locality, 10*N < M <100*N Number of Reducers Many can produce an explosion of intermediate files For immediate launch: 0.95*N*MaxTasks For load balancing: 1.75*N*MaxTasks September 2013 Alberto Abelló & Oscar Romero 7

Fault tolerance mechanisms Worker failure Master pings workers periodically (heartbeat) Assumes failure if no response Completed/in-progress map and in-progress reduce tasks on failed worker are rescheduled on a different worker node Use chunk replicas Master failure Since there is only one, it is less likely it fails Keep checkpoints of data structure September 2013 Alberto Abelló & Oscar Romero 8

Tasks and Data Flows START-UP TIME Single-point failure Reassignments if needed WRITING INTERMEDIATE RESULTS DATA TRANSFER DATA TRANSFER S. Abiteboul et al. September 2013 Alberto Abelló & Oscar Romero 9

Performance Problems Defines the execution plan on the fly Schedules one block at a time Which allows to adapt to workload and performance differences between nodes in the cluster Heavy start-up process Writes intermediate results to disk Reduce tasks pull intermediate data Which improves fault tolerance The Map and Reduce code are black boxes The system knows nothing about what is going on there Although query-shipping-oriented, the shuffle phase of the reducer implies massive data transfer Does not benefit from compression Without any reason inherent to the model In general, it hardly beats a RDBMS until a cluster of >8 nodes is considered Distributed scans compensate the inherent overheads September 2013 Alberto Abelló & Oscar Romero 10

Improvements Direct access to disk The API of the storage system generates overhead Implementing a query optimizer The Map & Reducer should define pre- postconditions (Stratosphere project) Index usage No index is used currently Record parsing with mutable objects E.g., users should avoid using Java String Implement different grouping algorithms Merge-sort is not always the best option Avoid fine grain scheduling Blocks may be too small September 2013 Alberto Abelló & Oscar Romero 11

High Level Languages Is the de facto standard for robust execution of large data-oriented tasks Has been massively criticized for being too low-level APIs for Ruby, Python, Java, C++, etc. There are not better solutions Support in HBase, MongoDB, CouchDB, etc. Attempts to build declarative languages on top of it Hive Pig Latin Cassandra Query Language (CQL) Resembles SQL September 2013 Alberto Abelló & Oscar Romero 12

MapReduce at First Sight The MapReduce programming paradigm is computationally complete Any program can be adapted to it Not necessarily efficient MapReduce s signature is closed MapReduce iterations can be nested Fault tolerance is not guaranteed in between Some tasks better adapt to it than others Matrix-vector multiplication Relational algebra September 2013 Alberto Abelló & Oscar Romero 13

Relational operations: Projection October 2010 Alberto Abelló 14

Relational operations: Cross Product October 2010 Alberto Abelló 15

Activity Objective: Practice MapReduce paradigm Tasks: 1. (10 ) Individually solve two exercises a) Selection & Difference b) Join & Union c) Aggregation & Intersection 2. (20 ) Explain the solution to the others 3. Hand in the six solutions Roles for the team-mates during task 2: a) Explains his/her material b) Asks for clarification of blur concepts c) Mediates and controls time September 2013 Alberto Abelló & Oscar Romero 16

Summary Processes in MapReduce Fault tolerance mechanisms in MapReduce Data transfer in MapReduce MapReduce expressivity September 2013 Alberto Abelló & Oscar Romero 17

Bibliography J. Dean et al. MapReduce: Simplified Data Processing on Large Clusters. OSDI 04 J. Dittrich et al. Hadoop++: Making a yellow elefant run like a cheetah (without it even noticing). Proceedings of VLDB 3(1), 2010 A. Pavlo et al. A Comparison of Approaches to Large-Scale Data Analysis. SIGMOD, 2009 A. Rajaraman et al. Mining massive data sets. Cambridge University Press, 2012 September 2013 Alberto Abelló & Oscar Romero 18

Resources http://hadoop.apache.org http://hive.apache.org http://pig.apache.org http://www.cloudera.com http://flink.incubator.apache.org Former http://stratosphere.eu https://spark.apache.org September 2013 Alberto Abelló & Oscar Romero 19