Introduction to MapReduce

Similar documents
Parallel Computing: MapReduce Jin, Hai

The MapReduce Abstraction

Map Reduce Group Meeting

Motivation. Map in Lisp (Scheme) Map/Reduce. MapReduce: Simplified Data Processing on Large Clusters

CSE Lecture 11: Map/Reduce 7 October Nate Nystrom UTA

MapReduce. Cloud Computing COMP / ECPE 293A

MapReduce: Simplified Data Processing on Large Clusters

CS555: Distributed Systems [Fall 2017] Dept. Of Computer Science, Colorado State University

Big Data Management and NoSQL Databases

Concurrency for data-intensive applications

Introduction to MapReduce. Adapted from Jimmy Lin (U. Maryland, USA)

CS 61C: Great Ideas in Computer Architecture. MapReduce

Parallel Programming Principle and Practice. Lecture 10 Big Data Processing with MapReduce

Introduction to MapReduce

CS 345A Data Mining. MapReduce

PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS

CS427 Multicore Architecture and Parallel Computing

Clustering Lecture 8: MapReduce

Principles of Data Management. Lecture #16 (MapReduce & DFS for Big Data)

MapReduce: A Programming Model for Large-Scale Distributed Computation

Database System Architectures Parallel DBs, MapReduce, ColumnStores

Map-Reduce. John Hughes

Map-Reduce (PFP Lecture 12) John Hughes

Cloud Programming. Programming Environment Oct 29, 2015 Osamu Tatebe

L22: SC Report, Map Reduce

Introduction to MapReduce

MapReduce. Kiril Valev LMU Kiril Valev (LMU) MapReduce / 35

ECE5610/CSC6220 Introduction to Parallel and Distribution Computing. Lecture 6: MapReduce in Parallel Computing

Parallel Nested Loops

Parallel Partition-Based. Parallel Nested Loops. Median. More Join Thoughts. Parallel Office Tools 9/15/2011

TITLE: PRE-REQUISITE THEORY. 1. Introduction to Hadoop. 2. Cluster. Implement sort algorithm and run it using HADOOP

The MapReduce Framework

MapReduce & Resilient Distributed Datasets. Yiqing Hua, Mengqi(Mandy) Xia

Advanced Data Management Technologies

How to Implement MapReduce Using. Presented By Jamie Pitts

CS6030 Cloud Computing. Acknowledgements. Today s Topics. Intro to Cloud Computing 10/20/15. Ajay Gupta, WMU-CS. WiSe Lab

MapReduce. U of Toronto, 2014

MapReduce: Simplified Data Processing on Large Clusters 유연일민철기

Introduction to MapReduce. Adapted from Jimmy Lin (U. Maryland, USA)

Large-Scale GPU programming

CS /5/18. Paul Krzyzanowski 1. Credit. Distributed Systems 18. MapReduce. Simplest environment for parallel processing. Background.

Introduction to MapReduce

Parallel Programming Concepts

1. Introduction to MapReduce

Distributed Systems. 18. MapReduce. Paul Krzyzanowski. Rutgers University. Fall 2015

Introduction to Map Reduce

MapReduce & BigTable

CS 345A Data Mining. MapReduce

Google: A Computer Scientist s Playground

Google: A Computer Scientist s Playground

Survey Paper on Traditional Hadoop and Pipelined Map Reduce

BigData and Map Reduce VITMAC03

Data Clustering on the Parallel Hadoop MapReduce Model. Dimitrios Verraros

Chapter 5. The MapReduce Programming Model and Implementation

Global Journal of Engineering Science and Research Management

In the news. Request- Level Parallelism (RLP) Agenda 10/7/11

CIS 601 Graduate Seminar Presentation Introduction to MapReduce --Mechanism and Applicatoin. Presented by: Suhua Wei Yong Yu

Distributed Computations MapReduce. adapted from Jeff Dean s slides

Optimal Algorithms for Cross-Rack Communication Optimization in MapReduce Framework

MapReduce in Erlang. Tom Van Cutsem

CS370 Operating Systems

Dept. Of Computer Science, Colorado State University

PARALLEL DATA PROCESSING IN BIG DATA SYSTEMS

Lessons Learned While Building Infrastructure Software at Google

The Beauty and Joy of Computing

Map Reduce. Yerevan.

The amount of data increases every day Some numbers ( 2012):

2/26/2017. The amount of data increases every day Some numbers ( 2012):

Outline. Distributed File System Map-Reduce The Computational Model Map-Reduce Algorithm Evaluation Computing Joins

Google File System (GFS) and Hadoop Distributed File System (HDFS)

MI-PDB, MIE-PDB: Advanced Database Systems

MapReduce Spark. Some slides are adapted from those of Jeff Dean and Matei Zaharia

Where We Are. Review: Parallel DBMS. Parallel DBMS. Introduction to Data Management CSE 344

Overview. Why MapReduce? What is MapReduce? The Hadoop Distributed File System Cloudera, Inc.

CS MapReduce. Vitaly Shmatikov

15-388/688 - Practical Data Science: Big data and MapReduce. J. Zico Kolter Carnegie Mellon University Spring 2018

Flat Datacenter Storage. Edmund B. Nightingale, Jeremy Elson, et al. 6.S897

Next-Generation Cloud Platform

STATS Data Analysis using Python. Lecture 7: the MapReduce framework Some slides adapted from C. Budak and R. Burns

The Beauty and Joy of Computing

ΕΠΛ 602:Foundations of Internet Technologies. Cloud Computing

Systems Infrastructure for Data Science. Web Science Group Uni Freiburg WS 2013/14

CompSci 516: Database Systems

Hadoop/MapReduce Computing Paradigm

The Google File System

Cloud Computing. Leonidas Fegaras University of Texas at Arlington. Web Data Management and XML L12: Cloud Computing 1

Large Scale Processing with Hadoop

Cloud Computing. Leonidas Fegaras University of Texas at Arlington. Web Data Management and XML L3b: Cloud Computing 1

Announcements. Reading Material. Map Reduce. The Map-Reduce Framework 10/3/17. Big Data. CompSci 516: Database Systems

COMP Parallel Computing. Lecture 22 November 29, Datacenters and Large Scale Data Processing

MapReduce and Hadoop

Survey on MapReduce Scheduling Algorithms

Lecture 11 Hadoop & Spark

Scalable Tools - Part I Introduction to Scalable Tools

Mitigating Data Skew Using Map Reduce Application

CS-2510 COMPUTER OPERATING SYSTEMS

Developing MapReduce Programs

Machine Learning for Large-Scale Data Analysis and Decision Making A. Distributed Machine Learning Week #9

MapReduce-II. September 2013 Alberto Abelló & Oscar Romero 1

Announcements. Optional Reading. Distributed File System (DFS) MapReduce Process. MapReduce. Database Systems CSE 414. HW5 is due tomorrow 11pm

Transcription:

Introduction to MapReduce April 19, 2012 Jinoh Kim, Ph.D. Computer Science Department Lock Haven University of Pennsylvania

Research Areas Datacenter Energy Management Exa-scale Computing Network Performance Estimation Healthcare System Performance Optimization Network Security Network Mgmt. 2

Outline Background MapReduce functions MapReduce system aspects Summary

Some Technical Terms Distributed computing Parallel computing (parallelism) Failure-prone Data replication

Distributed Computing? Utilizes multiple computers over a network Examples? SETI@home, web crawling (search engine), parallel rendering (computer graphics), and a lot! Distributed computing systems provide a platform to execute distributed computing jobs Examples? Datacenter clusters: 100s~10000s computers PlanetLab: 1,000+ computers BOINC (for @home projects): 500,000 participants

SETI@home Search for Extraterrestrial Intelligence http://setiathome.berkeley.edu/index.php Volunteer-based distributed computing Uses donated computers whenever they are not used (i.e., idle) You can donate your computer! Provides a fancy screen saver Computing power: 600 Tera FLOPS Cray Jaguar supercomputer: 1800 Tera FLOPS

7 SETI@home Screen Saver

Parallel Computing? Suppose a job with 1,000 data blocks E.g., Rendering 1,000 individual movie frames Average processing time per block in a node = 5 seconds 1000 blocks Total processing time? = 1000 * 5 seconds = 1.4 hours

Using Parallelism Use 1,000 nodes in parallel (1 block/node) E.g., Rendering individual frames can be separate 1000 blocks Total processing time? = 5 seconds (ideally!)

Failure-prone? Many clusters use commodity PCs rather than high-end servers So, failures are a part of everyday life A PlanetLab report (HotMetrics 2008) # failures per node in three months Hardware failure: 2.1 Software failure: 41 Failure happens every 4 days for a node Think about 1,000 nodes in a cluster

Data Replication? Place data to multiple machines redundantly Node 1 Node 2 Node 3 Node 4 D1 D1 D1 D2 D2 D2 D3 D3 D3 Why replicating data? Data availability under failures, load balancing More storage required, consistency, management overheads

12 MapReduce Functions

Motivation Want to process lots of data (> 1TB) 1TB = 16,384 data blocks as usual Want to parallelize across thousands of CPUs Want to make this easy MapReduce provides a sort of functions for large-scale data processing

Who uses MapReduce? NY Times Convert 4TB of raw images TIFF data into PDFs $240 for Amazon MapReduce cloud service Google (sort, grep, indexing, etc) 14 Source: MapReduce Tutorial, SIGMETRICS 2009

Programming Model Borrows from functional programming Users implement interface of two functions map(key,value) -> (key,value ) reduce(key,value' list) -> value Source: MapReduce Tutorial, SIGMETRICS 2009

Example: Word Count map(key=document name, val=contents): For each word w in contents, emit (w, 1 ) reduce(key=word, values=count) Sum all 1 s in values list and emit result (word, sum) Hey Jude Hey soul sister Sister act 16 mapper 1 mapper 2 mapper 3 (hey 1) (Jude 1) (hey 1) (soul 1) (sister 1) (sister 1) (act 1) reducer 1 (a-m) reducer 2 (n-z) (act 1) (hey 2) (Jude 1) (sister 2) (soul 1)

Word Count Pseudo-code 17 (From Jeffrey Dean and Sanjay Ghemawat, MapReduce: Simplied Data Processing on Large Clusters, OSDI 04)

18 MapReduce Systems

A (MapReduce) Cluster Thousands of commodity PCs High bandwidth network links and switches Google MapReduce, Hadoop Map-Reduce A Rack = nodes + switch + (from http://www.scbi.uma.es/portal/) (Picture from picmg.org) A Node = CPU, memory, disk, network card,

A MapReduce System Automatic parallelization and distribution Fault-tolerant Provides status and monitoring tools Clean abstraction for programmers

Execution Model 21 (From Jeffrey Dean and Sanjay Ghemawat, MapReduce: Simplied Data Processing on Large Clusters, OSDI 04)

MapReduce Data Layout MapReduce places three copies for each data Often uses rack-awareness for replication 1) No two replicas of a data block are stored on any single node 2) A data block can be found on at least two racks Why? 1) Don t need that a node contains two copies 2) A whole rack can be failed (e.g., power failure)

Locality Master program allocates tasks based on location of data Precedence rule for map task allocation 1) On same machine as physical file data 2) On a machine in the same rack 3) Anywhere else Why? Network is a bottleneck of performance Networking cost: same machine (0) < a machine in the same rack < other

Fault Tolerance Master detects worker failures Re-executes completed and in-progress map() tasks Why re-executing even completed map tasks? Intermediate results are stored in map() tasks Re-executes in-progress reduce() tasks

Optimization for Group Performance No reduce can start until map is complete: A single slow disk controller can rate-limit the whole process Group performance is more important than individual performance Master redundantly executes slow-moving map tasks (i.e., execute backup tasks) uses results of first copy to finish

Performance Results 1764 workers,15000 mappers, 4000 reducers for sort (850 seconds) (1283 seconds) (933 seconds) (From Jeffrey Dean and Sanjay Ghemawat, MapReduce: Simplied Data Processing on Large Clusters, OSDI 04)

Summary MapReduce provides clean abstraction for programmers for large-scale data processing Also provides an implementation of system for: Automatic parallelization and distribution Fault-tolerant Status and monitoring tools Fun to use: focus on problem, let library deal with messy details

Take-away Try Hadoop! Download and install the software http://hadoop.apache.org/ Or do google for Hadoop 28

References 1. Jeffrey Dean and Sanjay Ghemawat, MapReduce: Simplifed Data Processing on Large Clusters, OSDI 04 2. Jerry Zhao and Jelena Pjesivac-Grbovic, MapReduce: The Programming Model and Practice, Tutorial in SIGMETRICS 2009 3. Hadoop, http://hadoop.apache.org 4. Eric Tzeng, ttp://inst.cs.berkeley.edu/~cs61a/su11/lectures/ Su11_CS61a_Lec09_print.pdf 5. SETI@home, http://setiathome.berkeley.edu/index.php 6. PlanetLab, http://www.planet-lab.org 29