MapReduce: A Programming Model for Large-Scale Distributed Computation

Similar documents
The MapReduce Abstraction

MapReduce: Simplified Data Processing on Large Clusters

L22: SC Report, Map Reduce

Parallel Computing: MapReduce Jin, Hai

CS555: Distributed Systems [Fall 2017] Dept. Of Computer Science, Colorado State University

Parallel data processing with MapReduce

Distributed Computations MapReduce. adapted from Jeff Dean s slides

The MapReduce Framework

Optimizing the use of the Hard Disk in MapReduce Frameworks for Multi-core Architectures*

Map Reduce Group Meeting

Introduction to MapReduce. Adapted from Jimmy Lin (U. Maryland, USA)

Parallel Programming Principle and Practice. Lecture 10 Big Data Processing with MapReduce

Concurrency for data-intensive applications

MapReduce for Data Intensive Scientific Analyses

MapReduce Spark. Some slides are adapted from those of Jeff Dean and Matei Zaharia

Large-Scale GPU programming

Cloud Programming. Programming Environment Oct 29, 2015 Osamu Tatebe

Big Data Management and NoSQL Databases

Optimizing MapReduce for Multicore Architectures Yandong Mao, Robert Morris, and Frans Kaashoek

Clustering Documents. Case Study 2: Document Retrieval

Big Data Programming: an Introduction. Spring 2015, X. Zhang Fordham Univ.

Parallel Nested Loops

Parallel Partition-Based. Parallel Nested Loops. Median. More Join Thoughts. Parallel Office Tools 9/15/2011

Where We Are. Review: Parallel DBMS. Parallel DBMS. Introduction to Data Management CSE 344

Clustering Lecture 8: MapReduce

Introduction to MapReduce

THE SURVEY ON MAPREDUCE

Parallel Programming Concepts

Mitigating Data Skew Using Map Reduce Application

CSE 544 Principles of Database Management Systems. Alvin Cheung Fall 2015 Lecture 10 Parallel Programming Models: Map Reduce and Spark

CSE Lecture 11: Map/Reduce 7 October Nate Nystrom UTA

Announcements. Parallel Data Processing in the 20 th Century. Parallel Join Illustration. Introduction to Database Systems CSE 414

1. Introduction to MapReduce

Clustering Documents. Document Retrieval. Case Study 2: Document Retrieval

Database System Architectures Parallel DBs, MapReduce, ColumnStores

A Survey on Parallel Rough Set Based Knowledge Acquisition Using MapReduce from Big Data

Distributed Systems. 18. MapReduce. Paul Krzyzanowski. Rutgers University. Fall 2015

ECE5610/CSC6220 Introduction to Parallel and Distribution Computing. Lecture 6: MapReduce in Parallel Computing

MapReduce. Kiril Valev LMU Kiril Valev (LMU) MapReduce / 35

CS 61C: Great Ideas in Computer Architecture. MapReduce

CS6030 Cloud Computing. Acknowledgements. Today s Topics. Intro to Cloud Computing 10/20/15. Ajay Gupta, WMU-CS. WiSe Lab

Introduction to MapReduce

CS427 Multicore Architecture and Parallel Computing

MapReduce on Multi-core

Introduction to MapReduce. Adapted from Jimmy Lin (U. Maryland, USA)

MapReduce programming model

PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS

Introduction to MapReduce

Introduction to Data Management CSE 344

Parallel and Distributed Computing

Google: A Computer Scientist s Playground

Google: A Computer Scientist s Playground

Introduction to MapReduce

Introduction to Data Management CSE 344

Database Systems CSE 414

CSE 344 MAY 2 ND MAP/REDUCE

MapReduce & Resilient Distributed Datasets. Yiqing Hua, Mengqi(Mandy) Xia

MI-PDB, MIE-PDB: Advanced Database Systems

Introduction to Data Management CSE 344

MapReduce-style data processing

Announcements. Optional Reading. Distributed File System (DFS) MapReduce Process. MapReduce. Database Systems CSE 414. HW5 is due tomorrow 11pm

CS /5/18. Paul Krzyzanowski 1. Credit. Distributed Systems 18. MapReduce. Simplest environment for parallel processing. Background.

Advanced Database Technologies NoSQL: Not only SQL

CS 345A Data Mining. MapReduce

Distributed computing: index building and use

Distributed computing: index building and use

Introduction to Map Reduce

MapReduce. Cloud Computing COMP / ECPE 293A

MapReduce Algorithms

CIS 601 Graduate Seminar Presentation Introduction to MapReduce --Mechanism and Applicatoin. Presented by: Suhua Wei Yong Yu

CS 138: Google. CS 138 XVII 1 Copyright 2016 Thomas W. Doeppner. All rights reserved.

MapReduce in Erlang. Tom Van Cutsem

CompSci 516: Database Systems

Announcements. Reading Material. Map Reduce. The Map-Reduce Framework 10/3/17. Big Data. CompSci 516: Database Systems

Principles of Data Management. Lecture #16 (MapReduce & DFS for Big Data)

Dept. Of Computer Science, Colorado State University

MapReduce, Hadoop and Spark. Bompotas Agorakis

ABSTRACT I. INTRODUCTION

Motivation. Map in Lisp (Scheme) Map/Reduce. MapReduce: Simplified Data Processing on Large Clusters

CS MapReduce. Vitaly Shmatikov

Lecture 11 Hadoop & Spark

CS 138: Google. CS 138 XVI 1 Copyright 2017 Thomas W. Doeppner. All rights reserved.

Authors: Malewicz, G., Austern, M. H., Bik, A. J., Dehnert, J. C., Horn, L., Leiser, N., Czjkowski, G.

A CLOUD COMPUTING PLATFORM FOR LARGE-SCALE FORENSIC COMPUTING

Part A: MapReduce. Introduction Model Implementation issues

The amount of data increases every day Some numbers ( 2012):

2/26/2017. The amount of data increases every day Some numbers ( 2012):

Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2017)

MapReduce: Simplified Data Processing on Large Clusters 유연일민철기

Advanced Data Management Technologies

Map Reduce. Yerevan.

Apache Spark is a fast and general-purpose engine for large-scale data processing Spark aims at achieving the following goals in the Big data context

MMR: A Platform for Large-Scale Forensic Computing

Data Clustering on the Parallel Hadoop MapReduce Model. Dimitrios Verraros

Overview. Why MapReduce? What is MapReduce? The Hadoop Distributed File System Cloudera, Inc.

MRBench : A Benchmark for Map-Reduce Framework

MapReduce Simplified Data Processing on Large Clusters

15-388/688 - Practical Data Science: Big data and MapReduce. J. Zico Kolter Carnegie Mellon University Spring 2018

Chapter 5. The MapReduce Programming Model and Implementation

MapReduce. Simplified Data Processing on Large Clusters (Without the Agonizing Pain) Presented by Aaron Nathan

Transcription:

CSC 258/458 MapReduce: A Programming Model for Large-Scale Distributed Computation University of Rochester Department of Computer Science Shantonu Hossain April 18, 2011

Outline Motivation MapReduce Overview A Detailed Example Implications in Parallel/Distributed Computing Available Implementations Usage Examples Performance Summary

How big is www? http://www.worldwidewebsize.com/ ~35 billion pages 35billion X 20KB = 700 tera-bytes of data Average read rate 60MB/s 270 months of read all the data!

How to process the data? Most of the data processing are embarrassingly parallel tasks! Distribute the task across cluster of commodity machines (e.g. MPI) Challenges Co-ordination and communication among nodes Handle fault-tolerance Recover from failure Debug Difficult to write codes!

What do we need? A programming model that allows users to specify parallelism at high level An efficient run-time system that handles Low-level mapping Synchronization Resource management Fault tolerance and other issues Solution: MapReduce

MapReduce: Overview A programming model for large-scale data-parallel applications introduced by Google Aimed to process many tera-bytes of data on clusters with thousands of machines Programmers have to specify two primary methods Map: processes input data and generates an intermediate key/value pairs Reduce: aggregates and merges all intermediate values associated with the same key

Example: Count words in web pages Input: files with one URL per record Output: count of occurrences of individual word Steps: 1) Specify a Map function Input: <key= webpage url, value=contents> Output: <key=word, value=partialcount>* < www.abc.com, abc ab cc ab> Input < abc, 1> < ab,1> < cc,1> < ab,1> Output

2) Specify Reduce function : collects partial sum and compute total sum of each individual word Input: <key=word, value=partialcount*> Output: <key=word, value=totalcount>* key = abc values = 1 key = ab values = 1, 1 key = cc values = 1 < abc,1> < ab,2> < cc,1>

Example code: Count word void map(string key, String value): // key: webpage url // value: webpage contents for each word w in value: EmitIntermediate(w, "1"); void reduce(string key, Iterator partialcounts): // key: a word // partialcounts: a list of aggregated partial counts int result = 0; for each pc in partialcounts: result += ParseInt(pc); Emit(AsString(result));

How MapReduce Works? 16MB-64 MB per split M splits Ref: J. Dean, S. Ghemawat, MapReduce: Simplified Data Processing on Large Clusters, OSDI, 2004

Issues of Distributed Computing Fault tolerance Worker failure Master pings worker periodically Keep tracks of states for each map reduce states Reschedule failed workers Master failure Periodic check-pointing of master data structures Data Locality GFS stores several copies of each input split on different machines Master schedules tasks by taking location information into account

Key Features Simplicity of the model Programmers specifies few simple methods that focuses on the functionality not on parallelism Code is generic and portable across systems Scalability Scales easily for large number of clusters with thousands of machines Applicability to a large variety of problems Computation intensive

Implications on Parallel/Distributed Computing Allows programmers to write parallel codes easily and utilize large distributed systems by Providing a new abstraction to write codes that are automatically parallelized. By hiding all messy details of Data distribution Load balancing Co-ordination and communication Applies to multi-core/multi-processor systems which are great candidates for data-parallel applications Distributed clusters are replaced by large number of cores with shared memory system

Available Implementations: for Clusters MapReduce Hadoop An open source implementation by Apache Stores intermediate results of computation in local disks Doesn t support configuring Map tasks over multiple iterations CGL-MapReduce A streaming based open source implementation by Indiana University at Bloomington Intermediate results are directly transferred from Map to Reduce tasks. Supports both single and iterative MapReduce computation

Available Implementations: for Multi-cores Phoenix A MapReduce implementation in C/C++ for shared-memory systems by Stanford University Stores intermediate key/value pairs in a matrix Map puts result in the row of input split Reduce processes an entire column at a time Metis An in-memory MapReduce library in C optimized for multi-cores by MIT Uses hash table with b-tree in each entry to store intermediate results Involves several optimizations Lock free scheme to schedule map and reduce work immutable strings for fast key comparisons Scalable Streamflow memory allocator

Usage Examples MapReduce has been used within Google for completely regenerate Google s index of the world wide web Clustering problems for Google News Extraction of data used to produce reports of popular queries Large scale graph computations Can be applied to wide range of applications such as Distributed grep Distributed sort Document clustering Inverted index Large-scale machine learning algorithms

Usage Statics at Google Ref: PACT 06 Keynote slides by Jeff Dean, Google, Inc.

Candidate Workloads I/O intensive Performance is limited by the I/O bandwidth The overhead induced by the MapReduce implementations has negligible effect on the overall computation Example: High Energy Physics data analysis, process remote sensing data Memory intensive Performance is limited by memory access Example: Successive Over Relaxation (SOR) CPU intensive Performance is limited by the amount of computation Example: K-means algorithm

Performance Evaluation Applications used Word count (WC) Matrix Multiply (MM) Inverted index (II) K-Means clustering(kmeans) String matching (SM) PCA Histogram (Hist) Linear regression (LR)

Speedup (Metis) Ref: Y. Mao, R. Morris, and F. Kaashoek. Optimizing MapReduce for Multicore Architectures. Technical Report, MIT, 2010

Comparison to Pthreads (Phoenix) Ref: C. Ranger, R. Raghuraman, A. Penmetsa, G. Bradski, C. Kozyrakis. Evaluating MapReduce for Multi-core and Multiprocessor Systems, HPCA, 2007.

Metis Vs Phoenix Ref: Y. Mao, R. Morris, and F. Kaashoek. Optimizing MapReduce for Multicore Architectures. Technical Report, MIT, 2010

Summary MapReduce is a programming model for dataparallel applications Simplifies parallel programming Programmers write code in functional style The runtime library hides the burden of synchronization and coordination from programmers Many real-world large scale tasks can be expressed in MapReduce model Applicable for both distributed clusters and multicore systems

References J. Dean and J. Ghemawat. MapReduce: Simplified Data Processing on Large Clusters. In the Proc. of the 6th Symp. on Operating Systems Design and Implementation, Dec 2004. Wikipedia article on MapReduce, http://en.wikipedia.org/wiki/mapreduce C. Ranger, R. Raghuraman, A. Penmetsa, G. Bradski, C. Kozyrakis. Evaluating MapReduce for Multi-core and Multiprocessor Systems. In the Proc. of the 13th International Symposium on High-Performance Computer Architecture, Feb 2007. Y. Mao, R. Morris, and F. Kaashoek. Optimizing MapReduce for Multicore Architectures. Technical Report MIT-CSAIL-TR-2010-020, MIT, 2010.

PACT 06 Keynote slides by Jeff Dean, Google, Inc. J. Ekanayake, S. Pallickara, and G. Fox. MapReduce for data intensive scientific analyses. In IEEE Fourth International Conference on escience, 2008