Introduction to MapReduce Algorithms and Analysis

Similar documents
Introduction to Parallel Algorithm Analysis

CompSci 516: Database Systems

MapReduce & Resilient Distributed Datasets. Yiqing Hua, Mengqi(Mandy) Xia

Announcements. Reading Material. Map Reduce. The Map-Reduce Framework 10/3/17. Big Data. CompSci 516: Database Systems

Spark: A Brief History.

Resilient Distributed Datasets

CSE 544 Principles of Database Management Systems. Alvin Cheung Fall 2015 Lecture 10 Parallel Programming Models: Map Reduce and Spark

Fast, Interactive, Language-Integrated Cluster Computing

Where We Are. Review: Parallel DBMS. Parallel DBMS. Introduction to Data Management CSE 344

CSE 444: Database Internals. Lecture 23 Spark

L3: Spark & RDD. CDS Department of Computational and Data Sciences. Department of Computational and Data Sciences

Apache Spark is a fast and general-purpose engine for large-scale data processing Spark aims at achieving the following goals in the Big data context

Introduction to Hadoop. Owen O Malley Yahoo!, Grid Team

Announcements. Optional Reading. Distributed File System (DFS) MapReduce Process. MapReduce. Database Systems CSE 414. HW5 is due tomorrow 11pm

The MapReduce Abstraction

Database Systems CSE 414

Spark. In- Memory Cluster Computing for Iterative and Interactive Applications

Survey on Incremental MapReduce for Data Mining

MapReduce Spark. Some slides are adapted from those of Jeff Dean and Matei Zaharia

Principles of Data Management. Lecture #16 (MapReduce & DFS for Big Data)

CS370 Operating Systems

RESILIENT DISTRIBUTED DATASETS: A FAULT-TOLERANT ABSTRACTION FOR IN-MEMORY CLUSTER COMPUTING

Spark. In- Memory Cluster Computing for Iterative and Interactive Applications

Programming Systems for Big Data

Graphs (Part II) Shannon Quinn

Twitter data Analytics using Distributed Computing

Spark. Cluster Computing with Working Sets. Matei Zaharia, Mosharaf Chowdhury, Michael Franklin, Scott Shenker, Ion Stoica.

Distributed computing: index building and use

CSE Lecture 11: Map/Reduce 7 October Nate Nystrom UTA

CS 345A Data Mining. MapReduce

Clustering Lecture 8: MapReduce

18-hdfs-gfs.txt Thu Oct 27 10:05: Notes on Parallel File Systems: HDFS & GFS , Fall 2011 Carnegie Mellon University Randal E.

Parallel HITS Algorithm Implemented Using HADOOP GIRAPH Framework to resolve Big Data Problem

Big Data Programming: an Introduction. Spring 2015, X. Zhang Fordham Univ.

Introduction to Data Management CSE 344

Page 1. Goals for Today" Background of Cloud Computing" Sources Driving Big Data" CS162 Operating Systems and Systems Programming Lecture 24

MapReduce, Hadoop and Spark. Bompotas Agorakis

Parallel Computing: MapReduce Jin, Hai

The amount of data increases every day Some numbers ( 2012):

2/26/2017. The amount of data increases every day Some numbers ( 2012):

B490 Mining the Big Data. 5. Models for Big Data

Lecture 11 Hadoop & Spark

CS 61C: Great Ideas in Computer Architecture. MapReduce

Data Clustering on the Parallel Hadoop MapReduce Model. Dimitrios Verraros

Big Data Hadoop Stack

Map Reduce. Yerevan.

CDS. André Schaaff1, François-Xavier Pineau1, Gilles Landais1, Laurent Michel2 de Données astronomiques de Strasbourg, 2SSC-XMM-Newton

CS6030 Cloud Computing. Acknowledgements. Today s Topics. Intro to Cloud Computing 10/20/15. Ajay Gupta, WMU-CS. WiSe Lab

Analytics in Spark. Yanlei Diao Tim Hunter. Slides Courtesy of Ion Stoica, Matei Zaharia and Brooke Wenig

Yuval Carmel Tel-Aviv University "Advanced Topics in Storage Systems" - Spring 2013

MixApart: Decoupled Analytics for Shared Storage Systems

2/26/2017. Originally developed at the University of California - Berkeley's AMPLab

Motivation. Map in Lisp (Scheme) Map/Reduce. MapReduce: Simplified Data Processing on Large Clusters

a Spark in the cloud iterative and interactive cluster computing

Introduction to MapReduce. Adapted from Jimmy Lin (U. Maryland, USA)

Clustering Documents. Case Study 2: Document Retrieval

Jumbo: Beyond MapReduce for Workload Balancing

18-hdfs-gfs.txt Thu Nov 01 09:53: Notes on Parallel File Systems: HDFS & GFS , Fall 2012 Carnegie Mellon University Randal E.

Parallel Programming Principle and Practice. Lecture 10 Big Data Processing with MapReduce

Introduction to Data Management CSE 344

2/4/2019 Week 3- A Sangmi Lee Pallickara

Clustering Documents. Document Retrieval. Case Study 2: Document Retrieval

Performance Evaluation of Large Table Association Problem Implemented in Apache Spark on Cluster with Angara Interconnect

Lessons Learned While Building Infrastructure Software at Google

Distributed computing: index building and use

Introduction to Data Management CSE 344

PSON: A Parallelized SON Algorithm with MapReduce for Mining Frequent Sets

Warehouse- Scale Computing and the BDAS Stack

Systems Infrastructure for Data Science. Web Science Group Uni Freiburg WS 2013/14

Massive Online Analysis - Storm,Spark

Parallel Algorithms for Geometric Graph Problems Grigory Yaroslavtsev

Integration of Machine Learning Library in Apache Apex

Shark. Hive on Spark. Cliff Engle, Antonio Lupher, Reynold Xin, Matei Zaharia, Michael Franklin, Ion Stoica, Scott Shenker

Outline. Distributed File System Map-Reduce The Computational Model Map-Reduce Algorithm Evaluation Computing Joins

Databases and Big Data Today. CS634 Class 22

MapReduce and Friends

Announcements. Parallel Data Processing in the 20 th Century. Parallel Join Illustration. Introduction to Database Systems CSE 414

Distributed Computing with Spark

Distributed Computation Models

Summary of Big Data Frameworks Course 2015 Professor Sasu Tarkoma

MapReduce: A Programming Model for Large-Scale Distributed Computation

Corpus methods in linguistics and NLP Lecture 7: Programming for large-scale data processing

Dremel: Interac-ve Analysis of Web- Scale Datasets

Google: A Computer Scientist s Playground

Google: A Computer Scientist s Playground

Introduction to MapReduce

A brief history on Hadoop

Big Data Analytics. Map-Reduce and Spark. Slides by: Joseph E. Gonzalez Guest Lecturer: Vikram Sreekanti

Cloud Computing 2. CSCI 4850/5850 High-Performance Computing Spring 2018

Topics. Big Data Analytics What is and Why Hadoop? Comparison to other technologies Hadoop architecture Hadoop ecosystem Hadoop usage examples

CS294 Big Data System Course Project Report Gemini: Boosting Spark Performance with GPU Accelerators

SSS: An Implementation of Key-value Store based MapReduce Framework. Hirotaka Ogawa (AIST, Japan) Hidemoto Nakada Ryousei Takano Tomohiro Kudoh

Big Data Management and NoSQL Databases

Backtesting with Spark

Introduction to MapReduce

Improved MapReduce k-means Clustering Algorithm with Combiner

The Google File System

An Introduction to Big Data Analysis using Spark

CS60021: Scalable Data Mining. Sourangshu Bhattacharya

Big Data Infrastructures & Technologies

Transcription:

Introduction to MapReduce Algorithms and Analysis Jeff M. Phillips October 25, 2013

Trade-Offs Massive parallelism that is very easy to program. Cheaper than HPC style (uses top of the line everything) Assumption about data (key-value pairs). Restrictions on how computation is done.

Cluster of Commodity Nodes (2003) Big Iron Box: 8 2GHz Xeons (Processors) 64 GB RAM 8 TB disk 758,000 USD Google Rack: 176 2GHz Xeons (Processors) 176 GB RAM 7 TB disk 278,000 USD

Google File System SOSP 2003 (Ghemawat, Gobioff, Leung) Key-Value Pairs: All files stored as Key-Value pairs key: log id key: web address key: document id in set key: word in corpus value: actual log value: html and/or outgoing links value: list of words value: how often it appears Blocking: All files broken into blocks (often 64 MB) Each block has replication factor (say 3 times), stored in separate nodes. No locality on compute nodes, no neighbors or replicas on same node (but often same rack).

No locality? Really?

No locality? Really? Resiliency: if one dies, use another (on big clusters happens all the time) Redundancy: If one is slow, use another (...curse of last reducer) Heterogeneity: Format quite flexible, 64MB often still enough (recall: IO-Efficiency)

MapReduce OSDI 04 (Dean, Ghemawat) Each Processor has full hard drive, data items < key, value >. Parallelism Procedes in Rounds: Map: assigns items to processor by key. Reduce: processes all items using value. Usually combines many items with same key. Repeat M+R a constant number of times, often only one round. Optional post-processing step. RAM CPU MAP sort RAM CPU REDUCE post-process Pro: Robust (duplication) and simple. Can harness Locality Con: Somewhat restrictive model RAM CPU

MapReduce OSDI 04 (Dean, Ghemawat)

MapReduce OSDI 04 (Dean, Ghemawat)

Granularity and Pipelining OSDI 04 (Dean, Ghemawat)

MapReduce OSDI 04 (Dean, Ghemawat)

MapReduce OSDI 04 (Dean, Ghemawat)

MapReduce OSDI 04 (Dean, Ghemawat)

MapReduce OSDI 04 (Dean, Ghemawat)

MapReduce OSDI 04 (Dean, Ghemawat)

MapReduce OSDI 04 (Dean, Ghemawat)

MapReduce OSDI 04 (Dean, Ghemawat)

MapReduce OSDI 04 (Dean, Ghemawat)

MapReduce OSDI 04 (Dean, Ghemawat)

MapReduce OSDI 04 (Dean, Ghemawat)

MapReduce OSDI 04 (Dean, Ghemawat)

Last Reducer Typically Map phase linear on blocks. Reducers more variable. No answer until the last one is done! Some machines get slow/crash! Solution: Automatically run back-up copies. Take first to complete.

Last Reducer Typically Map phase linear on blocks. Reducers more variable. No answer until the last one is done! Some machines get slow/crash! Solution: Automatically run back-up copies. Take first to complete. Scheduled by Master Node Organizes computation, but does not process data. If this fails, all goes down.

Example 1: Word Count Given text corpus doc id, list of words, count how many of each word exists. Map:

Example 1: Word Count Given text corpus doc id, list of words, count how many of each word exists. Map: For each word w w, 1 Reduce:

Example 1: Word Count Given text corpus doc id, list of words, count how many of each word exists. Map: For each word w w, 1 Reduce: { w, c 1, w, c 2, w, c 3,...}

Example 1: Word Count Given text corpus doc id, list of words, count how many of each word exists. Map: For each word w w, 1 Reduce: { w, c 1, w, c 2, w, c 3,...} w, i c i

Example 1: Word Count Given text corpus doc id, list of words, count how many of each word exists. Map: For each word w w, 1 Reduce: { w, c 1, w, c 2, w, c 3,...} w, i c i w = the is 7% of all words!

Example 1: Word Count Given text corpus doc id, list of words, count how many of each word exists. Map: For each word w w, 1 Reduce: { w, c 1, w, c 2, w, c 3,...} w, i c i w = the is 7% of all words! Combine: (before Map goes to Shuffle phase) { w, c 1, w, c 2, w, c 3,...} w, i c i

Example 1: Word Count - Actual Code

Example 1: Word Count - Actual Code

Example 1: Word Count - Actual Code

Example 1: Word Count - Actual Code

Example 2: Inverted Index Given all of Wikipedia (all webpages), for each word, list all pages it is on. Map:

Example 2: Inverted Index Given all of Wikipedia (all webpages), for each word, list all pages it is on. Map: For page p, each word w w, p Reduce:

Example 2: Inverted Index Given all of Wikipedia (all webpages), for each word, list all pages it is on. Map: For page p, each word w w, p Reduce: { w, p 1, w, p 2, w, p 3,...}

Example 2: Inverted Index Given all of Wikipedia (all webpages), for each word, list all pages it is on. Map: For page p, each word w w, p Reduce: { w, p 1, w, p 2, w, p 3,...} w, i p i

Example 2: Inverted Index Given all of Wikipedia (all webpages), for each word, list all pages it is on. Map: For page p, each word w w, p Combine: { w, p 1, w, p 2, w, p 3,...} w, i p i Reduce: { w, p 1, w, p 2, w, p 3,...} w, i p i

Hadoop Open source version of MapReduce (and related, e.g. HDFS) Began 2005 (Cutting + Cafarella) supported by Yahoo! Stable enough for large scale around 2008 Source code released 2009 Java (MapReduce in C++) Led to widespread adoption in industry and academia!

Rounds Many algorithms are iterative, especially machine learning / data mining: Lloyd s algorithm for k-means gradient descent singular value decomposition May require log 2 n rounds.

Rounds Many algorithms are iterative, especially machine learning / data mining: Lloyd s algorithm for k-means gradient descent singular value decomposition May require log 2 n rounds. log 2 (n = 1 billion) 30

Rounds Many algorithms are iterative, especially machine learning / data mining: Lloyd s algorithm for k-means gradient descent singular value decomposition May require log 2 n rounds. log 2 (n = 1 billion) 30 MapReduce puts rounds at a premium. Hadoop can have several minute delay between rounds. (Each rounds writes to HDFS for resiliency; same in MapReduce)

Rounds Many algorithms are iterative, especially machine learning / data mining: Lloyd s algorithm for k-means gradient descent singular value decomposition May require log 2 n rounds. log 2 (n = 1 billion) 30 MapReduce puts rounds at a premium. Hadoop can have several minute delay between rounds. (Each rounds writes to HDFS for resiliency; same in MapReduce) MRC Model (Karloff, Suri, Vassilvitskii; SODA 2010). Stresses Rounds.

Bulk Synchronous Parallel Les Valiant [1989] BSP Creates barriers in parallel algorithm. 1. Each processor computes on data 2. Processors send/receive data 3. Barrier : All processors wait for communication to end globally Allows for easy synchronization. Easier to analyze since handles many messy synchronization details if this is emulated. RAM CPU RAM CPU RAM CPU RAM CPU RAM CPU

Bulk Synchronous Parallel RAM Les Valiant [1989] BSP Creates barriers in parallel algorithm. RAM CPU 1. Each processor computes on data 2. Processors send/receive data 3. Barrier : All processors wait for communication to end globally Allows for easy synchronization. Easier to analyze since handles many messy synchronization details if this is emulated. CPU RAM CPU RAM RAM CPU CPU

Bulk Synchronous Parallel RAM Les Valiant [1989] BSP Creates barriers in parallel algorithm. RAM CPU 1. Each processor computes on data 2. Processors send/receive data 3. Barrier : All processors wait for communication to end globally Allows for easy synchronization. Easier to analyze since handles many messy synchronization details if this is emulated. CPU RAM CPU RAM RAM CPU Reduction from MR (Goodrich, Sitchinava, Zhang; ISAAC 2011) CPU

Replication Rate Consider a join of two sets of size R, S of size n = 10, 000. List pair (r, s) R S if f (r, s) = 1, for some function f. Option 1: Create n 2 reducers. Replication rate of g = n.

Replication Rate Consider a join of two sets of size R, S of size n = 10, 000. List pair (r, s) R S if f (r, s) = 1, for some function f. Option 1: Create n 2 reducers. Replication rate of g = n. Option 2: Create 1 reducers. Reducer of size 2n, has n 2 = 100million operation. Replication rate g = 1. No parallelism.

Replication Rate Consider a join of two sets of size R, S of size n = 10, 000. List pair (r, s) R S if f (r, s) = 1, for some function f. Option 1: Create n 2 reducers. Replication rate of g = n. Option 2: Create 1 reducers. Reducer of size 2n, has n 2 = 100million operation. Replication rate g = 1. No parallelism. Option 3: Create g 2 reducers, each with 2 groups of size n/g. Reducer size 2n/g, (n/g) 2 operations (g = 10 only 1million). Replication rate of g.

Replication Rate Consider a join of two sets of size R, S of size n = 10, 000. List pair (r, s) R S if f (r, s) = 1, for some function f. Option 1: Create n 2 reducers. Replication rate of g = n. Option 2: Create 1 reducers. Reducer of size 2n, has n 2 = 100million operation. Replication rate g = 1. No parallelism. Option 3: Create g 2 reducers, each with 2 groups of size n/g. Reducer size 2n/g, (n/g) 2 operations (g = 10 only 1million). Replication rate of g. (Afrati, Das Sarma, Salihoglu, Ullman 2013), (Beame, Koutris, Suciu 2013)

Sawzall / Dremel / Tenzing Google solution to many to few: Compute statistics on massive distributed data. Separates local computation from aggregation. Better with Iteration

Berkeley Spark: Processing in memory Zaharia, Chowdhury, Das, Ma, McCauley, Franklin, Shenker, Stoica (HotCloud 2010, NSDI 2012) Keeps relevant information in memory. Much faster on iterative algorithms (machine learning, SQL queries) Requires careful work to retain resiliency Key idea: RDDs: Resilient Distributed Data. Can be stored in memory without replication, rebuilt from lineage if lost.