MapReduce. Stony Brook University CSE545, Fall 2016

Similar documents
CS 345A Data Mining. MapReduce

Outline. Distributed File System Map-Reduce The Computational Model Map-Reduce Algorithm Evaluation Computing Joins

MapReduce: Recap. Juliana Freire & Cláudio Silva. Some slides borrowed from Jimmy Lin, Jeff Ullman, Jerome Simeon, and Jure Leskovec

Clustering Lecture 8: MapReduce

Introduction to MapReduce (cont.)

Programming Systems for Big Data

Parallel Programming Principle and Practice. Lecture 10 Big Data Processing with MapReduce

Introduction to MapReduce

Introduction to MapReduce

Database Systems CSE 414

Distributed Computations MapReduce. adapted from Jeff Dean s slides

Databases 2 (VU) ( )

The MapReduce Abstraction

Announcements. Optional Reading. Distributed File System (DFS) MapReduce Process. MapReduce. Database Systems CSE 414. HW5 is due tomorrow 11pm

Where We Are. Review: Parallel DBMS. Parallel DBMS. Introduction to Data Management CSE 344

CS 345A Data Mining. MapReduce

Introduction to Data Management CSE 344

Distributed computing: index building and use

Introduction to MapReduce

Introduction to Data Management CSE 344

Fall 2018: Introduction to Data Science GIRI NARASIMHAN, SCIS, FIU

Principles of Data Management. Lecture #16 (MapReduce & DFS for Big Data)

PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS

MapReduce and Friends

Announcements. Parallel Data Processing in the 20 th Century. Parallel Join Illustration. Introduction to Database Systems CSE 414

MapReduce Spark. Some slides are adapted from those of Jeff Dean and Matei Zaharia

Introduction to Data Management CSE 344

Hadoop/MapReduce Computing Paradigm

Database Applications (15-415)

Introduction to MapReduce. Adapted from Jimmy Lin (U. Maryland, USA)

INTRODUCTION TO DATA SCIENCE. MapReduce and the New Software Stacks(MMDS2)

Big Data Analytics: What is Big Data? Stony Brook University CSE545, Fall 2016 the inaugural edition

Parallel Nested Loops

Parallel Partition-Based. Parallel Nested Loops. Median. More Join Thoughts. Parallel Office Tools 9/15/2011

TITLE: PRE-REQUISITE THEORY. 1. Introduction to Hadoop. 2. Cluster. Implement sort algorithm and run it using HADOOP

Distributed File Systems II

18-hdfs-gfs.txt Thu Oct 27 10:05: Notes on Parallel File Systems: HDFS & GFS , Fall 2011 Carnegie Mellon University Randal E.

Distributed Computation Models

Distributed Filesystem

FLAT DATACENTER STORAGE. Paper-3 Presenter-Pratik Bhatt fx6568

Systems Infrastructure for Data Science. Web Science Group Uni Freiburg WS 2013/14

CSE 344 MAY 2 ND MAP/REDUCE

PARALLEL & DISTRIBUTED DATABASES CS561-SPRING 2012 WPI, MOHAMED ELTABAKH

MASSACHUSETTS INSTITUTE OF TECHNOLOGY Database Systems: Fall 2008 Quiz II

The amount of data increases every day Some numbers ( 2012):

2/26/2017. The amount of data increases every day Some numbers ( 2012):

Programming Models MapReduce

CS6030 Cloud Computing. Acknowledgements. Today s Topics. Intro to Cloud Computing 10/20/15. Ajay Gupta, WMU-CS. WiSe Lab

Cloud Computing CS

CSE 544 Principles of Database Management Systems. Alvin Cheung Fall 2015 Lecture 10 Parallel Programming Models: Map Reduce and Spark

Cloud Computing and Hadoop Distributed File System. UCSB CS170, Spring 2018

CS 61C: Great Ideas in Computer Architecture. MapReduce

18-hdfs-gfs.txt Thu Nov 01 09:53: Notes on Parallel File Systems: HDFS & GFS , Fall 2012 Carnegie Mellon University Randal E.

Parallel Programming Concepts

Map-Reduce. Marco Mura 2010 March, 31th

Streaming Algorithms. Stony Brook University CSE545, Fall 2016

CSE Lecture 11: Map/Reduce 7 October Nate Nystrom UTA

MapReduce and Hadoop. Debapriyo Majumdar Indian Statistical Institute Kolkata

Data Partitioning and MapReduce

BigData and Map Reduce VITMAC03

Map-Reduce and Related Systems

CompSci 516: Database Systems

CS / Cloud Computing. Recitation 3 September 9 th & 11 th, 2014

Cloud Programming. Programming Environment Oct 29, 2015 Osamu Tatebe

Introduction to Map Reduce

Flat Datacenter Storage. Edmund B. Nightingale, Jeremy Elson, et al. 6.S897

Distributed computing: index building and use

FLAT DATACENTER STORAGE CHANDNI MODI (FN8692)

MI-PDB, MIE-PDB: Advanced Database Systems

ΕΠΛ 602:Foundations of Internet Technologies. Cloud Computing

Introduction to Hadoop. Owen O Malley Yahoo!, Grid Team

SQL, NoSQL, MongoDB. CSE-291 (Cloud Computing) Fall 2016 Gregory Kesden

MapReduce: Simplified Data Processing on Large Clusters 유연일민철기

Remote Procedure Call. Tom Anderson

Hadoop An Overview. - Socrates CCDH

Motivation. Map in Lisp (Scheme) Map/Reduce. MapReduce: Simplified Data Processing on Large Clusters

CS-2510 COMPUTER OPERATING SYSTEMS

Introduction to Hadoop and MapReduce

A BigData Tour HDFS, Ceph and MapReduce

Distributed Systems 16. Distributed File Systems II

Announcements. Database Systems CSE 414. Why compute in parallel? Big Data 10/11/2017. Two Kinds of Parallel Data Processing

Page 1. Goals for Today" Background of Cloud Computing" Sources Driving Big Data" CS162 Operating Systems and Systems Programming Lecture 24

Databases 2 (VU) ( / )

Announcements. Reading Material. Map Reduce. The Map-Reduce Framework 10/3/17. Big Data. CompSci 516: Database Systems

Parallel Computing: MapReduce Jin, Hai

MapReduce. U of Toronto, 2014

Introduction to Distributed Data Systems

Big Data Analytics. Izabela Moise, Evangelos Pournaras, Dirk Helbing

Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2016)

Developing MapReduce Programs

HADOOP FRAMEWORK FOR BIG DATA

A brief history on Hadoop

Database Architectures

Parallel Processing - MapReduce and FlumeJava. Amir H. Payberah 14/09/2018

MapReduce-style data processing

Chapter 5. The MapReduce Programming Model and Implementation

Laarge-Scale Data Engineering

HDFS: Hadoop Distributed File System. CIS 612 Sunnie Chung

Big Data Programming: an Introduction. Spring 2015, X. Zhang Fordham Univ.

Parallel DBMS. Parallel Database Systems. PDBS vs Distributed DBS. Types of Parallelism. Goals and Metrics Speedup. Types of Parallelism

Transcription:

MapReduce Stony Brook University CSE545, Fall 2016

Classical Data Mining CPU Memory Disk

Classical Data Mining CPU Memory (64 GB) Disk

Classical Data Mining CPU Memory (64 GB) Disk

Classical Data Mining CPU Memory (64 GB) Disk

IO Bounded Reading a word from disk versus main memory: 10 5 slower! Reading many contiguously stored words is faster per word, but fast modern disks still only reach 150MB/s for sequential reads. IO Bound: biggest performance bottleneck is reading / writing to disk. (starts around 100 GBs; ~10 minutes just to read).

Classical Big Data Analysis CPU Memory Often focused on efficiently utilizing the disk. e.g. Apache Lucene / Solr Disk Still bounded when needing to process all of a large file.

IO Bound How to solve?

Distributed Architecture (Cluster) Switch ~10Gbps Rack 1 Switch ~1Gbps Rack 2 Switch ~1Gbps... CPU CPU CPU CPU CPU CPU Memory Memory... Memory Memory Memory... Memory Disk Disk Disk Disk Disk Disk

Distributed Architecture (Cluster) In reality, modern setups often have multiple cpus and disks per server, but we will model as if one machine per cpu-disk pair. Switch ~1Gbps CPU CPU Memory... CPU CPU CPU Memory... CPU... Disk Disk... Disk Disk Disk... Disk

Challenges for IO Cluster Computing 1. Nodes fail 1 in 1000 nodes fail a day 2. Network is a bottleneck Typically 1-10 Gb/s throughput 3. Traditional distributed programming is often ad-hoc and complicated

Challenges for IO Cluster Computing 1. Nodes fail 1 in 1000 nodes fail a day Duplicate Data 2. Network is a bottleneck Typically 1-10 Gb/s throughput Bring computation to nodes, rather than data to nodes. 3. Traditional distributed programming is often ad-hoc and complicated Stipulate a programming system that can easily be distributed

Challenges for IO Cluster Computing 1. Nodes fail 1 in 1000 nodes fail a day Duplicate Data 2. Network is a bottleneck Typically 1-10 Gb/s throughput Bring computation to nodes, rather than data to nodes. 3. Traditional distributed programming is often ad-hoc and complicated Stipulate a programming system that can easily be distributed MapReduce to the rescue!

Common Characteristics of Big Data Large files (i.e. >100 GB to TBs) No need to update in place (append preferred) Reads are most common (e.g. web logs, social media, product purchases)

Distributed File System (e.g. Apache HadoopDFS, GoogleFS) C, D: Two different files chunk server 1 chunk server 2 chunk server 3 chunk server n (Leskovec at al., 2014; http://www.mmds.org/)

Distributed File System (e.g. Apache HadoopDFS, GoogleFS) C, D: Two different files chunk server 1 chunk server 2 chunk server 3 chunk server n (Leskovec at al., 2014; http://www.mmds.org/)

Distributed File System (e.g. Apache HadoopDFS, GoogleFS) C, D: Two different files chunk server 1 chunk server 2 chunk server 3 chunk server n (Leskovec at al., 2014; http://www.mmds.org/)

Components of a Distributed File System Chunk servers (on Data Nodes) File is split into contiguous chunks Typically each chunk is 16-64MB Each chunk replicated (usually 2x or 3x) Try to keep replicas in different racks Name node (aka master node) Stores metadata about where files are stored Might be replicated or distributed across data nodes. Client library for file access Talks to master to find chunk servers Connects directly to chunk servers to access data (Leskovec at al., 2014; http://www.mmds.org/)

Challenges for IO Cluster Computing 1. Nodes fail 1 in 1000 nodes fail a day Duplicate Data (Distributed FS) 2. Network is a bottleneck Typically 1-10 Gb/s throughput Bring computation to nodes, rather than data to nodes. 3. Traditional distributed programming is often ad-hoc and complicated Stipulate a programming system that can easily be distributed

What is MapReduce? 1. A style of programming input chunks => map tasks group_by keys reduce tasks => output is the linux pipe symbol: passes stdout from first process to stdin of next. E.g. counting words: tokenize(document) sort uniq -C

What is MapReduce? 1. A style of programming input chunks => map tasks group_by keys reduce tasks => output is the linux pipe symbol: passes stdout from first process to stdin of next. E.g. counting words: tokenize(document) sort uniq -C 2. A system that distributes MapReduce style programs across a distributed file-system. (e.g. Google s internal MapReduce or apache.hadoop.mapreduce with hdfs)

What is MapReduce?

What is MapReduce? extract what you care about. Map

What is MapReduce? extract what you care about. sort and shuffle Map

What is MapReduce? extract what you care about. Map sort and shuffle aggregate, summarize Reduce

What is MapReduce? (Leskovec at al., 2014; http://www.mmds.org/)

The Map Step (Leskovec at al., 2014; http://www.mmds.org/)

The Reduce Step (Leskovec at al., 2014; http://www.mmds.org/)

What is MapReduce? Map: (k,v) -> (k, v )* (Written by programmer) Group by key: (k 1, v 1 ), (k 2, v 2 ),... -> (k 1, (v 1, v, ), (system handles) (k 2, (v 1, v, ), Reduce: (k, (v 1, v, )) -> (k, v )* (Written by programmer)

Example: Word Count tokenize(document) sort uniq -C

Example: Word Count tokenize(document) sort uniq -C Map: extract what you care about. sort and shuffle Reduce: aggregate, summarize

Example: Word Count (Leskovec at al., 2014; http://www.mmds.org/)

Chunks

Example: Word Count (version 1) def map(k, v): for w in tokenize(v): yield (w,1) def reduce(k, vs): return len(vs)

Example: Word Count (version 2) def map(k, v): counts = dict() for w in tokenize(v): try: counts[w] += 1 except KeyError: counts[w] = 1 for item in counts.iteritems() yield item counts each word within the chunk (try/except is faster than if w in counts ) def reduce(k, vs): return sum(vs) sum of counts from different chunks

Example: Relational Algebra Select Project Union, Intersection, Difference Natural Join Grouping

Example: Relational Algebra Select Project Union, Intersection, Difference Natural Join Grouping

Example: Relational Algebra Select R(A 1,A 2,A 3,...), Relation R, Attributes A * return only those attribute tuples where condition C is true

Example: Relational Algebra Select R(A 1,A 2,A 3,...), Relation R, Attributes A * return only those attribute tuples where condition C is true def map(k, v): #v is list of attribute tuples for t in v: if t satisfies C: yield (t, t) def reduce(k, vs): For each v in vs: yield (k, v)

Example: Relational Algebra Natural Join Given R 1 and R 2 return R join -- union of all pairs of tuples that match given attributes.

Example: Relational Algebra Natural Join Given R 1 and R 2 return R join -- union of all pairs of tuples that match given attributes. def map(k, v): #v is (R 1 =(A, B), R 2 =(B, C));B are matched attributes for (a, b) in R 1 : yield (b,(r 1, a)) for (b, c) in R 2 : yield (b,(r 2, c))

Example: Relational Algebra Natural Join Given R 1 and R 2 return R join -- union of all pairs of tuples that match given attributes. def map(k, v): #v is (R 1 =(A, B), R 2 =(B, C));B are matched attributes for (a, b) in R 1 : yield (b,(r 1, a)) def reduce(k, vs): for (b, c) in R 2 : r1, r2 = [], [] yield (b,(r 2, c)) for (S, x) in vs: #separate rs if S == r1: r1.append(x) else: r2.append(x) for a in r1: #join as tuple for each c in r2: yield (R join, (a, k, c)) #k is

Data Flow

Data Flow: In Parallel hash (Leskovec at al., 2014; http://www.mmds.org/)

Data Flow: In Parallel Programmed hash Programmed (Leskovec at al., 2014; http://www.mmds.org/)

Data Flow DFS Map Map s Local FS Reduce DFS

Data Flow MapReduce system handles: Partitioning Scheduling map / reducer execution Group by key Restarts from node failures Inter-machine communication

Data Flow DFS MapReduce DFS Schedule map tasks near physical storage of chunk Intermediate results stored locally Master / Name Node coordinates

Data Flow DFS MapReduce DFS Schedule map tasks near physical storage of chunk Intermediate results stored locally Master / Name Node coordinates Task status: idle, in-progress, complete Receives location of intermediate results and schedules with reducer Checks nodes for failures and restarts when necessary All map tasks on nodes must be completely restarted Reduce tasks can pickup with reduce task failed

Data Flow DFS MapReduce DFS Schedule map tasks near physical storage of chunk Intermediate results stored locally Master / Name Node coordinates Task status: idle, in-progress, complete Receives location of intermediate results and schedules with reducer Checks nodes for failures and restarts when necessary All map tasks on nodes must be completely restarted Reduce tasks can pickup with reduce task failed DFS MapReduce DFS MapReduce DFS

Data Flow Skew: The degree to which certain tasks end up taking much longer than others. Handled with: More reducers than reduce tasks More reduce tasks than nodes

Data Flow Key Question: How many Map and Reduce jobs?

Data Flow Key Question: How many Map and Reduce jobs? M: map tasks, R: reducer tasks A: If possible, one chunk per map task. and M >> nodes R < M (better handling of node failures, better load balancing) (reduces number of files stored in DFS)

Communication Cost Model How to assess performance? (1) Computation: Map + Reduce + System Tasks (2) Communication: Moving key, value pairs

Communication Cost Model How to assess performance? (1) Computation: Map + Reduce + System Tasks (2) Communication: Moving key, value pairs Ultimate Goal: wall-clock Time.

Communication Cost Model How to assess performance? (1) Computation: Map + Reduce + System Tasks Mappers and reducers often single pass O(n) within node System: sort the keys is usually most expensive In any case, can add more nodes (2) Communication: Moving key, value pairs Ultimate Goal: wall-clock Time.

Communication Cost Model How to assess performance? (1) Computation: Map + Reduce + System Tasks (2) Communication: Moving key, value pairs Often dominates computation. Connection speeds: 1-10 gigabits per sec; HD read: 50-150 gigabytes per sec Even reading from disk to memory typically takes longer than operating on the data. Ultimate Goal: wall-clock Time.

Communication Cost Model How to assess performance? (1) Communication Computation: Cost Map = input + Reduce size + + System Tasks (2) Communication: Moving key, value pairs Often dominates computation. Connection speeds: 1-10 gigabits per sec; HD read: 50-150 gigabytes per sec Even reading from disk to memory typically takes longer than operating on the data. Ultimate Goal: wall-clock Time. (sum of size of all map-to-reducer files)

Communication Cost Model How to assess performance? (1) Communication Computation: Cost Map = input + Reduce size + + System Tasks (2) Communication: Moving key, value pairs Often dominates computation. Connection speeds: 1-10 gigabits per sec; HD read: 50-150 gigabytes per sec Even reading from disk to memory typically takes longer than operating on the data. Output from reducer ignored because it s either small (finished summarizing data) or being passed to another mapreduce job. Ultimate Goal: wall-clock Time. (sum of size of all map-to-reducer files)

Example: Natural Join R 1, R 2 : Relations (Tables) Communication Cost = input size + (sum of size of all map-to-reducer files) = R + S + ( R + S ) = O( R + S )

Exercise: Calculate Communication Cost for Matrix Multiplication with One MapReduce Step (see MMDS section 2.3.10)

Last Notes Performance Refinements: Backup tasks (aka speculative tasks) Schedule multiple copies of tasks when close to the end to mitigate certain nodes running slow. Combiners (like word count version 2) Do some reducing from within map before passing to reduce Reduces communication cost Override partition hash function E.g. instead of hash(url) use hash(hostname(url))