Algorithms for MapReduce. Combiners Partition and Sort Pairs vs Stripes

Similar documents
Developing MapReduce Programs

Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2016)

MapReduce Algorithms

Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2017)

Data-Intensive Distributed Computing

MapReduce Algorithm Design

TI2736-B Big Data Processing. Claudia Hauff

Lecture 7: MapReduce design patterns! Claudia Hauff (Web Information Systems)!

MapReduce Design Patterns

CMSC 723: Computational Linguistics I Session #12 MapReduce and Data Intensive NLP. University of Maryland. Wednesday, November 18, 2009

MapReduce Algorithm Design

Clustering Lecture 8: MapReduce

Map Reduce. Yerevan.

Extreme Computing. Introduction to MapReduce. Cluster Outline Map Reduce

Informa)on Retrieval and Map- Reduce Implementa)ons. Mohammad Amir Sharif PhD Student Center for Advanced Computer Studies

Batch Processing Basic architecture

Information Retrieval Processing with MapReduce

Voldemort. Smruti R. Sarangi. Department of Computer Science Indian Institute of Technology New Delhi, India. Overview Design Evaluation

Processing of big data with Apache Spark

Introduction to MapReduce

Data-Intensive Distributed Computing

MapReduce: Recap. Juliana Freire & Cláudio Silva. Some slides borrowed from Jimmy Lin, Jeff Ullman, Jerome Simeon, and Jure Leskovec

Map-Reduce and Adwords Problem

ML from Large Datasets

Introduction to MapReduce

Data Partitioning and MapReduce

Introduction to MapReduce. Adapted from Jimmy Lin (U. Maryland, USA)

Laarge-Scale Data Engineering

Apache Spark is a fast and general-purpose engine for large-scale data processing Spark aims at achieving the following goals in the Big data context

Hadoop Map Reduce 10/17/2018 1

Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2017)

HDFS: Hadoop Distributed File System. CIS 612 Sunnie Chung

Big data systems 12/8/17

61A Lecture 36. Wednesday, November 30

Map Reduce.

Introduction to Spark

CS /29/18. Paul Krzyzanowski 1. Question 1 (Bigtable) Distributed Systems 2018 Pre-exam 3 review Selected questions from past exams

Introduction to MapReduce

Distributed Systems Pre-exam 3 review Selected questions from past exams. David Domingo Paul Krzyzanowski Rutgers University Fall 2018

Howdah. a flexible pipeline framework and applications to analyzing genomic data. Steven Lewis PhD

CSE Lecture 11: Map/Reduce 7 October Nate Nystrom UTA

MASSACHUSETTS INSTITUTE OF TECHNOLOGY Database Systems: Fall 2008 Quiz II

MapReduce. Arend Hintze

STA141C: Big Data & High Performance Statistical Computing

Data-Intensive Computing with MapReduce

Parallel Programming Principle and Practice. Lecture 10 Big Data Processing with MapReduce

Hadoop. copyright 2011 Trainologic LTD

2/4/2019 Week 3- A Sangmi Lee Pallickara

Databases 2 (VU) ( / )

Hadoop streaming is an alternative way to program Hadoop than the traditional approach of writing and compiling Java code.

Introduction to Data Management CSE 344

ColumnStore Indexes. מה חדש ב- 2014?SQL Server.

Dr. Chuck Cartledge. 4 Feb. 2015

MapReduce Patterns, Algorithms, and Use Cases

2/26/2017. For instance, consider running Word Count across 20 splits

MapReduce. Stony Brook University CSE545, Fall 2016

Parallel Nested Loops

Parallel Partition-Based. Parallel Nested Loops. Median. More Join Thoughts. Parallel Office Tools 9/15/2011

2/26/2017. Originally developed at the University of California - Berkeley's AMPLab

CS427 Multicore Architecture and Parallel Computing

Big Data Management and NoSQL Databases

Big Data Hadoop Course Content

Hadoop MapReduce Framework

Track Join. Distributed Joins with Minimal Network Traffic. Orestis Polychroniou! Rajkumar Sen! Kenneth A. Ross

Hadoop Online Training

Map Reduce Group Meeting

MapReduce-II. September 2013 Alberto Abelló & Oscar Romero 1

Principles of Software Construction: Objects, Design, and Concurrency. Distributed System Design, Part 2. Charlie Garrod Jonathan Aldrich.

with MapReduce The ischool University of Maryland JHU Summer School on Human Language Technology Wednesday, June 17, 2009

Information Retrieval and Organisation

Where We Are. Review: Parallel DBMS. Parallel DBMS. Introduction to Data Management CSE 344

On the Performance of MapReduce: A Stochastic Approach

Generalizing Map- Reduce

Cloud Computing CS

Vendor: Cloudera. Exam Code: CCD-410. Exam Name: Cloudera Certified Developer for Apache Hadoop. Version: Demo

Principles of Data Management. Lecture #16 (MapReduce & DFS for Big Data)

CS 345A Data Mining. MapReduce

Camdoop Exploiting In-network Aggregation for Big Data Applications Paolo Costa

CS 61C: Great Ideas in Computer Architecture. MapReduce

Lecture 11. Lecture 11: External Sorting

STATS Data Analysis using Python. Lecture 7: the MapReduce framework Some slides adapted from C. Budak and R. Burns

Programming Models MapReduce

Apache Hive for Oracle DBAs. Luís Marques

Basic MapReduce Algorithm Design

Lecture 11 Hadoop & Spark

April Final Quiz COSC MapReduce Programming a) Explain briefly the main ideas and components of the MapReduce programming model.

/ Cloud Computing. Recitation 13 April 14 th 2015

index construct Overview Overview Recap How to construct index? Introduction Index construction Introduction to Recap

15-388/688 - Practical Data Science: Big data and MapReduce. J. Zico Kolter Carnegie Mellon University Spring 2018

Practice Questions for Midterm

PSON: A Parallelized SON Algorithm with MapReduce for Mining Frequent Sets

Shark: Hive (SQL) on Spark

Introduction to MapReduce (cont.)

DATABASE DESIGN II - 1DL400

CompSci 516: Database Systems

Map-Reduce Applications: Counting, Graph Shortest Paths

CS-2510 COMPUTER OPERATING SYSTEMS

Introduction to Information Retrieval (Manning, Raghavan, Schutze)

Principal Software Engineer Red Hat Emerging Technology June 24, 2015

CSE 190D Spring 2017 Final Exam

Transcription:

Algorithms for MapReduce 1

Assignment 1 released Due 16:00 on 20 October Correctness is not enough! Most marks are for efficiency. 2

Combining, Sorting, and Partitioning... and algorithms exploiting these options. Important: learn and apply optimization tricks. Less important: these specific examples. 3

Last lecture: hash table has unbounded size #!/usr/bin/python3 import sys def spill(cache): for word, count in cache.items(): print(word + "\t" + str(count)) cache = {} for line in sys.stdin: for word in line.split(): cache[word] = cache.get(word, 0) + 1 spill(cache) 4

Solution: bounded size #!/usr/bin/python3 import sys def spill(cache): for word, count in cache.items(): print(word + "\t" + str(count)) cache = {} for line in sys.stdin: for word in line.split(): cache[word] = cache.get(word, 0) + 1 if (len(cache) >= 10): #Limit 10 entries spill(cache) cache.clear() spill(cache) 5

Combiners Combiners formalize the local aggregation we just did: Map Machine Mapper Combiner Local Disk 6

Specifying a Combiner Hadoop bas built-in support for combiners: hadoop jar hadoop-streaming-2.7.3.jar -files count_map.py,count_reduce.py -input /data/assignments/ex1/websmall.txt -output /user/$user/combined -mapper count_map.py -combiner count_reduce.py -reducer count_reduce.py Run Hadoop Copy to workers Read text file Write here Simple mapper Combiner sums Reducer sums 7

Specifying a Combiner Hadoop bas built-in support for combiners: hadoop jar hadoop-streaming-2.7.3.jar -files count_map.py,count_reduce.py -input /data/assignments/ex1/websmall.txt -output /user/$user/combined -mapper count_map.py -combiner count_reduce.py -reducer count_reduce.py Run Hadoop Copy to workers Read text file Write here Simple mapper Combiner sums Reducer sums How is this implemented? 8

Mapper s Initial Sort Map Partition (aka Shard) Assign destination reducer RAM buffer RAM buffer Remember what fits in RAM Sort Sort Sort batch in RAM Combine Combine Optional combiner Disk Disk 9

Merge Sort When the mapper runs out of RAM, it spills to disk. = Chunks of sorted data called spills. Mappers merge their spills into one per reducer. Reducers merge input from multiple mappers. Spill 0 a 3 c 4 d 2 Spill 1 a 5 b 9 c 6 Combiner a 8 b 9 c 10... 10

Combiner Summary Combiners optimize merge sort and reduce network traffic. They may run in: Mapper initial sort Mapper merge Reducer merge 11

Combiner FAQ Hadoop might not run your combiner at all! Combiners will see a mix of mapper and combiner output. Hadoop won t partition or sort combiner output again. = Don t change the key. 12

Combiner Efficiency: Sort vs Hash Table Hadoop sorts before combining = Duplicate keys are sorted = slow Our in-mapper implementation used a hash table. Also reduces Java Python overhead. In-mapper is usually faster, but we ll let you use either one. 13

Problem: Averaging We re given temperature readings from cities: Key Value San Francisco 22 Edinburgh 14 Los Angeles 23 Edinburgh 12 Edinburgh 9 Los Angeles 21 Find the average temperature in each city. Map: (city, temperature) (city, temperature) Reduce: Count, sum temperatures, and divide. 14

Problem: Averaging We re given temperature readings from cities: Key Value San Francisco 22 Edinburgh 14 Los Angeles 23 Edinburgh 12 Edinburgh 9 Los Angeles 21 Find the average temperature in each city. Map: (city, temperature) (city, temperature) Combine: Same as reducer? Reduce: Count, sum temperatures, and divide. 15

Problem: Averaging We re given temperature readings from cities: Key Value San Francisco 22 Edinburgh 14 Los Angeles 23 Edinburgh 12 Edinburgh 9 Los Angeles 21 Find the average temperature in each city. Map: (city, temperature) (city, count = 1, temperature) Combine: Sum count and temperature fields. Reduce: Sum count, sum temperatures, and divide. 16

Pattern: Combiners Combiners reduce communication by aggregating locally. Many times they are the same as reducers (i.e. summing).... but not always (i.e. averaging). 17

Custom Partitioner and Sorting Function 18

Mapper s Initial Sort Map Partition (aka Shard) Custom partitioner RAM buffer RAM buffer Sort Sort Custom sort function Combine Combine Disk Disk 19

Problem: Comparing Output Alice s Word Counts a 20 hi 2 i 13 the 31 why 12 a 20 why 12 Bob s Word Counts hi 2 i 13 the 31 20

Problem: Comparing Output Alice s Word Counts a 20 hi 2 i 13 the 31 why 12 a 20 why 12 Bob s Word Counts hi 2 i 13 the 31 a 20 a 20 hi 2 hi 2 the 31 the 31 i 13 i 13 why 12 why 12 Send words to a consistent place 21

Problem: Comparing Output a 20 hi 2 Map Alice s Word Counts i 13 the 31 why 12 a 20 why 12 Bob s Word Counts hi 2 i 13 the 31 a 20 a 20 hi 2 hi 2 the 31 the 31 i 13 i 13 why 12 why 12 Reduce Send words to a consistent place: reducers 22

Problem: Comparing Output a 20 hi 2 Map Alice s Word Counts i 13 the 31 why 12 a 20 why 12 Bob s Word Counts hi 2 i 13 the 31 a 20 a 20 hi 2 hi 2 the 31 the 31 i 13 i 13 why 12 why 12 Unordered Alice/Bob Reduce Send words to a consistent place: reducers 23

Comparing Output Detail Map: (word, count) (word, student, count) 1 Reduce: Verify both values are present and match. Deduct marks from Alice/Bob as appropriate. 1 The mapper can tell Alice and Bob apart by input file name. 24

Comparing Output Detail Map: (word, count) (word, student, count) 1 Partition: By word Sort: By word(word, student) Reduce: Verify both values are present and match. Deduct marks from Alice/Bob as appropriate. Exploit sort to control input order 1 The mapper can tell Alice and Bob apart by input file name. 25

Problem: Comparing Output a 20 hi 2 Map Alice s Word Counts i 13 the 31 why 12 a 20 why 12 Bob s Word Counts hi 2 i 13 the 31 a 20 a 20 hi 2 hi 2 the 31 the 31 i 13 i 13 why 12 why 12 Ordered Alice/Bob Reduce Send words to a consistent place: reducers 26

Pattern: Exploit the Sort Without Custom Sort Reducer buffers all students in RAM = Might run out of RAM With Custom Sort TA appears first, reducer streams through students. Constant reducer memory. 27

Problem: Word Coocurrence Count pairs of words that appear in the same line. 28

First try: pairs Each mapper takes a sentence: Generate all co-occurring term pairs For all pairs, emit (a, b) count Reducers sum up counts associated with these pairs Use combiners! www.inf.ed.ac.uk

Pairs: pseudo-code class Mapper method map(docid a, doc d) for all w in d do for all u in neighbours(w) do emit(pair(w, u), 1); class Reducer method reduce(pair p, counts [c1, c2, ]) sum = 0; for all c in [c1, c2, ] do sum = sum + c; emit(p, sum); www.inf.ed.ac.uk

Analysing pairs Advantages Easy to implement, easy to understand Disadvantages Lots of pairs to sort and shuffle around (upper bound?) Not many opportunities for combiners to work www.inf.ed.ac.uk

Another try: stripes Idea: group together pairs into an associative array (a, b) 1 (a, c) 2 (a, d) 5 (a, e) 3 (a, f) 2 Each mapper takes a sentence: Generate all co-occurring term pairs a { b: 1, c: 2, d: 5, e: 3, f: 2 } For each term, emit a { b: count b, c: count c, d: count d } Reducers perform element-wise sum of associative arrays + a { b: 1, d: 5, e: 3 } a { b: 1, c: 2, d: 2, f: 2 } a { b: 2, c: 2, d: 7, e: 3, f: 2 } Cleverly-constructed data structure brings together partial results www.inf.ed.ac.uk

Stripes: pseudo-code class Mapper method map(docid a, doc d) for all w in d do H = associative_array(string à integer); for all u in neighbours(w) do H[u]++; emit(w, H); class Reducer method reduce(term w, stripes [H1, H2, ]) H f = assoiative_array(string à integer); for all H in [H1, H2, ] do sum(h f, H); // sum same- keyed entries emit(w, H f ); www.inf.ed.ac.uk

Stripes analysis Advantages Far less sorting and shuffling of key-value pairs Can make better use of combiners Disadvantages More difficult to implement Underlying object more heavyweight Fundamental limitation in terms of size of event space www.inf.ed.ac.uk

Cluster size: 38 cores Data Source: Associated Press Worldstream (APW) of the English Gigaword Corpus (v3), which contains 2.27 million documents (1.8 GB compressed, 5.7 GB uncompressed) www.inf.ed.ac.uk

www.inf.ed.ac.uk