MapReduce Algorithm Design

Similar documents
Developing MapReduce Programs

Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2016)

MapReduce Design Patterns

Lecture 7: MapReduce design patterns! Claudia Hauff (Web Information Systems)!

TI2736-B Big Data Processing. Claudia Hauff

Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2017)

Data-Intensive Distributed Computing

Anurag Sharma (IIT Bombay) 1 / 13

Laboratory Session: MapReduce

MapReduce Algorithms

Algorithms for MapReduce. Combiners Partition and Sort Pairs vs Stripes

MapReduce Patterns, Algorithms, and Use Cases

MapReduce Algorithm Design

Basic MapReduce Algorithm Design

Data Partitioning and MapReduce

Chapter 4. Distributed Algorithms based on MapReduce. - Applications

CONTAİNERS COLLECTİONS

Introduction to MapReduce

27/04/2012. Objectives. Collection. Collections Framework. "Collection" Interface. Collection algorithm. Legacy collection

Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2017)

KillTest *KIJGT 3WCNKV[ $GVVGT 5GTXKEG Q&A NZZV ]]] QORRZKYZ IUS =K ULLKX LXKK [VJGZK YKX\OIK LUX UTK _KGX

MapReduce Patterns. MCSN - N. Tonellotto - Distributed Enabling Platforms

Vendor: Cloudera. Exam Code: CCD-410. Exam Name: Cloudera Certified Developer for Apache Hadoop. Version: Demo

Hadoop Map Reduce 10/17/2018 1

15/03/2018. Counters

Map Reduce. Yerevan.

April Final Quiz COSC MapReduce Programming a) Explain briefly the main ideas and components of the MapReduce programming model.

We are ready to serve Latest Testing Trends, Are you ready to learn?? New Batches Info

Data-Intensive Distributed Computing

Informa)on Retrieval and Map- Reduce Implementa)ons. Mohammad Amir Sharif PhD Student Center for Advanced Computer Studies

CMSC 723: Computational Linguistics I Session #12 MapReduce and Data Intensive NLP. University of Maryland. Wednesday, November 18, 2009

Map-Reduce and Adwords Problem

17. Java Collections. Organizing Data. Generic List in Java: java.util.list. Type Parameters ( Parameteric Polymorphism ) Data Structures that we know

Databases 2 (VU) ( / )

Chapter 3. Distributed Algorithms based on MapReduce

Parallel Programming Principle and Practice. Lecture 10 Big Data Processing with MapReduce

[3-5] Consider a Combiner that tracks local counts of followers and emits only the local top10 users with their number of followers.

Sets and Maps. Sets Maps The Comparator Interface Sets and Maps in Java Collections API. Review for Exam Reading:

Hash Table. Ric Glassey

Hadoop. copyright 2011 Trainologic LTD

itpass4sure Helps you pass the actual test with valid and latest training material.

Sets and Maps. Part of the Collections Framework

Apache Hive for Oracle DBAs. Luís Marques

Programming Models MapReduce

Data Analytics Framework and Methodology for WhatsApp Chats

The MapReduce Abstraction

Principal Software Engineer Red Hat Emerging Technology June 24, 2015

11-1. Collections. CSE 143 Java. Java 2 Collection Interfaces. Goals for Next Several Lectures

Introducing Hashing. Chapter 21. Copyright 2012 by Pearson Education, Inc. All rights reserved

Big Data and Scripting map reduce in Hadoop

Distributed Systems. CS422/522 Lecture17 17 November 2014

Lecture Map-Reduce. Algorithms. By Marina Barsky Winter 2017, University of Toronto

Abstract Data Types Spring 2018 Exam Prep 5: February 12, 2018

CS 2102 Exam 2 B-Term 2013

Map Reduce.

CME 323: Distributed Algorithms and Optimization Instructor: Reza Zadeh HW#3 - Due at the beginning of class May 18th.

Batch Processing Basic architecture

15/03/2018. Combiner

Mitigating Data Skew Using Map Reduce Application

Introduction to MapReduce (cont.)

Fall 2017 Mentoring 7: October 9, Abstract Data Types

Graph Algorithms using Map-Reduce. Graphs are ubiquitous in modern society. Some examples: The hyperlink structure of the web

Hadoop and Map-reduce computing

An Introduction to The Beam Model

CS S-17 Recursion IV 1. ArrayLists give some extra functionality to arrays (automatic resizing, code for inserting, etc)

Part A: MapReduce. Introduction Model Implementation issues

Spring 2017 EXTERNAL SORTING (CH. 13 IN THE COW BOOK) 2/7/17 CS 564: Database Management Systems; (c) Jignesh M. Patel,

Parallel Dijkstra s Algorithm

Efficiency. Efficiency: Indexing. Indexing. Efficiency Techniques. Inverted Index. Inverted Index (COSC 488)

Java Collections Framework reloaded

CSC 321: Data Structures. Fall 2016

CSC 321: Data Structures. Fall 2017

Portable stateful big data processing in Apache Beam

PSON: A Parallelized SON Algorithm with MapReduce for Mining Frequent Sets

Graphics Pipeline 2D Geometric Transformations

Fall 2018: Introduction to Data Science GIRI NARASIMHAN, SCIS, FIU

CS / Cloud Computing. Recitation 3 September 9 th & 11 th, 2014

Shark: Hive (SQL) on Spark

Enabling Cost-effective Data Processing with Smart SSD

Today's Agenda. > To give a practical introduction to data structures. > To look specifically at Lists, Sets, and Maps

Clustering Lecture 8: MapReduce

Information Retrieval Processing with MapReduce

The MapReduce Framework

CS 314 Midterm 2 Fall 2012

1B1a Arrays. Arrays. Indexing. Naming arrays. Why? Using indexing. 1B1a Lecture Slides. Copyright 2003, Graham Roberts 1

Map/Reduce. Large Scale Duplicate Detection. Prof. Felix Naumann, Arvid Heise

App Engine MapReduce. Mike Aizatsky 11 May Hashtags: #io2011 #AppEngine Feedback:

CSC630/CSC730 Parallel & Distributed Computing

University of Maryland. Tuesday, March 23, 2010

Introduction to Data Management CSE 344

Model Solutions. COMP 103: Test April, 2013

GraphCEP Real-Time Data Analytics Using Parallel Complex Event and Graph Processing

Collective Communication Patterns for Iterative MapReduce

1. Stratified sampling is advantageous when sampling each stratum independently.

Vendor: Hortonworks. Exam Code: HDPCD. Exam Name: Hortonworks Data Platform Certified Developer. Version: Demo

CS 378 Big Data Programming. Lecture 4 Summariza9on Pa:erns

Large-Scale GPU programming

Cloud Computing CS

MongoDB DI Dr. Angelika Kusel

MapReduce: Recap. Juliana Freire & Cláudio Silva. Some slides borrowed from Jimmy Lin, Jeff Ullman, Jerome Simeon, and Jure Leskovec

Transcription:

MapReduce Algorithm Design

Contents Combiner and in mapper combining Complex keys and values Secondary Sorting

Combiner and in mapper combining Purpose Carry out local aggregation before shuffle and sort phase Reduce the communication volume between map and reduce stages

Combiner example (word count) Use reducer as combiner Integer addition is both associative and commutative

MapReduce with Combiner

MapReduce with combiner MapReduce with combiner map: (k1, v1) > [(k2, v2)] combine: (k2, [v2]) > [(k2, v2)] reduce: (k2, [v2]) > [(k3, v3)] The combiner input and output key value types must match the mapper output key value type

Combiner is an optimization, not a requirement Combiner is optional A particular implementation of MapReduce framework may choose to execute the combine method many times or none Calling the combine method zero, one, or many times should produce the same output from the reducer The correctness of the MapReduce program should not rely on the assumption that the combiner is always carried out

In method combining

Use Java Map to implement AssociativeArray A Map is an object that maps keys to values A map cannot contain duplicate keys Each key can map to at most one value The Java platform contains three general purpose Map implementations HashMap, TreeMap, and LinkedHashMap Basic operations boolean containskey(object key) get(object key) put(k key, V value) keyset() Returns a Set view of the keys contained in this map

In mapper combining

Demo and Some online resource Demo on in method and in mapper combining Online resource https://vangjee.wordpress.com/2012/03/07/the in mapper combiningdesign pattern for mapreduce programming codingjunkie.net Blog by Bill Bejeck, who provides a lot of source code on MapReduce in Hadoop

Complex keys and values Both keys and values can be complex data structures Pairs Stripes Serialization and deserialization for complex structures After map stage, structures need to be serialized to be written to storage Complex keys and values need to be de serialized at the reducer side

Motivation An example to compute the mean of value

With combiner

Revised mapper

In mapper combining

A running example Build word co occurrence matrices for large corpora Build the word co occurrence matrix for all the works by Shakespeare Co occurrence within a specific context A sentence A paragraph A document A certain window of m words

Pairs For each map task For each unique pair (a, b), emit [pair (a, b), count] Use combiner or in mapper combining to reduce the volume of intermediate pairs Reducers sum up counts associated with these pairs

Pairs approach

Stripes Idea: group together pairs into an associative array (a, b) 1 (a, c) 2 (a, d) 5 (a, e) 3 (a, f) 2 a { b: 1, c: 2, d: 5, e: 3, f: 2 } Mappers emit [word, associate array] Reducers perform element wise sum of associative arrays + a { b: 1, d: 5, e: 3 } a { b: 1, c: 2, d: 2, f: 2 } a { b: 2, c: 2, d: 7, e: 3, f: 2 }

Stripes approach

Relative frequency What proportion of the time does B appear in the context A? Whenever there is a co occurrence of (A,*), for what percentage will * be B The total count of co occurrence of (A,*) is called marginal f ( B A) count( A, B) count( A,*) B' count( A, B) count( A, B') (A, *) 32 (A, B 1 ) 3 (A, B 2 ) 12 (A, B 3 ) 7 (A, B 4 ) 1 (A, B 5 ) 4 (A, B 6 ) 5 Reducer holds this value in memory (A, B 1 ) 3 / 32 (A, B 2 ) 12 / 32 (A, B 3 ) 7 / 32 (A, B 4 ) 1 / 32 (A, B 5 ) 4 / 32 (A, B 6 ) 5 / 32

Relative frequency with stripes It is easy One pass to compute (A, *) Another pass to directly compute f(b A) May have the scalability issue for really large data The final associative array holds all the neighbors and their co occurrence counts with the word A

Relative frequency with pairs Must emit extra (A, *) for every B (B can be any word) in mapper Must make sure all pairs of (A, *) and (A, B) get sent to same reducer (use partitioner) Must make sure (A, *) comes first (define sort order) Must hold state in reducer across different key value pairs This pattern is called order inversion

Secondary Sorting A motivating example The readings of m sensors are recorded over the time (t1, m1, r80521) (t1, m2, r14209) (t1, m3, r76042) (t2, m1, r21823) (t2, m2, r66508) (t2, m3, r98347) (m1, t1, r80521) (m1, t2, r21823) (m2, t1, r14209) (m2, t2, r66508) (m3, t1, r76042) (m3, t2, r98347)

First approach (t1, m1, r80521) (t1, m2, r14209) (t1, m3, r76042) (t2, m1, r21823) (t2, m2, r66508) (t2, m3, r98347) (m1, (t1, r80521)) (m2, (t1, r14209)) (m3, (t1, r76042)) (m1, (t2, r21823)) (m2, (t2, r66508)) (m3, (t2, r98347)) However, Hadoop MapReduce sorts intermediate pairs by key Values can be arbitrarily ordered E.g., (m1, [(t100, r23456), (t2, r21823),,(t234, r34870)]) Buffer values in memory, then sort Is there issue with this approach?

Second approach Value to key conversion Move part of the value into the intermediate key to form a composite key Composite key: (m, t) Intermediate pair: ((m, t), r) Let execution framework do the sorting First sort by the sensor id, i.e., m (the left element in the key) Then sort by the timestamp, i.e., t (the right element in the key) Implement the custom partitioner All pairs with the same sensor shuffled to the same reducer ((m1, t1), r80521) ((m1, t2), r21823) ((m1, t3), r149625)