MapReduce: Algorithm Design for Relational Operations

Similar documents
MapReduce: Recap. Juliana Freire & Cláudio Silva. Some slides borrowed from Jimmy Lin, Jeff Ullman, Jerome Simeon, and Jure Leskovec

Anurag Sharma (IIT Bombay) 1 / 13

University of Maryland. Tuesday, March 23, 2010

MapReduce Design Patterns

1. Introduction to MapReduce

TI2736-B Big Data Processing. Claudia Hauff

Lecture 7: MapReduce design patterns! Claudia Hauff (Web Information Systems)!

CompSci 516: Database Systems

Where We Are. Review: Parallel DBMS. Parallel DBMS. Introduction to Data Management CSE 344

CS 345A Data Mining. MapReduce

Principles of Data Management. Lecture #16 (MapReduce & DFS for Big Data)

Epilog: Further Topics

Announcements. Parallel Data Processing in the 20 th Century. Parallel Join Illustration. Introduction to Database Systems CSE 414

MapReduce for Data Warehouses 1/25

PSON: A Parallelized SON Algorithm with MapReduce for Mining Frequent Sets

Introduction to Data Management CSE 344

Databases 2 (VU) ( / )

CS 345A Data Mining. MapReduce

Announcements. Optional Reading. Distributed File System (DFS) MapReduce Process. MapReduce. Database Systems CSE 414. HW5 is due tomorrow 11pm

Database Systems CSE 414

Announcements. Reading Material. Map Reduce. The Map-Reduce Framework 10/3/17. Big Data. CompSci 516: Database Systems

Introduction to Data Management CSE 344

Introduction to MapReduce

CompSci 516: Database Systems. Lecture 20. Parallel DBMS. Instructor: Sudeepa Roy

Relational Data Processing on MapReduce

Introduction to MapReduce

Introduction to Data Management CSE 344

MapReduce Algorithms

Google File System (GFS) and Hadoop Distributed File System (HDFS)

Distributed computing: index building and use

CIS 601 Graduate Seminar Presentation Introduction to MapReduce --Mechanism and Applicatoin. Presented by: Suhua Wei Yong Yu

L22: SC Report, Map Reduce

"Big Data" Open Source Systems. CS347: Map-Reduce & Pig. Motivation for Map-Reduce. Building Text Index - Part II. Building Text Index - Part I

HADOOP FRAMEWORK FOR BIG DATA

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

Data Partitioning and MapReduce

Big Data Management and NoSQL Databases

Developing MapReduce Programs

Parallel Computing: MapReduce Jin, Hai

Distributed computing: index building and use

CS-2510 COMPUTER OPERATING SYSTEMS

Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2016)

Introduction to MapReduce

Map Reduce Group Meeting

CS555: Distributed Systems [Fall 2017] Dept. Of Computer Science, Colorado State University

CS60021: Scalable Data Mining. Sourangshu Bhattacharya

Introduction to MapReduce Algorithms and Analysis

Parallel Nested Loops

Parallel Partition-Based. Parallel Nested Loops. Median. More Join Thoughts. Parallel Office Tools 9/15/2011

MapReduce: Simplified Data Processing on Large Clusters 유연일민철기

What happens. 376a. Database Design. Execution strategy. Query conversion. Next. Two types of techniques

On the Varieties of Clouds for Data Intensive Computing

Streaming Algorithms. Stony Brook University CSE545, Fall 2016

Introduction to MapReduce

Parallel DBMS. Lecture 20. Reading Material. Instructor: Sudeepa Roy. Reading Material. Parallel vs. Distributed DBMS. Parallel DBMS 11/15/18

Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2017)

How to Implement MapReduce Using. Presented By Jamie Pitts

Chapter 4: Apache Spark

Graphs (Part II) Shannon Quinn

TI2736-B Big Data Processing. Claudia Hauff

MapReduce for Graph Algorithms

Big Trend in Business Intelligence: Data Mining over Big Data Web Transaction Data. Fall 2012

MapReduce and Hadoop. Debapriyo Majumdar Indian Statistical Institute Kolkata

Parallel Programming Principle and Practice. Lecture 10 Big Data Processing with MapReduce

Introduction to MapReduce

Instructor: Dr. Mehmet Aktaş. Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman Stanford University

The MapReduce Abstraction

CS370 Operating Systems

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS

Parallel data processing with MapReduce

Clustering websites using a MapReduce programming model

BIG DATA & HADOOP: A Survey

Map Reduce. Yerevan.

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

APPLICATION OF HADOOP MAPREDUCE TECHNIQUE TOVIRTUAL DATABASE SYSTEM DESIGN. Neha Tiwari Rahul Pandita Nisha Chhatwani Divyakalpa Patil Prof. N.B.

The MapReduce Framework

Clustering Lecture 8: MapReduce

This material is covered in the textbook in Chapter 21.

Outline. MapReduce Data Model. MapReduce. Step 2: the REDUCE Phase. Step 1: the MAP Phase 11/29/11. Introduction to Data Management CSE 344

CHAPTER 4 ROUND ROBIN PARTITIONING

Outline. What is Big Data? Hadoop HDFS MapReduce Twitter Analytics and Hadoop

Data-Intensive Distributed Computing

Distributed Systems. 18. MapReduce. Paul Krzyzanowski. Rutgers University. Fall 2015

Bigtable. A Distributed Storage System for Structured Data. Presenter: Yunming Zhang Conglong Li. Saturday, September 21, 13

Research on the Application of Bank Transaction Data Stream Storage based on HBase Xiaoguo Wang*, Yuxiang Liu and Lin Zhang

Introduction to streams and stream processing. Mining Data Streams. Florian Rappl. September 4, 2014

Yuval Carmel Tel-Aviv University "Advanced Topics in Storage Systems" - Spring 2013

FAST DATA RETRIEVAL USING MAP REDUCE: A CASE STUDY

Introduction to MapReduce

Obtaining Rough Set Approximation using MapReduce Technique in Data Mining

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

Parallel DBMS. Lecture 20. Reading Material. Instructor: Sudeepa Roy. Reading Material. Parallel vs. Distributed DBMS. Parallel DBMS 11/7/17

MapReduce. Kiril Valev LMU Kiril Valev (LMU) MapReduce / 35

ML from Large Datasets

MapReduce Algorithm Design

Part A: MapReduce. Introduction Model Implementation issues

The Google File System

GOOGLE FILE SYSTEM: MASTER Sanjay Ghemawat, Howard Gobioff and Shun-Tak Leung

Parallel HITS Algorithm Implemented Using HADOOP GIRAPH Framework to resolve Big Data Problem

Transcription:

MapReduce: Algorithm Design for Relational Operations Some slides borrowed from Jimmy Lin, Jeff Ullman, Jerome Simeon, and Jure Leskovec

Projection π

Projection in MapReduce Easy Map over tuples, emit new tuples with appropriate attributes No reducers, unless for regrouping or re-sorting tuples Alternative: do projection in reducer, after some other processing Limited by HDFS streaming speeds Speed of encoding/decoding tuples becomes important Semi-structured (XML) data? No problem!

Selection

Selection in MapReduce Easy Map over tuples, emit new tuples with appropriate attributes No reducers, unless for regrouping or re-sorting tuples Alternative: do projection in reducer, after some other processing Limited by HDFS streaming speeds Speed of encoding/decoding tuples becomes important Semi-structured (XML) data? No problem!

Aggregation in MapReduce Example Log analysis: What is the average time spent per URL? In SQL: SELECT url, AVG(time) FROM visits GROUP BY url In MapReduce: Map over tuples, emit time, keyed by url Framework automatically groups values by keys Compute average in reducer How can you make this more efficient?

Relational Joins

Running Example Two tables: S: User demographics (user_id, gender, age, income, etc.) R: User page visits (user_id, URL, time spent, etc.) Analyses we might want to perform: Statistics on demographic characteristics Statistics on page visits Statistics on page visits by URL Statistics on page visits by demographic characteristic Need to join S and R!

Reduce-side join Map-side join In-memory join Join Algorithms in MapReduce

Reduce-Side Join Basic idea: group by join key Map over both sets of tuples Emit tuple as value with join key as the intermediate key Execution framework brings together tuples sharing the same key Perform actual join in reducer Similar to a sort-merge join in database terminology Two variants 1-to-1 joins 1-to-many and many-to-many joins

Reduce-Side Join: 1-to-1 Map keys values R 4 R 4 S 2 S 2 Reduce keys values S 2 R 4 Note: no guarantee if R is going to come first or S

Reduce-Side Join: 1-to-1 At most one tuple from R and one tuple from S share the same join key Reducer will receive keys in the following format k23 [(r64,r 64 ),(s84,s 84 )] k37 [(r68, R 68 )] k59 [(s97,s 97 ),(r81,r 81 )] k61 [(s99, S 99 )] S i à other attr of S R i à other attr of R si,rià tuple id If there are two values associated with a key, one must be from R and the other from S What should the reduced do for k 37?

Map Reduce-Side Join: 1-to-many keys values S 2 S 2 S 9 S 9 Reduce keys values S 2 What s the problem? R is the one side, S is the many!

Reduce-Side Join: 1-to-many Reducer will receive keys in the following format k23 [(r64,r 64 ),(s84,s 84 ),(r65, R 65 )] k37 [(r68, R 68 )] k59 [(s97,s 97 ),(s98,s 98 ), (s99,s 99 ), k61 [(s99, S 99 )] (s100,s 100 ), (r81,r 81 )] S i à other attr of S R i à other attr of R si,rià tuple id There can be many values from S for a given key, we do not know when the value for R will be encountered

Reduce-Side Join: 1-to-many! Map keys values S 2 S 9 Buffer all values in memory, S 2 pick out the tuple from R, and then cross it with every tuple from S 9 S to perform the join Reduce keys values S 2 What s the problem? R is the one side, S is the many!

Reduce-Side Join: 1-to-many! Map keys values S 2 S 9 Buffer all values in memory, S 2 pick out the tuple from R, and then cross it with every tuple from S 9 S to perform the join Reduce keys values S 2 What s the problem? R is the one side, S is the many!

Sorting: Use Values Value-to-key conversion design pattern: form composite intermediate key, (k, v 1 ) Let execution framework do the sorting: Sort by join attribute, then sort all tuple ids from R before S Before k (s97,s 97 ),(s98,s 98 ), (s99,s 99 ), (s100,s 100 ), (r81,r 81 ) Values arrive in arbitrary order After (k, r81) (r81, R 81 ) (k, s97) (s97, S 97 ) (k, s98) (s98, S 98 ) (k, s99) (s99, S 99 ) Values arrive in sorted order of tuple id Process by preserving state across multiple keys Remember to partition correctly!

Reduce-Side Join: Value-to-Key Conversion In reducer keys values S 2 New key encountered: hold in memory Cross with records from other set S 9 R 4 New key encountered: hold in memory Cross with records from other set S 7 https://www.inkling.com/read/hadoop-definitive-guide-tom-white-3rd/chapter-8/example-8-9

Reduce-Side Join Many-to-Many In reducer keys values R 5 Hold in memory R 8 S 2 Cross with records from other set S 9 What s the problem? R is the smaller dataset!

Reduce-Side Join Many-to-Many In reducer keys values R 5 Hold in memory R 8 S 2 Cross with records from other set S 9 What s the problem? R is the smaller dataset!

Reduce-Side Join What are the limitations? Both datasets are transferred over the network!

Map-Side Join: Basic Idea Assume two datasets are sorted by the join key: S 2 R 2 S 4 R 4 R 3 S 1 A sequential scan through both datasets to join (called a merge join in database terminology)

Map-Side Join: Parallel Scans If datasets are sorted by join key, join can be accomplished by a scan over both datasets How can we accomplish this in parallel? Partition and sort both datasets in the same manner In MapReduce: Map over one dataset, read from other corresponding partition No reducers necessary (unless to repartition or resort) Consistently partitioned datasets: realistic to expect? Depends on the workflow For ad hoc data analysis, reduce-side are more general, although less efficient

References Data Intensive Text Processing with MapReduce, Lin and Dyer (Chapter 3) Mining of Massive Data Sets, Rajaraman et al. (Chapter 2) Hadoop tutorial: https://developer.yahoo.com/hadoop/tutorial/ module4.html Jeffrey Dean and Sanjay Ghemawat, MapReduce: Simplified Data Processing on Large Clusters http://labs.google.com/papers/mapreduce.html Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung, The Google File System http://labs.google.com/papers/gfs.html