Guoping Wang and Chee-Yong Chan Department of Computer Science, School of Computing National University of Singapore VLDB 14.

Similar documents
Data Clustering on the Parallel Hadoop MapReduce Model. Dimitrios Verraros

Hadoop Map Reduce 10/17/2018 1

Improving Hadoop MapReduce Performance on Supercomputers with JVM Reuse

Hadoop/MapReduce Computing Paradigm

Where We Are. Review: Parallel DBMS. Parallel DBMS. Introduction to Data Management CSE 344

MI-PDB, MIE-PDB: Advanced Database Systems

Hadoop 2.x Core: YARN, Tez, and Spark. Hortonworks Inc All Rights Reserved

Parallelizing Multiple Group by Query in Shared-nothing Environment: A MapReduce Study Case

HDFS: Hadoop Distributed File System. CIS 612 Sunnie Chung

MATE-EC2: A Middleware for Processing Data with Amazon Web Services

Voldemort. Smruti R. Sarangi. Department of Computer Science Indian Institute of Technology New Delhi, India. Overview Design Evaluation

CIS 601 Graduate Seminar Presentation Introduction to MapReduce --Mechanism and Applicatoin. Presented by: Suhua Wei Yong Yu

Enhanced Hadoop with Search and MapReduce Concurrency Optimization

Clustering Lecture 8: MapReduce

Andrew Pavlo, Erik Paulson, Alexander Rasin, Daniel Abadi, David DeWitt, Samuel Madden, and Michael Stonebraker SIGMOD'09. Presented by: Daniel Isaacs

Hadoop An Overview. - Socrates CCDH

Introduction to Data Management CSE 344

April Final Quiz COSC MapReduce Programming a) Explain briefly the main ideas and components of the MapReduce programming model.

HaLoop Efficient Iterative Data Processing on Large Clusters

Big Data Management and NoSQL Databases

Introduction to Data Management CSE 344

A brief history on Hadoop

Announcements. Parallel Data Processing in the 20 th Century. Parallel Join Illustration. Introduction to Database Systems CSE 414

Experiences Running and Optimizing the Berkeley Data Analytics Stack on Cray Platforms

Programming Models MapReduce

Data Informatics. Seon Ho Kim, Ph.D.

Cloud Computing CS

Introduction to BigData, Hadoop:-

Parallel Computing: MapReduce Jin, Hai

Parallel Programming Concepts

Database Applications (15-415)

SSS: An Implementation of Key-value Store based MapReduce Framework. Hirotaka Ogawa (AIST, Japan) Hidemoto Nakada Ryousei Takano Tomohiro Kudoh

Backtesting with Spark

TITLE: PRE-REQUISITE THEORY. 1. Introduction to Hadoop. 2. Cluster. Implement sort algorithm and run it using HADOOP

2/26/2017. For instance, consider running Word Count across 20 splits

Shark: Hive (SQL) on Spark

Pig Latin Reference Manual 1

Map Reduce & Hadoop Recommended Text:

Native-Task Performance Test Report

HADOOP FRAMEWORK FOR BIG DATA

CIS 601 Graduate Seminar. Dr. Sunnie S. Chung Dhruv Patel ( ) Kalpesh Sharma ( )

Survey Paper on Traditional Hadoop and Pipelined Map Reduce

Improving the MapReduce Big Data Processing Framework

A Fast and High Throughput SQL Query System for Big Data

Lecture 11 Hadoop & Spark

The MapReduce Abstraction

Shark. Hive on Spark. Cliff Engle, Antonio Lupher, Reynold Xin, Matei Zaharia, Michael Franklin, Ion Stoica, Scott Shenker

MapReduce Simplified Data Processing on Large Clusters

CSE 544 Principles of Database Management Systems. Alvin Cheung Fall 2015 Lecture 10 Parallel Programming Models: Map Reduce and Spark

Distributed File Systems II

Apache Spark is a fast and general-purpose engine for large-scale data processing Spark aims at achieving the following goals in the Big data context

Big Data Analytics. Izabela Moise, Evangelos Pournaras, Dirk Helbing

CLoud computing is a service through which a

Topics. Big Data Analytics What is and Why Hadoop? Comparison to other technologies Hadoop architecture Hadoop ecosystem Hadoop usage examples

2/4/2019 Week 3- A Sangmi Lee Pallickara

International Journal of Advance Engineering and Research Development. A Study: Hadoop Framework

Big Data Programming: an Introduction. Spring 2015, X. Zhang Fordham Univ.

Computer Memory. Data Structures and Algorithms CSE 373 SP 18 - KASEY CHAMPION 1

Introduction to Hadoop. Owen O Malley Yahoo!, Grid Team

HyperDex. A Distributed, Searchable Key-Value Store. Robert Escriva. Department of Computer Science Cornell University

GLADE: A Scalable Framework for Efficient Analytics. Florin Rusu University of California, Merced

VIAF: Verification-based Integrity Assurance Framework for MapReduce. YongzhiWang, JinpengWei

Data Analysis Using MapReduce in Hadoop Environment

Batch Processing Basic architecture

SMCCSE: PaaS Platform for processing large amounts of social media

INDEX-BASED JOIN IN MAPREDUCE USING HADOOP MAPFILES

CS 61C: Great Ideas in Computer Architecture. MapReduce

Hadoop محبوبه دادخواه کارگاه ساالنه آزمایشگاه فناوری وب زمستان 1391

Evaluation of Apache Hadoop for parallel data analysis with ROOT

Hadoop Online Training

Hive and Shark. Amir H. Payberah. Amirkabir University of Technology (Tehran Polytechnic)

Chisel++: Handling Partitioning Skew in MapReduce Framework Using Efficient Range Partitioning Technique

Facilitating Consistency Check between Specification & Implementation with MapReduce Framework

CS555: Distributed Systems [Fall 2017] Dept. Of Computer Science, Colorado State University

What is the maximum file size you have dealt so far? Movies/Files/Streaming video that you have used? What have you observed?

High Performance Computing on MapReduce Programming Framework

Dremel: Interactice Analysis of Web-Scale Datasets

Big Data Hadoop Developer Course Content. Big Data Hadoop Developer - The Complete Course Course Duration: 45 Hours

Hadoop MapReduce Framework

Resource and Performance Distribution Prediction for Large Scale Analytics Queries

Hadoop. copyright 2011 Trainologic LTD

A Parallel R Framework

Introduction to MapReduce Algorithms and Analysis

Certified Big Data and Hadoop Course Curriculum

Progress on Efficient Integration of Lustre* and Hadoop/YARN

Parallel Programming Principle and Practice. Lecture 10 Big Data Processing with MapReduce

CSE Lecture 11: Map/Reduce 7 October Nate Nystrom UTA

CISC 7610 Lecture 2b The beginnings of NoSQL

HADOOP 3.0 is here! Dr. Sandeep Deshmukh Sadepach Labs Pvt. Ltd. - Let us grow together!

Scaling Up 1 CSE 6242 / CX Duen Horng (Polo) Chau Georgia Tech. Hadoop, Pig

Certified Big Data Hadoop and Spark Scala Course Curriculum

Developing MapReduce Programs

Functional Comparison and Performance Evaluation. Huafeng Wang Tianlun Zhang Wei Mao 2016/11/14


Map-Reduce. John Hughes

Dynamic Data Placement Strategy in MapReduce-styled Data Processing Platform Hua-Ci WANG 1,a,*, Cai CHEN 2,b,*, Yi LIANG 3,c

Clustering Documents. Document Retrieval. Case Study 2: Document Retrieval

Introduction to Hadoop. High Availability Scaling Advantages and Challenges. Introduction to Big Data

Cloud Computing and Hadoop Distributed File System. UCSB CS170, Spring 2018

Large-Scale GPU programming

Transcription:

Guoping Wang and Chee-Yong Chan Department of Computer Science, School of Computing National University of Singapore VLDB 14 Page 1

Introduction & Notations Multi-Job optimization Evaluation Conclusion Page 2

Can scale to thousands of commodity machines Fault tolerant manner and able to support parallel computing But still not simple and convenient enough! Its has been widely embraced! Page 3

To simplify the execution of MR programs But such high-level languages lead to a new problem MRQL So we can use SQL or Script instead of writing the MR java method Page 4

Native Java Program VS SQL Script Page 5

For job1,2: a 10 a 10 and b > 20 And b 20 and a < 10 b 20 Page 6

J 1 F Maper M 1 Reducer R 1 J 2 F Maper M 2 Reducer R 2 The overhead can be reduced by: Cost(M 1 )+ Cost(M 2 )- Cost(M 1 M 2 ) + Cost(F) Some Technique for this issue: MR-Share s grouping technique (MRGT) Generalized Grouping Technique(GGT) Materialization Technique(MT) - By MRShare (VLDB 10) - By the author - By the author Page 7

Condition: job1 & Job2 have the same schema of input KVs J 1 F Maper M 1 Reducer R 1 J 2 F Maper M 2 Reducer R 2 (key,(tag,value)) Main Idea: Sharing map input scan&sharing map output Page 8

Example 1: But, the condition is restricted! J 1 J 1 t. a 10 t. b > 20 Reducer of J 1 T J 1,4 t. a 10 t. b 20 J 4 J 4 t. a < 10 t. b > 20 Reducer of J 4 Mapper Output Page 9

Condition: Job i & Job j satisfiy that K i K j, e.g. ((a,b),d) ((a,b,c),d) or that M j A = M i, e.g. ((a, b, c), d) {a,b} = ((a, b), d) Page 10

Example 2: J 1 J 1 t. a 10 t. b > 20 Reducer of J 1 T J 1,2 t. a 10 t. b 20 J 2 J 2 t. a < 10 t. b 20 Reducer of J 2 Mapper Output An alter: must partitioned on a & sort on a:b Page 11

MOM Condition: Job i & Job j satisfiy that they can be processed in a specific sequence Two major part: Map Output Materialization (MOM) Reduce Input Materialization (RIM) J i J i Map output for J i Reducer of J i J j F J j Map output for J j HDFS Mapper Output Reducer of J j Page 12

RIM Extra Condition: Job i & Job j satisfiy that K j K i, e.g. ((a,b),d) ((a,b,c),d) or that M i A = M j, e.g. ((a, b, c), d) {a,b} = ((a, b), d) J i F Maper M i Reducer R i M j M i K j results of M j that can derived from M i K j J j F Maper M j M i K j Reducer R j Page 13

Example 3: J 2 F Maper M i Reducer R i t. a 10 t. b 20 results of M 1 that can derived from M 2 {a} J 1 F Maper M j M i K j Reducer R j Page 14

Algorithms: Data: 1 NA naive approach 2 MGRT MRShare s grouping technique 3 GGT generalized grouping technique 4 MT materialize technique 5 GGTMT combining of GGT & MT 6 NA Naïve approach 1 Data schema (key char(8),dim1 char(20),dim2 char(20), dim3 char(20), dim4,char(20),range int,value int) 2 Size 1.7 billion tuples with a size of 100GB 3 Template select T, sum(value) from Data where a range b group by T Page 15

Experimental Results Experimental Environment: 1 Env Hadoop 1.0.1 2 processor Intel Xeon X3430 2.4Ghz 3 RAM 8G 4 OS CentOS 5.5 5 Default cluster size 1 master 40 slaves 6 Disk 2* 500GB SATA Hadoop Configuration: 1 Heap size of JVM 1024MB 2 Default split size of HDFS 3 Data replication 3 512MB 4 I/O buffer size 128KB Page 16

Experimental Results Page 17

1 Effect of number of queries: (a) Effect of number of queries GGT outperform NA by 105% on average and up to 167% when No. of queries is 30 and outperform MRGT by 85% on average and up to 107% when No. of queries is 30 No.of queries, outperform Page 18

2 Effect of data size: (b) Effect of data size GGT outperform NA by 103% on average and up to 128% when data size is 320GB and outperform MRGT by 82% on average and up to 93% when data size is 320GB No.of queries, outperform Page 19

3 Effect of cluster size: (c) Effect of cluster size Page 20

4 Effect of data size and cluster size: (d) Effect of data size and cluster size Page 21

5 Effect of split size: (e) Effect of split size Page 22

6 Analysis of MT: (f) Analysis of MT Page 23

Primarily with MR-Share 24

Notations(2) â Page 25

split 0 split 0 split 0 map map map sort sort sort copy merge reduce reduce part 0 Some ap input can be shared merge part 1 HDFS replication HDFS replication Job1 split 0 split 0 split 0 map map map sort sort sort copy merge reduce reduce part 0 Some map output can be shared merge part 1 HDFS replication HDFS replication Job2 Load Parse Process Sort Shuffle Merge Reduce Page 26

Partitioning Algorithm (G i, T i ) a group of jobs G i being processed by a technique T i Merging benefit: Cost(G 1, T 1 )+ Cost(G 2, T 2 )- Cost(G 1 G 2, T 3 ) (G 1 G 2 = φ, T 3 {GGT, MT}) 27