Parallelizing Multiple Group by Query in Shared-nothing Environment: A MapReduce Study Case

Similar documents
Where We Are. Review: Parallel DBMS. Parallel DBMS. Introduction to Data Management CSE 344

Hadoop 2.x Core: YARN, Tez, and Spark. Hortonworks Inc All Rights Reserved

Introduction to Hadoop and MapReduce

Data Partitioning and MapReduce

Dremel: Interactice Analysis of Web-Scale Datasets

Guoping Wang and Chee-Yong Chan Department of Computer Science, School of Computing National University of Singapore VLDB 14.

HadoopDB: An open source hybrid of MapReduce

Introduction to Data Management CSE 344

Hadoop/MapReduce Computing Paradigm

Parallel Programming Principle and Practice. Lecture 10 Big Data Processing with MapReduce

Topics. Big Data Analytics What is and Why Hadoop? Comparison to other technologies Hadoop architecture Hadoop ecosystem Hadoop usage examples

Shark. Hive on Spark. Cliff Engle, Antonio Lupher, Reynold Xin, Matei Zaharia, Michael Franklin, Ion Stoica, Scott Shenker

Shark: SQL and Rich Analytics at Scale. Yash Thakkar ( ) Deeksha Singh ( )

GLADE: A Scalable Framework for Efficient Analytics. Florin Rusu University of California, Merced

CIS 601 Graduate Seminar. Dr. Sunnie S. Chung Dhruv Patel ( ) Kalpesh Sharma ( )

Lecture 11 Hadoop & Spark

April Final Quiz COSC MapReduce Programming a) Explain briefly the main ideas and components of the MapReduce programming model.

Distributed computing: index building and use

Improving the MapReduce Big Data Processing Framework

Database System Architectures Parallel DBs, MapReduce, ColumnStores

MI-PDB, MIE-PDB: Advanced Database Systems

Greenplum Architecture Class Outline

HDFS: Hadoop Distributed File System. CIS 612 Sunnie Chung

Announcements. Parallel Data Processing in the 20 th Century. Parallel Join Illustration. Introduction to Database Systems CSE 414

Distributed computing: index building and use

Data Clustering on the Parallel Hadoop MapReduce Model. Dimitrios Verraros

Informa)on Retrieval and Map- Reduce Implementa)ons. Mohammad Amir Sharif PhD Student Center for Advanced Computer Studies

Big Data Analytics. Izabela Moise, Evangelos Pournaras, Dirk Helbing

Big Data Management and NoSQL Databases

Shark: Hive (SQL) on Spark

Large-Scale GPU programming

Hadoop Map Reduce 10/17/2018 1

Developing MapReduce Programs

VIAF: Verification-based Integrity Assurance Framework for MapReduce. YongzhiWang, JinpengWei

Accelerate Big Data Insights

Jie Pan. Modelling and executing multidimensional data analysis applications over distributed architectures..

Chapter 5. The MapReduce Programming Model and Implementation

Mixing and matching virtual and physical HPC clusters. Paolo Anedda

Voldemort. Smruti R. Sarangi. Department of Computer Science Indian Institute of Technology New Delhi, India. Overview Design Evaluation

Achieving Horizontal Scalability. Alain Houf Sales Engineer

Data Storage Infrastructure at Facebook

MapReduce: A Programming Model for Large-Scale Distributed Computation

HadoopDB: An Architectural Hybrid of MapReduce and DBMS Technologies for Analytical Workloads

Tutorial Outline. Map/Reduce vs. DBMS. MR vs. DBMS [DeWitt and Stonebraker 2008] Acknowledgements. MR is a step backwards in database access

EXTRACT DATA IN LARGE DATABASE WITH HADOOP

Backtesting with Spark

Introduction to Data Management CSE 344

CIS 601 Graduate Seminar Presentation Introduction to MapReduce --Mechanism and Applicatoin. Presented by: Suhua Wei Yong Yu

CSE 344 MAY 2 ND MAP/REDUCE

CSE 544 Principles of Database Management Systems. Alvin Cheung Fall 2015 Lecture 10 Parallel Programming Models: Map Reduce and Spark

MapReduce & Resilient Distributed Datasets. Yiqing Hua, Mengqi(Mandy) Xia

Big Data for Engineers Spring Resource Management

A Fast and High Throughput SQL Query System for Big Data

Programming Systems for Big Data

Apache Ignite TM - In- Memory Data Fabric Fast Data Meets Open Source

Introduction to Data Management CSE 344

Distributed Filesystem

Introduction to MapReduce. Adapted from Jimmy Lin (U. Maryland, USA)

Cloud Computing & Visualization

Lecture 20: WSC, Datacenters. Topics: warehouse-scale computing and datacenters (Sections )

September 2013 Alberto Abelló & Oscar Romero 1

MixApart: Decoupled Analytics for Shared Storage Systems. Madalin Mihailescu, Gokul Soundararajan, Cristiana Amza University of Toronto and NetApp

Big Data Hadoop Stack

MapReduce programming model

MapReduce-II. September 2013 Alberto Abelló & Oscar Romero 1

How to Implement MapReduce Using. Presented By Jamie Pitts

Overview. Why MapReduce? What is MapReduce? The Hadoop Distributed File System Cloudera, Inc.

Resource and Performance Distribution Prediction for Large Scale Analytics Queries

Andrew Pavlo, Erik Paulson, Alexander Rasin, Daniel Abadi, David DeWitt, Samuel Madden, and Michael Stonebraker SIGMOD'09. Presented by: Daniel Isaacs

SQT03 Big Data and Hadoop with Azure HDInsight Andrew Brust. Senior Director, Technical Product Marketing and Evangelism

Database Applications (15-415)

TITLE: PRE-REQUISITE THEORY. 1. Introduction to Hadoop. 2. Cluster. Implement sort algorithm and run it using HADOOP

The Stratosphere Platform for Big Data Analytics

CS 61C: Great Ideas in Computer Architecture. MapReduce

Improving Hadoop MapReduce Performance on Supercomputers with JVM Reuse

HIGH AVAILABILITY AND DISASTER RECOVERY FOR IMDG VLADIMIR KOMAROV, MIKHAIL GORELOV SBERBANK OF RUSSIA

Shark: SQL and Rich Analytics at Scale. Michael Xueyuan Han Ronny Hajoon Ko

Scalable Web Programming. CS193S - Jan Jannink - 2/25/10

Big Data 7. Resource Management

Embedded Technosolutions

Analysis in the Big Data Era

Hybrid MapReduce Workflow. Yang Ruan, Zhenhua Guo, Yuduo Zhou, Judy Qiu, Geoffrey Fox Indiana University, US

HiTune. Dataflow-Based Performance Analysis for Big Data Cloud

Advanced Database Technologies NoSQL: Not only SQL

MapReduce Design Patterns

Designing a Parallel Query Engine over Map/Reduce

Databases 2 (VU) ( / )

Dremel: Interactive Analysis of Web-Scale Database

Parallel Computing: MapReduce Jin, Hai

Hive and Shark. Amir H. Payberah. Amirkabir University of Technology (Tehran Polytechnic)

Apache Spark is a fast and general-purpose engine for large-scale data processing Spark aims at achieving the following goals in the Big data context

Data Analysis Using MapReduce in Hadoop Environment

DATA SCIENCE USING SPARK: AN INTRODUCTION

Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2016)

Clustering Lecture 8: MapReduce

a linear algebra approach to olap

Cloud Programming. Programming Environment Oct 29, 2015 Osamu Tatebe

Batch Processing Basic architecture

Parallel Programming Concepts

HADOOP FRAMEWORK FOR BIG DATA

Transcription:

1 / 39 Parallelizing Multiple Group by Query in Shared-nothing Environment: A MapReduce Study Case PAN Jie 1 Yann LE BIANNIC 2 Frédéric MAGOULES 1 1 Ecole Centrale Paris-Applied Mathematics and Systems Department (MAS) 2 SAP BusinessObjects Innovation Center June 22, 2010

2 / 39 Outline 1 Introduction 2 Related works 3 Multiple Group by Query 4 Data Preparation 5 MapReduce jobs 6 Experimental Results 7 Cost Estimation Model 8 Conclusions and Future works

3 / 39 Outline 1 Introduction 2 Related works 3 Multiple Group by Query 4 Data Preparation 5 MapReduce jobs 6 Experimental Results 7 Cost Estimation Model 8 Conclusions and Future works

4 / 39 Introduction Objective: Cheap solution for multidimensional data analysis applications Challenges: Large data volume Short response time (ex. within 5 seconds) Use cheap commodity computers scalability and fault-tolerance Advantages of MapReduce: Automatic scalability and fault-tolerance Minimize the need of locking Wide range of applicative environments: cluster, multi-core hardware,

5 / 39 Outline 1 Introduction 2 Related works 3 Multiple Group by Query 4 Data Preparation 5 MapReduce jobs 6 Experimental Results 7 Cost Estimation Model 8 Conclusions and Future works

6 / 39 Related works Related works MapReduce s implementations: GridGain, Hadoop... Data query languages based on MapReduce: PigLatin, HiveQL MapReduce-based analytic data warehouses: Greenplum, Aster Combination of MapReduce and parallel databases: HadoopDB This work focuses on: MapReduce-based multiple group by query

7 / 39 Outline 1 Introduction 2 Related works 3 Multiple Group by Query 4 Data Preparation 5 MapReduce jobs 6 Experimental Results 7 Cost Estimation Model 8 Conclusions and Future works

8 / 39 Multiple Group by Query Multiple group-by query s form: select X, SUM(*), from R where condition group by X Example: select B,sum(A) from R where I>i group by B; select C,sum(A) from R where I>i group by C; select D,sum(A) from R where I>i group by D; select E,sum(A) from R where I>i group by E; select F,sum(A) from R where I>i group by F;...

9 / 39 Outline 1 Introduction 2 Related works 3 Multiple Group by Query 4 Data Preparation 5 MapReduce jobs 6 Experimental Results 7 Cost Estimation Model 8 Conclusions and Future works

10 / 39 Dataset used in multidimensional data analysis applications

11 / 39 Dataset used in multidimensional data analysis applications

12 / 39 Data partitioning Horizontal Partitioning Putting different entire rows into different partitions. In our work : 1 horizontal partition = a certain number of entire rows (13 Dimensions + 2 Measures). Vertical Partitioning Putting different columns into different partitions. In our work : 1 vertical partition = 1 Dimension + 2 Measures.

13 / 39 Data Distribution With horizontal partitioning: Each worker node holds equal number of partitions (in this work, 1, 2, 10 and 20 partitions/worker) With vertical partitioning: Small worker number: replicate all partitions over every worker node Large worker number: 1 partitions are divided into regions 2 worker nodes are grouped into regions 3 replicate the vertical partitions over nodes within region (hybrid partitioning)

14 / 39 Data Distribution under Vertical Partitioning Figure: 3 partitions in 1 region vs 3 partitions in 3 regions

15 / 39 Indexation: Lucene Inverted Index Figure: Lucene Index

16 / 39 Outline 1 Introduction 2 Related works 3 Multiple Group by Query 4 Data Preparation 5 MapReduce jobs 6 Experimental Results 7 Cost Estimation Model 8 Conclusions and Future works

17 / 39 GridGain s MapReduce Framework Multiple mappers one reducer Figure: GridGain s MapReduce framework

18 / 39 GridGain MapReduce Data Flow Figure: Data flow of GridGain MapReduce

19 / 39 Master s Processings: Start-up and Closure Figure: Start-up and closure processing on Master node

20 / 39 Workers Processings Figure: The processing on worker node

Mapper s Calculations under Horizontal Partitioning Mapper: Filter + Aggregate over all the group by columns List<FacetCollector 1 > 1 1 FacetCollector: aggregated values over 1 column s distinct values 21 / 39

Mapper s Calculations under Horizontal Partitioning Mapper: Filter + Aggregate over all the group by columns List<FacetCollector 1 > 1 1 FacetCollector is a set of aggregated measure values over 1 dimension s distinct values 22 / 39

23 / 39 Reducer s Calculations under Horizontal Partitioning Reducer: aggregate over List<List<FacetCollector>>

24 / 39 Mapper s Calculations under Vertical Partitioning Filter + Aggregate over one group by column FacetCollector

25 / 39 Reducer s Calculations under Vertical Partitioning Reducer: aggregate over List<FacetCollector>

26 / 39 Outline 1 Introduction 2 Related works 3 Multiple Group by Query 4 Data Preparation 5 MapReduce jobs 6 Experimental Results 7 Cost Estimation Model 8 Conclusions and Future works

27 / 39 Environment Configuration Softwares: GridGain version 2.1.1 Java version 1.6.0 JVM heap 1536M maximum Hardwares: cluster of a Grid5000 s Sophia site CPU 2 single-core at 2.0GHz Memory 2GB RAM Disk 80GB a An infrastructure distributed in 9 sites around France, for research in large-scale parallel and distributed systems (https://www.grid5000.fr/) Data set 1.2GB 10 000 000 rows 15 columns = 13 D + 2 M Data preparations Horizontal partitioning or vertical partitioning Indexed with Lucene MG-Query selectivities: 1.05%, 9.96%, 18.6%, 43.1% Aggregate over: 5 group by columns

Speed-up Measurements: MapReduce-based IMPL. over Horizontally Partitioned Data Query Sequential Execution Time Query Execution selectivity time(ms) 0.0105 869 0.0996 1259 0.186 2097 0.431 4098 Observations: lrg. slectivity> sm. selectivity sm. nb job /node >lrg. nb job /node Figure: Speed-up with horizontally partitioned data without combiner. 28 / 39

Speed-up Measurements: MapCombineReduce-based IMPL. over Horizontally Partitioned data About combiner: Function: pre-aggregating Advantage: reduce communication cost Observations: For sm. job no. (eg. 1, 2), MR > MCR For lrg. job no. (eg. 10, 20), MCR > MR Figure: Speed-up with horizontally partitioned data with combiner. 29 / 39

30 / 39 Speed-up measurements: Vertically Partitioned Data Figure: Speed-up with vertically partitioned data.

31 / 39 Speed-up measurements: Vertically Partitioned Data Observations: speed-up when worker nb MR-based IMPL. > MCR-based IMPL. lrg. selectivity queries > sm. selectivity queries

32 / 39 Performance Affecting Factors Selectivity decides the workload of aggregation. When // aggregation, lrg. selectivity queries > sm. selectivity queries nb job /node Running multiple mappers on one node may degrade performance because of resource contention. Job number 1 Mapper s on Average Execution 1 node time (ms) 1 170 2 208 3 354 4 417 5 537 Table: Query selectivity=0.01, vertical partitioning, 3 regions

33 / 39 Outline 1 Introduction 2 Related works 3 Multiple Group by Query 4 Data Preparation 5 MapReduce jobs 6 Experimental Results 7 Cost Estimation Model 8 Conclusions and Future works

34 / 39 Cost Estimation Model (MapReduce-based IMPL.) Startup cost (Master) Mapping jobs to nodes: C m nb mapper Serialization: C s size mapper nb mapper Cost startup = C m nb mapper + C s size mapper nb mapper Each Mapper Job (Worker) De-serialization: C d size mapper Mapper cost: Cost mapper Serialization: C s size intermediate Cost worker = f (nb mapper/node ) (C d size mapper + Cost mapper + C s size intermediate ) Closure cost (Master) De-serialization: C d Pnb mapper i=1 size intermediate i Reducer: Cost reducer Cost closure = C d Pnb mapper i=1 size intermediate i + Cost reducer Communication Cost comm = C n (nb mapper size mapper + P nb mapper i=1 size intermediate i )

35 / 39 Cost Estimation for Horizontal Partitioning-based IMPL. Mapper Cost Cost mapper = Selectivity size block [C slct + (D + M) C read + M nb GB C add ] Intermediate output size size intermediate = P nb GB i=1 nb DV i size aggregate Reducer cost Cost reducer = C add M nb block Pnb GB i=1 nb DV i Total cost Cost hp = Selectivity size block f (nb mapper/node ) [C slct + (M + D)C read + M nb GB C add ]+ P nb GB i=1 nb DV i [f (nb mapper/node ) (C s +C d +C n) size aggregate +C add M nb block ]

36 / 39 Cost Estimation for Vertical Partitioning-based IMPL. Mapper cost Cost mapper = Selectivity size region [C slct + (M + 1)C read + M C add ] Intermediate output size size intermediate = nb DVcurrentGB size aggregate Reducer cost Cost reducer = C add M nb region Pnb GB i=1 nb DV i Total cost Cost vp = f (nb mapper/node ) Selectivity size region [C slct + (M + 1) C read + M C add )]f (nb mapper/node ) C s nb DVcurrentGB size aggregate + (C d + C add M + C n) nb region Pnb GB i=1 nb DV i

37 / 39 Outline 1 Introduction 2 Related works 3 Multiple Group by Query 4 Data Preparation 5 MapReduce jobs 6 Experimental Results 7 Cost Estimation Model 8 Conclusions and Future works

38 / 39 Conclusions We realized multiple group by query with MapReduce (GridGain). Prepare data by horizontal and vertical partitioning and indexing. Measure speedup performance on Grid 5000. Formal cost analysis. Larger job number better performance. Performance affecting factors: selectivity and nb job /node

39 / 39 Future works Find more performance affecting factors (e.g. filtered data distribution). Intermediate data compression. Separate the index searching from filter and aggregate; Determine mapper number on the fly, considering query selectivity. nb job /node factor to be examined on multi-core hardware.