Modeling and evaluation on Ad hoc query processing with Adaptive Index in Map Reduce Environment

Similar documents
MINING OF LARGE SCALE DATA USING BESTPEER++ STRATEGY

Multi-indexed Graph Based Knowledge Storage System

LeanBench: comparing software stacks for batch and query processing of IoT data

Load Balancing Through Map Reducing Application Using CentOS System

THE data generated and stored by enterprises are in the

Large Scale OLAP. Yifu Huang. 2014/11/4 MAST Scientific English Writing Report

CIS 601 Graduate Seminar Presentation Introduction to MapReduce --Mechanism and Applicatoin. Presented by: Suhua Wei Yong Yu

Query processing on raw files. Vítor Uwe Reus

Performance Comparison of Hive, Pig & Map Reduce over Variety of Big Data

Improved MapReduce k-means Clustering Algorithm with Combiner

A Fast and High Throughput SQL Query System for Big Data

Pivoting Entity-Attribute-Value data Using MapReduce for Bulk Extraction

Jumbo: Beyond MapReduce for Workload Balancing

Parallelizing Structural Joins to Process Queries over Big XML Data Using MapReduce

Evaluation of Multiple Fat-Btrees on a Parallel Database

Recurring Job Optimization for Massively Distributed Query Processing

Optimizing the use of the Hard Disk in MapReduce Frameworks for Multi-core Architectures*

Dremel: Interactive Analysis of Web-Scale Datasets

SURVEY ON BIG DATA TECHNOLOGIES

ECE 7650 Scalable and Secure Internet Services and Architecture ---- A Systems Perspective

Combining MapReduce with Parallel DBMS Techniques for Large-Scale Data Analytics

Introduction to MapReduce Algorithms and Analysis

Map/Reduce. Large Scale Duplicate Detection. Prof. Felix Naumann, Arvid Heise

Big Trend in Business Intelligence: Data Mining over Big Data Web Transaction Data. Fall 2012

SMCCSE: PaaS Platform for processing large amounts of social media

Delft University of Technology Parallel and Distributed Systems Report Series

Resource and Performance Distribution Prediction for Large Scale Analytics Queries

Camdoop Exploiting In-network Aggregation for Big Data Applications Paolo Costa

Processing Load Prediction for Parallel FP-growth

called Hadoop Distribution file System (HDFS). HDFS is designed to run on clusters of commodity hardware and is capable of handling large files. A fil

P-Codec: Parallel Compressed File Decompression Algorithm for Hadoop

Exploiting Bloom Filters for Efficient Joins in MapReduce

PSON: A Parallelized SON Algorithm with MapReduce for Mining Frequent Sets

Dremel: Interactice Analysis of Web-Scale Datasets

Global Journal of Engineering Science and Research Management

SQL-to-MapReduce Translation for Efficient OLAP Query Processing

Hash Joins for Multi-core CPUs. Benjamin Wagner

SQL Query Optimization on Cross Nodes for Distributed System

High Performance Computing on MapReduce Programming Framework

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY

Big Data Hive. Laurent d Orazio Univ Rennes, CNRS, IRISA

An Efficient Distributed B-tree Index Method in Cloud Computing

SSD. DEIM Forum 2014 D8-6 SSD I/O I/O I/O HDD SSD I/O

A What-if Engine for Cost-based MapReduce Optimization

DCODE: A Distributed Column-Oriented Database Engine for Big Data Analytics

ABSTRACT I. INTRODUCTION

Parallel HITS Algorithm Implemented Using HADOOP GIRAPH Framework to resolve Big Data Problem

MRBench : A Benchmark for Map-Reduce Framework

Twitter data Analytics using Distributed Computing

Join Processing for Flash SSDs: Remembering Past Lessons

Introduction to Hadoop. Owen O Malley Yahoo!, Grid Team

Introduction to MapReduce. Adapted from Jimmy Lin (U. Maryland, USA)

MapReduce-II. September 2013 Alberto Abelló & Oscar Romero 1

Comparative Analysis of Range Aggregate Queries In Big Data Environment

Adaptive Join Plan Generation in Hadoop

Things To Know. When Buying for an! Alekh Jindal, Jorge Quiané, Jens Dittrich

Dremel: Interactive Analysis of Web- Scale Datasets

Implementation of Aggregation of Map and Reduce Function for Performance Improvisation

Andrew Pavlo, Erik Paulson, Alexander Rasin, Daniel Abadi, David DeWitt, Samuel Madden, and Michael Stonebraker SIGMOD'09. Presented by: Daniel Isaacs

Survey Paper on Traditional Hadoop and Pipelined Map Reduce

class 5 column stores 2.0 prof. Stratos Idreos

Clash of the Titans: MapReduce vs. Spark for Large Scale Data Analytics

A Survey: Dynamic Allocation of Slots for Map Reduce Workload.

ANALYZING CHARACTERISTICS OF PC CLUSTER CONSOLIDATED WITH IP-SAN USING DATA-INTENSIVE APPLICATIONS

Apache Drill. Interactive Analysis of Large-Scale Datasets. Tomer Shiran

Journal of East China Normal University (Natural Science) Data calculation and performance optimization of dairy traceability based on Hadoop/Hive

A Cloud Storage Adaptable to Read-Intensive and Write-Intensive Workload

MapReduce: A Programming Model for Large-Scale Distributed Computation

Guoping Wang and Chee-Yong Chan Department of Computer Science, School of Computing National University of Singapore VLDB 14.

CIS 601 Graduate Seminar. Dr. Sunnie S. Chung Dhruv Patel ( ) Kalpesh Sharma ( )

On The Fly Mapreduce Aggregation for Big Data Processing In Hadoop Environment

Big Data Management and NoSQL Databases

I. Introduction. FlashQueryFile: Flash-Optimized Layout and Algorithms for Interactive Ad Hoc SQL on Big Data Rini T Kaushik 1

Replica Parallelism to Utilize the Granularity of Data

Adaptive Cache Mode Selection for Queries over Raw Data

A Comparison of MapReduce Join Algorithms for RDF

class 17 updates prof. Stratos Idreos

Track Join. Distributed Joins with Minimal Network Traffic. Orestis Polychroniou! Rajkumar Sen! Kenneth A. Ross

Modelling of Map Reduce Performance for Work Control and Supply Provisioning

MAPREDUCE FOR BIG DATA PROCESSING BASED ON NETWORK TRAFFIC PERFORMANCE Rajeshwari Adrakatti

Distributed Bottom up Approach for Data Anonymization using MapReduce framework on Cloud

LITERATURE SURVEY (BIG DATA ANALYTICS)!

A Comparison of Memory Usage and CPU Utilization in Column-Based Database Architecture vs. Row-Based Database Architecture

Data Quality for Web Log Data Using a Hadoop Environment

Uday Kumar Sr 1, Naveen D Chandavarkar 2 1 PG Scholar, Assistant professor, Dept. of CSE, NMAMIT, Nitte, India. IJRASET 2015: All Rights are Reserved

Improving Efficiency of Parallel Mining of Frequent Itemsets using Fidoop-hd

Introduction to MapReduce

RESTORE: REUSING RESULTS OF MAPREDUCE JOBS. Presented by: Ahmed Elbagoury

Efficient Map Reduce Model with Hadoop Framework for Data Processing

Dynamic processing slots scheduling for I/O intensive jobs of Hadoop MapReduce

A Database System Performance Study with Micro Benchmarks on a Many-core System

Shark: Hive on Spark

The amount of data increases every day Some numbers ( 2012):

Results and Discussions on Transaction Splitting Technique for Mining Differential Private Frequent Itemsets

SDA: Software-Defined Accelerator for general-purpose big data analysis system

2/26/2017. The amount of data increases every day Some numbers ( 2012):

September 2013 Alberto Abelló & Oscar Romero 1

Jignesh M. Patel. Blog:

I ++ Mapreduce: Incremental Mapreduce for Mining the Big Data

Topics. Big Data Analytics What is and Why Hadoop? Comparison to other technologies Hadoop architecture Hadoop ecosystem Hadoop usage examples

Transcription:

DEIM Forum 213 F2-1 Adaptive indexing 153 855 4-6-1 E-mail: {okudera,yokoyama,miyuki,kitsure}@tkl.iis.u-tokyo.ac.jp MapReduce MapReduce MapReduce Modeling and evaluation on Ad hoc query processing with Adaptive Index in Map Reduce Environment Shohei OKUDERA, Daisaku YOKOYMA, Miyuki NAKANO, and Masaru KITSUREGAWA Abstract Institute of Industrial Science the University of Tokyo Komaba 4 6 1 Meguro-ku Tokyo 153 855 Japan E-mail: {okudera,yokoyama,miyuki,kitsure}@tkl.iis.u-tokyo.ac.jp We are required to process large-scale data efficiently as the volume of digital data we generate increases rapidly. MapReduce environment is suitable for large-scale distributed data processing and getting more important as a ad-hoc data analysis platform. However, all data scan is required every query when performing ad-hoc analysis in MapReduce environment, so the I/O cost will be badly huge. In this paper, in order to reduce the I/O cost, we propose an integrated MapReduce framework with adaptive index. This framework gradually shrinks the processing cost, as those queries are issued, each of which has a selection condition on the same column but different selection values. And the framework can keep the performance when data updates happens. We applied our proposed adaptive indexing technique into basic data operations projection, aggregation, join and evaluated the reduction of the execution time. We conducted the simulation and compared the accumulated execution time in the workload where the queries with the selection conditions slightly changed are repeatedly issued. As a result of simulation, our proposed framework showed higher performance than those processing without adaptive indexing. Key words MapReduce Adaptive indexing Ad-hoc query Simulation 1. SNS Facebook 28 8 1 213 1 1 3 [1], [2], ebay 2 3 5

Web. MapReduce [3] Hadoop [4] MapReduce [5] MapReduce MapReduce MapReduce 2 MapReduce 3 4 MapReduce 5 6 7 2. MapReduce MapReduce 3 discount (1) SELECT id, discount FROM R WHERE x <= discount < y ; (2) (3) SELECT S. partid, S. partname, R. discount FROM R, S WHERE x <= R. discount < y AND R. partid = S. partid 2. 1 Hadoop MapReduce. MapReduce MapReduce Map Reduce 2 Map Hadoop HDFS map Reduce Map Reduce HDFS 1 Map 2 < = discount < 8 Map Map Map Collect Spill Merge Reduce. Reduce Map Reduce HDFS 1 2. 2 1 3 < = discount < 5 MapReduce SELECT partid, AVE( d i s c o u n t ) FROM R WHERE x <= discount < y GROUP BY parrid

3. MapReduce Adaptive index [6], [7] MapReduce Adaptive indexing I/O 3. 1 2 2 MapReduce Map Map Map Map HDFS HDFS Adaptive indexing Database Cracking [6] 2 2 2 2 8 8 3 2 3. 2 3 Map 1. 2. Map 2 < = discount < 8 discount < 2, 2 < = discount < 8, 8 < = discount 3 2, 8 ) HDFS 3 3. 2. 1 5 7 4 2 8

Map Map HDFS 4 3. 2. 2 HDFS 5 3 1 5 1.all-delete HDFS 2 5 2.split-deleteHDFS HDFS hash hash hash hash 5 3.partial-update 3. 2. 3 2 1 5 3 HDFS 2 1 2 a ) Map b )

4. [8] Hadoop [8] I/O CPU Map 1 Scan,Map, Collect,Spill,Merge 5 Map Scan I/O CPU Intel Xeon E553 2.4GHz 2P/8C DDR3 24GB, HDD 1 HDD 115MBps OS Linux2.6.32, Hadoop.2.3 1GBps 1 4 1 1 5. 4. 1 MapRedcue 2 5. 1 2 17 17 R 1TB 1 168 S 1GB R id S partid 2 2 x, y.1% 8% x, y.1% discount MapReduce Map 512MB, Reduce 1 5. 2 1 5. 2. 1 2 (1) 6 6 (no-index) 691 2 13,8 (adaptiveindex) 1 725 2 2382 1 2, 1 5% 2

6 15 1 5 no-index adaptive index 6 5. 2. 2 2 (2) 1% 7 (no-index) 692 2 13,822 718 2 2,24 1 5% 2 6.1 15 1 5 no-index adaptive index 7 5. 2. 3 2 (3) (3) R S 8 (no-index) 1 2 774 15,478 1 81 2 3,895 3.5% 2 4 15 1 5 no-index adaptive index 8 5. 2. 4 R 1TB 5 2 ID 9 8%.1% 9 (no-index) 5 2 5 (partial-update) split-delete

(partial-update) (partial-update) 15 1 5 9 no-index all-delete split-delete partial-update 5. 3 2 2 4 [9] 4 3 5. 4 1 1 2 4 5 2 1.2 1 25 2 15 1 with-value 1stQuery sort 1stQuery index pointer 5. 5 11 2 MapReduce 2 11 6 4 2 with-value 1stQuery sort 1stQuery index pointer 5. 6 12 5 2 4

5 2 4 5 12 4 3 2 1 with-value 1stQuery sort 1stQuery index pointer 6. [9] [12] Dittrich [11] IO 12 Ritcher [9] Adaptive index [6], [7] Dremel [13] Dremel Dremel MapReduce I/O Map [1] M. Zuckerberg, One billion people on facebook, Oct. 212. One Billion People on Facebook [2] Big data: The next frontier for innovation, competition, and productivity, 211. [3] J. Dean and S. Ghemawat, Mapreduce: simplified data processing on large clusters, Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6, pp.1 1, OSDI 4, USENIX Association, 24. [4] hadoop http://hadoop.apache.org/. [5] A. Thusoo, Z. Shao, S. Anthony, D. Borthakur, N. Jain, J. Sen Sarma, R. Murthy, and H. Liu, Data warehousing and analytics infrastructure at facebook, Proceedings of the 21 ACM SIGMOD International Conference on Management of data, pp.113 12, SIGMOD 1, ACM, 21. [6] S. Idreos, M.L. Kersten, and S. Manegold, Database cracking, CIDR, pp.68 78, 27. [7] S. Idreos, M.L. Kersten, and S. Manegold, Self-organizing tuple reconstruction in column-stores, Proceedings of the 35th SIGMOD international conference on Management of data, pp.297 38, SIGMOD 9, ACM, New York, NY, USA, 29. http://doi.acm.org/1.1145/1559845.1559878 [8] H. Herodotou, Hadoop performance models, Dec. 211. [9] S. Richter, J.-A. Quiané-Ruiz, S. Schuh, and J. Dittrich, Towards zero-overhead adaptive indexing in hadoop, 212. [1] D. Jiang, B.C. Ooi, L. Shi, and S. Wu, The performance of mapreduce: an in-depth study, Proc. VLDB Endow., vol.3, no.1-2, pp.472 483, Sept. 21. [11] J. Dittrich, J.-A. Quiané-Ruiz, A. Jindal, Y. Kargin, V. Setty, and J. Schad, Hadoop++: making a yellow elephant run like a cheetah (without it even noticing), Proc. VLDB Endow., vol.3, no.1-2, pp.515 529, Sept. 21. http://dl.acm.org/citation.cfm?id=192841.19298 [12] J. Dittrich, J.-A. Quiané-Ruiz, S. Richter, S. Schuh, A. Jindal, and J. Schad, Only aggressive elephants are fast elephants, Proc. VLDB Endow., vol.5, no.11, pp.1591 162, July 212. http://dl.acm.org/citation.cfm?id=235229.235272 [13] S. Melnik, A. Gubarev, J.J. Long, G. Romer, S. Shivakumar, M. Tolton, and T. Vassilakis, Dremel: Interactive analysis of web-scale datasets, Proc. of the 36th Int l Conf on Very Large Data Bases, pp.33 339, 21. 7. MapRedcue